# Intro

We have a binary classification problem, given a fact we must acknowledge wether it is verifiable or not.

First, we will setup the environment, install the necessary dependencies, download the dataset and explore it to get a grasp of its nuances. Then we will prepare the dataset for training, creating the necessary splits to train a BERT based model and, at the end, we will evaluate it and perform a small error analysis.

This notebook calls the necessary steps and explains the results obtained. The full source code and instructions to perform the process locally or on your own Colab can be found in the following github repository: https://github.com/geblanco/newtral_technical_test

## Choosing the model

This is a crucial step on every process involving machine learning, briefly, after a first researching step (not reported here) we will employ a RoBERTa-based model pre-trained in spanish from the ground up (found [here](https://huggingface.co/bertin-project/bertin-roberta-base-spanish)). Albeit being very novel, this model achieves very good results in spanish and is, to the best of our knowledge, the first realistic attempt to create a pre-trained model in spanish.

In [1]:
# base modules
import os
import sys
import pandas as pd
from pathlib import Path

In [2]:
# constants
base_dir = Path("/content/newtral_technical_test/")
data_url = "https://ml-coding-test.s3.eu-west-1.amazonaws.com/ml_test_data.csv"
data_dir = base_dir.joinpath("data")
data_file = data_dir.joinpath("ml_test_data.csv")
models_dir = Path("/content/drive/My Drive/Colab Notebooks/models")
model_name = "bertin-project/bertin-roberta-base-spanish"

In [3]:
# setup storage
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [4]:
# get code
![[ ! -d "newtral_technical_test" ]] && git clone https://github.com/geblanco/newtral_technical_test

Cloning into 'newtral_technical_test'...
remote: Enumerating objects: 38, done.[K
remote: Counting objects: 100% (38/38), done.[K
remote: Compressing objects: 100% (28/28), done.[K
remote: Total 38 (delta 14), reused 32 (delta 10), pack-reused 0[K
Unpacking objects: 100% (38/38), done.


In [5]:
os.chdir("newtral_technical_test")
!pwd
sys.path.append("./src")

/content/newtral_technical_test


In [6]:
!pip install -q -r requirements.txt

[?25l[K     |████▍                           | 10 kB 24.7 MB/s eta 0:00:01[K     |████████▊                       | 20 kB 27.7 MB/s eta 0:00:01[K     |█████████████▏                  | 30 kB 12.1 MB/s eta 0:00:01[K     |█████████████████▌              | 40 kB 9.5 MB/s eta 0:00:01[K     |██████████████████████          | 51 kB 5.4 MB/s eta 0:00:01[K     |██████████████████████████▎     | 61 kB 6.0 MB/s eta 0:00:01[K     |██████████████████████████████▊ | 71 kB 5.9 MB/s eta 0:00:01[K     |████████████████████████████████| 74 kB 2.2 MB/s 
[K     |████████████████████████████████| 302 kB 11.0 MB/s 
[K     |████████████████████████████████| 2.9 MB 36.1 MB/s 
[K     |████████████████████████████████| 112 kB 13.4 MB/s 
[K     |████████████████████████████████| 636 kB 51.0 MB/s 
[K     |████████████████████████████████| 208 kB 54.7 MB/s 
[K     |████████████████████████████████| 80 kB 9.2 MB/s 
[K     |████████████████████████████████| 3.3 MB 8.8 MB/s 
[K     |████████

In [9]:
# download data
from dl_data import maybe_download
maybe_download(data_url, data_dir, overwrite=True)

Downloading https://ml-coding-test.s3.eu-west-1.amazonaws.com/ml_test_data.csv
Writing /content/newtral_technical_test/data/ml_test_data.csv


# Data exploration

We are given a dataset entailing facts and a flag indicating whether the fact is verifiable or not (classes 1/0 respectively).

To get familiarized with the data we will explore:
- number of examples
- repeated examples
- class representation

To do so, we will use `pandas`, it provides many easy to use data manipulation tools.

In [10]:
# load data
data = pd.read_csv(data_file)
print(f"Number of examples: {len(data)}")
print(f"Number of invalid examples: {data['text'].isnull().sum()}")
print("First samples:")
print(f"{data.head()}\n")
print("Some stats:")
print(f"{data.describe()}\n")
# remove duplicates: compare lowercased sentences
lower = data.copy()
lower["text"] = lower["text"].apply(lambda x: x.lower())
dups = lower.duplicated()
print(f"Number of duplicates {len(dups[dups == True])}")
data.drop(dups[dups == True].index, inplace=True)
data.to_csv(data_file)

Number of examples: 15000
Number of invalid examples: 0
First samples:
                                                text  claim
0          Le hace la primera pregunta Lucía Méndez?      0
1  También debo decir que nunca, nunca se se habl...      0
2  Y ahora lo que estamos viendo efectivamente es...      0
3  Por ejemplo el secretario de Defensa norteamer...      0
4  Y en Radio y Televisión Española afortunadamen...      0

Some stats:
              claim
count  15000.000000
mean       0.072067
std        0.258607
min        0.000000
25%        0.000000
50%        0.000000
75%        0.000000
max        1.000000

Number of duplicates 5


# First conclusions on the data

From the previous exploration we see that there are no invalid examples and that sentences look already clean, so there is no need to clean it further. Additionally, there is a reasonable number of examples to train a Deep Learning model (though more examples could be better), and only a few examples are repeated.

Finally, from the stats we see that there are around 7.2% of verifiable examples (mean == 0.072067), and around 92.8% of unverifiable examples present in the dataset. This is worrying as a it will make generalization more difficult, is easy to learn to always respond that the example is unverifiable and still get around 92% of accuracy.

A clear conclusion is that `accuracy` metric won't be representative to evaluate the model and that is crucial to ensure class proportions are preserved in the splits we create out of the original dataset.


# Data preparation

In this section we will divide the dataset into splits for training and evaluating the model and get numeric-based features out of each sentence (embed each sentence) so that the model can ingest it.

We will create three splits: train/dev/test with the following proportions: 70% train, 10% validation and 20% test split. To do so and still keep the original class proportions we can use the `StratifiedShuffleSplit` function from `sklearn`. The full source code can be found [here](https://github.com/geblanco/newtral_technical_test/blob/master/src/prepare_data.py#L82)

Creating the features is just a matter of passing each sentence through the embedding model, in this case a RoBERTa-based spanish one.

After the whole process we will end up wit the following tree structure:
```sh
data
├── cached_dev.csv_RobertaTokenizerFast
├── cached_test.csv_RobertaTokenizerFast
├── cached_train.csv_RobertaTokenizerFast
├── dev.csv
├── ml_test_data.csv
├── test.csv
└── train.csv
```

In [11]:
from transformers import RobertaTokenizerFast

from prepare_data import maybe_split
from featurize import featurize_files

maybe_split(
    input_file=data_file,
    output_dir=data_dir,
    dev_size=0.1,
    test_size=0.2,
    random_state=42,
    overwrite=True
)

splits = ["train", "dev", "test"]
tokenizer = RobertaTokenizerFast.from_pretrained(model_name)
split_paths = [data_dir.joinpath(f"{split}.csv") for split in splits]
featurize_files(
    tokenizer=tokenizer,
    files=split_paths,
    output_dir=data_dir,
    overwrite=True
)

Saved split to /content/newtral_technical_test/data/train.csv
Saved split to /content/newtral_technical_test/data/dev.csv
Saved split to /content/newtral_technical_test/data/test.csv


Downloading:   0%|          | 0.00/835k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/502k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.40M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/292 [00:00<?, ?B/s]

Creating features for /content/newtral_technical_test/data/train.csv




Saving features to /content/newtral_technical_test/data/cached_train.csv_RobertaTokenizerFast
Creating features for /content/newtral_technical_test/data/dev.csv
Saving features to /content/newtral_technical_test/data/cached_dev.csv_RobertaTokenizerFast
Creating features for /content/newtral_technical_test/data/test.csv
Saving features to /content/newtral_technical_test/data/cached_test.csv_RobertaTokenizerFast


# Model training

---

__Disclaimer__: Due to Google Colab restrictions, the whole hyper-parameter/train procedure is slow and the kernel disconnects. To perform it correctly, it requires saving and restoring the state. To avoid this, we will depict the process here, but do the training offline.

---

Next, we want to train the model on the featurized dataset created above. To get the best out of our model we will perform some hyper-parameter tuning, `hf Trainer` class already supports this, we just have to setup the parameters to search for ([SOURCE CODE](https://github.com/geblanco/newtral_technical_test/blob/master/src/hypersearch.py)).

We will tune the following hyper-parameters:
- `learning_rate`: Between `1e-4 and 1e-2`
- `num_train_epochs`: Between `1` and `5`
- `per_device_train_batch_size`: In the range `[4, 8, 16, 32]`
- `gradient_accumulation_steps`: Between `1` and `2`

The [training script](https://github.com/geblanco/newtral_technical_test/blob/master/src/modeling.py) does exactly this. In essence, it will try each parameter, train the model, evaluate it on the development set and repeat. After the process is completed, we retrieve the best model and save the winning parameters to [config.yaml](https://github.com/geblanco/newtral_technical_test/blob/master/config.yaml).

When training models that depend upon randomly initialized parameters, we must take into account that the results can be hugely influenced by that specific initialization and other initializations perform poorly. In other words, it is a spurious result and hence, not representative of the models real performance. To overcome this, we will report the results averaged across `n` different models, each one trained with the same hyper-parameters, but different initialization, that is, changing the seed for each training (also, saving the seed of the best model aids in reproducibility later). In this case, we will fix `n = 10`.

This procedure can be launched locally by issueing:
```bash
python src/modeling.py data --output_dir models --hypersearch validation_steps=10
```

Summarizing, it will do:
- search the best hyper-parameters.
- train 10 different models with the best parameters.
- save the best out of those 10 models.

# Model evaluation

We will evaluate the model with the test dataset and measure global, micro and macro accuracy, global and per class f1 score, precision and recal for each class.

In [12]:
# first, some imports
import yaml
import json
import torch
import warnings
import argparse

from sklearn.metrics import (
    f1_score,
    accuracy_score,
    classification_report,
)
from transformers import (
    Trainer,
    TrainingArguments,
    RobertaTokenizerFast,
    RobertaForSequenceClassification,
)

from data_classes import FactsDataset
from modeling import compute_metrics

In [13]:
# utility function to load featurized dataset
def load_dataset(file_path):
    data_dict = torch.load(file_path)
    dataset = FactsDataset(
        features=data_dict["features"],
        labels=data_dict["labels"]
    )
    return dataset

In [14]:
train_dataset = load_dataset(data_dir.joinpath("cached_train.csv_RobertaTokenizerFast"))
dev_dataset = load_dataset(data_dir.joinpath("cached_dev.csv_RobertaTokenizerFast"))
test_dataset = load_dataset(data_dir.joinpath("cached_test.csv_RobertaTokenizerFast"))

train_params = yaml.safe_load(open("./config.yaml", "r"))["params"]
model_path = models_dir.joinpath(model_name)
model = RobertaForSequenceClassification.from_pretrained(model_path).to("cuda:0")
training_args = TrainingArguments(
    **train_params,
    output_dir=models_dir,
    save_strategy="no",
    evaluation_strategy="no",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=dev_dataset,
    compute_metrics=compute_metrics,
)
# evaluate the model
print("Evaluating model...")
model_path = models_dir.joinpath(model_name)
preds = trainer.predict(test_dataset)
compute_metrics(preds, print_results=True)
print(json.dumps(preds.metrics, indent=2) + "\n")

***** Running Prediction *****
  Num examples = 1000
  Batch size = 8


Evaluating model...


              precision    recall  f1-score   support

           0       0.99      0.99      0.99       925
           1       0.92      0.91      0.91        75

    accuracy                           0.99      1000
   macro avg       0.96      0.95      0.95      1000
weighted avg       0.99      0.99      0.99      1000

{
  "test_loss": 0.03965435177087784,
  "test_acc": 0.987,
  "test_f1": 0.912751677852349,
  "test_macro": 0.9528642235831167,
  "test_micro": 0.987,
  "test_runtime": 18.4537,
  "test_samples_per_second": 54.19,
  "test_steps_per_second": 6.774
}



# Interpretation of results

Looking solely at accuracy tells us that the model performs pretty well (`98.7%`), but as we have a very unbalanced dataset, it is not very reliable. Instead, we should probably look at global f1 score (`~ 0.91`), that balances precision-recall. Though it is good, there is still room for improvement.

Also, micro and macro accuracy tell us two things: taking into account how much examples of each class there are, the model performs reasonably well, but the metric is dominated by one of the classes. As we don't know the risk of incorrecly classifying an example, is difficult to asses the cost of putting this model to production.

Digging deeper on each class' performance, the model performs very well on unverifiable facts (class 0, `f1 = 0.99`), but is definitely improvable on verifiable facts (class 1, `f1 = 0.91`), in this case, it is failing to identify them (low recall on class 1), misclasifying them as class 0 examples. This could be due to the model failing to capture meaningful enough features for each class' examples and because there are so few examples of verifiable facts. The model is struggling to learn them correctly and more examples would probably help.

# Error analysis

We will extract predictions from the test set and see some correct and incorrectly classified examples both for class 1 and 0.

In [15]:
import nltk
import random
import numpy as np
import pandas as pd

nltk.download('punkt')

from nltk.tokenize import word_tokenize
from transformers import pipeline, RobertaForSequenceClassification

test_data = pd.read_csv(data_dir.joinpath("test.csv"))
tokenizer = RobertaTokenizerFast.from_pretrained("bertin-project/bertin-roberta-base-spanish")

pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0)
texts, labels = data["text"].values.tolist(), data["claim"].values.tolist()

preds = pipe(texts)
pred_labels = [int(l["label"].split("_")[-1]) for l in preds]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


loading file https://huggingface.co/bertin-project/bertin-roberta-base-spanish/resolve/main/vocab.json from cache at /root/.cache/huggingface/transformers/f47efb87887425ef9a4ef795bfaa907d57ac9a650d733c7ca621b9eced3235e8.a80f232f572026f92499b14999a8ed4e044e04cf3d01b9f2be298c98e78e8498
loading file https://huggingface.co/bertin-project/bertin-roberta-base-spanish/resolve/main/merges.txt from cache at /root/.cache/huggingface/transformers/aba9e0895dea47dd4208a36012ffd3eb21eb4c5f7ce0be6547afb37cdd4ddef4.a0dfc41f9d0f03a56ba7a5401d770f6e43071045a0bd79073380d408d17a0d92
loading file https://huggingface.co/bertin-project/bertin-roberta-base-spanish/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/baad57d0f574d3e660cafb14601d0ecebe83f25071d59f3e51d225d75285b773.2a5dc806edc00ab3a329cb22b9973596ca75b24ba0e5e4963bf1308de7237a3d
loading file https://huggingface.co/bertin-project/bertin-roberta-base-spanish/resolve/main/added_tokens.json from cache at None
loading file

In [16]:
def pretty_print(sentence, max_len=100):
    text = ""
    curr_sentence = ""
    sentences = []
    tokens = word_tokenize(sentence)
    for idx, tok in enumerate(tokens):
        if len(curr_sentence) + len(tok) > max_len:
            sentences.append(curr_sentence)
            curr_sentence = ""
        
        curr_sentence += tok + " "
        if idx == len(tokens) -1:
            sentences.append(curr_sentence)

    return "\n  ".join(sentences)

def print_class(correct, incorrect, class_name, num_samples=5):
    print(f"\nExamples of {class_name}")
    print("=" * 15)
    for samples, name in [(correct, "correct"), (incorrect, "incorrect")]:
        # sample 5 random examples
        indices = random.sample(range(len(samples)), k=min(num_samples, len(samples)))
        print(name)
        for sent in np.array(samples)[indices]:
            print(pretty_print(f"- {sent}"))

        print("\nStats:")
        lens = np.array([len(sent) for sent in samples])
        print(f"Mean sentence length: {round(lens.mean(), 2)}")
        print(f"Std sentence length: {round(lens.std(), 2)}")
        print(f"Amount of sentences: {len(samples)}")
        print()

In [17]:
# correct examples of class 1
correct_class_1 = []
# examples of class 1 missclassified as class 0
incorrect_class_1 = []

# correct examples of class 0
correct_class_0 = []
# examples of class 0 missclassified as class 1
incorrect_class_0 = []

for sent, pred_label, label in zip(texts, pred_labels, labels):
    correct = label == pred_label
    if label and not correct:
        incorrect_class_1.append(sent)
    elif label and correct:
        correct_class_1.append(sent)
    elif not label and not correct:
        incorrect_class_0.append(sent)
    else:
        correct_class_0.append(sent)

In [18]:
print_class(correct_class_1, incorrect_class_1, "class 1")
print_class(correct_class_0, incorrect_class_0, "class 0")


Examples of class 1
correct
- No ha convocado ni siquiera las comunidades autónomas . 
- La llegada este mes de 17 millones de dosis , ha supuesto un salto cuantitativo que el sistema ha 
  sabido absorber . 
- Yo no conozco ningún otro país en el que la oposición se dedique a deteriorar y atacar la 
  confianza en el país , en los medios internacionales en el Parlamento Europeo . 
- En Cataluña estamos alrededor de 450 personas que están ingresadas en el hospital y sí que se han 
  reducido respecto a ayer el número de personas que están en la UCI alrededor de 140 . 
- La primera es que en España ya ocurre y en el año 2015 el Comité Olímpico Internacional ya lo dijo 
  con mucha claridad que las mujeres participan en las competiciones de mujeres y los hombres 
  participan en las competiciones de hombres . 

Stats:
Mean sentence length: 171.24
Std sentence length: 95.17
Amount of sentences: 970

incorrect
- Además hemos hecho una formación profesional que ha incorporado 64000 plazas 

# Verifiable facts:

__Correct:__

- Seem to represent facts with numeric evidences, all sentences include the subject or item of the action.
- Sentences refer to facts, one speaker outlining a fact, not directly speaking to other person.
- A human could identify what the sentence is referring to.

__Incorrect:__

- Although following the same structure as correctly classified verifiable facts (subject/item included, speaker announcing a fact), in these samples, either the relevant part of the sentence is not the numeric evidence or the sentence is too big.

# Unverifiable facts:

__Correct:__

- Seem to be short or chopped sentences, personal opinions, sentences whose subject of the action is not present or direct questions from the speaker to other parties.
- Reading some of those sentences, don't make sense to me.

__Incorrect:__

- It is remarkable that correctly classified unverifiable facts are shorter than the rest (~70 chars less in average, with a smaller deviation), this might indicate that the model is learning to use the length of the sentence as a meaningful feature to decide whether an example is verifiable or not. In other words, if the sentence is big, it tends to classify it as verifiable, failing to classify some class 0 examples.

# Conclusions

At sight of these errors and results, I would conclude that the model is heavily relying on making sense out the text, the length of the given sentence, whether or not there are numeric symbols (be it written or digits) and detecting the subject of the action inside the sentence. When some of these features are misleading, the model fails.

# Final conclusions and further study

We were given the task of fine-tuning a BERT based model on a collection entailing verifiable and unverifiable facts in spanish. Being the task in spanish we employed the novel model BERTin, based on the RoBERTa variant of BERT. After searching for the best hyper-parameters and validating the results are not spurious, we obtain a model performing reasonably well.

Having a baseline to compare our model to would give us better insights on how well the model performs, but as the task is not standard, we could have just created one. Also, a future step in this direction is testing the model in other collections.

With the proposed solution we see that there is still some room for improvement, following are some ideas that could enhance the model's performance:
- Get more features: We could try extracting more features from each sentence (syntactic, semantic analysis) and feeding that to the model along with each sentence.
- The dataset is heavily unbalanced, augmenting the data could improve the performance. We could gather more news or train a generative model (gpt-2...)
- As the data in spanish is scarce, we could train another model in English with a similar task and transfer learning the weights to our model, though it would require an extra effort to clean the new dataset and adapt the weights to the spanish embedding layer.
- Transfer learning from other collection, we could train with a similar collection (i.e.: [FakeNewsCorpusSpanish](https://github.com/jpposadas/FakeNewsCorpusSpanish)) and later train with the given collection, though this won't work if the collection are too different.
- We could test other architectures or models, i.e.: [FakeBERT](https://link.springer.com/article/10.1007/s11042-020-10183-2) implements a BERT+LSTM/CNN, though it only performs slightly better that us in accuracy (`98.90` vs `98.7`, it's other collection, but similar task)