# Onclusive Machine Learning Challenge
### Building an ML system to verify the veracity of claims from the dataset PUBHEALTH 

The general approach is using transfer learning. Since training a comprehensive NLP model would require a huge amount of data and computing resources, and there are already many pre-trained general-purpose models available publicly. All we have to do is select a suitable one to start with to fine-tune it for our purpose.

Loading the necessary libraies:

In [2]:
import numpy as np
import pandas as pd
from datasets import load_dataset, load_metric
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding
import mlflow

Loading the PUBHEALTH dataset

In [2]:
dataset = load_dataset("health_fact")

Using custom data configuration default
Reusing dataset health_fact (C:\Users\david\.cache\huggingface\datasets\health_fact\default\1.1.0\99503637e4255bd805f84d57031c18fe4dd88298f00299d56c94fc59ed68ec19)


  0%|          | 0/3 [00:00<?, ?it/s]

Removing invalid entries

In [3]:
dataset = dataset.filter(lambda example: example['label'] != -1)

Loading cached processed dataset at C:\Users\david\.cache\huggingface\datasets\health_fact\default\1.1.0\99503637e4255bd805f84d57031c18fe4dd88298f00299d56c94fc59ed68ec19\cache-41180d96da5bd5f3.arrow
Loading cached processed dataset at C:\Users\david\.cache\huggingface\datasets\health_fact\default\1.1.0\99503637e4255bd805f84d57031c18fe4dd88298f00299d56c94fc59ed68ec19\cache-e37781bb9e63676d.arrow
Loading cached processed dataset at C:\Users\david\.cache\huggingface\datasets\health_fact\default\1.1.0\99503637e4255bd805f84d57031c18fe4dd88298f00299d56c94fc59ed68ec19\cache-bd791c5953b7d1ab.arrow


The first pre-trained model to try was the DistilBERT base model (uncased) (https://huggingface.co/distilbert-base-uncased), which is a distilled version of the BERT base model. It is a good starting point as it uses much less computing resources to re-tune and use. The drawback is that the max length limit of tokens is only 512. The input text is needed to be truncated.

Before inputting the text data into the model, it needs to be tokenized, which encodes the strings of text into transformer-readable token ID integers.

In [4]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [5]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)  # for dynamic padding

In [6]:
def tokenize_function(examples):
    return tokenizer(examples["main_text"], truncation=True)  # should take the "claim" field also, but skipped due to the length limit 

tokenized_datasets = dataset.map(tokenize_function, batched=True) 



  0%|          | 0/10 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [None]:
# smaller subset of data was used to test the code first
# small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(100)) 
# small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(100))

Specifying the chosen model

In [8]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased",num_labels=4)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

Setting up the training details

In [3]:
training_args = TrainingArguments(output_dir="models/distilbert-base-uncased-mod")  
# Hyperparameters were not set at the begining. There should be hyperparameter searching afterwards to improve the model performance.

In [44]:
metric = load_metric("f1")  # f1 is good for measuring the classification problems, also applied to multi-class ones
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    final_metric = {}
    for key in ['micro','macro','weighted']:
        final_metric[key] = metric.compute(predictions=predictions, references=labels, average=key)
    final_metric['individuals'] = metric.compute(predictions=predictions, references=labels, average=None)
    return final_metric

In [31]:
trainer = Trainer(
    model=model,
    args=training_args,
    # train_dataset=small_train_dataset,   # smaller subset of data was used to test the code first
    # eval_dataset=small_eval_dataset,
    train_dataset=tokenized_datasets["train"],  # then input the whole training set
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [32]:
# this is for solving one error in training
old_collator = trainer.data_collator
trainer.data_collator = lambda data: dict(old_collator(data))

Run the training

In [33]:
mlflow.end_run()  # removing previous run if any
trainer.train()

The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: main_text, sources, explanation, fact_checkers, claim, claim_id, subjects, date_published.
***** Running training *****
  Num examples = 9804
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 3678


  0%|          | 0/3678 [00:00<?, ?it/s]

Saving model checkpoint to models/distilbert-base-uncased-mod\checkpoint-500
Configuration saved in models/distilbert-base-uncased-mod\checkpoint-500\config.json


{'loss': 0.335, 'learning_rate': 4.320282762370854e-05, 'epoch': 0.41}


Model weights saved in models/distilbert-base-uncased-mod\checkpoint-500\pytorch_model.bin
tokenizer config file saved in models/distilbert-base-uncased-mod\checkpoint-500\tokenizer_config.json
Special tokens file saved in models/distilbert-base-uncased-mod\checkpoint-500\special_tokens_map.json
Saving model checkpoint to models/distilbert-base-uncased-mod\checkpoint-1000
Configuration saved in models/distilbert-base-uncased-mod\checkpoint-1000\config.json


{'loss': 0.3503, 'learning_rate': 3.640565524741708e-05, 'epoch': 0.82}


Model weights saved in models/distilbert-base-uncased-mod\checkpoint-1000\pytorch_model.bin
tokenizer config file saved in models/distilbert-base-uncased-mod\checkpoint-1000\tokenizer_config.json
Special tokens file saved in models/distilbert-base-uncased-mod\checkpoint-1000\special_tokens_map.json
Saving model checkpoint to models/distilbert-base-uncased-mod\checkpoint-1500
Configuration saved in models/distilbert-base-uncased-mod\checkpoint-1500\config.json


{'loss': 0.2477, 'learning_rate': 2.9608482871125614e-05, 'epoch': 1.22}


Model weights saved in models/distilbert-base-uncased-mod\checkpoint-1500\pytorch_model.bin
tokenizer config file saved in models/distilbert-base-uncased-mod\checkpoint-1500\tokenizer_config.json
Special tokens file saved in models/distilbert-base-uncased-mod\checkpoint-1500\special_tokens_map.json
Saving model checkpoint to models/distilbert-base-uncased-mod\checkpoint-2000
Configuration saved in models/distilbert-base-uncased-mod\checkpoint-2000\config.json


{'loss': 0.2127, 'learning_rate': 2.281131049483415e-05, 'epoch': 1.63}


Model weights saved in models/distilbert-base-uncased-mod\checkpoint-2000\pytorch_model.bin
tokenizer config file saved in models/distilbert-base-uncased-mod\checkpoint-2000\tokenizer_config.json
Special tokens file saved in models/distilbert-base-uncased-mod\checkpoint-2000\special_tokens_map.json
Saving model checkpoint to models/distilbert-base-uncased-mod\checkpoint-2500
Configuration saved in models/distilbert-base-uncased-mod\checkpoint-2500\config.json


{'loss': 0.2056, 'learning_rate': 1.6014138118542688e-05, 'epoch': 2.04}


Model weights saved in models/distilbert-base-uncased-mod\checkpoint-2500\pytorch_model.bin
tokenizer config file saved in models/distilbert-base-uncased-mod\checkpoint-2500\tokenizer_config.json
Special tokens file saved in models/distilbert-base-uncased-mod\checkpoint-2500\special_tokens_map.json
Saving model checkpoint to models/distilbert-base-uncased-mod\checkpoint-3000
Configuration saved in models/distilbert-base-uncased-mod\checkpoint-3000\config.json


{'loss': 0.1125, 'learning_rate': 9.216965742251224e-06, 'epoch': 2.45}


Model weights saved in models/distilbert-base-uncased-mod\checkpoint-3000\pytorch_model.bin
tokenizer config file saved in models/distilbert-base-uncased-mod\checkpoint-3000\tokenizer_config.json
Special tokens file saved in models/distilbert-base-uncased-mod\checkpoint-3000\special_tokens_map.json
Saving model checkpoint to models/distilbert-base-uncased-mod\checkpoint-3500
Configuration saved in models/distilbert-base-uncased-mod\checkpoint-3500\config.json


{'loss': 0.1221, 'learning_rate': 2.419793365959761e-06, 'epoch': 2.85}


Model weights saved in models/distilbert-base-uncased-mod\checkpoint-3500\pytorch_model.bin
tokenizer config file saved in models/distilbert-base-uncased-mod\checkpoint-3500\tokenizer_config.json
Special tokens file saved in models/distilbert-base-uncased-mod\checkpoint-3500\special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




{'train_runtime': 695.7751, 'train_samples_per_second': 42.272, 'train_steps_per_second': 5.286, 'train_loss': 0.22084757165457647, 'epoch': 3.0}


TrainOutput(global_step=3678, training_loss=0.22084757165457647, metrics={'train_runtime': 695.7751, 'train_samples_per_second': 42.272, 'train_steps_per_second': 5.286, 'train_loss': 0.22084757165457647, 'epoch': 3.0})

In [47]:
trainer.save_model()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
Saving model checkpoint to models/distilbert-base-uncased-mod2
Configuration saved in models/distilbert-base-uncased-mod2\config.json
Model weights saved in models/distilbert-base-uncased-mod2\pytorch_model.bin
tokenizer config file saved in models/distilbert-base-uncased-mod2\tokenizer_config.json
Special tokens file saved in models/distilbert-base-uncased-mod2\special_tokens_map.json


In [46]:
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: main_text, sources, explanation, fact_checkers, claim, claim_id, subjects, date_published.
***** Running Evaluation *****
  Num examples = 1214
  Batch size = 8
Trainer is attempting to log a value of "{'f1': 0.6927512355848435}" of type <class 'dict'> for key "eval_micro" as a metric. MLflow's log_metric() only accepts float and int types so we dropped this attribute.
Trainer is attempting to log a value of "{'f1': 0.5484052734706103}" of type <class 'dict'> for key "eval_macro" as a metric. MLflow's log_metric() only accepts float and int types so we dropped this attribute.
Trainer is attempting to log a value of "{'f1': 0.6936196071712313}" of type <class 'dict'> for key "eval_weighted" as a metric. MLflow's log_metric() only accepts float and int types so we dropped this attribute.
Trainer is attempting to log a value of "{'f1': ar

{'eval_loss': 1.8061708211898804,
 'eval_micro': {'f1': 0.6927512355848435},
 'eval_macro': {'f1': 0.5484052734706103},
 'eval_weighted': {'f1': 0.6936196071712313},
 'eval_individuals': {'f1': array([0.63729809, 0.34269663, 0.84      , 0.37362637])},
 'eval_runtime': 13.4377,
 'eval_samples_per_second': 90.343,
 'eval_steps_per_second': 11.311,
 'epoch': 3.0}