# Text Classification using BERT on the Health Fact Dataset

### Project Outline

**Task 1**: Introduction (this section)

**Task 2**: Exploratory Data Analysis and Preprocessing

**Task 3**: Training/Validation Split

**Task 4**: Loading Tokenizer and Encoding our Data

**Task 5**: Setting up BERT Pretrained Model

**Task 6**: Creating Data Loaders

**Task 7**: Setting Up Optimizer and Scheduler

**Task 8**: Defining our Performance Metrics

**Task 9**: Creating our Training Loop

**Task 10**: Loading and Evaluating our Model

## Task 1: Introduction

### What is BERT

BERT is a large-scale transformer-based Language Model that can be finetuned for a variety of tasks.

For more information, the original paper can be found [here](https://arxiv.org/abs/1810.04805). 

[HuggingFace documentation](https://huggingface.co/transformers/model_doc/bert.html)

[Bert documentation](https://characters.fandom.com/wiki/Bert_(Sesame_Street) 

## Task 2: Exploratory Data Analysis and Preprocessing

We will use the SMILE Twitter dataset.

_Wang, Bo; Tsakalidis, Adam; Liakata, Maria; Zubiaga, Arkaitz; Procter, Rob; Jensen, Eric (2016): SMILE Twitter Emotion dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.3187909.v2_

In [2]:
!pip install datasets

Collecting datasets
  Downloading datasets-1.7.0-py3-none-any.whl (234 kB)
[K     |████████████████████████████████| 234 kB 748 kB/s eta 0:00:01
Collecting huggingface-hub<0.1.0
  Downloading huggingface_hub-0.0.9-py3-none-any.whl (37 kB)
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 2.9 MB/s eta 0:00:01
Collecting tqdm<4.50.0,>=4.27
  Downloading tqdm-4.49.0-py2.py3-none-any.whl (69 kB)
[K     |████████████████████████████████| 69 kB 2.4 MB/s eta 0:00:01
Installing collected packages: tqdm, xxhash, huggingface-hub, datasets
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.56.2
    Uninstalling tqdm-4.56.2:
      Successfully uninstalled tqdm-4.56.2
Successfully installed datasets-1.7.0 huggingface-hub-0.0.9 tqdm-4.49.0 xxhash-2.0.2


In [3]:
import torch
import pandas as pd
from tqdm.notebook import tqdm
from datasets import list_datasets
datasets_list = list_datasets()
len(datasets_list)

935

In [4]:
print(', \n'.join(dataset for dataset in datasets_list))

acronym_identification, 
ade_corpus_v2, 
adversarial_qa, 
aeslc, 
afrikaans_ner_corpus, 
ag_news, 
ai2_arc, 
air_dialogue, 
ajgt_twitter_ar, 
allegro_reviews, 
allocine, 
alt, 
amazon_polarity, 
amazon_reviews_multi, 
amazon_us_reviews, 
ambig_qa, 
amttl, 
anli, 
app_reviews, 
aqua_rat, 
aquamuse, 
ar_cov19, 
ar_res_reviews, 
ar_sarcasm, 
arabic_billion_words, 
arabic_pos_dialect, 
arabic_speech_corpus, 
arcd, 
arsentd_lev, 
art, 
arxiv_dataset, 
ascent_kb, 
aslg_pc12, 
asnq, 
asset, 
assin, 
assin2, 
atomic, 
autshumato, 
babi_qa, 
banking77, 
bbaw_egyptian, 
bbc_hindi_nli, 
bc2gm_corpus, 
best2009, 
bianet, 
bible_para, 
big_patent, 
billsum, 
bing_coronavirus_query_set, 
biomrc, 
blended_skill_talk, 
blimp, 
blog_authorship_corpus, 
bn_hate_speech, 
bookcorpus, 
bookcorpusopen, 
boolq, 
bprec, 
break_data, 
brwac, 
bsd_ja_en, 
bswac, 
c3, 
c4, 
cail2018, 
caner, 
capes, 
catalonia_independence, 
cawac, 
cbt, 
cc100, 
cc_news, 
ccaligned_multilingual, 
cdsc, 
cdt, 
cfq, 
chr_en, 
cif

In [5]:
from datasets import load_dataset
data = load_dataset('health_fact', split='train')

Downloading:   0%|          | 0.00/2.27k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.34k [00:00<?, ?B/s]

Downloading and preparing dataset health_fact/default (download: 23.74 MiB, generated: 64.34 MiB, post-processed: Unknown size, total: 88.08 MiB) to /root/.cache/huggingface/datasets/health_fact/default/1.1.0/99503637e4255bd805f84d57031c18fe4dd88298f00299d56c94fc59ed68ec19...


Downloading: 0.00B [00:00, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset health_fact downloaded and prepared to /root/.cache/huggingface/datasets/health_fact/default/1.1.0/99503637e4255bd805f84d57031c18fe4dd88298f00299d56c94fc59ed68ec19. Subsequent calls will reuse this data.


In [6]:
print(data.description)

PUBHEALTH is a comprehensive dataset for explainable automated fact-checking of
public health claims. Each instance in the PUBHEALTH dataset has an associated
veracity label (true, false, unproven, mixture). Furthermore each instance in the
dataset has an explanation text field. The explanation is a justification for which
the claim has been assigned a particular veracity label.

The dataset was created to explore fact-checking of difficult to verify claims i.e.,
those which require expertise from outside of the journalistics domain, in this case
biomedical and public health expertise.

It was also created in response to the lack of fact-checking datasets which provide
gold standard natural language explanations for verdicts/labels.

NOTE: There are missing labels in the dataset and we have replaced them with -1.



In [7]:
print("Columns {}".format(data.num_columns))
print("Rows {}".format(data.num_rows))
print("Columns names {}".format(data.column_names))

Columns 9
Rows 9832
Columns names ['claim', 'claim_id', 'date_published', 'explanation', 'fact_checkers', 'label', 'main_text', 'sources', 'subjects']


In [8]:
data.features

{'claim_id': Value(dtype='string', id=None),
 'claim': Value(dtype='string', id=None),
 'date_published': Value(dtype='string', id=None),
 'explanation': Value(dtype='string', id=None),
 'fact_checkers': Value(dtype='string', id=None),
 'main_text': Value(dtype='string', id=None),
 'sources': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=4, names=['false', 'mixture', 'true', 'unproven'], names_file=None, id=None),
 'subjects': Value(dtype='string', id=None)}

In [9]:
data = data.to_pandas()

In [10]:
data['claim_and_explanation'] = data['claim'].map(str) + ' ' + data['explanation'].map(str)
data

Unnamed: 0,claim,claim_id,date_published,explanation,fact_checkers,label,main_text,sources,subjects,claim_and_explanation
0,"""The money the Clinton Foundation took from fr...",15661,"April 26, 2015","""Gingrich said the Clinton Foundation """"took m...",Katie Sanders,0,"""Hillary Clinton is in the political crosshair...",https://www.wsj.com/articles/clinton-foundatio...,"Foreign Policy, PunditFact, Newt Gingrich,","""The money the Clinton Foundation took from fr..."
1,Annual Mammograms May Have More False-Positives,9893,"October 18, 2011",This article reports on the results of a study...,,1,While the financial costs of screening mammogr...,,"Screening,WebMD,women's health",Annual Mammograms May Have More False-Positive...
2,SBRT Offers Prostate Cancer Patients High Canc...,11358,"September 28, 2016",This news release describes five-year outcomes...,"Mary Chris Jaklevic,Steven J. Atlas, MD, MPH,K...",1,The news release quotes lead researcher Robert...,https://www.healthnewsreview.org/wp-content/up...,"Association/Society news release,Cancer",SBRT Offers Prostate Cancer Patients High Canc...
3,"Study: Vaccine for Breast, Ovarian Cancer Has ...",10166,"November 8, 2011","While the story does many things well, the ove...",,2,"The story does discuss costs, but the framing ...",http://clinicaltrials.gov/ct2/results?term=can...,"Cancer,WebMD,women's health","Study: Vaccine for Breast, Ovarian Cancer Has ..."
4,Some appendicitis cases may not require ’emerg...,11276,"September 20, 2010",We really don’t understand why only a handful ...,,2,"""Although the story didn’t cite the cost of ap...",,,Some appendicitis cases may not require ’emerg...
...,...,...,...,...,...,...,...,...,...,...
9827,The Sturgis motorcycle rally in 2020 resulted ...,35948,"September 10, 2020","They want to know if mass-events (protests, co...",Dan Evon,3,"In September 2020, social media was abuzz over...",,"Politics Medical, COVID-19",The Sturgis motorcycle rally in 2020 resulted ...
9828,AstraZeneca's infant respiratory drug prioriti...,401,"September 25, 1995",Britain’s AstraZeneca said a potential medicin...,,2,The “Breakthrough Therapy” and “Prime” designa...,,Health News,AstraZeneca's infant respiratory drug prioriti...
9829,Testicular cancer deaths double with after 40 ...,2023,"February 10, 2011",Men diagnosed with testicular cancer at 40 yea...,,2,This was true even when initial treatment and ...,http://bit.ly/fGNEw9,Health News,Testicular cancer deaths double with after 40 ...
9830,The FDA published “conclusive proof” that the...,38118,"November 22, 2017",FDA Confirms DTaP Vaccine Causes Autism in Nov...,Rich Buhler & Staff,0,The FDA hasn’t confirmed a link between DTaP v...,https://www.truthorfiction.com/marshall-kamena...,Medical,The FDA published “conclusive proof” that the...


In [11]:
data = data[['claim_and_explanation','label']]
data = data[data['label']!=-1]
data

Unnamed: 0,claim_and_explanation,label
0,"""The money the Clinton Foundation took from fr...",0
1,Annual Mammograms May Have More False-Positive...,1
2,SBRT Offers Prostate Cancer Patients High Canc...,1
3,"Study: Vaccine for Breast, Ovarian Cancer Has ...",2
4,Some appendicitis cases may not require ’emerg...,2
...,...,...
9827,The Sturgis motorcycle rally in 2020 resulted ...,3
9828,AstraZeneca's infant respiratory drug prioriti...,2
9829,Testicular cancer deaths double with after 40 ...,2
9830,The FDA published “conclusive proof” that the...,0


## Task 3: Training/Validation Split

In [12]:
from sklearn.model_selection import train_test_split

In [13]:
X_train, X_val, y_train, y_val = train_test_split(
    data.index.values,
    data.label.values,
    test_size=0.2,
    random_state=42,
    stratify=data.label.values
)

In [14]:
data['data_type'] = ['not_set']*data.shape[0]

In [15]:
data.loc[X_train,'data_type'] = 'train'
data.loc[X_val,'data_type'] = 'val'

In [16]:
data

Unnamed: 0,claim_and_explanation,label,data_type
0,"""The money the Clinton Foundation took from fr...",0,train
1,Annual Mammograms May Have More False-Positive...,1,val
2,SBRT Offers Prostate Cancer Patients High Canc...,1,train
3,"Study: Vaccine for Breast, Ovarian Cancer Has ...",2,train
4,Some appendicitis cases may not require ’emerg...,2,train
...,...,...,...
9827,The Sturgis motorcycle rally in 2020 resulted ...,3,train
9828,AstraZeneca's infant respiratory drug prioriti...,2,train
9829,Testicular cancer deaths double with after 40 ...,2,train
9830,The FDA published “conclusive proof” that the...,0,train


In [17]:
data['label'].unique()

array([0, 1, 2, 3])

## Task 4: Loading Tokenizer and Encoding our Data

In [18]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

In [19]:
tokenizer = BertTokenizer.from_pretrained(
    "bert-base-uncased",
    do_lower_case=True,
    
)

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [20]:
encdoded_data_train = tokenizer.batch_encode_plus(
    data[data.data_type=='train'].claim_and_explanation.values,
    add_special_tokens=True,
    return_attention_mask=True,
    padding = True,
    truncation=True,
    return_tensors='pt'
)


encdoded_data_val = tokenizer.batch_encode_plus(
    data[data.data_type=='val'].claim_and_explanation.values,
    add_special_tokens=True,
    return_attention_mask=True,
    padding = True,
    truncation=True,
    return_tensors='pt'
)

input_ids_train = encdoded_data_train['input_ids']
attention_masks_train = encdoded_data_train['attention_mask']
labels_train = torch.tensor(data[data.data_type=='train'].label.values)

input_ids_val = encdoded_data_val['input_ids']
attention_masks_val = encdoded_data_val['attention_mask']
labels_val = torch.tensor(data[data.data_type=='val'].label.values)





In [21]:
dataset_train = TensorDataset(input_ids_train,attention_masks_train,labels_train)
dataset_val = TensorDataset(input_ids_val,attention_masks_val,labels_val)

## Task 5: Setting up BERT Pretrained Model

In [22]:
from transformers import BertForSequenceClassification

In [23]:
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels= 4,
    output_attentions=False,
    output_hidden_states=False

)

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

## Task 6: Creating Data Loaders

In [24]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

In [25]:

dataloader_train = DataLoader(
    dataset_train,
    sampler = RandomSampler(dataset_train),
    batch_size = 16
)

dataloader_val = DataLoader(
    dataset_val,
    sampler = RandomSampler(dataset_val),
    batch_size = 16
)

## Task 7: Setting Up Optimizer and Scheduler

In [26]:
from transformers import AdamW, get_linear_schedule_with_warmup

In [27]:
optimizer = AdamW(
    model.parameters(),
    lr=1e-5,
    eps=1e-8,
)

In [28]:
epochs = 5
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=len(dataloader_train)*epochs
)

## Task 8: Defining our Performance Metrics

Accuracy metric approach originally used in accuracy function in [this tutorial](https://mccormickml.com/2019/07/22/BERT-fine-tuning/#41-bertforsequenceclassification).

In [29]:
import numpy as np

In [30]:
from sklearn.metrics import f1_score

In [31]:
def f1_score_func(preds, labels):
    
    preds_flat = np.argmax(preds,axis=1).flatten()
    labels_flat = labels.flatten()
    
    return f1_score(labels_flat,preds_flat,average='weighted')

## Task 9: Creating our Training Loop

In [32]:
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [33]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model.to(device)

print(device)

cuda


In [34]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals


In [37]:
for epoch in tqdm(range(1, epochs+1)):
    
    model.train()
    
    loss_train_total = 0
    
    progress_bar = tqdm(dataloader_train, 
                        desc='Epoch {:1d}'.format(epoch),
                        leave=False,
                        disable=False)
    
    for batch in progress_bar:
        
        model.zero_grad()
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {
            
            'input_ids' : batch[0],
            'attention_mask' : batch[1],
            'labels' : batch[2]
            
        }
        
        
        outputs = model(**inputs)
        
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(),1.0)
        
        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss' : '{:.3f}'.format(loss.item()/len(batch))})
        
    torch.save(model.state_dict(), f'BERT_ft_epoch{epoch}.model')
    
    tqdm.write('\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train) 
    
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_val)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score: {val_f1}')
        
    
    

  0%|          | 0/5 [00:00<?, ?it/s]

Epoch 1:   0%|          | 0/491 [00:00<?, ?it/s]


Epoch {epoch}
Training loss: 0.6218421251613594


  0%|          | 0/123 [00:00<?, ?it/s]

Validation loss: 0.6117300202206868
F1 Score: 0.73277405705906


Epoch 2:   0%|          | 0/491 [00:00<?, ?it/s]


Epoch {epoch}
Training loss: 0.509000963233639


  0%|          | 0/123 [00:00<?, ?it/s]

Validation loss: 0.6428659796957078
F1 Score: 0.7272646451188487


Epoch 3:   0%|          | 0/491 [00:00<?, ?it/s]


Epoch {epoch}
Training loss: 0.42261358678280214


  0%|          | 0/123 [00:00<?, ?it/s]

Validation loss: 0.6796860910528074
F1 Score: 0.7399883169018384


Epoch 4:   0%|          | 0/491 [00:00<?, ?it/s]


Epoch {epoch}
Training loss: 0.36687057318124167


  0%|          | 0/123 [00:00<?, ?it/s]

Validation loss: 0.6981886942091027
F1 Score: 0.7334435536257927


Epoch 5:   0%|          | 0/491 [00:00<?, ?it/s]


Epoch {epoch}
Training loss: 0.342659136435954


  0%|          | 0/123 [00:00<?, ?it/s]

Validation loss: 0.698319235952889
F1 Score: 0.7334435536257927


## Task 10: Loading and Evaluating our Model

In [39]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=4,
                                                      output_attentions=False,
                                                      output_hidden_states=False)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [40]:
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [41]:
model.load_state_dict(
    torch.load('BERT_ft_epoch5.model',
              map_location = torch.device('cpu'))

)

<All keys matched successfully>

In [42]:
_, predictions, true_vals = evaluate(dataloader_val)

  0%|          | 0/123 [00:00<?, ?it/s]

In [43]:
f1_score_func(predictions, true_vals)

0.7334435536257927