# Sentiment Analysis with Deep Learning using BERT

### Prerequisites

- Intermediate-level knowledge of Python 3 (NumPy and Pandas preferably, but not required)
- Exposure to PyTorch usage
- Basic understanding of Deep Learning and Language Models (BERT specifically)

### Project Outline

**Task 1**: Introduction (this section)

**Task 2**: Exploratory Data Analysis and Preprocessing

**Task 3**: Training/Validation Split

**Task 4**: Loading Tokenizer and Encoding our Data

**Task 5**: Setting up BERT Pretrained Model

**Task 6**: Creating Data Loaders

**Task 7**: Setting Up Optimizer and Scheduler

**Task 8**: Defining our Performance Metrics

**Task 9**: Creating our Training Loop

## Introduction

### What is BERT

BERT is a large-scale transformer-based Language Model that can be finetuned for a variety of tasks.

For more information, the original paper can be found [here](https://arxiv.org/abs/1810.04805). 

[HuggingFace documentation](https://huggingface.co/transformers/model_doc/bert.html)

[Bert documentation](https://characters.fandom.com/wiki/Bert_(Sesame_Street) ;)

<img src="Images/BERT_diagrams.pdf" width="1000">

## Exploratory Data Analysis and Preprocessing

We will use the SMILE Twitter dataset.

_Wang, Bo; Tsakalidis, Adam; Liakata, Maria; Zubiaga, Arkaitz; Procter, Rob; Jensen, Eric (2016): SMILE Twitter Emotion dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.3187909.v2_

In [1]:
import torch
import pandas as pd
from tqdm.notebook import tqdm
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer

In [2]:
df_train=pd.read_csv("/kaggle/input/janatahack-independence-day-2020-ml-hackathon/train.csv")
df_test=pd.read_csv("/kaggle/input/janatahack-independence-day-2020-ml-hackathon/test.csv")
df_train.set_index('ID', inplace=True)

In [3]:
df_train.head()

Unnamed: 0_level_0,TITLE,ABSTRACT,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,Reconstructing Subject-Specific Effect Maps,Predictive models allow subject-specific inf...,1,0,0,0,0,0
2,Rotation Invariance Neural Network,Rotation invariance and translation invarian...,1,0,0,0,0,0
3,Spherical polyharmonics and Poisson kernels fo...,We introduce and develop the notion of spher...,0,0,1,0,0,0
4,A finite element approximation for the stochas...,The stochastic Landau--Lifshitz--Gilbert (LL...,0,0,1,0,0,0
5,Comparative study of Discrete Wavelet Transfor...,Fourier-transform infra-red (FTIR) spectra o...,1,0,0,1,0,0


In [4]:
df_train["text"]=df_train["TITLE"]+df_train["ABSTRACT"]
df_test["text"]=df_test["TITLE"]+df_test["ABSTRACT"]
del df_train["TITLE"]
del df_train["ABSTRACT"]
#del df_train["ID"]
#main_test_ids=df_test["ID"]
main_test_title=df_test["TITLE"]
main_test_abstract=df_test["ABSTRACT"]
del df_test["TITLE"]
del df_test["ABSTRACT"]
#del df_test["ID"]

df_train.head()

Unnamed: 0_level_0,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance,text
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1,0,0,0,0,0,Reconstructing Subject-Specific Effect Maps P...
2,1,0,0,0,0,0,Rotation Invariance Neural Network Rotation i...
3,0,0,1,0,0,0,Spherical polyharmonics and Poisson kernels fo...
4,0,0,1,0,0,0,A finite element approximation for the stochas...
5,1,0,0,1,0,0,Comparative study of Discrete Wavelet Transfor...


In [5]:
df_train["text"][1]
df_train_classes=df_train.drop("text",axis=1)
df_train_classes.head()

Unnamed: 0_level_0,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1,0,0,0,0,0
2,1,0,0,0,0,0
3,0,0,1,0,0,0
4,0,0,1,0,0,0
5,1,0,0,1,0,0


In [6]:
df_train.rename(columns={"Physics": "label"})

Unnamed: 0_level_0,Computer Science,label,Mathematics,Statistics,Quantitative Biology,Quantitative Finance,text
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1,0,0,0,0,0,Reconstructing Subject-Specific Effect Maps P...
2,1,0,0,0,0,0,Rotation Invariance Neural Network Rotation i...
3,0,0,1,0,0,0,Spherical polyharmonics and Poisson kernels fo...
4,0,0,1,0,0,0,A finite element approximation for the stochas...
5,1,0,0,1,0,0,Comparative study of Discrete Wavelet Transfor...
...,...,...,...,...,...,...,...
20968,1,1,0,0,0,0,Contemporary machine learning: a guide for pra...
20969,0,1,0,0,0,0,Uniform diamond coatings on WC-Co hard alloy c...
20970,1,0,0,0,0,0,Analysing Soccer Games with Clustering and Con...
20971,0,0,1,1,0,0,On the Efficient Simulation of the Left-Tail o...


In [7]:
possible_labels = df_train.Physics.unique()

In [8]:
label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

In [9]:
df_train['label'] = df_train.Physics.replace(label_dict)

In [10]:
df_train.head()

Unnamed: 0_level_0,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance,text,label
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,1,0,0,0,0,0,Reconstructing Subject-Specific Effect Maps P...,0
2,1,0,0,0,0,0,Rotation Invariance Neural Network Rotation i...,0
3,0,0,1,0,0,0,Spherical polyharmonics and Poisson kernels fo...,0
4,0,0,1,0,0,0,A finite element approximation for the stochas...,0
5,1,0,0,1,0,0,Comparative study of Discrete Wavelet Transfor...,0


## Training/Validation Split

In [11]:
from sklearn.model_selection import train_test_split

In [12]:
X_train, X_val, y_train, y_val = train_test_split(df_train.index.values, 
                                                  df_train.label.values, 
                                                  test_size=0.15, 
                                                  random_state=17, 
                                                  stratify=df_train.label.values)

In [13]:
df_train['data_type'] = ['not_set']*df_train.shape[0]

In [14]:
df_train.loc[X_train, 'data_type'] = 'train'
df_train.loc[X_val, 'data_type'] = 'val'

## Loading Tokenizer and Encoding our Data

In [15]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset



In [16]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [17]:
encoded_data_train = tokenizer.batch_encode_plus(
    df_train[df_train.data_type=='train'].text.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    df_train[df_train.data_type=='val'].text.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)


input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df_train[df_train.data_type=='train'].label.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df_train[df_train.data_type=='val'].label.values)

In [18]:
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)

In [19]:
len(dataset_train)

17826

In [20]:
len(dataset_val)

3146

## Setting up BERT Pretrained Model

In [21]:
from transformers import BertForSequenceClassification

In [22]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)



HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




## Creating Data Loaders

In [23]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

In [24]:
batch_size = 32

dataloader_train = DataLoader(dataset_train, 
                              sampler=RandomSampler(dataset_train), 
                              batch_size=batch_size)

dataloader_validation = DataLoader(dataset_val, 
                                   sampler=SequentialSampler(dataset_val), 
                                   batch_size=batch_size)

## Setting Up Optimiser and Scheduler

In [25]:
from transformers import AdamW, get_linear_schedule_with_warmup

In [26]:
optimizer = AdamW(model.parameters(),
                  lr=1e-5, 
                  eps=1e-8)

In [27]:
epochs = 3

scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps=0,
                                            num_training_steps=len(dataloader_train)*epochs)

## Defining our Performance Metrics

Accuracy metric approach originally used in accuracy function in [this tutorial](https://mccormickml.com/2019/07/22/BERT-fine-tuning/#41-bertforsequenceclassification).

In [28]:
import numpy as np

In [29]:
from sklearn.metrics import f1_score

In [30]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

In [31]:
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

## Creating our Training Loop

Approach adapted from an older version of HuggingFace's `run_glue.py` script. Accessible [here](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128).

In [32]:
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [33]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(device)

cuda


In [34]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in dataloader_val:
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [35]:
for epoch in tqdm(range(1, epochs+1)):
    
    model.train()
    
    loss_train_total = 0

    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:

        model.zero_grad()
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }       

        outputs = model(**inputs)
        
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
         
        
    torch.save(model.state_dict(), f'finetuned_BERT_epoch_{epoch}.model')
        
    tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)            
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_validation)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (Weighted): {val_f1}')

HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=558.0, style=ProgressStyle(description_widt…


Epoch 1
Training loss: 0.20950657587939053
Validation loss: 0.1779989796794123
F1 Score (Weighted): 0.9347624985736873


HBox(children=(FloatProgress(value=0.0, description='Epoch 2', max=558.0, style=ProgressStyle(description_widt…


Epoch 2
Training loss: 0.13625500710367303
Validation loss: 0.17558387196575753
F1 Score (Weighted): 0.9387008994532907


HBox(children=(FloatProgress(value=0.0, description='Epoch 3', max=558.0, style=ProgressStyle(description_widt…


Epoch 3
Training loss: 0.10732416884713275
Validation loss: 0.19434567115674115
F1 Score (Weighted): 0.9387905619288796



In [36]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [38]:
model.load_state_dict(torch.load('./finetuned_BERT_epoch_2.model', map_location=torch.device('cpu')))

<All keys matched successfully>

In [39]:
_, predictions, true_vals = evaluate(dataloader_validation)

In [40]:
accuracy_per_class(predictions, true_vals)

Class: 0
Accuracy: 2176/2244

Class: 1
Accuracy: 779/902



In [None]:
Ypred = model.predict(test_data)