<a href="https://colab.research.google.com/github/bahgat-ahmed/BERT/blob/main/Sentiment_Analysis_with_Deep_Learning_using_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis with Deep Learning using BERT

### Prerequisites

- Intermediate-level knowledge of Python 3 (NumPy and Pandas preferably, but not required)
- Exposure to PyTorch usage
- Basic understanding of Deep Learning and Language Models (BERT specifically)

### Project Outline

**Task 1**: Introduction (this section)

**Task 2**: Exploratory Data Analysis and Preprocessing

**Task 3**: Training/Validation Split

**Task 4**: Loading Tokenizer and Encoding our Data

**Task 5**: Setting up BERT Pretrained Model

**Task 6**: Creating Data Loaders

**Task 7**: Setting Up Optimizer and Scheduler

**Task 8**: Defining our Performance Metrics

**Task 9**: Creating our Training Loop

**Task 10**: Loading and Evaluating our Model

## Task 1: Introduction

### What is BERT

BERT is a large-scale transformer-based Language Model that can be finetuned for a variety of tasks.

For more information, the original paper can be found [here](https://arxiv.org/abs/1810.04805). 

[HuggingFace documentation](https://huggingface.co/transformers/model_doc/bert.html)

[Bert documentation](https://characters.fandom.com/wiki/Bert_(Sesame_Street) ;)

<img src="Images/BERT_diagrams.pdf" width="1000">

## Task 2: Exploratory Data Analysis and Preprocessing

We will use the SMILE Twitter dataset.

_Wang, Bo; Tsakalidis, Adam; Liakata, Maria; Zubiaga, Arkaitz; Procter, Rob; Jensen, Eric (2016): SMILE Twitter Emotion dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.3187909.v2_

In [1]:
import torch
import pandas as pd
from tqdm.notebook import tqdm

In [2]:
df = pd.read_csv('Data/smile-annotations-final.csv',
                names=['id', 'text', 'category'])
df.set_index('id', inplace=True) # since we know the id will be unique

In [3]:
df.head()

Unnamed: 0_level_0,text,category
id,Unnamed: 1_level_1,Unnamed: 2_level_1
611857364396965889,@aandraous @britishmuseum @AndrewsAntonio Merc...,nocode
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy
614877582664835073,@Sofabsports thank you for following me back. ...,happy
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy


In [4]:
df.text.iloc[0]

'@aandraous @britishmuseum @AndrewsAntonio Merci pour le partage! @openwinemap'

In [5]:
df.category.value_counts()

nocode               1572
happy                1137
not-relevant          214
angry                  57
surprise               35
sad                    32
happy|surprise         11
happy|sad               9
disgust|angry           7
disgust                 6
sad|angry               2
sad|disgust             2
sad|disgust|angry       1
Name: category, dtype: int64

In [6]:
df = df[~df.category.str.contains('\|')]

In [7]:
df = df[df.category != 'nocode']

In [8]:
df.category.value_counts()

happy           1137
not-relevant     214
angry             57
surprise          35
sad               32
disgust            6
Name: category, dtype: int64

In [9]:
possible_labels = df.category.unique()

In [10]:
label_dict = {}
for index, possibl_label in enumerate(possible_labels):
    label_dict[possibl_label] = index

In [11]:
label_dict

{'angry': 2,
 'disgust': 3,
 'happy': 0,
 'not-relevant': 1,
 'sad': 4,
 'surprise': 5}

In [12]:
df['label'] = df.category.replace(label_dict)
df.head(10)

Unnamed: 0_level_0,text,category,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy,0
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy,0
614877582664835073,@Sofabsports thank you for following me back. ...,happy,0
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy,0
611570404268883969,@NationalGallery @ThePoldarkian I have always ...,happy,0
614499696015503361,Lucky @FitzMuseum_UK! Good luck @MirandaStearn...,happy,0
613601881441570816,Yr 9 art students are off to the @britishmuseu...,happy,0
613696526297210880,@RAMMuseum Please vote for us as @sainsbury #s...,not-relevant,1
610746718641102848,#AskTheGallery Have you got plans to privatise...,not-relevant,1
612648200588038144,@BarbyWT @britishmuseum so beautiful,happy,0


## Task 3: Training/Validation Split

In [13]:
from sklearn.model_selection import train_test_split

In [14]:
X_train, X_val, y_train, y_val = train_test_split(
df.index.values,
df.label.values,
test_size=0.15,
random_state=17,
stratify=df.label.values)

In [15]:
df['data_type'] = ['not_set']*df.shape[0]

In [16]:
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

In [17]:
df.groupby(['category', 'label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,text
category,label,data_type,Unnamed: 3_level_1
angry,2,train,48
angry,2,val,9
disgust,3,train,5
disgust,3,val,1
happy,0,train,966
happy,0,val,171
not-relevant,1,train,182
not-relevant,1,val,32
sad,4,train,27
sad,4,val,5


## Task 4: Loading Tokenizer and Encoding our Data

In [18]:
!pip install transformers



In [19]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset # to set up our datasets that's usable in a pytorch environment

In [20]:
tokenizer = BertTokenizer.from_pretrained(
'bert-base-uncased',
do_lower_case=True, # it makes sense to conver everything to lower-case since we are using bert-uncased
)

In [21]:
# we want to convert all our sentences (tweets) from language into encoded form
encoded_data_train = tokenizer.batch_encode_plus(
df[df.data_type=='train'].text.values,
add_special_tokens=True, # this is  just bert's way of knowing where the sentence begins and when a new one begins
return_attention_masks=True,
    pad_to_max_length=True, # for masking padded tokens because they do not contain any relevant information to our problem
    max_length=256, # a single tweet won't have more than 256 words in it
    return_tensors='pt') # how we want to return these sentences (pt) because we are using pytorch

# we will almost do the same thing with the validation set but it should be done separate from the training set
encoded_data_val = tokenizer.batch_encode_plus(
df[df.data_type=='val'].text.values,
add_special_tokens=True,
return_attention_masks=True,
    pad_to_max_length=True, 
    max_length=256,
    return_tensors='pt')

# encoded_data_train returns a dictionary
input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].label.values) # make tensor out of our original data

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.data_type=='val'].label.values)

Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Keyword arguments {'return_attention_masks': True} not recognized.
Keyword arguments {'return_attention_masks': True} not recognized.
Keyword arguments {'return_attention_masks': True} not recognized.
Keyword arguments {'return_attention_masks': True} not recognized.
Keyword arguments {'return_attention_masks': True} not recognized.
Keyword arguments {'return_attention_masks': True} not recognized.
Keyword arguments {'return_attention_masks': True} not recognized.
Keyword arguments {'return_attention_masks': True} not recognized.
Keyword arguments {'return_attention_masks': True} not recognized.
Keyword arguments {'re

In [22]:
# we will use a TensorDataset which is the standard way of using a dataset in the pytorch library
dataset_train = TensorDataset(input_ids_train, 
                              attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val, 
                              attention_masks_val, labels_val)

In [23]:
len(dataset_train)

1258

In [24]:
len(dataset_val)

223

## Task 5: Setting up BERT Pretrained Model

In [25]:
# we are essentially treating each tweet as its own unique sequence so one sequence will be classified into one of six classes
from transformers import BertForSequenceClassification

In [26]:
# num labels is basically saying how many output labels must this final layer of bert have
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', 
                                      num_labels=len(label_dict),
                                      output_attentions=False,
                                      output_hidden_states=False) # which is the state just before prediction which might be useful in encoding situations but for us we don't really need them

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

## Task 6: Creating Data Loaders

In [27]:
# data loaders essentially offer a nice way to iterate through your datasets in batches
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

In [28]:
batch_size = 32 # as the original authors suggestions

dataloader_train = DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train), # it also help to prevent the model from learning sequence-based differences when it is training
    batch_size=batch_size
)

dataloader_val = DataLoader(
    dataset_val,
    sampler=RandomSampler(dataset_val), # it also help to prevent the model from learning sequence-based differences when it is training
    batch_size=batch_size # and that is because we do not actually have to do many computations on this, we do not have to do any form of back propagation
)

## Task 7: Setting Up Optimizer and Scheduler

In [29]:
# Adam is a stochastic optimization approach
# our scheduler is what controls the learning rate (it adjusts the learning rate as training goes on based on how it is going)
from transformers import AdamW, get_linear_schedule_with_warmup

In [30]:
optimizer = AdamW(
    model.parameters(),
    lr=1e-5, # 2e-5 -> 5e-5 the recommended range in the BERT paper but this also depends mostly on you specific dataset that you're tuning the BERT model on
    eps=1e-8
)

In [31]:
epochs = 10

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=len(dataset_train)*epochs, # how many iterations that this should run for (it is how many times would you want you learning rate to actually change)
)

## Task 8: Defining our Performance Metrics

Accuracy metric approach originally used in accuracy function in [this tutorial](https://mccormickml.com/2019/07/22/BERT-fine-tuning/#41-bertforsequenceclassification).

In [32]:
import numpy as np

In [33]:
from sklearn.metrics import f1_score

In [34]:
#preds = [0.9 0.05 0.05 0 0 ]
#preds = [1 0 0 0 0 ] that is how we want the outputto be so we want the flat value, which would be something like this vector

In [35]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted') # weighted average which weights each class based on how many samples exist 

In [36]:
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')
              

## Task 9: Creating our Training Loop

Approach adapted from an older version of HuggingFace's `run_glue.py` script. Accessible [here](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128).

In [37]:
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [38]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(device)

cuda


In [39]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val): # wrapping it around tqdm to know how long it is taking
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1] # we use our logits as our predictions
        loss_val_total += loss.item()
        # detach().cpu() is in the case that you are using a GPU, you want to pull the values onto your CPU in order to use them with numpy 
        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals


In [40]:
for epoch in tqdm(range(1, epochs+1)):
    
    model.train()
    
    loss_train_total = 0
    # progress_bar let us see how many batches have been trained and how many to go to know what is happening if the training took unexptected long time (if you are training somehting big for instance) due to crashing or hanging or somethng else
    progress_bar = tqdm(dataloader_train,
                        desc='Epoch {:1d}'.format(epoch),
                        leave=False, # to override itself each new epoch)
                        disable=False)
    for batch in progress_bar:
        
        model.zero_grad() # not necessary unless using RNN
        # our batch containes three items since our dataloader has three different variables
        batch = tuple(b.to(device) for b in batch)
        # inputs to the BERT pretrained model
        inputs = {
            'input_ids'      : batch[0],
            'attention_mask' : batch[1],
            'labels'         : batch[2]
            
        }
        # ** is for unbatching a dictionary strainght into the inputs and it works out quite well
        outputs = model(**inputs)
        # the BERT model returns loss and logits (hidden layer units so they are before passing through the activation function)
        # here we don't excute prediction so we care about the loss output from the BERT model
        
        loss = outputs[0]
        loss_train_total += loss.item()
        # we will use the loss function built into BERT to backpropagate 
        # predictions are needed in the evaluate function and not here in the training function (here we are updaing the model's weights)
        loss.backward()
        # model.parameters() are all our weights
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        scheduler.step()
        # upadate the progress_bar to show the loss per batch
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
        
    # this will have all the model layers and weights
    torch.save(model.state_dict(), f'Models/BERT_ft_epoch{epoch}.model') # ft stands for fine-tune
    
    # finally we can just report a couple of things
    tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_val)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (weighted) loss: {val_f1}')

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=40.0, style=ProgressStyle(description_width…


Epoch 1
Training loss: 1.1002346470952034


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.8009054831096104
F1 Score (weighted) loss: 0.6656119824269878


HBox(children=(FloatProgress(value=0.0, description='Epoch 2', max=40.0, style=ProgressStyle(description_width…


Epoch 2
Training loss: 0.7622748166322708


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.7874705961772374
F1 Score (weighted) loss: 0.6953185953656175


HBox(children=(FloatProgress(value=0.0, description='Epoch 3', max=40.0, style=ProgressStyle(description_width…


Epoch 3
Training loss: 0.6346930161118507


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.6044950868402209
F1 Score (weighted) loss: 0.7338077597972313


HBox(children=(FloatProgress(value=0.0, description='Epoch 4', max=40.0, style=ProgressStyle(description_width…


Epoch 4
Training loss: 0.5016121059656143


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.5253081960337502
F1 Score (weighted) loss: 0.7891069515803997


HBox(children=(FloatProgress(value=0.0, description='Epoch 5', max=40.0, style=ProgressStyle(description_width…


Epoch 5
Training loss: 0.40804899781942366


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.4757124291998999
F1 Score (weighted) loss: 0.8057853849554728


HBox(children=(FloatProgress(value=0.0, description='Epoch 6', max=40.0, style=ProgressStyle(description_width…


Epoch 6
Training loss: 0.31254596207290886


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.4877690630299704
F1 Score (weighted) loss: 0.8347055538954001


HBox(children=(FloatProgress(value=0.0, description='Epoch 7', max=40.0, style=ProgressStyle(description_width…


Epoch 7
Training loss: 0.2598301130346954


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.5374378263950348
F1 Score (weighted) loss: 0.8437200697788302


HBox(children=(FloatProgress(value=0.0, description='Epoch 8', max=40.0, style=ProgressStyle(description_width…


Epoch 8
Training loss: 0.2178127020597458


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.5144038274884224
F1 Score (weighted) loss: 0.8408866929001592


HBox(children=(FloatProgress(value=0.0, description='Epoch 9', max=40.0, style=ProgressStyle(description_width…


Epoch 9
Training loss: 0.1637041580863297


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.5272139608860016
F1 Score (weighted) loss: 0.8389368885572283


HBox(children=(FloatProgress(value=0.0, description='Epoch 10', max=40.0, style=ProgressStyle(description_widt…


Epoch 10
Training loss: 0.12217284790240228


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.522835122687476
F1 Score (weighted) loss: 0.8487165264346768



## Task 10: Loading and Evaluating our Model

In [41]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [50]:
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [54]:
# BERT_ft_epoch6.model the second smallest validation loss and a reasonable F1 score
model.load_state_dict(torch.load('Models/BERT_ft_epoch6.model',
                      map_location=torch.device('cpu'))) # I trained the model on GPU but I want to move it back to the CPU

<All keys matched successfully>

In [55]:
# we don't want our loss anymore here. We just care about our predictions and our true values
_, predictions, true_vals = evaluate(dataloader_val)

HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))




In [56]:
accuracy_per_class(predictions, true_vals)

Class: happy
Accuracy: 165/171

Class: not-relevant
Accuracy: 22/32

Class: angry
Accuracy: 3/9

Class: disgust
Accuracy: 0/1

Class: sad
Accuracy: 0/5

Class: surprise
Accuracy: 1/5

