# Project Overview

- This project will focus on addressing the problem of sentiment analysis using BERT Deep Learning technique. 
- BERT is a large-scale transformer-based Language Model that can be fine-tuned for different tasks.
- Original paper of BERT can be found [here](https://arxiv.org/pdf/1810.04805.pdf).
- For the experiments, we will use [SMILE Twitter Emotion dataset](https://www.researchgate.net/publication/305676818_SMILE_Twitter_Emotion_Classification_using_Domain_Adaptation/link/5798c71c08aed51475e877ee/download).

In [2]:
import torch
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot
from tqdm.notebook import tqdm
%matplotlib inline

# Data Exploration

In [22]:
df = pd.read_csv("smile-annotations-final.csv", names=['id', 'text', 'category'])
df.head()

Unnamed: 0,id,text,category
0,611857364396965889,@aandraous @britishmuseum @AndrewsAntonio Merc...,nocode
1,614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy
2,614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy
3,614877582664835073,@Sofabsports thank you for following me back. ...,happy
4,611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy


As we know that id is unique for every tweet, we would like to set index based on id number of every sample.

In [23]:
df.set_index('id', inplace=True)
df.head()

Unnamed: 0_level_0,text,category
id,Unnamed: 1_level_1,Unnamed: 2_level_1
611857364396965889,@aandraous @britishmuseum @AndrewsAntonio Merc...,nocode
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy
614877582664835073,@Sofabsports thank you for following me back. ...,happy
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy


In [24]:
df.text.iloc[100]

"* @hist_astro @britishmuseum It looks like there's some #ArtisticLicence involved in that sketch of #SummerSolstice #sunrise at #Stonehenge."

In [25]:
df.category.value_counts()

nocode               1572
happy                1137
not-relevant          214
angry                  57
surprise               35
sad                    32
happy|surprise         11
happy|sad               9
disgust|angry           7
disgust                 6
sad|angry               2
sad|disgust             2
sad|disgust|angry       1
Name: category, dtype: int64

In [26]:
df = df[~df.category.str.contains('\|')]
df.category.value_counts()

nocode          1572
happy           1137
not-relevant     214
angry             57
surprise          35
sad               32
disgust            6
Name: category, dtype: int64

In [27]:
df = df[df.category!='nocode']
df.category.value_counts()

happy           1137
not-relevant     214
angry             57
surprise          35
sad               32
disgust            6
Name: category, dtype: int64

In [28]:
labels = df.category.unique()
labels_di = {}

for index, label in enumerate(labels):
    labels_di[label] = index
print(labels_di)

{'happy': 0, 'not-relevant': 1, 'angry': 2, 'disgust': 3, 'sad': 4, 'surprise': 5}


In [44]:
df['label'] = df.category.replace(labels_di)
print(df.label.value_counts())
print(df.category.value_counts())

0    1137
1     214
2      57
5      35
4      32
3       6
Name: label, dtype: int64
happy           1137
not-relevant     214
angry             57
surprise          35
sad               32
disgust            6
Name: category, dtype: int64


# Data Separation

Since the dataset for the experiment is not balanced, we do not want to use simple train-test-split for this data. Because, a class with limited datasamples might not be represented in training or validation sets, which will cause problem of generalizability of the model. Therefore, we would like to use stratified split that ensures a certain portion of the examples will be in training and test splits for each class.

In [45]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(df.index.values, df.label.values, 
                                                 stratify=df.label.values, test_size=0.15,
                                                 shuffle=True, random_state=2020)

Now, we can check wheter the samples in the data were properly distributed into train and validation sets. For this, we can create another column in the dataframe, called 'data_type', which shows wheter the sample is in training or validation set.

In [50]:
df['data_type'] = ['not_set'] * df.shape[0]
df.head()

Unnamed: 0_level_0,text,category,label,data_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy,0,not_set
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy,0,not_set
614877582664835073,@Sofabsports thank you for following me back. ...,happy,0,not_set
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy,0,not_set
611570404268883969,@NationalGallery @ThePoldarkian I have always ...,happy,0,not_set


In [51]:
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'
df.head()

Unnamed: 0_level_0,text,category,label,data_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy,0,val
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy,0,train
614877582664835073,@Sofabsports thank you for following me back. ...,happy,0,train
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy,0,train
611570404268883969,@NationalGallery @ThePoldarkian I have always ...,happy,0,train


In [52]:
df.groupby(['category', 'label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,text
category,label,data_type,Unnamed: 3_level_1
angry,2,train,48
angry,2,val,9
disgust,3,train,5
disgust,3,val,1
happy,0,train,966
happy,0,val,171
not-relevant,1,train,182
not-relevant,1,val,32
sad,4,train,27
sad,4,val,5


# Tokenizing and Encoding the data

In [53]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

In [54]:
tokenizer = BertTokenizer.from_pretrained(
            'bert-base-uncased',
            do_lower_case=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [63]:
encoded_train = tokenizer.batch_encode_plus(
                df[df['data_type']=='train'].text.values,
                add_special_tokens=True,
                return_attention_mask=True,
                pad_to_max_length=True,
                max_length=256, return_tensors='pt')

encoded_val = tokenizer.batch_encode_plus(
                df[df['data_type']=='val'].text.values,
                add_special_tokens=True,
                return_attention_mask=True,
                pad_to_max_length=True,
                max_length=256, return_tensors='pt')
print(f"{encoded_train.keys()}\n{encoded_val.keys()}")

Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])


In [64]:
input_ids_train = encoded_train['input_ids']
attention_masks_train = encoded_train['attention_mask']
labels_train = torch.tensor(df[df['data_type'] == 'train'].label.values)

input_ids_val = encoded_val['input_ids']
attention_masks_val = encoded_val['attention_mask']
labels_val = torch.tensor(df[df['data_type'] == 'val'].label.values)

In [65]:
dataset_train = TensorDataset(input_ids_train,
                            attention_masks_train,
                            labels_train)

dataset_val = TensorDataset(input_ids_val,
                           attention_masks_val,
                           labels_val)
print(f"There are {len(dataset_train)} samples in the training data!")
print(f"There are {len(dataset_val)} samples in the validation data!")

There are 1258 samples in the training data!
There are 223 samples in the validation data!


# Formulating a Model

In [71]:
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased',
                                     num_labels = len(labels_di),
                                     output_attentions=False,
                                     output_hidden_states=False)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

# Creating DataLoaders

In [72]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

bs = 8
dataloader_train = DataLoader(dataset_train,
                             sampler=RandomSampler(dataset_train),
                             batch_size=bs)
dataloader_val = DataLoader(dataset_val,
                           sampler=RandomSampler(dataset_val),
                           batch_size=bs*4)

# Setting an optimizer and scheduler

In [73]:
from transformers import AdamW, get_linear_schedule_with_warmup

epochs = 10
optimizer = AdamW(model.parameters(),
                 lr=1e-5, eps=1e-8)
scheduler = get_linear_schedule_with_warmup(optimizer,
                                           num_warmup_steps=0,
                                           num_training_steps=len(dataloader_train)*epochs)

# Defining evaluation metrics

Since we have imbalanced dataset, we may be interested in using f1-score, which is one of the most appropriate evaluation metrics out there to be used for inbalanced data.

In [74]:
from sklearn.metrics import f1_score

def f1_score_func(preds, targs):
    preds_flat = np.argmax(preds, axis=1).flatten()
    targs_flat = np.argmax(targs, axis=1).flatten()
    return f1_score(targs_flat, preds_flat, average='weighted')

def accuracy_per_class(preds, labels):
    
    label_di_inverse = {v:k for k, v in labels_di.items()}
    preds_flat = np.argmax(preds, axis=1).flatten()
    targs_flat = np.argmax(targs, axis=1).flatten()
    
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f"Class: {label_di_inverse[label]}")
        print(f"Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n")

# Training a BERT model

In [79]:
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
model.to(device)

cuda


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [80]:
def evaluate(dataloader_val):
    
    model.eval()
    
    loss_val_total = 0
    preds, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids'    : batch[0],
                 'attention_mask': batch[1],
                 'labels'        : batch[2]}
        
        with torch.no_grad():
            
            outputs = model(**inputs)
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()
        
        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        preds.append(logits)
        true_vals.append(label_ids)
        
    loss_val_avg = loss_val_total / len(dataloader_val)
    
    preds = np.concatenate(preds, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
    
    return loss_val_avg, preds, true_vals

In [82]:
for epoch in tqdm(range(1, epochs+1)):
    
    model.train()
    loss_train_total = 0
    progress_bar = tqdm(dataloader_train,
                       desc='Epoch {:1d}'.format(epoch),
                       leave=False, disable=False)
    
    for batch in progress_bar:
        model.zero_grad()
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids'    : batch[0],
                 'attention_mask': batch[1],
                 'labels'        : batch[2]}
        
        outputs = model(**inputs)
        
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix(f'Training loss: {loss.item()/len(batch):.3f}')
        
    torch.save(model.state_dict(), f'Models/BERT_ft_epoch{epoch}.model')
    tqdm.write(f'\nEpoch {epoch}')
    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f"Training loss: {loss_train_avg}")
    
    val_loss, preds, true_vals = evaluate(dataloader_val)
    val_f1 = f1_score_func(preds, true_vals)
    tqdm.write(f"Validation loss: {val_loss}")
    tqdm.write(f"F1 Score: {val_f1}")

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=158.0, style=ProgressStyle(description_widt…




RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 4.00 GiB total capacity; 2.77 GiB already allocated; 22.64 MiB free; 2.91 GiB reserved in total by PyTorch)