<a href="https://colab.research.google.com/github/anwarbabukm/Sentiment_Analysis_DistilBert/blob/main/DistilBert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Sentiment Analysis using DistilBert on IMDB Movie Reviews***

*Our goal is to create a model that takes a movie review and predicts if the particular review is positive or negative. DistilBERT processes the review and passes along some information it extracted from it on to the neural network classification model. DistilBERT is a smaller version of BERT developed and open sourced by the team at HuggingFace. It’s a lighter and faster version of BERT that roughly matches its performance. The ML model will take in the result of DistilBERT’s processing, and classify the review as either positive or negative sentiment.*

*For DistillBERT, we use a model that’s already pre-trained and has a grasp on the English language. This model, however is neither trained not fine-tuned to do sentence classification.*


In [None]:
!pip install transformers  #install transformer using pip

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ed/db/98c3ea1a78190dac41c0127a063abf92bd01b4b0b6970a6db1c2f5b66fa0/transformers-4.0.1-py3-none-any.whl (1.4MB)
[K     |████████████████████████████████| 1.4MB 4.7MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 19.2MB/s 
[?25hCollecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 23.3MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=77580a8c31b2f

## ***Reading the datasets containing the movie reviews and sentiments***

In [None]:
#Import Pandas and read excel file containing the movie reviews using pandas framework
import pandas as pd
train=df = pd.read_excel('/content/drive/MyDrive/train.xlsx') #training data
test=df = pd.read_excel('/content/drive/MyDrive/test.xlsx') #test data
train.shape, test.shape  #Provides the shape of imported dataframe

((25000, 2), (25000, 2))

In [None]:
train.dtypes #provides the data type of training dataset

Reviews      object
Sentiment     int64
dtype: object

In [None]:
test.dtypes  #provides the data type of test dataset

Reviews      object
Sentiment     int64
dtype: object

In [None]:
train.head() #first 5 reviews and sentiments from the training data

Unnamed: 0,Reviews,Sentiment
0,"I saw this film at the London Premiere, and I ...",0
1,"What a bad, bad film!!! I can't believe all th...",0
2,The photography on the DVD is so dark I though...,0
3,It seems a shame that Greta Garbo ended her il...,0
4,Dear me... Peter Sellers was one of the most o...,0


In [None]:
train.tail() #last 5 reviews and sentiments from the training data

*The sentiment is represented in numerical category where 0 is termed as negative sentiment and 1 is termed as positive sentiment.*

In [None]:
def random_splitting(train,sample_size):
 train_data=train.sample(frac=sample_size) #randomly chooses the train data based on sample size
                                            
 print('The shape of randomly picked data:',train_data.shape)                                         
 return train_data                                  


*random_splitting() function randomly picks 50% of the train data for the sentiment analysis. The fraction to split the data can be chosen as per our need. Here I have chosen sampling size to be 0.5 so that, it randomly chooses 50% of the train data.*

In [None]:
#splitting the data into batches of 2500 rows each, for the smooth operation.
def data_split(data,batches):
 n = batches  #batch size

 list_df = [data[i:i+n] for i in range(0,data.shape[0],n)] #Splits the data into batches

 [i.shape for i in list_df]
 return list_df

*data_split() function is used to split the datasets into batches before giving to DISTILBERT model. We have split the train data into 5 batches of 2500 each*

# ***DistilBert Model***

In [None]:
#importing necessary packages required for running
import numpy as np
import torch
import transformers as trans
import warnings
warnings.filterwarnings('ignore')

***Initializing and Pre Training***

*Here DistilBert model is pretraining on the distilbert-base-uncased which means it does not make a difference between english and English.
It has 6-layer, 768-hidden, 12-heads, 66M parameters.*

*DistilBERT pretrained on the same data as BERT, which is BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers).*



In [None]:
#DistilBERT
model_class, tokenizer_class, pretrained_weights = (trans.DistilBertModel, trans.DistilBertTokenizer, 'distilbert-base-uncased')

# Load pretrained model and tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




*The first step is to use the DISTILBERT tokenizer to split the word into tokens. Then, we add the special tokens needed for sentence classifications (these are [CLS] at the first position, and [SEP] at the end of the sentence).The next step the tokenizer does is to replace each token with its id from the embedding table which is a component we get with the trained model*


In [None]:
#Tokenization and Model running on the given datasets
def DistilBert_model(list_df):
 count=0
 print('Status:')
 for df in list_df:  
   count=count+1 
   print('working on:',count,'set of data')
   tokenized = df['Reviews'].apply((lambda x: tokenizer.encode(str(x), add_special_tokens=True,max_length=100,truncation=True)))
   #had to restrict the max_length=100, as the system keeps on crashing on max_length greater than 100 
  
   #Padding
   max_len = 0
   for i in tokenized.values:
      if len(i) > max_len:
          max_len = len(i)
   padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

   input_ids = torch.tensor(padded)  #converting the tokens into tensors before inputting to DistilBert
 
   with torch.no_grad():
      last_hidden_states = model(input_ids)  #running DistilBert model on the converted tokens
      feature= last_hidden_states[0][:,0,:].numpy() #Specially picks the CLS token from model
   #for joining the batches into one single array for the machine learning model
      if (count == 1):
       features=feature
      else:
        data=np.append(features,feature,axis=0)
        features=data

 return features

*In this function, last_hidden_states holds the outputs of DistilBERT. It is a tuple with the shape (number of examples, max number of tokens in the sequence, number of hidden units in the DistilBERT model). In our case, this will be 12500 (since we only limited ourselves to 12500 examples), 100 (which is the number of tokens), 768 (the number of hidden units in the DistilBERT model).*

*Because of the computational issues due to high datasets and feature size, I had to restrict the max_length of tokenizer model to be 100, as values above 100 led my colab to crash constantly.*

*The output is a vector for each input token. each vector is made up of 768 numbers (floats). Because this is a sentence classification task, we ignore all except the first vector (the one associated with the [CLS] token). This is the vector we pass as the input to the neural network classification model.*


In [None]:
def train_test(X,Y):
 from sklearn.model_selection import train_test_split
 x_train,x_val,y_train,y_val=train_test_split(X,Y,test_size=0.2)
 x_train.shape,x_val.shape #shape after splitting the data
 return x_train,x_val,y_train,y_val

*train_test() function is used to split the data into train and validation set on a given specific size*

In [None]:
from keras import models
from keras import layers 

def model_fit(x_train,x_val,y_train,y_val):
 base_model = models.Sequential()
 base_model.add(layers.Dense(64, activation='relu', input_dim=768))
 base_model.add(layers.Dense(64, activation='relu'))
 base_model.add(layers.Dense(1, activation='sigmoid'))
 print(base_model.summary())
 base_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
 base_model.fit(x_train
                       , y_train
                       , epochs=20                     
                       , batch_size=5
                       , validation_data=(x_val,y_val)
                       , verbose=1)
 return base_model

*model_fit() is used to initialize and train the neural model on the training dataset and outputs the prediction accuracy of the model on validation dataset.*

In [None]:
frac=0.5 
print('Randomly picked',100*frac,'% data')
train_data = random_splitting(train,frac) #randomly picking 30% of the train data = 7.5k datasets
batch_size=2500
list_df= data_split(train_data,batch_size) #splitting total of 7.5k datasets into 3 batches of 2.5k datas
print('Split the data into',int(len(train_data)/batch_size),'batches of',batch_size,'each')
features= DistilBert_model(list_df) #running tokenization and distilbert model on the datasets
print('DistilBert was succesfully run on the full datasets')

Randomly picked 50.0 % data
The shape of randomly picked data: (12500, 2)
Split the data into 5 batches of 2500 each
Status:
working on: 1 set of data
working on: 2 set of data
working on: 3 set of data
working on: 4 set of data
working on: 5 set of data
DistilBert was succesfully run on the full datasets


In [None]:
train_data['Sentiment'].head()
labels=train_data['Sentiment']

In [None]:
labels[:5,]

9479     0
13173    1
12585    1
18821    0
7452     0
Name: Sentiment, dtype: int64

***Choosing the test data for prediction***

In [None]:
frac= 0.2
test_data = random_splitting(test,frac) #randomly picking 20% of the test data for prediction
print('Randomly picked',100*frac,'% data')
batch=2500
list_tdf= data_split(test_data,batch) #split into batches
print('Split the data into',int(len(test_data)/batch),'batches')
x_test= DistilBert_model(list_tdf) #running tokenization and distilbert model on the datasets
print('DistilBert was succesfully run on the full datasets')
y_test=test_data['Sentiment']

The shape of randomly picked data: (5000, 2)
Randomly picked 20.0 % data
Split the data into 2 batches
Status:
working on: 1 set of data
working on: 2 set of data
DistilBert was succesfully run on the full datasets


***Data splitting and training of the neural network***

In [None]:
 x_train,x_val,y_train,y_val = train_test(features,labels)
 print('Train data shape:',x_train.shape,', Validation data shape:',x_val.shape)
 base_model = model_fit(x_train,x_val,y_train,y_val)

Train data shape: (10000, 768) , Validation data shape: (2500, 768)
Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_12 (Dense)             (None, 64)                49216     
_________________________________________________________________
dense_13 (Dense)             (None, 64)                4160      
_________________________________________________________________
dense_14 (Dense)             (None, 1)                 65        
Total params: 53,441
Trainable params: 53,441
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


***Prediction on test datasets***

In [None]:
y_pred=base_model.predict_classes(x_test,batch_size=1,verbose=1)
#prediction on the test datasets.



***Performance metrics of the classification model***

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.81      0.72      0.77      2470
           1       0.76      0.84      0.79      2530

    accuracy                           0.78      5000
   macro avg       0.78      0.78      0.78      5000
weighted avg       0.78      0.78      0.78      5000



In [None]:
from sklearn.metrics import accuracy_score
accu= accuracy_score(y_test,y_pred)
print('The Accuracy on test data is:',accu*100)

The Accuracy on test data is: 78.08


*The neural network predicted the test data with an accuracy of 78.08%. The model was trained on 12.5k train datasets and predicted the sentiment on 5k randomly selected test data.*

# ***EDA - Easy Data Augmentation***

*EDA (Easy Data Augmentation) is a set of techniques for boosting performance on textclassification tasks. EDA consists of four simple but powerful operations: 1) synonym replacement 2) random insertion 3) random swap 4) random deletion. On five text classification tasks, It shows that EDA improves performance for both convolutional and recurrent neural networks. EDA demonstrates particularly strong results for smaller datasets, on average, across five datasets, training with EDA while using only 50% of the available training set achieved the same accuracy as normal training with all available data.*

In [None]:
#Applying EDA - Easy Data Augmentation

In [3]:
!pip install numpy nltk gensim textblob googletrans
import nltk 
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
!pip install textaugment

Collecting googletrans
  Downloading https://files.pythonhosted.org/packages/71/3a/3b19effdd4c03958b90f40fe01c93de6d5280e03843cc5adf6956bfc9512/googletrans-3.0.0.tar.gz
Collecting httpx==0.13.3
[?25l  Downloading https://files.pythonhosted.org/packages/54/b4/698b284c6aed4d7c2b4fe3ba5df1fcf6093612423797e76fbb24890dd22f/httpx-0.13.3-py3-none-any.whl (55kB)
[K     |████████████████████████████████| 61kB 4.8MB/s 
Collecting sniffio
  Downloading https://files.pythonhosted.org/packages/52/b0/7b2e028b63d092804b6794595871f936aafa5e9322dcaaad50ebf67445b3/sniffio-1.2.0-py3-none-any.whl
Collecting hstspreload
[?25l  Downloading https://files.pythonhosted.org/packages/d3/3c/cdeaf9ab0404853e77c45d9e8021d0d2c01f70a1bb26e460090926fe2a5e/hstspreload-2020.11.21-py3-none-any.whl (981kB)
[K     |████████████████████████████████| 983kB 8.8MB/s 
[?25hCollecting rfc3986<2,>=1.3
  Downloading https://files.pythonhosted.org/packages/78/be/7b8b99fd74ff5684225f50dd0e865393d2265656ef3b4ba9eaaaffe622b8/rfc3

In [4]:
from textaugment import EDA
t=EDA()

*EDA is imported from the text augment. And called to a object.*

### ***Examples of Easy Data Augmentation***

In [None]:
t.random_insertion('Movie is awesome, And it needs some improvemnt in cinematography')

'Movie is awesome, And film it needs some improvemnt in cinematography'

*It inserts a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. In this case the synonym of word 'movie' is created, which is 'film', and inserted into the sentence.*

In [5]:
t.random_deletion('film was not upto the expectation and expected a lot from such a great director')

'film was not upto the expectation and expected from such a director'

*Randomly  remove each word in the sentence with probability p. Here two words are omitted from the sentence which are 'a' and 'lot'.*

In [None]:
t.random_swap('the best movie experience till date. No words to express the cinematography')

'date. best movie experience till the No words to express the cinematography'

*Randomly  choose  two words in the sentence and swap their positions. 
The words 'Date' and 'The' are swapped in this sentence.*

In [None]:
t.synonym_replacement('movie is okayish. The acting was good but needs improvement')

'flick is okayish. The acting was good but needs improvement'

*Randomly choose n words from the sentence that are not stop words.  Replace each of these words withone of its synonyms chosen at random. In the above sentence the word 'movie' is replaced by its synonym 'flick'.*

In [None]:
def data_augmentation(train):
 train1=[]
 label1=[]
 train=train.sample(frac=0.5)
 for i in range(0,6250):
   train1.append(t.random_insertion(train.iloc[i,0]))
   train1.append(t.random_deletion(train.iloc[i,0]))
   train1.append(t.random_swap(train.iloc[i,0]))
   train1.append(t.synonym_replacement(train.iloc[i,0]))
   for j in range(4):
    if train.iloc[i,1]==0:
      label1.append(0)
    else:
      label1.append(1)
 return train1,label1

*data_augmentation() function is used for Easy Data Augmentation process. We randomly chose 50% of the selected training data and applied four different processes of EDA to it. Thus 25000 new datasets are formed after the application of EDA and appended to the original 12.5k datasets.*


In [None]:
def to_df(trains,labels_data):
  train=np.array(trains) #the reviews are converted to array
  train.reshape(-1,1)
  label=np.array(labels_data) #sentiments are converted to array
  label.reshape(-1,1)
  arr = np.column_stack((train,label)) #joined the review column and sentiment column
  train=pd.DataFrame(arr,columns=['Reviews','Sentiment']) #the array of reviews and sentiment are converted to dataframe
  train=train.reset_index(drop=True) 
  print('The shape of dataset which contain only EDA applied reviews:->', train.shape) #shape of newly formed datasets
  return train

*to_df() function is used to convert those newly formed EDA applied datasets into a dataframe.*

In [None]:
train_EDA,label_EDA=data_augmentation(train_data) #EDA process is done using this function
t_data=to_df(train_EDA,label_EDA) #converts the EDA applied datasets into dataframe
train_data_EDA=train_data.append(t_data,sort=False) #both original datasets and EDA applied datasets are merged.
train_data_EDA=train_data_EDA.sample(frac=1) #shuffling is done
print('After applying EDA methods the total dataset for the model is increased to ->',train_data_EDA.shape)
batches=2500 
list_df= data_split(train_data_EDA,batches) #The whole dataset is split into number of batcges of certain size. We have chosen 2500 as the batch size.
print('Split the data into',int(len(train_data_EDA)/batches),'batches')
features= DistilBert_model(list_df) #running tokenization and distilbert model on the datasets
print('Finished....')
print('DistilBert was succesfully run on the full datasets')


The shape of dataset which contain only EDA applied reviews:-> (25000, 2)
After applying EDA methods the total dataset for the model is increased to -> (37500, 2)
Split the data into 15 batches
Status:
working on: 1 set of data
working on: 2 set of data
working on: 3 set of data
working on: 4 set of data
working on: 5 set of data
working on: 6 set of data
working on: 7 set of data
working on: 8 set of data
working on: 9 set of data
working on: 10 set of data
working on: 11 set of data
working on: 12 set of data
working on: 13 set of data
working on: 14 set of data
working on: 15 set of data
Finished....
DistilBert was succesfully run on the full datasets


In [None]:
 train_data_EDA['Sentiment'].dtypes #checking the data type of Sentiment column

dtype('O')

In [None]:
labels_eda=train_data_EDA['Sentiment'].astype('int64') #as the data type of Sentiment column is Object, We convert it to 'int64' for the neural network
labels_eda.shape, labels_eda.dtypes

((37500,), dtype('int64'))

### ***Training and prediction after applying EDA***

In [None]:
 x_train,x_val,y_train,y_val = train_test(features,labels_eda) #The datasets are split into train and validation set.
 base_model = model_fit(x_train,x_val,y_train,y_val) #neural network

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_15 (Dense)             (None, 64)                49216     
_________________________________________________________________
dense_16 (Dense)             (None, 64)                4160      
_________________________________________________________________
dense_17 (Dense)             (None, 1)                 65        
Total params: 53,441
Trainable params: 53,441
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


*Prediction on the same set of test data which was used before the application of EDA. So the model performance can easily be distinguished as with and without applying EDA.*

***The performance metrics of the model after applying EDA***

In [None]:
y_pred=base_model.predict_classes(x_test,batch_size=1,verbose=1)
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.78      0.71      0.74      2470
           1       0.74      0.81      0.77      2530

    accuracy                           0.76      5000
   macro avg       0.76      0.76      0.76      5000
weighted avg       0.76      0.76      0.76      5000




*The Model performance on the test data before the application of EDA is : 78.08%*

*The Model performance on the test data after the application of EDA is: 76%*

*Based on the model performance, the accuracy have fallen down after applying the EDA techniques. The model has overfitted the data which caused to decrease the performance in terms of precision,recall and accuracy.* 

*As I had the plan to use CNN model in this particular task, the colab was crashing due to high computational requirement on the datasets. So moved up with neural network and noted the accuracy before and after applying the EDA.*