<a href="https://colab.research.google.com/github/amantayal44/BERT-sentiment-analysis/blob/master/BERT_sentimental_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Downloading and Importing required libraries
we use huggingface transformers library for BERT

In [1]:
!pip install -q transformers

[K     |████████████████████████████████| 778kB 3.5MB/s 
[K     |████████████████████████████████| 3.0MB 15.0MB/s 
[K     |████████████████████████████████| 1.1MB 42.0MB/s 
[K     |████████████████████████████████| 890kB 40.1MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


In [37]:
import numpy as np
import pandas as pd
import torch
import transformers
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV,train_test_split
from sklearn.metrics import classification_report,accuracy_score
import warnings

warnings.filterwarnings('ignore')

## Loading Dataset
We use [SST2](https://nlp.stanford.edu/sentiment/code.html) sentiment analysis dataset on rotten tomatoes movie reviews



In [93]:
data = pd.read_csv('https://raw.githubusercontent.com/amantayal44/BERT-sentiment-analysis/master/train.tsv', delimiter='\t',header=None)

In [94]:
data.head()

Unnamed: 0,0,1
0,"a stirring , funny and finally transporting re...",1
1,apparently reassembled from the cutting room f...,0
2,they presume their audience wo n't sit still f...,0
3,this is a visually stunning rumination on love...,1
4,jonathan parker 's bartleby should have been t...,1


In [9]:
# some examples
print('\n'.join(data[0][:5]))

a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films
apparently reassembled from the cutting room floor of any given daytime soap
they presume their audience wo n't sit still for a sociology lesson , however entertainingly presented , so they trot out the conventional science fiction elements of bug eyed monsters and futuristic women in skimpy clothes
this is a visually stunning rumination on love , memory , history and the war between art and commerce
jonathan parker 's bartleby should have been the be all end all of the modern office anomie films


In [10]:
data[1].value_counts()

1    3610
0    3310
Name: 1, dtype: int64

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6920 entries, 0 to 6919
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       6920 non-null   object
 1   1       6920 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 108.2+ KB


total dataset size is 6920, we only use first 2000 for training and rest as validation

## BERT model and token
we will use distilbert a version of BERT that is smaller, but much faster and requiring a lot less memory.

In [12]:
tokenizer = transformers.DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = transformers.DistilBertModel.from_pretrained('distilbert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




## Tokenizing Sentences

In [14]:
tokens = data[0].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))

special token add \<CLS> and \<SEP> at start and end of sentence respectively, output of \<CLS> token is used for classification

In [15]:
#examples
for token in tokens[0]:
  print('{} ----> {}'.format(token,tokenizer.decode([token])))

101 ----> [CLS]
1037 ----> a
18385 ----> stirring
1010 ----> ,
6057 ----> funny
1998 ----> and
2633 ----> finally
18276 ----> transporting
2128 ----> re
16603 ----> imagining
1997 ----> of
5053 ----> beauty
1998 ----> and
1996 ----> the
6841 ----> beast
1998 ----> and
5687 ----> 1930s
5469 ----> horror
3152 ----> films
102 ----> [SEP]


train set of size 2000

In [16]:
train_tokens = tokens[:2000]
train_labels = data[1][:2000]
test_tokens = tokens[2000:]
test_labels = data[1][2000:]

## Padding and creating mask for train tokens

In [19]:
max_len = max([len(s) for s in train_tokens.values])

In [20]:
max_len

59

In [22]:
padded_tokens = np.array([i + [0]*(max_len-len(i)) for i in train_tokens.values])

we also give mask to transformer to mask padded values out in attention layers

In [23]:
attention_mask = np.where(padded_tokens != 0 ,1,0)

In [24]:
padded_tokens.shape

(2000, 59)

## Extracting features from BERT model

In [25]:
inputs = torch.tensor(padded_tokens)
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
  outputs = model(inputs, attention_mask=attention_mask)

model outputs d=768 dimension encoded value for every token in sentence we output of \<CLS> token for classification task

In [27]:
train_features = outputs[0][:,0,:].numpy()

## Logistic Regression

In [40]:
parameters = {'C': np.array([0.0001*(2**i) for i in range(20)])}
grid_search = GridSearchCV(LogisticRegression(), parameters)
grid_search.fit(train_features, train_labels)

print('best parameters: ', grid_search.best_params_)
print('best scrores: ', grid_search.best_score_)

best parameters:  {'C': 0.2048}
best scrores:  0.8425


In [42]:
lr = LogisticRegression(C=0.2048)
lr.fit(train_features,train_labels)

LogisticRegression(C=0.2048, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [44]:
pred = lr.predict(train_features)
print(classification_report(train_labels,pred))

              precision    recall  f1-score   support

           0       0.86      0.89      0.88       959
           1       0.90      0.87      0.88      1041

    accuracy                           0.88      2000
   macro avg       0.88      0.88      0.88      2000
weighted avg       0.88      0.88      0.88      2000



## K Nearest Neighbours

In [49]:
parameters = {'n_neighbors': [1,3,5]}
grid_search = GridSearchCV(KNeighborsClassifier(), parameters)
grid_search.fit(train_features, train_labels)

print('best parameters: ', grid_search.best_params_)
print('best scrores: ', grid_search.best_score_)

best parameters:  {'n_neighbors': 5}
best scrores:  0.7330000000000001


In [54]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(train_features,train_labels)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [55]:
pred = knn.predict(train_features)
print(classification_report(train_labels,pred))

              precision    recall  f1-score   support

           0       0.83      0.82      0.83       959
           1       0.84      0.85      0.84      1041

    accuracy                           0.84      2000
   macro avg       0.84      0.83      0.84      2000
weighted avg       0.84      0.84      0.84      2000



## Support Vector Classifier

In [56]:
parameters = {'C': np.array([0.0001*(2**i) for i in range(20)])}
grid_search = GridSearchCV(SVC(), parameters)
grid_search.fit(train_features, train_labels)

print('best parameters: ', grid_search.best_params_)
print('best scrores: ', grid_search.best_score_)

best parameters:  {'C': 3.2768, 'kernel': 'rbf'}
best scrores:  0.8435


In [59]:
svc = SVC(C=3.2768)
svc.fit(train_features,train_labels)

SVC(C=3.2768, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [60]:
pred = svc.predict(train_features)
print(classification_report(train_labels,pred))

              precision    recall  f1-score   support

           0       0.85      0.91      0.88       959
           1       0.91      0.85      0.88      1041

    accuracy                           0.88      2000
   macro avg       0.88      0.88      0.88      2000
weighted avg       0.88      0.88      0.88      2000



## Comapring models on test dataset

In [61]:
test_tokens.shape

(4920,)

In [65]:
# we use two batches of size 2500,2420
test_tokens = [test_tokens[:2500],test_tokens[2500:]]

In [70]:
lr_pred = []
knn_pred = []
svc_pred = []

for j in range(2):
  max_len = max([len(s) for s in test_tokens[j].values])
  padded_tokens = np.array([i + [0]*(max_len-len(i)) for i in test_tokens[j].values])
  attention_mask = np.where(padded_tokens != 0 ,1,0)
  
  inputs = torch.tensor(padded_tokens)
  attention_mask = torch.tensor(attention_mask)
  with torch.no_grad():
    outputs = model(inputs, attention_mask=attention_mask)
    
  features = outputs[0][:,0,:].numpy()
  lr_pred.append(lr.predict(features))
  knn_pred.append(knn.predict(features))
  svc_pred.append(svc.predict(features))


In [75]:
lr_prediction = np.concatenate(lr_pred) 
knn_prediction = np.concatenate(knn_pred)
svc_prediction = np.concatenate(svc_pred)

In [77]:
# logistic regression 
print(classification_report(test_labels,lr_prediction))

              precision    recall  f1-score   support

           0       0.81      0.86      0.83      2351
           1       0.86      0.82      0.84      2569

    accuracy                           0.84      4920
   macro avg       0.84      0.84      0.84      4920
weighted avg       0.84      0.84      0.84      4920



In [78]:
# knn 
print(classification_report(test_labels,knn_prediction))

              precision    recall  f1-score   support

           0       0.73      0.73      0.73      2351
           1       0.75      0.75      0.75      2569

    accuracy                           0.74      4920
   macro avg       0.74      0.74      0.74      4920
weighted avg       0.74      0.74      0.74      4920



In [79]:
# svm
print(classification_report(test_labels,svc_prediction))

              precision    recall  f1-score   support

           0       0.80      0.87      0.84      2351
           1       0.87      0.80      0.84      2569

    accuracy                           0.84      4920
   macro avg       0.84      0.84      0.84      4920
weighted avg       0.84      0.84      0.84      4920



Conclusion: instead of training on less 1/3 rd dataset model preformed similar in test dataset that shows how well BERT helps in generating features for sentence, this like imagenet moment for sentences. It also achieved very good accuracy with very less samples.Support Vector and Logistic Regression out
performs KNN classifier

## Prediction

In [90]:
def prediction(sentence):
  sentence = tokenizer.encode(sentence,add_special_tokens=True)
  inputs = torch.tensor([sentence])
  with torch.no_grad():
    outputs = model(inputs)
  features = outputs[0][:,0,:]
  pred_lr = lr.predict(features)[0]
  pred_svc = svc.predict(features)[0]
  pred_knn = knn.predict(features)[0]
  sentiment = {0:'Negative',1:'Positive'}
  print("lr => {}".format(sentiment[pred_lr]))
  print("svc => {}".format(sentiment[pred_svc]))

In [91]:
prediction("overall movie story is fine,songs are good but acting is crap")

lr => Negative
svc => Negative


In [92]:
prediction("movie story is crap but songs and acting is great")

lr => Positive
svc => Positive
