Himanshu Mahajan

Thapar Institute of Engineering and Technology, Patiala

The problem in sentiment analysis is classifying the polarity of a given text at the document,
sentence, or feature/aspect level. Whether the expressed opinion in a document, a sentence or an
entity feature/aspect is positive or negative.
Here, Movie Review Sentiment Analysis using the IMDB dataset is done.

Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
import re
from copy import deepcopy as d
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from bs4 import BeautifulSoup
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.svm import LinearSVC

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Loading Dataset

In [None]:
dataset = pd.read_csv('IMDB Dataset.csv')

print(dataset.shape)
dataset.head()

(50000, 2)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [None]:
dataset.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,negative
freq,5,25000


In [None]:
dataset['sentiment'].value_counts()

negative    25000
positive    25000
Name: sentiment, dtype: int64

Converting sentiments into numerical form

In [None]:
dataset['sentiment']=dataset['sentiment'].replace({'positive':1, 'negative':0})
dataset.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


Removal of HTML contents

In [None]:
dataset_noHTML=d(dataset)
for i in range(dataset_noHTML['review'].shape[0]):
    soup = BeautifulSoup(dataset_noHTML['review'].iloc[i],'html.parser')
    dataset_noHTML['review'].iloc[i] = soup.get_text()
    
dataset_noHTML['review'].iloc[1]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


'A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well done.'

Removing special characters, digits, etc. except alphabet.

In [None]:
dataset_onlyAlphabet=d(dataset_noHTML)
for i in range(dataset_onlyAlphabet['review'].shape[0]):
    dataset_onlyAlphabet['review'].iloc[i] = re.sub('/[[^]]*/]',' ',dataset_onlyAlphabet['review'].iloc[i])
    dataset_onlyAlphabet['review'].iloc[i] = re.sub('[^a-zA-Z]',' ',dataset_onlyAlphabet['review'].iloc[i])
    
dataset_onlyAlphabet['review'].iloc[1]

  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


'A wonderful little production  The filming technique is very unassuming  very old time BBC fashion and gives a comforting  and sometimes discomforting  sense of realism to the entire piece  The actors are extremely well chosen  Michael Sheen not only  has got all the polari  but he has all the voices down pat too  You can truly see the seamless editing guided by the references to Williams  diary entries  not only is it well worth the watching but it is a terrificly written and performed piece  A masterful production about one of the great master s of comedy and his life  The realism really comes home with the little things  the fantasy of the guard which  rather than use the traditional  dream  techniques remains solid then disappears  It plays on our knowledge and our senses  particularly with the scenes concerning Orton and Halliwell and the sets  particularly of their flat with Halliwell s murals decorating every surface  are terribly well done '

Converting everthing into lowercase

In [None]:
dataset_lowercase=d(dataset_onlyAlphabet)
for i in range(dataset_lowercase['review'].shape[0]):
  dataset_lowercase['review'].iloc[i] = dataset_lowercase['review'].iloc[i] .lower()
  
dataset_lowercase['review'].iloc[1] 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


'a wonderful little production  the filming technique is very unassuming  very old time bbc fashion and gives a comforting  and sometimes discomforting  sense of realism to the entire piece  the actors are extremely well chosen  michael sheen not only  has got all the polari  but he has all the voices down pat too  you can truly see the seamless editing guided by the references to williams  diary entries  not only is it well worth the watching but it is a terrificly written and performed piece  a masterful production about one of the great master s of comedy and his life  the realism really comes home with the little things  the fantasy of the guard which  rather than use the traditional  dream  techniques remains solid then disappears  it plays on our knowledge and our senses  particularly with the scenes concerning orton and halliwell and the sets  particularly of their flat with halliwell s murals decorating every surface  are terribly well done '

Split sentences, removal of Stopwords and Lemmatization

In [None]:
dataset_final=d(dataset_lowercase)
for i in range(dataset_final['review'].shape[0]):
  dataset_final['review'].iloc[i] = dataset_final['review'].iloc[i].split()
  lm  = WordNetLemmatizer()
  dataset_final['review'].iloc[i]  = [lm.lemmatize(word) for word in dataset_final['review'].iloc[i] if not word in set(stopwords.words('english'))]
  dataset_final['review'].iloc[i] = ' '.join(dataset_final['review'].iloc[i])

dataset_final['review'].iloc[1] 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


'wonderful little production filming technique unassuming old time bbc fashion give comforting sometimes discomforting sense realism entire piece actor extremely well chosen michael sheen got polari voice pat truly see seamless editing guided reference williams diary entry well worth watching terrificly written performed piece masterful production one great master comedy life realism really come home little thing fantasy guard rather use traditional dream technique remains solid disappears play knowledge sens particularly scene concerning orton halliwell set particularly flat halliwell mural decorating every surface terribly well done'

Data Splitting

In [None]:
x_train, x_test, y_train, y_test = train_test_split(dataset_final['review'], dataset_final['sentiment'], test_size=0.25, random_state=42)

To vectorize we will apply - 1. Bag of Words model ( CountVectorizer)

In [None]:
#Count vectorizer for bag of words
cv=CountVectorizer(min_df=0,max_df=1,binary=False,ngram_range=(1,3))
#transformed train reviews
cv_train=cv.fit_transform(x_train)
#transformed test reviews
cv_test=cv.transform(x_test)

To vectorize we will apply - 2. TF - IDF model (TfidfVectorizer)

In [None]:
tfidf = TfidfVectorizer(ngram_range = (1, 3))

tfidf_train = tfidf.fit_transform(x_train)
tfidf_test = tfidf.transform(x_test)

Using Logistic Regression as the first model

In [None]:
#training the model
lr1=LogisticRegression(penalty='l2',max_iter=500,C=1,random_state=42)
lr2=LogisticRegression(penalty='l2',max_iter=500,C=1,random_state=42)

#Fitting the model for Bag of words
lr_cv=lr1.fit(cv_train ,y_train)
print(lr_cv)

#Fitting the model for tfidf features
lr_tfidf=lr2.fit(tfidf_train,y_train)
print(lr_tfidf)


LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=500,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=500,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)


Logistic Regression model prediction on test dataset

In [None]:
#Predicting the model for bag of words
lr_cv_predict=lr1.predict(cv_test)
print(lr_cv_predict)

##Predicting the model for tfidf features
lr_tfidf_predict=lr2.predict(tfidf_test)
print(lr_tfidf_predict)

[0 1 0 ... 0 1 0]
[0 1 0 ... 0 1 1]


Accuracy of Logistic Regresstion Model

In [None]:
#Accuracy score for bag of words
lr_cv_score=accuracy_score(y_test,lr_cv_predict)
print("lr_cv_score :",lr_cv_score)

#Accuracy score for tfidf features
lr_tfidf_score=accuracy_score(y_test,lr_tfidf_predict)
print("lr_tfidf_score :",lr_tfidf_score)

lr_cv_score : 0.59384
lr_tfidf_score : 0.88528


Confusion Matrix for Logistic Regression

In [None]:
#confusion matrix for bag of words
cm_cv=confusion_matrix(y_test,lr_cv_predict,labels=[1,0])
print(cm_cv)

#confusion matrix for tfidf features
cm_tfidf=confusion_matrix(y_test,lr_tfidf_predict,labels=[1,0])
print(cm_tfidf)

[[1372 4971]
 [ 106 6051]]
[[5705  638]
 [ 796 5361]]


Classification report for Logisitc Regression

In [None]:
#Classification report for bag of words 
lr_cv_report=classification_report(y_test,lr_cv_predict,target_names=['Positive','Negative'])
print('LR (CV):')
print(lr_cv_report)

#Classification report for tfidf features
lr_tfidf_report=classification_report(y_test,lr_tfidf_predict,target_names=['Positive','Negative'])
print('\nLR (TF-IDF):')
print(lr_tfidf_report)

LR (CV):
              precision    recall  f1-score   support

    Positive       0.55      0.98      0.70      6157
    Negative       0.93      0.22      0.35      6343

    accuracy                           0.59     12500
   macro avg       0.74      0.60      0.53     12500
weighted avg       0.74      0.59      0.53     12500


LR (TF-IDF):
              precision    recall  f1-score   support

    Positive       0.89      0.87      0.88      6157
    Negative       0.88      0.90      0.89      6343

    accuracy                           0.89     12500
   macro avg       0.89      0.89      0.89     12500
weighted avg       0.89      0.89      0.89     12500



Using Linear SupportVectorClassifier(SVC) as second model

In [None]:
#training the model
linear_svc1 = LinearSVC(C=0.5, random_state=42)
linear_svc2 = LinearSVC(C=0.5, random_state=42)

#Fitting the model for Bag of words
svc_cv=linear_svc1.fit(cv_train,y_train)
print(svc_cv)

#Fitting the model for tfidf features
svc_tfidf=linear_svc2.fit(tfidf_train,y_train)
print(svc_tfidf)

LinearSVC(C=0.5, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=42, tol=0.0001,
          verbose=0)
LinearSVC(C=0.5, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=42, tol=0.0001,
          verbose=0)


SVC model prediction on test dataset

In [None]:
#Predicting the model for bag of words
predict_cv = linear_svc1.predict(cv_test)
print(predict_cv)

##Predicting the model for tfidf features
predict_tfidf = linear_svc2.predict(tfidf_test)
print(predict_tfidf)

[0 0 0 ... 0 0 0]
[0 1 0 ... 0 1 1]


Accuracy of SVC Model

In [None]:
#Accuracy score for bag of words
print('Accuracy score (CV): \n', accuracy_score(y_test, predict_cv))

#Accuracy score for tfidf features
print('Accuracy score (TFIDF): \n', accuracy_score(y_test, predict_tfidf))

Accuracy score (CV): 
 0.50656
Accuracy score (TFIDF): 
 0.90296


Confusion Matrix for SVC model

In [None]:
#confusion matrix for bag of words
print('Confusion Matrix (CV): \n', confusion_matrix(y_test, predict_cv))

#confusion matrix for tfidf features
print('Confusion Matrix (TFIDF): \n', confusion_matrix(y_test, predict_tfidf))

Confusion Matrix (CV): 
 [[6149    8]
 [6160  183]]
Confusion Matrix (TFIDF): 
 [[5468  689]
 [ 524 5819]]


Classification report for SVC model

In [None]:
#Classification report for bag of words 
print('SVC (CV):')
print('Classification Report (CV): \n', classification_report(y_test, predict_cv, target_names = ['Negative', 'Positive']))

#Classification report for tfidf features
print('\nSVC (TF-IDF):')
print('Classification Report (TFIDF): \n', classification_report(y_test, predict_tfidf, target_names = ['Negative', 'Positive']))

SVC (CV):
Classification Report (CV): 
               precision    recall  f1-score   support

    Negative       0.50      1.00      0.67      6157
    Positive       0.96      0.03      0.06      6343

    accuracy                           0.51     12500
   macro avg       0.73      0.51      0.36     12500
weighted avg       0.73      0.51      0.36     12500


SVC (TF-IDF):
Classification Report (TFIDF): 
               precision    recall  f1-score   support

    Negative       0.91      0.89      0.90      6157
    Positive       0.89      0.92      0.91      6343

    accuracy                           0.90     12500
   macro avg       0.90      0.90      0.90     12500
weighted avg       0.90      0.90      0.90     12500

