**Project Ojective:** To perform sentiment analysis on IMDB movie reviews data

**Author:** Amarjeet S Cheema

**Link to download the data:** https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews


In [3]:
#Import necessary libs
import tensorflow as tf
import tensorflow_hub as hub
import pandas as pd


In [4]:
#Check tensorflow version
tf.__version__

'2.4.0'

In [5]:
#Import necessary libs
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout

In [13]:
#Load the data to the dataframe
data= pd.read_csv('/content/IMDB Dataset.csv')


In [14]:
#Check the data
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [15]:
#Create a copy of the original dataset to work on. this step is not mandatory
df=data.copy()

In [16]:
#Drop the rows which have na data
df.dropna
df.reset_index(inplace=True)

In [17]:
#Separate Independent and Dependent variables
X=df['review']
X.head()

0    One of the other reviewers has mentioned that ...
1    A wonderful little production. <br /><br />The...
2    I thought this was a wonderful way to spend ti...
3    Basically there's a family where a little boy ...
4    Petter Mattei's "Love in the Time of Money" is...
Name: review, dtype: object

In [18]:
#Separate Independent and Dependent variables
y=df['sentiment']
y.head()

0    positive
1    positive
2    positive
3    negative
4    positive
Name: sentiment, dtype: object

In [19]:
#Check a sample review
df['review'][6]

"I sure would like to see a resurrection of a up dated Seahunt series with the tech they have today it would bring back the kid excitement in me.I grew up on black and white TV and Seahunt with Gunsmoke were my hero's every week.You have my vote for a comeback of a new sea hunt.We need a change of pace in TV and this would work for a world of under water adventure.Oh by the way thank you for an outlet like this to view many viewpoints about TV and the many movies.So any ole way I believe I've got what I wanna say.Would be nice to read some more plus points about sea hunt.If my rhymes would be 10 lines would you let me submit,or leave me out to be in doubt and have me to quit,If this is so then I must go so lets do it."

In [20]:
#check some rows using df.head()
df.head()

Unnamed: 0,index,review,sentiment
0,0,One of the other reviewers has mentioned that ...,positive
1,1,A wonderful little production. <br /><br />The...,positive
2,2,I thought this was a wonderful way to spend ti...,positive
3,3,Basically there's a family where a little boy ...,negative
4,4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [21]:
#Check the number of reviews
len(df)

50000

In [22]:
#Download the stopwords from nltk - We only need to do this one time
import nltk
nltk.download('stopwords') 


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
#Very important data preprocessing
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import re
ps = PorterStemmer()
corpus = []
for i in range(0, len(df)):
    review = re.sub('[^a-zA-Z]', ' ', df['review'][i]) # substitute all the characters other than a-zA-Z to a blank in each message
    review = review.lower()
    review = review.split() # splits the sentences into individual list of words
    
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')] #Apply stemmer for all the words not in stopwoprd list
    review = ' '.join(review) # Add the stemmed word to the orginal list with spaces in between instead of ''
    corpus.append(review) # Add the word to the complete list of words i.e corpus
    

In [None]:
#Again check the number of reviews
len(corpus)

50000

In [None]:
##Check the first review after datapreprocessing
corpus[0]

'one review mention watch oz episod hook right exactli happen br br first thing struck oz brutal unflinch scene violenc set right word go trust show faint heart timid show pull punch regard drug sex violenc hardcor classic use word br br call oz nicknam given oswald maximum secur state penitentari focus mainli emerald citi experiment section prison cell glass front face inward privaci high agenda em citi home mani aryan muslim gangsta latino christian italian irish scuffl death stare dodgi deal shadi agreement never far away br br would say main appeal show due fact goe show dare forget pretti pictur paint mainstream audienc forget charm forget romanc oz mess around first episod ever saw struck nasti surreal say readi watch develop tast oz got accustom high level graphic violenc violenc injustic crook guard sold nickel inmat kill order get away well manner middl class inmat turn prison bitch due lack street skill prison experi watch oz may becom comfort uncomfort view that get touch da

In [None]:
### Set the Vocabulary size- depends on the programmer
voc_size=5000

In [None]:
#perform one hot encoding for each word to Return index of the words in corpus
onehot_repr=[one_hot(words,voc_size)for words in corpus] 


In [None]:
#Find the max length of the review
le=[]
for i in range(len(corpus)):
  new = len(corpus[i])
  le.append(new)

In [None]:
#Add 0 padding to match the review length
sent_length=max(le)
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

[[   0    0    0 ... 1199 3262 4214]
 [   0    0    0 ... 1427 1137 4254]
 [   0    0    0 ... 2108 3994 4339]
 ...
 [   0    0    0 ...  726  302 2973]
 [   0    0    0 ... 4926 2583 4084]
 [   0    0    0 ... 1047 4091 2589]]


## Remember guys keep checking your data after every step: it makes the debugging very easy

In [2]:
#Check the first review embedded doc after padding
embedded_docs[0]

NameError: ignored

## **Create LSTM model from scratch**

In [None]:
embedding_vector_features=40 #every word will be converted to 40 dimension vecto
model=Sequential() #Start building a seuential model
model.add(Embedding(voc_size,embedding_vector_features,input_length=sent_length)) #Add aembedding layer by passing the given parameters
model.add(Dropout(0.3)) #Add a dropout for regularization
model.add(LSTM(100)) # Add a lstm layer with 100 neurons
model.add(Dropout(0.3)) #Add a dropout for regularization
model.add(Dense(1,activation='sigmoid')) #add a dense output layer with a sigmoid function to predict probabilities
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy']) # Compile the model using adam optimizer
print(model.summary()) #Check the model architecture

In [None]:
#Check the number of review and number of labels
len(embedded_docs),y.shape

(50000, (50000,))

In [None]:
#Convert the embedded_docs and y to array
import numpy as np
X_final=np.array(embedded_docs)
y_final=np.array(y)

In [None]:
#Check the shape
X_final.shape,y_final.shape

((50000, 8350), (50000,))

In [None]:
#Check the target label
y_final

array(['positive', 'positive', 'positive', ..., 'negative', 'negative',
       'negative'], dtype=object)

In [None]:
#Create dummies for categorical target label
y_new= pd.get_dummies(y_final)


Unnamed: 0,negative,positive
0,0,1
1,0,1
2,0,1
3,1,0
4,0,1
...,...,...
49995,0,1
49996,1,0
49997,1,0
49998,1,0


In [None]:
#Out of two choose any one as target variable
y_label=y_new['positive'].to_numpy()
y_label

array([1, 1, 1, ..., 0, 0, 0], dtype=uint8)

In [None]:
#Test train split 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_final, y_label, test_size=0.33, random_state=42)

In [None]:
# Train the model
model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=10,batch_size=64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f84fd8aa668>

In [None]:
#Apply the model on the test data
y_pred=model.predict_classes(X_test)




In [None]:
#Check the confusion metrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,y_pred)

In [None]:
#Check the accuracy of the model on teting data
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.8559393939393939

We can further improve the accuracy of the model by changing the hyperparameters and changing the deep neural architecture but we need better and faster resources for this. 
If you liked this project, kindly follow me at https://github.com/amarjeet-cheema