PROJECT : SENTIMENT ANALYSIS (IMDB REVIEWS) 

AUTHOR  : GANJI VARSHITHA

ROLL NO : AI20BTECH11009

ML MODEL: RNN


In [94]:
#Importing all the required libraries
import numpy as np
import pandas as pd #For loading and handling dataset
import re
from string import punctuation
import nltk #for nlp
from nltk.corpus import stopwords # for the collection of stopping words
nltk.download('stopwords')
from sklearn.model_selection import train_test_split #for splitting the data into training and testing
from tensorflow import keras
from keras.preprocessing.text import Tokenizer #To encode the text into integer array
from keras.preprocessing.sequence import pad_sequences #Helps in padding and truncating the sequence
import matplotlib.pyplot as plt #For plotting graphs
from keras.models import Sequential, load_model #We are using sequential model and we'll also load(call) the saved model
from keras.layers import Dense, LSTM, Embedding, Dropout #Layers in RNN architecture
from keras.callbacks import ModelCheckpoint #Helps to save the model


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Loading the dataset 

In [95]:
dataset = pd.read_csv('movie_data.csv')
print(dataset.tail())#Previews the data with first five rows 
dataset.describe() #Shows the statistical analysis

                                                  review  sentiment
49995  The 1998 version of "Psycho" needed to be set ...          0
49996  IT IS So Sad. Even though this was shot with f...          0
49997  Over several years of looking for half-decent ...          0
49998  ***Possible Plot Spoilers***<br /><br />I ador...          0
49999  While I can't say whether or not Larry Hama ev...          1


Unnamed: 0,sentiment
count,50000.0
mean,0.5
std,0.500005
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


In [96]:
#Exploring the data
#Finding number of unique outputs
Classes = np.unique(dataset['sentiment'])
#Finding the maximum number of unique words
Max_num_words = len(np.unique(dataset['review']))

#printing observations
print('Classes: ',Classes)
print('Maximum number of unique words: ',Max_num_words)




Classes:  [0 1]
Maximum number of unique words:  49582


In [97]:

stop_words = stopwords.words('english')#creating a list of stop words




def training_samples():
  dataset=pd.read_csv('movie_data.csv')
  input_data=dataset['review']
  output_data=dataset['sentiment']
  #pre-processing data
  input_data=input_data.apply(lambda x: x.lower())#making the words lowercase
  input_data = input_data.apply(lambda x:''.join([c for c in x if c not in punctuation]))#removing characters
  input_data=input_data.apply(lambda x : [i for i in x.split() if i not in stop_words]) #removing stopwords
  
  return input_data, output_data



input_data, output_data = training_samples()


#Finding the average of words in review
length = [len(i) for i in input_data]
max_length=np.mean(length)




TOKENIZE:

Neural network takes numerical input hence we need to encode the review data into integers.


*   Each unique word is indexed using fit_on_texts method
*   Training and testing inputs are converted to integers using texts_to_sequences method

Also, each review is having different length hence we need to pad the sequences by adding 0 and truncate the words to same length (i.e average length of review)



In [98]:
tokens= Tokenizer(lower=False)# Since the data is converted to lowercase before
tokens.fit_on_texts(input_data)
input_data = tokens.texts_to_sequences(input_data)

input_data = pad_sequences(input_data,maxlen=127,padding='post',truncating='post')

train_input,test_input,train_output,test_output=train_test_split(input_data,output_data,test_size=0.2)#test_input_size=0.2*input_data

total_words=len(tokens.word_index) + 1 #word_index0 is reserved to distinguish between pad and unknown



Building RNN Architecture

In [99]:
embed=32 # dimensions of embeddding
LSTM_SIZE=64 #number of hidden layers
model = Sequential()
model.add(Embedding(total_words,embed,input_length=127))
model.add(LSTM(LSTM_SIZE))
model.add(Dense(1,activation='sigmoid'))#activation is sigmoid as output is either 0 or 1
model.compile(optimizer='adam',loss='binary_crossentropy',metrics =['accuracy'])
print(model.summary())


Model: "sequential_18"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_16 (Embedding)     (None, 127, 32)           5809408   
_________________________________________________________________
lstm_15 (LSTM)               (None, 64)                24832     
_________________________________________________________________
dense_14 (Dense)             (None, 1)                 65        
Total params: 5,834,305
Trainable params: 5,834,305
Non-trainable params: 0
_________________________________________________________________
None


Training model

In [100]:
#Using mini-batch learning method with batch_size 200 and 5 epochs
#Adding a callback called checkpoint which saves the model if accuracy is increased from previous epoch
checkpoint=ModelCheckpoint('sentiment/LSTM.h5',monitor='accuracy',save_best_only=True,verbose=2)
model.fit(train_input,train_output,batch_size=200,epochs=5,callbacks=[checkpoint])

Epoch 1/5

Epoch 00001: accuracy improved from -inf to 0.73652, saving model to sentiment/LSTM.h5
Epoch 2/5

Epoch 00002: accuracy improved from 0.73652 to 0.93155, saving model to sentiment/LSTM.h5
Epoch 3/5

Epoch 00003: accuracy improved from 0.93155 to 0.96978, saving model to sentiment/LSTM.h5
Epoch 4/5

Epoch 00004: accuracy improved from 0.96978 to 0.98677, saving model to sentiment/LSTM.h5
Epoch 5/5

Epoch 00005: accuracy improved from 0.98677 to 0.99182, saving model to sentiment/LSTM.h5


<keras.callbacks.History at 0x7f8e3c1b89d0>

TESTING THE MODEL

In [101]:
pred = model.predict(test_input)#predicting the labels
true=0
correct=0
ptrue=0
for i,y in enumerate(test_output):
  if pred[i]>0.5:#classifying predicted label as positive if the confidence value is greater than 0.5 and negative otherwise
    ptrue+=1
  if y==1:
    true+=1
  if ((pred[i]>0.5 and  y==1) or (pred[i]<0.5 and y==0)):
    correct+=1

print('Number of positive sentiment predictions:',ptrue)
print('Real positive sentiment : ',true)
print('Number of negative sentiment predictions:',len(test_input)-ptrue)
print('Real negative sentiment : ',len(test_input)-true)
print('Accuracy of the model is :',(correct/len(test_input))*100)


Number of positive sentiment predictions: 5272
Real positive sentiment :  5025
Number of negative sentiment predictions: 4728
Real negative sentiment :  4975
Accuracy of the model is : 85.87
