# IMDB Sentiment Classification

## Problem

Using the movie review document classifier discussed in this chapter, generate a classifier to classify the sentiment of the reviews. Each review can have positive or negative sentiment.

## Solution

For this classification project I have downloaded [IMDB dataset](http://ai.stanford.edu/~amaas/data/sentiment). As part of this project, we will be loading and clearning the dataset. Create a Convolutional DeepNeural Network(CNN) using word embedding to predict the sentiment of the each reviews.

As an initail step, we will import all the necessary packages. As we are going to use neural network, we will be using Keras with tensorflow as backend.

In [275]:
from keras.preprocessing.text import text_to_word_sequence, Tokenizer
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential, Model
from keras.layers.embeddings import Embedding
#from keras.layers import Input, Activation, Dense, Permute, Dropout, add, dot, concatenate
from keras.layers import LSTM
import numpy as np
from keras.datasets import imdb
from keras.layers import Flatten
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
import pandas as pd
from sklearn.metrics import classification_report,confusion_matrix
import glob
from nltk.corpus import stopwords

As the dataset is huge, we will download the data into separate folder and load it to the variables. Dataset has positive and negative reviews with train and test split.

In [276]:
# Dataset files
train_p_files = glob.glob("./data/train/pos/*.txt")
train_n_files = glob.glob("./data/train/neg/*.txt")
test_p_files = glob.glob("./data/test/pos/*.txt")
test_n_files = glob.glob("./data/test/neg/*.txt")
train_files = [train_p_files,train_n_files]
test_files=[test_p_files,test_n_files]

In [277]:
# Xtrain and y train variables. Load positive reviews as 1 and negative reviews as  0 

X_train = []
y_train = []

for file in train_files:
    for txt in file:
        with open(txt,encoding='utf8') as f: 

            # positive reviews
            if txt.find('train') !=-1 & txt.find('pos')!=-1:
                #print(txt.find('train') & txt.find('pos'))
                X_train.append(f.readlines())
                y_train.append(1)
            
            # Negative reviews
            elif txt.find('train') !=-1 & txt.find('neg')!=-1:
                #print(txt.find('train') & txt.find('neg'))
                X_train.append(f.readlines())
                y_train.append(0)


In [300]:
# X_test and y_test variables. Load positive reviews as 1 and negative reviews as  0 

X_test = []
y_test = []

for file in test_files:
    for txt in file:
        with open(txt,encoding='utf8') as f: 

            #print(txt)
            if txt.find('test') !=-1 & txt.find('pos')!=-1:
                #print(txt.find('test') & txt.find('pos'))
                X_test.append(f.readlines())
                y_test.append(1)
            elif txt.find('test') !=-1 & txt.find('neg') !=-1:
                #print(txt.find('test') & txt.find('neg'))
                X_test.append(f.readlines())
                y_test.append(0)

In [303]:
# Sample test data
X_test[0]

["I went and saw this movie last night after being coaxed to by a few friends of mine. I'll admit that I was reluctant to see it because from what I knew of Ashton Kutcher he was only able to do comedy. I was wrong. Kutcher played the character of Jake Fischer very well, and Kevin Costner played Ben Randall with such professionalism. The sign of a good movie is that it can toy with our emotions. This one did exactly that. The entire theater (which was sold out) was overcome by laughter during the first half of the movie, and were moved to tears during the second half. While exiting the theater I not only saw many women in tears, but many full grown men as well, trying desperately not to let anyone see them crying. This movie was great, and I suggest that you go see it before you judge."]

Once the above dataset is loaded into variables, we need to preprocess the dataset. We need to create a word embedding for the most frequent words and use the word embedding dimension input to our deep neural network.

As a first step, need to remove the symbols and convert to lower case. Below list comprehension performs the operation and converts back to sentenses for Train and test dataset.

In [279]:
parsed_train_txt = []
parsed_test_txt = []

# Text_to_word_sequence removes symbols, changes to lowercase and tokenizes the text.

parsed_train_txt = [' '.join(text_to_word_sequence(X[0])) for X in X_train] 

parsed_test_txt = [' '.join(text_to_word_sequence(X[0])) for X in X_test] 


Once the unwanted symboles are removed, we need to convert the text to numbers. The way we are going to perform is follow below steps

1. Initilize a tokenizer with maximum word length. 
2. Word length will be the top frequently used words in the complete dataset. In our case we have choosen as 6000.
3. Apply Fit and text_to_sequence method on the train and test dataset.
4. Output of the step3 will be a vector which will have index of frequently used word in that document.
5. To standardize the input, we need to convert all the documents to a common dimenstion. So we need to apply pad_sequences to it.

In [280]:
# Perform tokenizing as mentioned above.

tokenizer = Tokenizer(6000)

#Fit on the train data
tokenizer.fit_on_texts(parsed_train_txt)
#print(tokenizer.word_index)

#Apply it to train and test dataset
X_train_txt = tokenizer.texts_to_sequences(parsed_train_txt)
X_test_txt = tokenizer.texts_to_sequences(parsed_test_txt)

In [281]:
# To standardize the input data, convert it to a common length. Here we have choosen as 500

X_train = pad_sequences(X_train_txt,maxlen=500)
X_test = pad_sequences(X_test_txt,maxlen=500)

In [304]:
# Train data after pad_sequences
X_train[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

Until now we have completed all the preprocessing steps. Next is to create a deep neural network with below layers.

1. As this is a sequential problem, we will create a sequential model using Keras.
2. Add a word embedding layer to the model. Here the vector size will be 6000. This is the total number of words which will be considered. Other parameters are dimension and document input length.
3. Add a 1 Dimensional convolution layer with a keranl size of 3. We will have same padding with relu activation layer. This will reduce the dimension and gather important features in the dataset.
4. As a next step, we will apply Max Pooling 1D layer, this will again reduce the dimensions of the data and will get only the important features in the dataset.
5. We are now done with the convolutional layer. Now we will need to flatten the dimension and create a fully connectioned network. 
6. Add two more hidden layers with 500 and 250 perceptron to make it more accurate.

Finally compile the model with RMSPROP optimizer and binary cross entropy loss.

In [290]:
#Sequential model
model = Sequential()
# 6000 voc. size, 32 dimension, 500 length
model.add(Embedding(6000,32,input_length=500))

# Add Convolution 1D layer
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))

# Add Maxpooling 1D layer
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(500,activation='relu'))
#Add two hidden layers
model.add(Dense(250,activation='relu'))

#Finally predict the output using sigmoid 
model.add(Dense(1,activation='sigmoid'))
#Compile the model
model.compile(optimizer='rmsprop',metrics=['accuracy'],loss='binary_crossentropy')
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 500, 32)           192000    
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 500, 32)           3104      
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 250, 32)           0         
_________________________________________________________________
flatten_9 (Flatten)          (None, 8000)              0         
_________________________________________________________________
dense_14 (Dense)             (None, 500)               4000500   
_________________________________________________________________
dense_15 (Dense)             (None, 250)               125250    
_________________________________________________________________
dense_16 (Dense)             (None, 1)                 251       
Total para

Once the model is fully created, we need to fit the model with training data and validated using testing data. Here the batch size will be 128 and epochs will be 10(depending on the pc specs).

In [291]:
# Model fitting
model.fit(X_train,y_train,validation_data=(X_test,y_test),batch_size=128,epochs=10)

Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x46116a20>

Predict the sentiment of test dataset using predict function of keras. Here predict will provide the probability if we need to get more details. For now we will use predict_classes method which wil provide the final sentiment.

In [None]:
# Accracy of the model.
print("Accuray of the model: {}".format(model.evaluate(X_test,y_test)*100))

In [293]:
y_pred = model.predict_classes(X_test)



In [301]:
#Create a confusion matrix using sklearn classification method.

print("Classification report of test dataset: \n")
confusion_matrix(y_test,y_pred)

Classification report of test dataset: 



array([[11256,  1244],
       [ 2036, 10464]], dtype=int64)

Here the weights are defined for the each layer. For all the 6000 frequently used words, some weight has bee assigned. Below are the diferent shapes of the model weight.

In [306]:
for w in model.get_weights():
    print(w.shape)

(6000, 32)
(3, 32, 32)
(32,)
(8000, 500)
(500,)
(500, 250)
(250,)
(250, 1)
(1,)


# Summary

1. We have successfully preprecessed the text data and converted into a vector format using word embeddings.
2. We have created a deep neural network with convolution and maxpooling to get the important features in the dataset.
3. Created a fully connected layers with hidden layers to predict the sentiment of the reviews.
4. Model gives a accuracy of around 87%. Which is good for this type of dataset.
5. Accuracy can be increased by adding more hidden layers and tuning the hyper parameters. Also pre-trained word embedding can be used get get higher accuracy.