# Sentiment Analysis on IMDB Reviews using LSTM and Keras
created by Hans Michael
<hr>

### Steps
<ol type="1">
    <li>Load the dataset (50K IMDB Movie Review)</li>
    <li>Clean Dataset</li>
    <li>Encode Sentiments</li>
    <li>Split Dataset</li>
    <li>Tokenize and Pad/Truncate Reviews</li>
    <li>Build Architecture/Model</li>
    <li>Train and Test</li>
</ol>

<hr>
<i>Import all the libraries needed</i>

In [3]:
import pandas as pd    # to load dataset
import numpy as np     # for mathematic equation
from nltk.corpus import stopwords   # to get collection of stopwords
from sklearn.model_selection import train_test_split       # for splitting dataset
from tensorflow.keras.preprocessing.text import Tokenizer  # to encode text to int
from tensorflow.keras.preprocessing.sequence import pad_sequences   # to do padding or truncating
from tensorflow.keras.models import Sequential, Model     # the model
from tensorflow.keras.layers import Embedding, LSTM, SimpleRNN, GRU, Dense, Dropout, Attention # layers of the architecture
from tensorflow.keras.callbacks import ModelCheckpoint   # save model
from tensorflow.keras import Input
import re

<hr>
<i>Preview dataset</i>

In [4]:
data = pd.read_csv('IMDB Dataset.csv')

print(data)

                                                  review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
3      Basically there's a family where a little boy ...  negative
4      Petter Mattei's "Love in the Time of Money" is...  positive
...                                                  ...       ...
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[50000 rows x 2 columns]


<hr>
<b>Stop Word</b> is a commonly used words in a sentence, usually a search engine is programmed to ignore this words (i.e. "the", "a", "an", "of", etc.)

<i>Declaring the english stop words</i>

In [5]:
english_stops = set(stopwords.words('english'))

<hr>

### Load and Clean Dataset

In the original dataset, the reviews are still dirty. There are still html tags, numbers, uppercase, and punctuations. This will not be good for training, so in <b>load_dataset()</b> function, beside loading the dataset using <b>pandas</b>, I also pre-process the reviews by removing html tags, non alphabet (punctuations and numbers), stop words, and lower case all of the reviews.

### Encode Sentiments
In the same function, I also encode the sentiments into integers (0 and 1). Where 0 is for negative sentiments and 1 is for positive sentiments.

In [6]:
def load_dataset():
    df = pd.read_csv('IMDB Dataset.csv')
    x_data = df['review']       # Reviews/Input
    y_data = df['sentiment']    # Sentiment/Output

    # PRE-PROCESS REVIEW
    x_data = x_data.replace({'<.*?>': ''}, regex = True)          # remove html tag
    x_data = x_data.replace({'[^A-Za-z]': ' '}, regex = True)     # remove non alphabet
    x_data = x_data.apply(lambda review: [w for w in review.split() if w not in english_stops])  # remove stop words
    x_data = x_data.apply(lambda review: [w.lower() for w in review])   # lower case
    
    # ENCODE SENTIMENT -> 0 & 1
    y_data = y_data.replace('positive', 1)
    y_data = y_data.replace('negative', 0)

    return x_data, y_data

x_data, y_data = load_dataset()

print('Reviews')
print(x_data, '\n')
print('Sentiment')
print(y_data)

Reviews
0        [one, reviewers, mentioned, watching, oz, epis...
1        [a, wonderful, little, production, the, filmin...
2        [i, thought, wonderful, way, spend, time, hot,...
3        [basically, family, little, boy, jake, thinks,...
4        [petter, mattei, love, time, money, visually, ...
                               ...                        
49995    [i, thought, movie, right, good, job, it, crea...
49996    [bad, plot, bad, dialogue, bad, acting, idioti...
49997    [i, catholic, taught, parochial, elementary, s...
49998    [i, going, disagree, previous, comment, side, ...
49999    [no, one, expects, star, trek, movies, high, a...
Name: review, Length: 50000, dtype: object 

Sentiment
0        1
1        1
2        1
3        0
4        1
        ..
49995    1
49996    0
49997    0
49998    0
49999    0
Name: sentiment, Length: 50000, dtype: int64


<hr>

### Split Dataset
In this work, I decided to split the data into 80% of Training and 20% of Testing set using <b>train_test_split</b> method from Scikit-Learn. By using this method, it automatically shuffles the dataset. We need to shuffle the data because in the original dataset, the reviews and sentiments are in order, where they list positive reviews first and then negative reviews. By shuffling the data, it will be distributed equally in the model, so it will be more accurate for predictions.

In [7]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.2)

<hr>
<i>Function for getting the maximum review length, by calculating the mean of all the reviews length (using <b>numpy.mean</b>)</i>

In [8]:
def get_max_length():
    review_length = []
    for review in x_train:
        review_length.append(len(review))

    return int(np.ceil(np.mean(review_length)))

<hr>

### Tokenize and Pad/Truncate Reviews
A Neural Network only accepts numeric data, so we need to encode the reviews. I use <b>tensorflow.keras.preprocessing.text.Tokenizer</b> to encode the reviews into integers, where each unique word is automatically indexed (using <b>fit_on_texts</b> method) based on <b>x_train</b>. <br>
<b>x_train</b> and <b>x_test</b> is converted into integers using <b>texts_to_sequences</b> method.

Each reviews has a different length, so we need to add padding (by adding 0) or truncating the words to the same length (in this case, it is the mean of all reviews length) using <b>tensorflow.keras.preprocessing.sequence.pad_sequences</b>.


<b>post</b>, pad or truncate the words in the back of a sentence<br>
<b>pre</b>, pad or truncate the words in front of a sentence

In [9]:
# ENCODE REVIEW
token = Tokenizer(lower=False)    # no need lower, because already lowered the data in load_data()
token.fit_on_texts(x_train)
x_train = token.texts_to_sequences(x_train)
x_test = token.texts_to_sequences(x_test)

max_length = get_max_length()

x_train = pad_sequences(x_train, maxlen=max_length, padding='post', truncating='post')
x_test = pad_sequences(x_test, maxlen=max_length, padding='post', truncating='post')

total_words = len(token.word_index) + 1   # add 1 because of 0 padding

print('Maximum review length: ', max_length)

Maximum review length:  130


## Modeling

In [10]:
def RNN_model(EMBED_DIM, total_words):

    input = Input(max_length)
    embedding = Embedding(total_words, EMBED_DIM)(input)
    
    rnn1 = SimpleRNN(units=20, return_sequences=True)(embedding)
    dropout1 = Dropout(0.2)(rnn1)
    rnn2 = SimpleRNN(units=10, return_sequences=True)(dropout1)
    dropout2 = Dropout(0.5)(rnn2)
    rnn3 = SimpleRNN(units=10)(dropout2)
    dropout3 = Dropout(0.5)(rnn3)

    output = Dense(units=1, activation='sigmoid')(dropout3)

    rnn_model = Model(inputs=input, outputs=output)
    print(rnn_model.summary())

    return rnn_model

In [11]:
def LSTM_model(EMBED_DIM, total_words):

    input = Input(max_length)
    embedding = Embedding(total_words, EMBED_DIM)(input)
    
    lstm1 = LSTM(units=20, return_sequences=True)(embedding)
    dropout1 = Dropout(0.2)(lstm1)
    lstm2 = LSTM(units=10, return_sequences=True)(dropout1)
    dropout2 = Dropout(0.5)(lstm2)
    lstm3 = LSTM(units=10)(dropout2)
    dropout3 = Dropout(0.5)(lstm3)

    output = Dense(units=1, activation='sigmoid')(dropout3)

    lstm_model = Model(inputs=input, outputs=output)
    print(lstm_model.summary())

    return lstm_model

In [12]:
def GRU_model(EMBED_DIM, total_words):

    input = Input(max_length)
    embedding = Embedding(total_words, EMBED_DIM)(input)
    
    gru1 = GRU(units=20, return_sequences=True)(embedding)
    dropout1 = Dropout(0.2)(gru1)
    gru2 = GRU(units=10, return_sequences=True)(dropout1)
    dropout2 = Dropout(0.5)(gru2)
    gru3 = GRU(units=10)(dropout2)
    dropout3 = Dropout(0.5)(gru3)

    output = Dense(units=1, activation='sigmoid')(dropout3)

    gru_model = Model(inputs=input, outputs=output)
    print(gru_model.summary())

    return gru_model

In [None]:
def GRU_attention_model(EMBED_DIM, total_words):

    input = Input(max_length)
    embedding = Embedding(total_words, EMBED_DIM)(input)
    
    gru1 = GRU(units=20, return_sequences=True)(embedding)
    dropout1 = Dropout(0.2)(gru1)
    gru2 = GRU(units=10, return_sequences=True)(dropout1)
    dropout2 = Dropout(0.5)(gru2)
    gru3 = GRU(units=10)(dropout2)
    dropout3 = Dropout(0.5)(gru3)

    output = Dense(units=1, activation='sigmoid')(dropout3)

    gru_model = Model(inputs=input, outputs=output)
    print(gru_model.summary())

    return gru_model

In [13]:
EMBED_DIM = 32
# RNN_model / LSTM_model / GRU_model
model = LSTM_model(EMBED_DIM, total_words)

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 130)]             0         
                                                                 
 embedding (Embedding)       (None, 130, 32)           2959072   
                                                                 
 lstm (LSTM)                 (None, 130, 20)           4240      
                                                                 
 dropout (Dropout)           (None, 130, 20)           0         
                                                                 
 lstm_1 (LSTM)               (None, 130, 10)           1240      
                                                                 
 dropout_1 (Dropout)         (None, 130, 10)           0         
                                                                 
 lstm_2 (LSTM)               (None, 10)                840   

In [14]:
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

<hr>

### Training
For training, it is simple. We only need to fit our <b>x_train</b> (input) and <b>y_train</b> (output/label) data. For this training, I use a mini-batch learning method with a <b>batch_size</b> of <i>128</i> and <i>5</i> <b>epochs</b>.

Also, I added a callback called **checkpoint** to save the model locally for every epoch if its accuracy improved from the previous epoch.

In [15]:
x_val = x_train[:10000]
partial_x_train = x_train[10000:]
 
y_val = y_train[:10000]
partial_y_train = y_train[10000:]

x_train.shape

(40000, 130)

In [16]:
history = model.fit(partial_x_train, partial_y_train, 
                    batch_size = 256,
                    shuffle = True,
                    verbose = 1, 
                    epochs = 5,
                    validation_data=(x_val, y_val))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<hr>

### Testing
To evaluate the model, we need to predict the sentiment using our <b>x_test</b> data and comparing the predictions with <b>y_test</b> (expected output) data. Then, we calculate the accuracy of the model by dividing numbers of correct prediction with the total data. Resulted an accuracy of <b>86.63%</b>

In [17]:
results = model.evaluate(x_test, y_test, batch_size=128)
print("test loss: ", round(results[0], 2))
print("test acc: ", round(results[1], 4)*100, '%')

test loss:  0.42
test acc:  86.53 %
