# Sentiment Analysis on IMDB Reviews

<hr>

### Steps
<ol type="1">
    <li>Load the dataset (50K IMDB Movie Review)</li>
    <li>Clean Dataset</li>
    <li>Encode Sentiments</li>
    <li>Split Dataset</li>
    <li>Tokenize and Pad/Truncate Reviews</li>
    <li>Build Architecture/Model</li>
    <li>Train and Test</li>
</ol>

<hr>
<i>Import all the libraries needed</i>

In [1]:
import pandas as pd    # to load dataset
import numpy as np     # for mathematic equation
from nltk.corpus import stopwords   # to get collection of stopwords
from sklearn.model_selection import train_test_split       # for splitting dataset
from tensorflow.keras.preprocessing.text import Tokenizer  # to encode text to int
from tensorflow.keras.preprocessing.sequence import pad_sequences   # to do padding or truncating
from tensorflow.keras.models import Sequential     # the model
from tensorflow.keras.layers import Embedding, LSTM, Dense # layers of the architecture
from tensorflow.keras.callbacks import ModelCheckpoint   # save model
from tensorflow.keras.models import load_model   # load saved model
import re

<hr>
<i>Preview dataset</i>

In [2]:
data = pd.read_csv('IMDB Dataset.csv')

print(data)

                                                  review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
3      Basically there's a family where a little boy ...  negative
4      Petter Mattei's "Love in the Time of Money" is...  positive
...                                                  ...       ...
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[50000 rows x 2 columns]


<hr>
<b>Stop Word</b> is a commonly used words in a sentence, usually a search engine is programmed to ignore this words (i.e. "the", "a", "an", "of", etc.)

<i>Declaring the english stop words</i>

In [3]:
english_stops = set(stopwords.words('english'))

<hr>

### Load and Clean Dataset

In the original dataset, the reviews are still dirty. There are still html tags, numbers, uppercase, and punctuations. This will not be good for training, so in <b>load_dataset()</b> function, beside loading the dataset using <b>pandas</b>, I also pre-process the reviews by removing html tags, non alphabet (punctuations and numbers), stop words, and lower case all of the reviews.

### Encode Sentiments
In the same function, I also encode the sentiments into integers (0 and 1). Where 0 is for negative sentiments and 1 is for positive sentiments.

In [4]:
def load_dataset():
    df = pd.read_csv('IMDB Dataset.csv')
    x_data = df['review']       # Reviews/Input
    y_data = df['sentiment']    # Sentiment/Output

    # PRE-PROCESS REVIEW
    x_data = x_data.replace({'<.*?>': ''}, regex = True)          # remove html tag
    x_data = x_data.replace({'[^A-Za-z]': ' '}, regex = True)     # remove non alphabet
    x_data = x_data.apply(lambda review: [w for w in review.split() if w not in english_stops])  # remove stop words
    x_data = x_data.apply(lambda review: [w.lower() for w in review])   # lower case
    
    # ENCODE SENTIMENT -> 0 & 1
    y_data = y_data.replace('positive', 1)
    y_data = y_data.replace('negative', 0)

    return x_data, y_data

x_data, y_data = load_dataset()

print('Reviews')
print(x_data, '\n')
print('Sentiment')
print(y_data)

Reviews
0        [one, reviewers, mentioned, watching, oz, epis...
1        [a, wonderful, little, production, the, filmin...
2        [i, thought, wonderful, way, spend, time, hot,...
3        [basically, family, little, boy, jake, thinks,...
4        [petter, mattei, love, time, money, visually, ...
                               ...                        
49995    [i, thought, movie, right, good, job, it, crea...
49996    [bad, plot, bad, dialogue, bad, acting, idioti...
49997    [i, catholic, taught, parochial, elementary, s...
49998    [i, going, disagree, previous, comment, side, ...
49999    [no, one, expects, star, trek, movies, high, a...
Name: review, Length: 50000, dtype: object 

Sentiment
0        1
1        1
2        1
3        0
4        1
        ..
49995    1
49996    0
49997    0
49998    0
49999    0
Name: sentiment, Length: 50000, dtype: int64


<hr>

### Split Dataset
In this work, I decided to split the data into 80% of Training and 20% of Testing set using <b>train_test_split</b> method from Scikit-Learn. By using this method, it automatically shuffles the dataset. We need to shuffle the data because in the original dataset, the reviews and sentiments are in order, where they list positive reviews first and then negative reviews. By shuffling the data, it will be distributed equally in the model, so it will be more accurate for predictions.

In [5]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.2)

print('Train Set')
print(x_train, '\n')
print(x_test, '\n')
print('Test Set')
print(y_train, '\n')
print(y_test)

Train Set
10510    [i, really, understand, positive, user, review...
38914    [one, worst, shows, time, the, show, would, be...
31556    [many, neglect, classic, due, fact, first, d, ...
42704    [paul, schrader, brother, leonard, wrote, mish...
23723    [to, honest, i, like, executive, decision, obv...
                               ...                        
23630    [when, i, first, watched, show, first, episode...
32925    [i, think, saying, film, many, makes, film, ba...
730      [this, who, powerful, although, masterwork, wh...
48770    [i, think, one, best, tamil, movies, seen, lov...
4956     [in, comparison, sand, sandal, fare, the, egyp...
Name: review, Length: 40000, dtype: object 

34790    [i, hope, never, become, cynical, society, app...
18821    [this, hands, worst, movie, time, a, combinati...
9701     [i, usually, inclined, write, reviews, films, ...
48206    [celia, johnson, good, nurse, michael, hordern...
9780     [eddie, murphy, best, supporting, actor, what,...
 

<hr>
<i>Function for getting the maximum review length, by calculating the mean of all the reviews length (using <b>numpy.mean</b>)</i>

In [6]:
def get_max_length():
    review_length = []
    for review in x_train:
        review_length.append(len(review))

    return int(np.ceil(np.mean(review_length)))

<hr>

### Tokenize and Pad/Truncate Reviews
A Neural Network only accepts numeric data, so we need to encode the reviews. I use <b>tensorflow.keras.preprocessing.text.Tokenizer</b> to encode the reviews into integers, where each unique word is automatically indexed (using <b>fit_on_texts</b> method) based on <b>x_train</b>. <br>
<b>x_train</b> and <b>x_test</b> is converted into integers using <b>texts_to_sequences</b> method.

Each reviews has a different length, so we need to add padding (by adding 0) or truncating the words to the same length (in this case, it is the mean of all reviews length) using <b>tensorflow.keras.preprocessing.sequence.pad_sequences</b>.


<b>post</b>, pad or truncate the words in the back of a sentence<br>
<b>pre</b>, pad or truncate the words in front of a sentence

In [7]:
# ENCODE REVIEW
token = Tokenizer(lower=False)    # no need lower, because already lowered the data in load_data()
token.fit_on_texts(x_train)
x_train = token.texts_to_sequences(x_train)
x_test = token.texts_to_sequences(x_test)

max_length = get_max_length()

x_train = pad_sequences(x_train, maxlen=max_length, padding='post', truncating='post')
x_test = pad_sequences(x_test, maxlen=max_length, padding='post', truncating='post')

total_words = len(token.word_index) + 1   # add 1 because of 0 padding

print('Encoded X Train\n', x_train, '\n')
print('Encoded X Test\n', x_test, '\n')
print('Maximum review length: ', max_length)

Encoded X Train
 [[   1   15  286 ...    0    0    0]
 [   5  158  181 ...    0    0    0]
 [  36 9241  265 ... 8709 5782  170]
 ...
 [   8  698  833 ...    0    0    0]
 [   1   31    5 ...    0    0    0]
 [  49 1790 5193 ...    0    0    0]] 

Encoded X Test
 [[   1  344   44 ...  119  742  495]
 [   8  829  158 ...    0    0    0]
 [   1  534 7611 ... 3729  195 5391]
 ...
 [   1  285   40 ... 8392  164  747]
 [  59 4007 6593 ...    0    0    0]
 [ 220  542 2580 ...    0    0    0]] 

Maximum review length:  130


<hr>

### Build Architecture/Model
<b>Embedding Layer</b>: in simple terms, it creates word vectors of each word in the <i>word_index</i> and group words that are related or have similar meaning by analyzing other words around them.

<b>LSTM Layer</b>: to make a decision to keep or throw away data by considering the current input, previous output, and previous memory. There are some important components in LSTM.
<ul>
    <li><b>Forget Gate</b>, decides information is to be kept or thrown away</li>
    <li><b>Input Gate</b>, updates cell state by passing previous output and current input into sigmoid activation function</li>
    <li><b>Cell State</b>, calculate new cell state, it is multiplied by forget vector (drop value if multiplied by a near 0), add it with the output from input gate to update the cell state value.</li>
    <li><b>Ouput Gate</b>, decides the next hidden state and used for predictions</li>
</ul>

<b>Dense Layer</b>: compute the input with the weight matrix and bias (optional), and using an activation function. I use <b>Sigmoid</b> activation function for this work because the output is only 0 or 1.

The optimizer is <b>Adam</b> and the loss function is <b>Binary Crossentropy</b> because again the output is only 0 and 1, which is a binary number.

In [8]:
# ARCHITECTURE
EMBED_DIM = 32
LSTM_OUT = 64

model = Sequential()
model.add(Embedding(total_words, EMBED_DIM, input_length = max_length))
model.add(LSTM(LSTM_OUT))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 130, 32)           2958784   
                                                                 
 lstm (LSTM)                 (None, 64)                24832     
                                                                 
 dense (Dense)               (None, 1)                 65        
                                                                 
Total params: 2,983,681
Trainable params: 2,983,681
Non-trainable params: 0
_________________________________________________________________
None


<hr>

### Training
For training, it is simple. We only need to fit our <b>x_train</b> (input) and <b>y_train</b> (output/label) data. For this training, I use a mini-batch learning method with a <b>batch_size</b> of <i>128</i> and <i>5</i> <b>epochs</b>.

Also, I added a callback called **checkpoint** to save the model locally for every epoch if its accuracy improved from the previous epoch.

In [9]:
checkpoint = ModelCheckpoint(
    'models/LSTM.h5',
    monitor='accuracy',
    save_best_only=True,
    verbose=1
)

In [10]:
model.fit(x_train, y_train, batch_size = 128, epochs = 5, callbacks=[checkpoint])

Epoch 1/5
Epoch 1: accuracy improved from -inf to 0.69447, saving model to models\LSTM.h5
Epoch 2/5
Epoch 2: accuracy improved from 0.69447 to 0.91570, saving model to models\LSTM.h5
Epoch 3/5
Epoch 3: accuracy improved from 0.91570 to 0.95952, saving model to models\LSTM.h5
Epoch 4/5
Epoch 4: accuracy improved from 0.95952 to 0.97765, saving model to models\LSTM.h5
Epoch 5/5
Epoch 5: accuracy improved from 0.97765 to 0.98705, saving model to models\LSTM.h5


<keras.callbacks.History at 0x1e59f8a1fa0>

<hr>

### Testing
To evaluate the model, we need to predict the sentiment using our <b>x_test</b> data and comparing the predictions with <b>y_test</b> (expected output) data. Then, we calculate the accuracy of the model by dividing numbers of correct prediction with the total data. Resulted an accuracy of <b>86.63%</b>

In [11]:
y_pred = model.predict(x_test, batch_size = 128)

true = 0
for i, y in enumerate(y_test):
    if y == y_pred[i]:
        true += 1

print('Correct Prediction: {}'.format(true))
print('Wrong Prediction: {}'.format(len(y_pred) - true))
print('Accuracy: {}'.format(true/len(y_pred)*100))

Correct Prediction: 0
Wrong Prediction: 10000
Accuracy: 0.0


---

### Load Saved Model

Load saved model and use it to predict a movie review statement's sentiment (positive or negative).

In [12]:
loaded_model = load_model('models/LSTM.h5')

Receives a review as an input to be predicted

In [29]:
review = str(input('Movie Review: '))

Movie Review: First review from me. This film deserves it. A superhero film, marvel or not, does not have the right to send me on the emotional rollercoaster that End Game did. It has no right to do what it did, paying tribute to 10 years of films whilst changing the rules as to what a superhero film should be.  I laughed, lots. I cried, lots. I cheered (quietly and internally of course), lots I even punched the air in a "go on!" during one scene  I can honestly say, no film has ever moved me like that. Hours later, my heart was still pounding, the adrenaline still flowing from this epic piece of film making.  Made by fans, for fans. Are there flaws? Yes. If you over think the story of course there is. But it works. It flows. It fits.  Go with it. Enjoy the ride.  Soak up every second, it doesn't feel like 3 hours, it feels like a flow of story reaching a conclusion that will shock and move you.  I have nothing else to say.  10/10. Well done Russo brothers. Well done!  Fans! Assemble!


The input must be pre processed before it is passed to the model to be predicted

In [30]:

# Pre-process input
regex = re.compile(r'[^a-zA-Z\s]')
review = regex.sub('', review)
print('Cleaned: ', review)

words = review.split(' ')
filtered = [w for w in words if w not in english_stops]
filtered = ' '.join(filtered)
filtered = [filtered.lower()]

print('Filtered: ', filtered)

Cleaned:  First review from me This film deserves it A superhero film marvel or not does not have the right to send me on the emotional rollercoaster that End Game did It has no right to do what it did paying tribute to  years of films whilst changing the rules as to what a superhero film should be  I laughed lots I cried lots I cheered quietly and internally of course lots I even punched the air in a go on during one scene  I can honestly say no film has ever moved me like that Hours later my heart was still pounding the adrenaline still flowing from this epic piece of film making  Made by fans for fans Are there flaws Yes If you over think the story of course there is But it works It flows It fits  Go with it Enjoy the ride  Soak up every second it doesnt feel like  hours it feels like a flow of story reaching a conclusion that will shock and move you  I have nothing else to say   Well done Russo brothers Well done  Fans Assemble
Filtered:  ['first review this film deserves a superhe

Once again, we need to tokenize and encode the words. I use the tokenizer which was previously declared because we want to encode the words based on words that are known by the model.

In [31]:
tokenize_words = token.texts_to_sequences(filtered)
tokenize_words = pad_sequences(tokenize_words, maxlen=max_length, padding='post', truncating='post')
print(tokenize_words)

[[   23   623     8     4   909    39  4718     4  5919   111  2087   786
  23103    54   365     7   111  2627  3000    69    35  1741  2485  2160
   4718     4     1  1380   634     1  3676   634     1 12459  5223 19915
    170   634     1    11 12421   790    61     5    55     1  1106    58
      4    51  1604     6   529   210   401    57 10752 12250    57  6771
   1557   319     4   136    24   366   366  1691  1487   323    56    31
     13   170    30   404     7  6990     7  2297    61   264  1200 21518
     84   242 15956   137     6   529   686     6  2766    13  4662  1156
   1367   726     1    77   230    58    16   129  5849  1002    16   129
    366 20046     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0]]


This is the result of the prediction which shows the **confidence score** of the review statement.

In [32]:
result = loaded_model.predict(tokenize_words)
print(result)

[[0.9964286]]


If the confidence score is close to 0, then the statement is **negative**. On the other hand, if the confidence score is close to 1, then the statement is **positive**. I use a threshold of **0.7** to determine which confidence score is positive and negative, so if it is equal or greater than 0.7, it is **positive** and if it is less than 0.7, it is **negative**

In [33]:
if result >= 0.7:
    print('This review is positive')
else:
    print('This review is negative')

This review is positive
