<a href="https://colab.research.google.com/github/anubhavyadav111/Movie-Review-Model-Using-RNN/blob/main/Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Importing Necessary libraries**


In [None]:
import pandas as pd    # to load dataset
import numpy as np     # for mathematic equation
from nltk.corpus import stopwords   # to get collection of stopwords
from sklearn.model_selection import train_test_split       # for splitting dataset
from tensorflow.keras.preprocessing.text import Tokenizer  # to encode text to int
from tensorflow.keras.preprocessing.sequence import pad_sequences   # to do padding or truncating
from tensorflow.keras.models import Sequential     # the model
from tensorflow.keras.layers import Embedding, LSTM, Dense # layers of the architecture
from tensorflow.keras.callbacks import ModelCheckpoint   # save model
from tensorflow.keras.models import load_model   # load saved model
import re
from tensorflow.keras.layers import SimpleRNN

# **Preparing the data named IMDB**

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
data = pd.read_csv('IMDB Dataset.csv', encoding="latin-1"
                  )
print(data)

                                                  review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
3      Basically there's a family where a little boy ...  negative
4      Petter Mattei's "Love in the Time of Money" is...  positive
...                                                  ...       ...
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[50000 rows x 2 columns]


Stop Word is a commonly used words in a sentence, usually a search engine is programmed to ignore this words (i.e. "the", "a", "an", "of", etc.)
Declaring the english stop words

In [None]:
import nltk
nltk.download("stopwords")
english_stops = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# **Load and Clean Dataset**
**In the original dataset, the reviews are still dirty. There are still html tags, numbers, uppercase, and punctuations. This will not be good for training, so in load_dataset() function, beside loading the dataset using pandas, I also pre-process the reviews by removing html tags, non alphabet (punctuations and numbers), stop words, and lower case all of the reviews.**

# **Encode Sentiments**
**In the same function, We also encode the sentiments into integers (0 and 1). Where 0 is for negative sentiments and 1 is for positive sentiments.**

In [None]:
def load_dataset():
    df = pd.read_csv('IMDB Dataset.csv',  encoding='latin-1')
    x_data = df['review']       # Reviews/Input
    y_data = df['sentiment']    # Sentiment/Output

    # PRE-PROCESS REVIEW
    x_data = x_data.replace({'<.*?>': ''}, regex = True)          # remove html tag
    x_data = x_data.replace({'[^A-Za-z]': ' '}, regex = True)     # remove non alphabet
    x_data = x_data.apply(lambda review: [w for w in review.split() if w not in english_stops])  # remove stop words
    x_data = x_data.apply(lambda review: [w.lower() for w in review])   # lower case

    # ENCODE SENTIMENT -> 0 & 1
    y_data = y_data.replace('positive', 1)
    y_data = y_data.replace('negative', 0)

    return x_data, y_data

x_data, y_data = load_dataset()

print('Reviews')
print(x_data, '\n')
print('Sentiment')
print(y_data)

Reviews
0        [one, reviewers, mentioned, watching, oz, epis...
1        [a, wonderful, little, production, the, filmin...
2        [i, thought, wonderful, way, spend, time, hot,...
3        [basically, family, little, boy, jake, thinks,...
4        [petter, mattei, love, time, money, visually, ...
                               ...                        
49995    [i, thought, movie, right, good, job, it, crea...
49996    [bad, plot, bad, dialogue, bad, acting, idioti...
49997    [i, catholic, taught, parochial, elementary, s...
49998    [i, going, disagree, previous, comment, side, ...
49999    [no, one, expects, star, trek, movies, high, a...
Name: review, Length: 50000, dtype: object 

Sentiment
0        1
1        1
2        1
3        0
4        1
        ..
49995    1
49996    0
49997    0
49998    0
49999    0
Name: sentiment, Length: 50000, dtype: int64




```
# This is formatted as code
```

#**Split Dataset**
**In this work, We decided to split the data into 80% of Training and 20% of Testing set using train_test_split method from Scikit-Learn. By using this method, it automatically shuffles the dataset. We need to shuffle the data because in the original dataset, the reviews and sentiments are in order, where they list positive reviews first and then negative reviews. By shuffling the data, it will be distributed equally in the model, so it will be more accurate for predictions.**

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.2)

print('Train Set')
print(x_train, '\n')
print(x_test, '\n')
print('Test Set')
print(y_train, '\n')
print(y_test)

Train Set
37892    [this, truly, hilarious, film, one, i, seen, m...
39081    [i, especially, liked, ending, movie, i, reall...
9615     [this, great, movie, in, genre, memphis, belle...
28472    [there, something, one, characters, aging, fil...
16707    [the, main, complaint, film, fact, i, can, not...
                               ...                        
28660    [julie, andrews, rock, hudson, great, movie, m...
7282     [i, really, like, show, as, part, greek, life,...
1896     [this, movie, offers, nothing, anyone, it, suc...
48493    [distasteful, cliched, thriller, young, couple...
21320    [yes, absolutely, dreadful, and, coming, someo...
Name: review, Length: 40000, dtype: object 

23950    [jack, kate, meet, physician, daniel, farady, ...
3965     [this, wes, craven, worst, worst, horror, call...
34102    [spoilers, sex, huh, it, one, basic, parts, hu...
19971    [david, duchovny, plays, lead, role, film, now...
48034    [yes, i, sentimental, schmaltzy, but, movie, t...
 

**Function for getting the average review length, by calculating the mean of all the reviews length (using numpy.mean)**

In [None]:
def get_max_length():
    review_length = []
    for review in x_train:
        review_length.append(len(review))

    return int(np.ceil(np.mean(review_length)))

#**Tokenize and Pad/Truncate Reviews**
**A Neural Network only accepts numeric data, so we need to encode the reviews. I use tensorflow.keras.preprocessing.text.Tokenizer to encode the reviews into integers, where each unique word is automatically indexed (using fit_on_texts method) based on x_train.**

**x_train and x_test is converted into integers using texts_to_sequences method.**

**Each reviews has a different length, so we need to add padding (by adding 0) or truncating the words to the same length (in this case, it is the mean of all reviews length) using tensorflow.keras.preprocessing.sequence.pad_sequences.**

In [None]:
# ENCODE REVIEW
token = Tokenizer(lower=False)    # no need lower, because already lowered the data in load_data()
token.fit_on_texts(x_train)
x_train = token.texts_to_sequences(x_train)
x_test = token.texts_to_sequences(x_test)

max_length = get_max_length()

x_train = pad_sequences(x_train, maxlen=max_length, padding='post', truncating='post')
x_test = pad_sequences(x_test, maxlen=max_length, padding='post', truncating='post')

total_words = len(token.word_index) + 1   # add 1 because of 0 padding
print('Total Words:', total_words)

print('Encoded X Train\n', x_train, '\n')
print('Encoded X Test\n', x_test, '\n')
print('Maximum review length: ', max_length)

Total Words: 92215
Encoded X Train
 [[   8  279  481 ...    0    0    0]
 [   1  167  330 ...    0    0    0]
 [   8   21    3 ...    0    0    0]
 ...
 [   8    3 1450 ...    0    0    0]
 [9331 6826  600 ...    0    0    0]
 [ 322  331 1967 ...    0    0    0]] 

Encoded X Test
 [[ 556 2163  812 ...    0    0    0]
 [   8 4906 4907 ...    0    0    0]
 [ 948  287 3763 ...  577 2224  178]
 ...
 [   8 7848 2411 ...    0    0    0]
 [  50    9   28 ... 7044  381   93]
 [  39  308 2598 ...    4  192   81]] 

Maximum review length:  130


#**Build Architecture/Model**
**Embedding Layer: in simple terms, it creates word vectors of each word in the word_index and group words that are related or have similar meaning by analyzing other words around them.**

**RNN Layer: to make a decision to keep or throw away data by considering the current input, previous output.**

**Dense Layer: compute the input with the weight matrix and bias (optional), and using an activation function. I use Sigmoid activation function for this work because the output is only 0 or 1.**

**The optimizer is Adam and the loss function is Binary Crossentropy because again the output is only 0 and 1, which is a binary number.**

In [None]:
rnn = Sequential()

rnn.add(Embedding(total_words,32,input_length =max_length))
rnn.add(SimpleRNN(64,input_shape = (total_words, max_length), return_sequences=False,activation="relu"))
rnn.add(Dense(1, activation = 'sigmoid')) #flatten

print(rnn.summary())
rnn.compile(loss="binary_crossentropy",optimizer='adam',metrics=["accuracy"])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 130, 32)           2950880   
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 64)                6208      
_________________________________________________________________
dense (Dense)                (None, 1)                 65        
Total params: 2,957,153
Trainable params: 2,957,153
Non-trainable params: 0
_________________________________________________________________
None


#**Training the Model**

In [None]:
history = rnn.fit(x_train,y_train,epochs = 100,batch_size=128,verbose = 1)
score = rnn.evaluate(x_test, y_test, verbose=1)

Train on 40000 samples


2023-08-25 04:48:02.339942: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2023-08-25 04:48:02.508103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1669] Found device 0 with properties: 
name: NVIDIA A100-SXM4-40GB major: 8 minor: 0 memoryClockRate(GHz): 1.41
pciBusID: 0000:07:00.0
2023-08-25 04:48:02.509435: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1669] Found device 1 with properties: 
name: NVIDIA A100-SXM4-40GB major: 8 minor: 0 memoryClockRate(GHz): 1.41
pciBusID: 0000:0f:00.0
2023-08-25 04:48:02.510767: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1669] Found device 2 with properties: 
name: NVIDIA A100-SXM4-40GB major: 8 minor: 0 memoryClockRate(GHz): 1.41
pciBusID: 0000:47:00.0
2023-08-25 04:48:02.512076: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1669] Found device 3 with properties: 
name: NVIDIA A100-SXM4-40GB major: 8 minor: 0 memoryClockRate(GHz): 1.41
pciBusID: 0000:4e:00.0
202

Epoch 1/100
  128/40000 [..............................] - ETA: 1:54 - loss: 0.6943 - acc: 0.4688

2023-08-25 04:48:04.485691: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 7

In [None]:
print("Test Score:", score[0])
print("Test Accuracy:", score[1])

Test Score: 0.9657272222042084
Test Accuracy: 0.758


In [None]:
model = rnn.save('rnn.h5')
loaded_model = load_model('rnn.h5')



#**Evaluation**

In [None]:
y_pred = rnn.predict(x_test, batch_size = 128)
print(y_pred)
print(y_test)
for i in range(len(y_pred)):
  if y_pred[i]>0.5:
    y_pred[i] = 1
  else:
    y_pred[i] = 0

true = 0
for i, y in enumerate(y_test):
    if y == y_pred[i]:
        true += 1

print('Correct Prediction: {}'.format(true))
print('Wrong Prediction: {}'.format(len(y_pred) - true))
print('Accuracy: {}'.format(true/len(y_pred)*100))

[[9.38848555e-01]
 [1.00796148e-01]
 [5.69291227e-03]
 ...
 [1.00878276e-01]
 [5.97474049e-04]
 [1.97202727e-01]]
23950    1
3965     0
34102    1
19971    1
48034    1
        ..
24569    0
2404     0
14261    0
4451     0
28786    1
Name: sentiment, Length: 10000, dtype: int64
Correct Prediction: 7580
Wrong Prediction: 2420
Accuracy: 75.8


Message: **Nothing was typical about this. Everything was beautifully done in this movie, the story, the flow, the scenario, everything. I highly recommend it for mystery lovers, for anyone who wants to watch a good movie!**

#**Example review**

In [None]:
review = str(input('Movie Review: '))

Movie Review:  movie is good


#**Pre-processing of entered review**

In [None]:
# Pre-process input
regex = re.compile(r'[^a-zA-Z\s]')
review = regex.sub('', review)
print('Cleaned: ', review)

words = review.split(' ')
filtered = [w for w in words if w not in english_stops]
filtered = ' '.join(filtered)
filtered = [filtered.lower()]

print('Filtered: ', filtered)

Cleaned:  movie is good
Filtered:  ['movie good']


In [None]:
tokenize_words = token.texts_to_sequences(filtered)
tokenize_words = pad_sequences(tokenize_words, maxlen=max_length, padding='post', truncating='post')
print(tokenize_words)

[[3 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


#**Prediction**

In [None]:
result = rnn.predict(tokenize_words)
print(result)

[[0.9390594]]


In [None]:
if result >= 0.7:
    print('positive')
else:
    print('negative')

positive
