<a href="https://colab.research.google.com/github/daurice/Deep-Learning--RNN/blob/main/LSTM_movie_review_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Movie Review Prediction RNN Model

Dataset- imdb dataset present in keras library in Python
The IMDB dataset in the Keras library is a popular dataset used for sentiment analysis.
It consists of 25,000 movie reviews from IMDB, labeled as either positive or negative.

For convenience, words are indexed by overall frequency in the dataset,
for instance the integer "3" encodes the 3rd most frequent word in the data.

This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".

<h3>Word Indexing by Frequency</h3>

In the IMDB dataset, words are indexed based on their frequency of occurrence.

This means that:

The most frequent word in the dataset is assigned the index 1.
The second most frequent word is assigned the index 2.
The third most frequent word is assigned the index 3, and so on.


Example
Let’s say we have a small dataset with the following word frequencies:

“the” appears 5000 times

“movie” appears 3000 times

“was” appears 2000 times

“great” appears 1500 times

“bad” appears 1000 times


The indexing would look like this:

“the” -> 1
“movie” -> 2
“was” -> 3
“great” -> 4
“bad” -> 5


Filtering Operations

This indexing allows for quick filtering operations. For example:

---Top 10,000 Most Common Words: You can limit your dataset to only include the top 10,000 most frequent words.
This helps in reducing the vocabulary size and focusing on the most relevant words.

---Eliminate the Top 20 Most Common Words: You can exclude the top 20 most frequent words, which are often stop words like “the,” “is,” “and,” etc., that might not add significant value to the analysis.

https://keras.io/api/datasets/imdb/

## Import Required Library

In [None]:
import numpy as np
import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding,LSTM
from tensorflow.keras.datasets import imdb
from keras.utils import pad_sequences

## Import Dataset for each sample load 20000 most frequently occuring words

In [None]:
# load imdb data https://keras.io/api/datasets/imdb/
(X_train,y_train),(X_test,y_test) = imdb.load_data(num_words=20000)
#load first 20000 most frequent occuring words
#imdb data is already split in the train and test data by default

## Explore Imported Data

In [None]:
#Check the shape of the training data.
X_train.shape

(25000,)

In [None]:
#Checkig the shape of the testing data
X_test.shape

(25000,)

In [None]:
#Print a sample review (as word indices) and its label.
#printing the word with indices 1
X_train[1]

[1,
 194,
 1153,
 194,
 8255,
 78,
 228,
 5,
 6,
 1463,
 4369,
 5012,
 134,
 26,
 4,
 715,
 8,
 118,
 1634,
 14,
 394,
 20,
 13,
 119,
 954,
 189,
 102,
 5,
 207,
 110,
 3103,
 21,
 14,
 69,
 188,
 8,
 30,
 23,
 7,
 4,
 249,
 126,
 93,
 4,
 114,
 9,
 2300,
 1523,
 5,
 647,
 4,
 116,
 9,
 35,
 8163,
 4,
 229,
 9,
 340,
 1322,
 4,
 118,
 9,
 4,
 130,
 4901,
 19,
 4,
 1002,
 5,
 89,
 29,
 952,
 46,
 37,
 4,
 455,
 9,
 45,
 43,
 38,
 1543,
 1905,
 398,
 4,
 1649,
 26,
 6853,
 5,
 163,
 11,
 3215,
 10156,
 4,
 1153,
 9,
 194,
 775,
 7,
 8255,
 11596,
 349,
 2637,
 148,
 605,
 15358,
 8003,
 15,
 123,
 125,
 68,
 2,
 6853,
 15,
 349,
 165,
 4362,
 98,
 5,
 4,
 228,
 9,
 43,
 2,
 1157,
 15,
 299,
 120,
 5,
 120,
 174,
 11,
 220,
 175,
 136,
 50,
 9,
 4373,
 228,
 8255,
 5,
 2,
 656,
 245,
 2350,
 5,
 4,
 9837,
 131,
 152,
 491,
 18,
 2,
 32,
 7464,
 1212,
 14,
 9,
 6,
 371,
 78,
 22,
 625,
 64,
 1382,
 9,
 8,
 168,
 145,
 23,
 4,
 1690,
 15,
 16,
 4,
 1355,
 5,
 28,
 6,
 52,
 154,
 462,
 33,


Understanding the Output
Word Indices: Each number in the list is an index that maps to a word in the dataset’s vocabulary.

Sequence: The sequence of numbers represents the order of words in the review.

Example Breakdown
Let’s take a closer look at the first few indices:

1: This is typically a special token used to mark the start of a sequence.

194, 1153, 194: These numbers correspond to specific words in the vocabulary.
For example, if 194 maps to the word “movie,” then every occurrence of 194 in the sequence represents the word “movie.”

<h3>Dataset vocabulary</h3>
A dataset vocabulary refers to the set of unique words or tokens that appear in a dataset.



Key Points about Dataset Vocabulary

---Unique Words: The vocabulary consists of all the unique words present in the dataset. For example, if your dataset contains the sentences “I love movies” and “I love coding,” the vocabulary would be {"I", "love", "movies", "coding"}.

---Word Indexing: Each word in the vocabulary is often assigned a unique index.
This indexing helps in converting words into numerical representations that can be processed by machine learning models.
For example, {"I": 1, "love": 2, "movies": 3, "coding": 4}.

---Frequency-Based Filtering: In some cases, the vocabulary is limited to the most frequent words.
For instance, you might only keep the top 10,000 most frequent words in the dataset to reduce complexity and focus on the most relevant words.

---Handling Out-of-Vocabulary Words: Words that are not in the vocabulary are often replaced with a special token
(e.g., <OOV> or 0). This helps in managing words that the model has not seen during training.

In [None]:
y_train

array([1, 0, 0, ..., 0, 1, 0])

## Add Padding to make inputs of same size

The pad_sequences function in Keras is used to ensure that all sequences in a dataset have the same length.

This is important because neural networks require inputs of uniform size.

The padding='post' argument specifies that padding should be added to the end of each sequence.

In [None]:
#Use pad_sequences to ensure all sequences have the same length.
X_train = pad_sequences(X_train,padding='post')
X_test = pad_sequences(X_test,padding='post')

'''
pad_sequences: This function pads sequences to the same length.
padding=‘post’: This argument specifies that padding should be added to the end (or “post”) of each sequence.

Example
Explanation
Original Sequences: [1, 2, 3], [4, 5], [6]
Padded Sequences: [1, 2, 3], [4, 5, 0], [6, 0, 0]
'''

'\npad_sequences: This function pads sequences to the same length.\npadding=‘post’: This argument specifies that padding should be added to the end (or “post”) of each sequence.\n\nExample\nExplanation\nOriginal Sequences: [1, 2, 3], [4, 5], [6]\nPadded Sequences: [1, 2, 3], [4, 5, 0], [6, 0, 0]\n'

In [None]:
#checking on the padding
X_train[1]

array([   1,  194, 1153, ...,    0,    0,    0], dtype=int32)

In [None]:
'''
key purposes of the Embedding layer in short bullet points:

---Dimensionality Reduction: Converts high-dimensional sparse vectors into dense, lower-dimensional vectors.

---Capturing Semantic Relationships: Learns word embeddings that capture the meanings and relationships between words.

---Handling Variable Input Lengths: Maps words to fixed-size vectors, making it easier to process sequences of different lengths.

---Improving Model Performance: Provides meaningful word representations, enhancing the performance of NLP tasks.

'''

'\nkey purposes of the Embedding layer in short bullet points:\n\n---Dimensionality Reduction: Converts high-dimensional sparse vectors into dense, lower-dimensional vectors.\n\n---Capturing Semantic Relationships: Learns word embeddings that capture the meanings and relationships between words.\n\n---Handling Variable Input Lengths: Maps words to fixed-size vectors, making it easier to process sequences of different lengths.\n\n---Improving Model Performance: Provides meaningful word representations, enhancing the performance of NLP tasks.\n\n'

In [None]:
'''
key purposes of the LSTM (Long Short-Term Memory) layer in short bullet points:

---Handling Long-Term Dependencies: Captures and retains information over long sequences, addressing the vanishing gradient problem.

---Sequence Learning: Effective for tasks involving sequential data, such as time series prediction, language modeling, and speech recognition.

---Memory Cells: Uses memory cells to store information, allowing the network to learn which information to keep or discard.

---Gating Mechanisms: Employs input, output, and forget gates to control the flow of information, enhancing the model’s ability to learn complex patterns.
'''

'\nkey purposes of the LSTM (Long Short-Term Memory) layer in short bullet points:\n\n---Handling Long-Term Dependencies: Captures and retains information over long sequences, addressing the vanishing gradient problem.\n\n---Sequence Learning: Effective for tasks involving sequential data, such as time series prediction, language modeling, and speech recognition.\n\n---Memory Cells: Uses memory cells to store information, allowing the network to learn which information to keep or discard.\n\n---Gating Mechanisms: Employs input, output, and forget gates to control the flow of information, enhancing the model’s ability to learn complex patterns.\n'

## Build Model

In [None]:
'''
1. Initialize a sequential model.

2. Add an Embedding layer to convert word indices to dense vectors.

3. Add an LSTM layer to capture long-term dependencies.

4. Add a Dense layer with a softmax activation function for binary classification.
'''

'\n1. Initialize a sequential model.\n\n2. Add an Embedding layer to convert word indices to dense vectors.\n\n3. Add an LSTM layer to capture long-term dependencies.\n\n4. Add a Dense layer with a softmax activation function for binary classification.\n'

In [None]:
#Initializes a sequential model, which is a linear stack of layers.

model = Sequential()

'''
Adds an Embedding layer.
20000: Size of the vocabulary (top 20,000 most frequent words).
128: Dimension of the dense embedding vectors.

Purpose: Converts high-dimensional sparse vectors into dense, lower-dimensional vectors.
Parameters: 20000 (vocabulary size), 128 (embedding dimension).
'''
model.add(Embedding(20000, 128))

'''
Adds an LSTM (Long Short-Term Memory) layer.
128: Number of LSTM units.
dropout=0.2: Dropout rate for regularization,
which helps prevent overfitting by randomly setting 20% of the input units to 0
at each update during training.
'''
model.add(LSTM(128, dropout=0.2))

'''
Adds a Dense (fully connected) layer.
1: Number of output units (since this is a binary classification problem).
activation=‘softmax’: For binary classification, sigmoid is typically used instead of softmax.
'''
model.add(Dense(1, activation='softmax'))

#present summary
model.summary()

## Compile Model

In [None]:
#compling the model
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

## Train the Model

# Caution--- Please run this model only when your system is free for one Epoch it could take you about 1 hour

In [None]:
model.fit(X_train, y_train,
          batch_size=256,
          epochs=5,
          verbose=1,
          validation_split=0.2)

#The following is not an error but result of intrupting the kernel in between as the model was taking about 1 hour for each epoch

Epoch 1/5




[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1429s[0m 18s/step - accuracy: 0.4980 - loss: 0.6937 - val_accuracy: 0.4938 - val_loss: 0.6932
Epoch 2/5
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1430s[0m 18s/step - accuracy: 0.5019 - loss: 0.6933 - val_accuracy: 0.4938 - val_loss: 0.6936
Epoch 3/5
[1m42/79[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m10:47[0m 18s/step - accuracy: 0.5043 - loss: 0.6933

## Test the Model

In [None]:
acc = model.evaluate(X_test, y_test,
                            batch_size=32,
                            verbose=1)
acc