**Simple RNN Neural Network using the IMDB dataset**
- Step 1) - Load the IMDB dataset
- Step 2) Create the word index dictionary and the reverse word index dictionary
- Step 3) Decode the first review - Convert from Numeric embeddings (word indices) to actual words
- Step 4) Apply padding to x_train and x_test
- Step 5) Create Simple RNN model
- Step 5.1 Create the basic Simple RNN model
- Step 5.2) Create The Early Stopping Callback
- Step 5.3) Train the model and apply Early Stopping Callback
- Step 5.4) Print Simple RNN Model metadata (Optional)
- Step 6) Helper Functions
- Step 7) Do prediction with a sample user input

In [None]:
# Imports
import numpy as np
import tensorflow as tf

# IMDB dataset is a built-in dataset in TensorFlow that contains movie reviews along with their corresponding sentiment labels (positive or negative). IMDB dataset is present in other places too.
from tensorflow.keras.datasets import imdb 
from tensorflow.keras.preprocessing import sequence

# Sequential API is a linear stack of layers in Keras, which allows you to create a neural network by adding layers sequentially. For any Neural Network 'Sequential' is a must.
from tensorflow.keras.models import Sequential 

# Dense is for creating nodes which are connected on multiple sides (hence the term Dense) - Dense is frequently used to create the hidden layers and output layer of a neural network. 
# SimpleRNN will create the RNN nodes
# Embedding is used to create the Embedding layer which is used to convert the integer encoded words into dense vectors of fixed size.
from tensorflow.keras.layers import Embedding,SimpleRNN,Dense 


In [58]:
# Step 1 - Load the IMDB dataset
# The imdb.load_data() gives the word embeddings (in the form of numerical values) for the reviews in the dataset.
# - max_vocabulary_size = 10000 means that we will only consider the top 10,000 most frequently occurring words in the reviews and ignore the rest of the words.
max_vocabulary_size=10000 
(x_train,y_train),(x_test,y_test)=imdb.load_data(num_words=max_vocabulary_size)
print('x_test = ',x_test)

# Print the shape of the data
print(f'Training data shape: {x_train.shape}, Training labels shape: {y_train.shape}')
print(f'Testing data shape: {x_test.shape}, Testing labels shape: {y_test.shape}')

# Print the first review and its corresponding label
# - x_train and x_test - Each number in the list (x_train or x_test)is a 'numeric word index' that corresponds to a specific word in the IMDB dataset's vocabulary.
# - y_train and y_test - Each number in the list (y_train or y_test) is a label that indicates whether the review is positive (1) or negative (0).
# - x_train[0] → first review
# - x_train[1] → second review
sample_first_review=x_train[0]
sample_first_review_output=y_train[0]
print(f'First review (as word indices): {sample_first_review}')
print(f'First review output: {sample_first_review_output}')

x_test =  [list([1, 591, 202, 14, 31, 6, 717, 10, 10, 2, 2, 5, 4, 360, 7, 4, 177, 5760, 394, 354, 4, 123, 9, 1035, 1035, 1035, 10, 10, 13, 92, 124, 89, 488, 7944, 100, 28, 1668, 14, 31, 23, 27, 7479, 29, 220, 468, 8, 124, 14, 286, 170, 8, 157, 46, 5, 27, 239, 16, 179, 2, 38, 32, 25, 7944, 451, 202, 14, 6, 717])
 list([1, 14, 22, 3443, 6, 176, 7, 5063, 88, 12, 2679, 23, 1310, 5, 109, 943, 4, 114, 9, 55, 606, 5, 111, 7, 4, 139, 193, 273, 23, 4, 172, 270, 11, 7216, 2, 4, 8463, 2801, 109, 1603, 21, 4, 22, 3861, 8, 6, 1193, 1330, 10, 10, 4, 105, 987, 35, 841, 2, 19, 861, 1074, 5, 1987, 2, 45, 55, 221, 15, 670, 5304, 526, 14, 1069, 4, 405, 5, 2438, 7, 27, 85, 108, 131, 4, 5045, 5304, 3884, 405, 9, 3523, 133, 5, 50, 13, 104, 51, 66, 166, 14, 22, 157, 9, 4, 530, 239, 34, 8463, 2801, 45, 407, 31, 7, 41, 3778, 105, 21, 59, 299, 12, 38, 950, 5, 4521, 15, 45, 629, 488, 2733, 127, 6, 52, 292, 17, 4, 6936, 185, 132, 1988, 5304, 1799, 488, 2693, 47, 6, 392, 173, 4, 2, 4378, 270, 2352, 4, 1500, 7, 4, 

In [59]:
# Step 2) Create the word index dictionary and the reverse word index dictionary
# word_index = e.g. fawn: 34701 means that the word 'fawn' is represented by the index 34701 in the IMDB dataset's vocabulary.
# reverse_word_index (opp of word_index)= e.g. 34701: fawn means that the word 'fawn' is represented by the index 34701 in the IMDB dataset's vocabulary.
word_index=imdb.get_word_index()
print('word_index = ')
print(list(word_index.items())[:10]) # Print the first 10 items in the word_index dictionary

reverse_word_index = {value: key for key, value in word_index.items()}
print('reverse_word_index = ')
print(list(reverse_word_index.items())[:10])


word_index = 
[('fawn', 34701), ('tsukino', 52006), ('nunnery', 52007), ('sonja', 16816), ('vani', 63951), ('woods', 1408), ('spiders', 16115), ('hanging', 2345), ('woody', 2289), ('trawling', 52008)]
reverse_word_index = 
[(34701, 'fawn'), (52006, 'tsukino'), (52007, 'nunnery'), (16816, 'sonja'), (63951, 'vani'), (1408, 'woods'), (16115, 'spiders'), (2345, 'hanging'), (2289, 'woody'), (52008, 'trawling')]


In [60]:
# Step 3) Decode the first review - Convert from Numeric embeddings (word indices) to actual words
# This is an optional step just to see the decoding (Numeric embedding --> Text format) actual works 
# - The word indices in the sample_first_review are offset by 3 because the IMDB dataset reserves the indices 0, 1, and 2 for special tokens (padding, start
decoded_first_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in sample_first_review])
print('Decoded first review:', decoded_first_review)

Decoded first review: ? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they h

In [None]:
# Step 4 - Apply padding to x_train and x_test
# - The length of each review (i.e. the number of words in each review) can vary, 
#- which is why we need to pad the sequences to ensure that they all have 
# - the same length (max_len=500 in this case).
from tensorflow.keras.preprocessing import sequence
max_len=500

'''
Output before padding:
= (25000,) 
- This means that there are 25,000 reviews in the training set 
- and each review is represented as a list of word indices (numeric embeddings). 
- The length of each review (i.e. the number of words in each review) can vary, 
- which is why we need to pad the sequences to ensure that they all have 
- the same length (max_len=500 in this case).
'''
print('Before padding shape = ',x_train.shape) 
print('Before padding: ',x_train[0])


# Padding
# By default, pad_sequences() function does pre-padding, which means that it adds zeros at the beginning of the sequences.
x_train=sequence.pad_sequences(x_train,maxlen=max_len)
x_test = sequence.pad_sequences(x_test, maxlen=max_len)

'''
Output after padding:
- (25000, 500)
'''
# Output 
print('after padding shape = ',x_train.shape)
print('after padding: ',x_train[0])


Before padding shape =  (25000,)
Before padding:  [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
after padd

In [None]:
# Step 5 - Create Simple RNN model
# Step 5.1 - Create the basic Simple RNN model
simple_rnn_model=Sequential()

# Embedding Layers - This layer will take the word indices (numeric embeddings) as input and convert them into dense vectors of fixed size (128 in this case).
# 128 neurons in the embedding layer.
simple_rnn_model.add(Embedding(max_vocabulary_size ,128,input_length=max_len)) 

simple_rnn_model.add(SimpleRNN(128,activation='relu')) # Simple RNN layer with 128 units and ReLU activation function

# Dense layer with 1 neuron and sigmoid activation function for binary classification
simple_rnn_model.add(Dense(1,activation="sigmoid")) # Dense layer with 1 neuron and sigmoid activation function for binary classification

# Binary Crossentropy loss function is used for binary classification problems, where the output is either 0 or 1.
simple_rnn_model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
simple_rnn_model.summary()

In [65]:
# Step 5.2) Create The Early Stopping Callback
from tensorflow.keras.callbacks import EarlyStopping
earlystopping=EarlyStopping(monitor='val_loss',patience=5,restore_best_weights=True)
earlystopping

<keras.src.callbacks.early_stopping.EarlyStopping at 0x1873dc500>

In [None]:
# Step 5.3) Train the model and apply Early Stopping Callback
'''
- When you call .fit(), Keras trains the model and returns a history object 
- that stores the training results from your model.
- It stores metrics for each epoch, such as:
    - Training loss
    - Training accuracy
    - Validation loss
    - Validation accuracy
- We can access this information through history.history, which is a dictionary containing the metrics for each epoch.
'''
history=simple_rnn_model.fit(
    x_train,y_train,epochs=10,batch_size=32,
    validation_split=0.2,
    callbacks=[earlystopping]
)
print('Training history = ',history.history)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m59s[0m 91ms/step - accuracy: 0.6014 - loss: 870.0620 - val_accuracy: 0.6620 - val_loss: 0.6169
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m105s[0m 168ms/step - accuracy: 0.7398 - loss: 0.5141 - val_accuracy: 0.7358 - val_loss: 0.5209
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m110s[0m 176ms/step - accuracy: 0.8306 - loss: 0.3792 - val_accuracy: 0.7710 - val_loss: 0.4863
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m98s[0m 157ms/step - accuracy: 0.8267 - loss: 20717030.0000 - val_accuracy: 0.8040 - val_loss: 0.4426
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m108s[0m 173ms/step - accuracy: 0.9004 - loss: 0.2516 - val_accuracy: 0.7978 - val_loss: 0.4620
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m125s[0m 199ms/step - accuracy: 0.9171 - loss: 0.2318 - val_accuracy: 0.8066 - val_loss: 0.459



In [None]:
# Step 5.4) Print Simple RNN Model metadata (Optional)
# Print model weights
# - The get_weights() method of a Keras model returns a list of all the weights 
# - (1) Embedding weights
# - (2) Simple RNN weights
# - (3) Bias
# - (4) Dense layer weights
# - (5) Recurrent weights of the Simple RNN layer

# simple_rnn_model.get_weights() - Returns the weights for all layers in order of the layers in the model.
# simple_rnn_model.layer.get_weights()[0] - Returns the weights for a specific layer in the model. You can specify the layer by its index or name.
print('Model weights = ',simple_rnn_model.get_weights())

Model weights =  [array([[-0.37726715, -0.32035768, -0.20062667, ...,  0.4667506 ,
        -0.11329959, -1.0732812 ],
       [ 0.07108647,  0.03172462, -0.01864558, ..., -0.04093852,
        -0.06434023, -0.00741006],
       [-0.0056583 ,  0.0506408 ,  0.05242374, ...,  0.00637565,
         0.0645816 , -0.05299132],
       ...,
       [-0.02541414, -0.06369624, -0.06624307, ...,  0.05469832,
         0.04099702, -0.05451935],
       [ 0.06334846,  0.04406402,  0.0189462 , ..., -0.06024647,
        -0.04491138,  0.01217934],
       [-0.04668416, -0.07016868, -0.12000916, ...,  0.04813214,
         0.09461384, -0.05623053]], dtype=float32), array([[ 0.02728274, -0.147261  ,  0.03197928, ...,  0.12192511,
         0.15190785, -0.04132107],
       [ 0.13361727,  0.09165725, -0.22814906, ..., -0.00184063,
         0.11813181,  0.00085336],
       [-0.005473  ,  0.08858676, -0.15086651, ..., -0.03359535,
        -0.04997173,  0.03973466],
       ...,
       [-0.14052275, -0.03455053,  0.0960

In [69]:
# Step 5.5) Save the Simple RNN model
simple_rnn_model.save('./resources/dist/2_simple_rnn_model.h5')



In [None]:
# Step 6: Helper Functions

# Function to convert numeric embeddings (word indices) back to text format
# e.g. Input = encoded_review = [1, 14, 22, 16, 43]
# Output = "? this film was great"
'''
 - i - 3 → adjusts for IMDB’s index offset
 - reverse_word_index → maps number → word
 - '?' → used if word not found

 This function is not used in the example.
'''
def decode_review(encoded_review):
    return ' '.join([reverse_word_index.get(i - 3, '?') for i in encoded_review])

# Function to preprocess user input 
# - Split the whole text into words
# - Convert the words to their corresponding word indices (numeric embeddings) using the word_index dictionary
# - Pad the encoded review to ensure it has the same length as the reviews in the training data (max_len=500 in this case).
'''
e.g. Input = text = "This movie was great"
e.g. Output = [[1, 14, 22, 16, 43, 0, 0, 0, ..., 0]] (padded and encoded review)
'''
def preprocess_text(text):
    words = text.lower().split() # Split the input text to words
    encoded_review = [word_index.get(word, 2) + 3 for word in words] # Convert the words to their corresponding word indices (numeric embeddings) using the word_index dictionary. If a word is not found in the word_index, it is assigned a default index of 2 (which corresponds to the "unknown" token in the IMDB dataset).
    padded_review = sequence.pad_sequences([encoded_review], maxlen=500) # Pad the encoded review to ensure it has the same length as the reviews in the training data (max_len=500 in this case).
    return padded_review

# Do prediction on user input
# e.g. Input = A movie review in text format (e.g. "This movie was fantastic! I loved it.")
# e.g. Output = Positive or Negative
def predict_sentiment(review):
    preprocessed_input=preprocess_text(review)
    prediction=simple_rnn_model.predict(preprocessed_input)
    sentiment = 'Positive' if prediction[0][0] > 0.5 else 'Negative'
    return sentiment, prediction[0][0]   

In [None]:
# Step 7 - Do prediction with a sample user input
example_review = "This movie was fantastic! The acting was great and the plot was thrilling."
sentiment,score=predict_sentiment(example_review)

print(f'Review: {example_review}')
print(f'Sentiment: {sentiment}')
print(f'Prediction Score: {score}') # Probability score indicating how positive the review is (closer to 1 means more positive, closer to 0 means more negative)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 160ms/step
Review: This movie was fantastic! The acting was great and the plot was thrilling.
Sentiment: Positive
Prediction Score: 0.5977053642272949
