# Homwork: Recurrent Neural Networks for Natural Language Processing

The goal of the below RNN network is to predict the _sentiment_ of a movie review (rather it is positive or negative) based on the text from the movie review. The data set is already split into training and test sets and the words from the movie reviews have already been tokenized.

In [1]:
#############
## IMPORTS ##
#############
# A data set of movie reviews from tensorflow which is already
# formatted for natural language processing (it has been 
# pre-processed for you)
from tensorflow.keras.datasets import imdb

# This is used to reduce the length of the training data to 
# decrease the ammount of time needed for training
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Mostly tensorflow NN imports that have been used before
# except for embedding which will be discussed later in the
# notebook
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense



In [None]:
##############################
## FORMATTING THE DATA SET  ##
#############################
# What the the maximum number of unique words which are allowed
# in a review, ranked by the popularity of the words. The default
# value in this notebook is to only have reviews that include the 
# 10,000 most used words. Limiting the number of words in reviews
# limits the number of tokens and thus the training time.
vocab_size = 10000
maxlen = 200

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocab_size)

# Pad sequences (Look up what this does.)
X_train = pad_sequences(X_train, maxlen=maxlen)
X_test = pad_sequences(X_test, maxlen=maxlen)

In [3]:
# Display the first point in the X component of the 
# training data. Also print its length.
# The X data are just lists of numbers representing
# the movie review text
print(X_train[0])
print(len(X_train[0]))

[   5   25  100   43  838  112   50  670    2    9   35  480  284    5
  150    4  172  112  167    2  336  385   39    4  172 4536 1111   17
  546   38   13  447    4  192   50   16    6  147 2025   19   14   22
    4 1920 4613  469    4   22   71   87   12   16   43  530   38   76
   15   13 1247    4   22   17  515   17   12   16  626   18    2    5
   62  386   12    8  316    8  106    5    4 2223 5244   16  480   66
 3785   33    4  130   12   16   38  619    5   25  124   51   36  135
   48   25 1415   33    6   22   12  215   28   77   52    5   14  407
   16   82    2    8    4  107  117 5952   15  256    4    2    7 3766
    5  723   36   71   43  530  476   26  400  317   46    7    4    2
 1029   13  104   88    4  381   15  297   98   32 2071   56   26  141
    6  194 7486   18    4  226   22   21  134  476   26  480    5  144
   30 5535   18   51   36   28  224   92   25  104    4  226   65   16
   38 1334   88   12   16  283    5   16 4472  113  103   32   15   16
 5345 

In [4]:
# Display the corresponding y value
# Which is just a single number
print(y_train[0])

1


To get an idea as to what the movie reviews actually look like, we can use the following chunk of code modified from [the tensorflow documentation page for the data set](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb/get_word_index). Play around with the below code to get an idea of what the reviews look like before they are tokenized.

In [None]:
# Use the default parameters to keras.datasets.imdb.load_data
# Try different values and see how they change the code.
start_char = 1
oov_char = 2
index_from = 3
# Retrieve the training sequences.
(x_train, _), _ = imdb.load_data(
    start_char=start_char, oov_char=oov_char, index_from=index_from
)
# Retrieve the word index file mapping words to indices
word_index = imdb.get_word_index()
# Reverse the word index to obtain a dict mapping indices to words
# And add `index_from` to indices to sync with `x_train`
inverted_word_index = dict(
    (i + index_from, word) for (word, i) in word_index.items()
)
# Update `inverted_word_index` to include `start_char` and `oov_char`
inverted_word_index[start_char] = "[START]"
inverted_word_index[oov_char] = "[OOV]"
# Decode the first sequence in the dataset
decoded_sequence = " ".join(inverted_word_index[i] for i in x_train[0])

decoded_sequence

"[START] this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert redford's is an amazing actor and now the same being director norman's father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the part's of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and sh

In [None]:
####################
## CREATE THE RNN ##
####################
# Define the RNN as a sequential neural network
model_lstm = Sequential()

# Add an embedding layer (Look this up)
# This converts word indices into dense vectors of fixed size
# input_dim = vocabulary size
# output_dim = size of embedding vectors
# input_length = length of input sequences
model_lstm.add(Embedding(vocab_size, 128, input_length=maxlen))

# Add an LSTM RNN layer with 64 neurons
model_lstm.add(LSTM(64))

# Add a dense layer to post-process the results of the LSTM
# layer and produce the output. Needs to have only a single 
# neuron as the output is one dimensional.
model_lstm.add(Dense(1, activation='sigmoid'))

# Use binary cross-entropy and accuracy as this is a classification
# problem and the adam optimizer
model_lstm.compile(loss='binary_crossentropy', optimizer='adam', 
                   metrics=['accuracy'])

# Print a summary of the model
model_lstm.summary()


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 200, 128)          1280000   
                                                                 
 lstm (LSTM)                 (None, 64)                49408     
                                                                 
 dense (Dense)               (None, 1)                 65        
                                                                 
Total params: 1329473 (5.07 MB)
Trainable params: 1329473 (5.07 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [6]:
##############
## TRAINING ##
#############
# Use a validation data set to improve the accuracy of the model
# Start with a small number of epochs to attempt to prevent 
# overfitting
model_lstm.fit(X_train, y_train, validation_split=0.2, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.src.callbacks.History at 0x3147c3970>

In [7]:
###############
## ACCURACY ##
##############
loss, accuracy = model_lstm.evaluate(X_test, y_test)
print(f"LSTM Accuracy:", accuracy)


LSTM Accuracy: 0.8563600182533264


## Assignment
0. Go through the above code and add more comments to more throughly explain what everything does, how everything works, and why it is there.
1. What is natural language processing? What is the goal and where is it used?
2. Research the IMDB data set and describe what is present in the X data, what is represented by the y data, and what the numerical encodings mean? How did tensorflow process the data set from movie reviews to numbers?
3. Perform hyperparameter tuning on the above network to maximize your accuracy. Play around with the parameters of the embedding layer. Adjust the number of LSTM layers or dense layers and any numeric values. Try different activation and loss function. You can also try formatting the data set using one-hot encoding. What is your highest accuracy and what model gave you that accuracy?
4. Redo the above network with GRU units instead of LSTM units. Add this network down below as to not alter your aboved work. Which type of RNN layer performs better?