# HW9.1 Sentiment classification with LSTMs and MLPs

In this homework we will work with the tf.keras framework. To simplify things, we will use the keras built-in dataset of IMDB movie reviews. To use the Colab GPUs, let's run this whole exercise in Colab, and then download it to your laptop and push it to the github repo under your own branch, within a folder named HW9.1.

## Run Notebook on bi-LSTM movie review classification in Colab

We will get started with the code example found in the Keras documentation. Head to this page, and then open the notebook in Colab.

https://keras.io/examples/nlp/bidirectional_lstm_imdb/

### Bidirectional LSTM on IMDB

**Author:** [fchollet](https://twitter.com/fchollet)<br>
**Date created:** 2020/05/03<br>
**Last modified:** 2020/05/03<br>
**Description:** Train a 2-layer bidirectional LSTM on the IMDB movie review sentiment classification dataset.

#### Setup

In [None]:
import numpy as np
import tensorflow
from tensorflow import keras
from keras import layers

max_features = 20000  # Only consider the top 20k words
# vocabulary size = |V| = 20000
maxlen = 200  # Only consider the first 200 words of each movie review
# shorter than 200: add 0s
# longer than 200: truncate

# Set random seed to make the results replicable
# For Python, NumPy, and TensorFlow
keras.utils.set_random_seed(297)
# https://keras.io/examples/keras_recipes/reproducibility_recipes/
# tensorflow.config.experimental.enable_op_determinism()
# some models can not reproduce the same output
# takes too much time to run

#### Build the model

In [None]:
# Input for variable-length sequences of integers
inputs = keras.Input(shape=(None,), dtype="int32") # input layer, input sequences can have variable lengths
# Embed each integer in a 128-dimensional vector
x = layers.Embedding(max_features, 128)(inputs) # embedding layer e.g., (25000, 200, 128), flexible batch size and sequence length
# Add 2 bidirectional LSTMs
x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x) # each forward and backward pass consists of an LSTM layer with 64 units
x = layers.Bidirectional(layers.LSTM(64))(x) # without "return_sequences=True": produce a single output for the entire sequence
# Add a classifier
outputs = layers.Dense(1, activation="sigmoid")(x) # a dense layer with one unit and a sigmoid activation function: binary classifier
model = keras.Model(inputs, outputs)
model.summary()
# https://towardsdatascience.com/counting-no-of-parameters-in-deep-learning-models-by-hand-8f1716241889

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 128)         2560000   
                                                                 
 bidirectional (Bidirection  (None, None, 128)         98816     
 al)                                                             
                                                                 
 bidirectional_1 (Bidirecti  (None, 128)               98816     
 onal)                                                           
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 2757761 (10.52 MB)
Trainable params: 2757761 (1

#### Load the IMDB movie review sentiment data

In [None]:
(x_train, y_train), (x_val, y_val) = keras.datasets.imdb.load_data(
    num_words=max_features
)
print(len(x_train), "Training sequences")
print(len(x_val), "Validation sequences")
x_train = keras.utils.pad_sequences(x_train, maxlen=maxlen)
x_val = keras.utils.pad_sequences(x_val, maxlen=maxlen)


25000 Training sequences
25000 Validation sequences


#### Train and evaluate the model

You can use the trained model hosted on [Hugging Face Hub](https://huggingface.co/keras-io/bidirectional-lstm-imdb) and try the demo on [Hugging Face Spaces](https://huggingface.co/spaces/keras-io/bidirectional_lstm_imdb).

In [None]:
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=32, epochs=2, validation_data=(x_val, y_val))
# each epoch starts with the model's parameters as they were at the end of the previous epoch

Epoch 1/2
Epoch 2/2


<keras.src.callbacks.History at 0x7ea468668160>

### Answers

Run through all the cells and try to understand what each cell is doing. Especially pay attention to the output of `model.summary()`. Look at the training data. How are they represented?

Record the parameter count of this model. Also record the final accuracy of the model.

The output of `model.summary()` shows the type of layers as well as the corresponding output shape and the number of parameters.          
Param # for Embedding = `vocab_size x embedding_dim` = `max_features × 128` = `20000 × 128` = `2560000`         
Param # for the 1st Bidirectional = `num_direction × num_FFNNs_per_unit × [(num_units + input_size) × num_units + num_units]` = `2 × 4 x [(64 + 128) × 64 + 64]` = `98816`      
Param # for the 2nd Bidirectional = `num_direction × num_FFNNs_per_unit × [(num_units + input_size) × num_units + num_units]` = `2 × 4 x [(64 + 128) × 64 + 64]` = `98816`         
Param # for Dense = `(prev_layer_output_size + 1) × units` = `(128 + 1) × 1` = `129`

In [None]:
x_train.shape

(25000, 200)

The training data are represented in the form of a 25000 by 200 matrix, 25000 is the number of sequences in the training set and 200 is the number of words to consider for each movie review. Given a sequence of words, an LSTM processes one word at a time. The embedding layer converts each word in the vocabulary (size = 20000) to a vector of size 128, and these word vectors are sequentially fed into the LSTM for sequence processing.

The parameter count of this model is 2757761.             
The final accuracy (val_accuracy of the last epoch) of the model is 0.8648.

## Task 1: hyperparameters

Try change the batch size to 16 and 64. How does the accuracy change?

In [None]:
# Reinitialize the model
# Input for variable-length sequences of integers
inputs = keras.Input(shape=(None,), dtype="int32") # input layer
# Embed each integer in a 128-dimensional vector
x = layers.Embedding(max_features, 128)(inputs) # embedding layer
# Add 2 bidirectional LSTMs
x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x) # each forward and backward pass consists of an LSTM layer with 64 units
x = layers.Bidirectional(layers.LSTM(64))(x) # without "return_sequences=True": produce a single output for the entire sequence
# Add a classifier
outputs = layers.Dense(1, activation="sigmoid")(x) # a dense layer with one unit and a sigmoid activation function: binary classifier

model = keras.Model(inputs, outputs)
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=16, epochs=2, validation_data=(x_val, y_val))

Epoch 1/2
Epoch 2/2


<keras.src.callbacks.History at 0x7ea3d65842b0>

When batch_size=16, the final accuracy is 0.8479, which is lower than that of the model with a batch size of 32. This value is also lower than the validation accuracy of the first epoch, suggesting this model is likely to be overfitting.

In [None]:
# Input for variable-length sequences of integers
inputs = keras.Input(shape=(None,), dtype="int32") # input layer
# Embed each integer in a 128-dimensional vector
x = layers.Embedding(max_features, 128)(inputs) # embedding layer
# Add 2 bidirectional LSTMs
x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x) # each forward and backward pass consists of an LSTM layer with 64 units
x = layers.Bidirectional(layers.LSTM(64))(x) # without "return_sequences=True": produce a single output for the entire sequence
# Add a classifier
outputs = layers.Dense(1, activation="sigmoid")(x) # a dense layer with one unit and a sigmoid activation function: binary classifier

model = keras.Model(inputs, outputs)
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=64, epochs=2, validation_data=(x_val, y_val))

Epoch 1/2
Epoch 2/2


<keras.src.callbacks.History at 0x7ea3f051e8f0>

When batch_size=64, the final accuracy is 0.8669, which is slightly higher than that of the model with a batch size of 32. However, the validation accuracy of the 2nd epoch is still lower than the previous result, showing signs of overfitting.

While holding the batch size 32, experiment with training the model longer with a few more epochs. How does the final accuracy change?

In [None]:
# Input for variable-length sequences of integers
inputs = keras.Input(shape=(None,), dtype="int32") # input layer
# Embed each integer in a 128-dimensional vector
x = layers.Embedding(max_features, 128)(inputs) # embedding layer
# Add 2 bidirectional LSTMs
x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x) # each forward and backward pass consists of an LSTM layer with 64 units
x = layers.Bidirectional(layers.LSTM(64))(x) # without "return_sequences=True": produce a single output for the entire sequence
# Add a classifier
outputs = layers.Dense(1, activation="sigmoid")(x) # a dense layer with one unit and a sigmoid activation function: binary classifier

model = keras.Model(inputs, outputs)
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=32, epochs=5, validation_data=(x_val, y_val))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7ea3ded53e20>

As the number of epochs increases, there appears to be a monotonic increase in the training accuracy as the model is given more opportunities to learn the data, while the validation accuracy shows a fluctuating decreasing trend. In the last epoch, the final training accuracy is almost 0.97 and the final validation accuracy is 0.8517. The values indicate that the model is likely to be overfitting, and early stopping should be adopted, which means training can stop after the first epoch, as it generates the highest validation accuracy (0.8711).

Record the best accuracy and the settings you obtained them with. Use these settings going forward.

Compare the validation accuracy of the last epoch:    

*   batch_size=32, epochs=2: 0.8648
*   batch_size=16, epochs=2: 0.8479
*   batch_size=64, epochs=2: 0.8669
*   batch_size=32, epochs=5: 0.8517    

So, for the following tasks, the settings should be batch_size=64 and epochs=2.




## Task 2: modify LSTM architecture

The original model has two layers of bi-directional LSTMs. Change it to only one and train again. How does the accuracy change? Write down any observations.

In [None]:
# Input for variable-length sequences of integers
inputs = keras.Input(shape=(None,), dtype="int32")
# Embed each integer in a 128-dimensional vector
x = layers.Embedding(max_features, 128)(inputs)
# Add 1 bidirectional LSTMs
x = layers.Bidirectional(layers.LSTM(64))(x) # binary classification - produce output only at the final time step, and not the full sequence
# Add a classifier
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.summary()

Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_4 (Embedding)     (None, None, 128)         2560000   
                                                                 
 bidirectional_8 (Bidirecti  (None, 128)               98816     
 onal)                                                           
                                                                 
 dense_4 (Dense)             (None, 1)                 129       
                                                                 
Total params: 2658945 (10.14 MB)
Trainable params: 2658945 (10.14 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [None]:
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=64, epochs=2, validation_data=(x_val, y_val))

Epoch 1/2
Epoch 2/2


<keras.src.callbacks.History at 0x7ea3dae874f0>

Compared to the LSTM model with two bi-directional layers, since this model has fewer layers, the training time is slightly shorter. The training accuracies are also slightly lower than those of the original model, as this model is less complicated, making it harder to capture the information in the dataset. However, the validation accuracies are higher, even though the differences are not significant, which means the model with one bi-directional layer may be more generalized, thus performing better.

## Task 3: use MLP (dense layers)

Now let's swap out the bi-LSTM layers with two dense layers with 64 hidden units. (refer to keras Functional API documentation for how to add a dense layer). Try to compile the model and look at the summary. How is it different from the LSTM model? How does the parameter count differ?

Now try to train the MLP model. Does it run?

If it doesn't run, can you explain why by looking at the comparison between the two model summary outputs?

Hint: in addition to adding two dense layers, you might need to also change the input dimension specification to maxlen, and do a Flatten operation between the embedding layer and the dense layer.

Try making changes to the MLP network so that you can train a model with it too.

In [None]:
# Input for variable-length sequences of integers
inputs = keras.Input(shape=(None,), dtype="int32")
# Embed each integer in a 128-dimensional vector
x = layers.Embedding(max_features, 128)(inputs)
# Add 2 dense layers with 64 hidden units
x = layers.Dense(64, activation="relu")(x)
x = layers.Dense(64, activation="relu")(x)
# Add a classifier
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.summary()

Model: "model_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_6 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_5 (Embedding)     (None, None, 128)         2560000   
                                                                 
 dense_5 (Dense)             (None, None, 64)          8256      
                                                                 
 dense_6 (Dense)             (None, None, 64)          4160      
                                                                 
 dense_7 (Dense)             (None, None, 1)           65        
                                                                 
Total params: 2572481 (9.81 MB)
Trainable params: 2572481 (9.81 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


The MLP model is different from the LSTM model as two Dense layers replace the two Bidirectional layers, and some layers have different output shapes (the last two layers of this model are 3D while the last two layers of LSTM are 2D), and different numbers of parameters compared to the original model.     
Param # for the 1st Dense = `(num_input_units + 1) × num_output_units` = `(128 + 1) × 64` = `8256`    
Param # for the 2nd Dense = `(num_input_units + 1) × num_output_units` = `(64 + 1) × 64` = `4160`    


In [None]:
# model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
# model.fit(x_train, y_train, batch_size=64, epochs=2, validation_data=(x_val, y_val))

It does not run.          
ValueError: `logits` and `labels` must have the same shape, received ((None, 200, 1) vs (None,)).

It does not run because the layers are not in the correct shape. MLP does not process each word at a time, so the input layer needs to have a specified dimension of `200`. MLP processes each word in parallel after embedding, and the embedded vectors need to be flattened into a one-dimensional vector, resulting in a flattened vector size of `128` × `200` = `25600`, which is further processed through dense layers.

In [None]:
# Input for variable-length sequences of integers
inputs = keras.Input(shape=(maxlen,), dtype="int32") # change the input dimension specification to "maxlen"
# Embed each integer in a 128-dimensional vector
x = layers.Embedding(max_features, 128)(inputs)
# Flatten operation between the embedding layer and the dense layer
x = layers.Flatten()(x) # reshapes the multidimensional input data into a one-dimensional array (vector) without modifying the actual data
# Add 2 dense layers with 64 hidden units
x = layers.Dense(64, activation="relu")(x)
x = layers.Dense(64, activation="relu")(x)
# Add a classifier
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.summary()

Model: "model_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_7 (InputLayer)        [(None, 200)]             0         
                                                                 
 embedding_6 (Embedding)     (None, 200, 128)          2560000   
                                                                 
 flatten (Flatten)           (None, 25600)             0         
                                                                 
 dense_8 (Dense)             (None, 64)                1638464   
                                                                 
 dense_9 (Dense)             (None, 64)                4160      
                                                                 
 dense_10 (Dense)            (None, 1)                 65        
                                                                 
Total params: 4202689 (16.03 MB)
Trainable params: 4202689 

Param # for the 1st Dense = `(num_input_units + 1) × num_output_units` = `(25600 + 1) × 64` = `1638464`    

In [None]:
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=64, epochs=2, validation_data=(x_val, y_val))

Epoch 1/2
Epoch 2/2


<keras.src.callbacks.History at 0x7ea3d635c280>

## Wrap up

Once you got the MLP training running, play around with it to get the best accuracy. Report the final accuracy and compare it with the previous models.

In [None]:
# Input for variable-length sequences of integers
inputs = keras.Input(shape=(maxlen,), dtype="int32") # change the input dimension specification to "maxlen"
# Embed each integer in a 128-dimensional vector
x = layers.Embedding(max_features, 128)(inputs)
# Flatten operation between the embedding layer and the dense layer
x = layers.Flatten()(x) # reshapes the multidimensional input data into a one-dimensional array (vector) without modifying the actual data
# Add 2 dense layers with 64 hidden units
x = layers.Dense(64, activation="relu")(x)
x = layers.Dense(64, activation="relu")(x)
# Add a classifier
outputs = layers.Dense(1, activation="sigmoid")(x)

callback = keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=3) # stop the training when there is no improvement in the val_accuracy for three consecutive epochs
model = keras.Model(inputs, outputs)
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=16, epochs=10, validation_data=(x_val, y_val), callbacks=[callback])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10


<keras.src.callbacks.History at 0x7ea3dffff9d0>

In [None]:
# Input for variable-length sequences of integers
inputs = keras.Input(shape=(maxlen,), dtype="int32") # change the input dimension specification to "maxlen"
# Embed each integer in a 128-dimensional vector
x = layers.Embedding(max_features, 128)(inputs)
# Flatten operation between the embedding layer and the dense layer
x = layers.Flatten()(x) # reshapes the multidimensional input data into a one-dimensional array (vector) without modifying the actual data
# Add 2 dense layers with 64 hidden units
x = layers.Dense(64, activation="relu")(x)
x = layers.Dense(64, activation="relu")(x)
# Add a classifier
outputs = layers.Dense(1, activation="sigmoid")(x)

callback = keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=3) # stop the training when there is no improvement in the val_accuracy for three consecutive epochs
model = keras.Model(inputs, outputs)
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=32, epochs=10, validation_data=(x_val, y_val), callbacks=[callback])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10


<keras.src.callbacks.History at 0x7ea3df8cdd20>

In [None]:
# Input for variable-length sequences of integers
inputs = keras.Input(shape=(maxlen,), dtype="int32") # change the input dimension specification to "maxlen"
# Embed each integer in a 128-dimensional vector
x = layers.Embedding(max_features, 128)(inputs)
# Flatten operation between the embedding layer and the dense layer
x = layers.Flatten()(x) # reshapes the multidimensional input data into a one-dimensional array (vector) without modifying the actual data
# Add 2 dense layers with 64 hidden units
x = layers.Dense(64, activation="relu")(x)
x = layers.Dense(64, activation="relu")(x)
# Add a classifier
outputs = layers.Dense(1, activation="sigmoid")(x)

callback = keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=3) # stop the training when there is no improvement in the val_accuracy for three consecutive epochs
model = keras.Model(inputs, outputs)
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=64, epochs=10, validation_data=(x_val, y_val), callbacks=[callback])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10


<keras.src.callbacks.History at 0x7ea3dffc1e10>

In [None]:
# Input for variable-length sequences of integers
inputs = keras.Input(shape=(maxlen,), dtype="int32") # change the input dimension specification to "maxlen"
# Embed each integer in a 128-dimensional vector
x = layers.Embedding(max_features, 128)(inputs)
# Flatten operation between the embedding layer and the dense layer
x = layers.Flatten()(x) # reshapes the multidimensional input data into a one-dimensional array (vector) without modifying the actual data
# Add 2 dense layers with 64 hidden units
x = layers.Dense(64, activation="relu")(x)
x = layers.Dense(64, activation="relu")(x)
# Add a classifier
outputs = layers.Dense(1, activation="sigmoid")(x)

model = keras.Model(inputs, outputs)
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=16, epochs=1, validation_data=(x_val, y_val))



<keras.src.callbacks.History at 0x7ea3d6e85930>

In [None]:
# Input for variable-length sequences of integers
inputs = keras.Input(shape=(maxlen,), dtype="int32") # change the input dimension specification to "maxlen"
# Embed each integer in a 128-dimensional vector
x = layers.Embedding(max_features, 128)(inputs)
# Flatten operation between the embedding layer and the dense layer
x = layers.Flatten()(x) # reshapes the multidimensional input data into a one-dimensional array (vector) without modifying the actual data
# Add 2 dense layers with 64 hidden units
x = layers.Dense(64, activation="relu")(x)
x = layers.Dense(64, activation="relu")(x)
# Add a classifier
outputs = layers.Dense(1, activation="sigmoid")(x)

model = keras.Model(inputs, outputs)
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=32, epochs=1, validation_data=(x_val, y_val))



<keras.src.callbacks.History at 0x7ea3bd55add0>

In [None]:
# Input for variable-length sequences of integers
inputs = keras.Input(shape=(maxlen,), dtype="int32") # change the input dimension specification to "maxlen"
# Embed each integer in a 128-dimensional vector
x = layers.Embedding(max_features, 128)(inputs)
# Flatten operation between the embedding layer and the dense layer
x = layers.Flatten()(x) # reshapes the multidimensional input data into a one-dimensional array (vector) without modifying the actual data
# Add 2 dense layers with 64 hidden units
x = layers.Dense(64, activation="relu")(x)
x = layers.Dense(64, activation="relu")(x)
# Add a classifier
outputs = layers.Dense(1, activation="sigmoid")(x)

model = keras.Model(inputs, outputs)
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=64, epochs=1, validation_data=(x_val, y_val))



<keras.src.callbacks.History at 0x7ea3dd1dcfd0>

To obtain the best accuracy, the first three models have a batch size of 16, 32, and 64 respectively, with 10 epochs. An early stopping criterion is added to stop the training when there is no improvement in the `val_accuracy` for three consecutive epochs to avoid overfitting. All three models stop training after four epochs, and all achieve the highest validation accuracy after the first epoch. So, the following three models have a batch size of 16, 32, and 64 respectively, with only 1 epoch. The model with a batch size of 32 achieves the highest final validation accuracy of 0.8726.

This best MLP model achieves higher final validation accuracy than the best LSTM, which is the one with one bi-directional layer (batch_size=64, epoch=2). However, it is possible that with further hyperparameter tuning, LSTM can achieve better results, since MLP has 4202689 parameters, which is much higher than that of the LSTM model, so it is even more likely to overfit. It is also important to note that, the accuracies are not significantly different despite the changes in batch size and epoch, so maybe more data are required or less complicated models should be considered in order to better address the overfitting issue.