[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/francisco-ortin/data-science-course/blob/main/deep-learning/rnn/sentiment.ipynb)
[![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC%20BY--NC--SA%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)

# Sentiment classification with RNNs

In this notebook, we implement a sentiment analysis task to classify movie reviews as positive or negative. By analyzing the text of written by the users, we will predict the sentiment of each review. We use the the [IMDb dataset](https://keras.io/api/datasets/imdb/), a set of 50,000 movie reviews from the [Internet Movie Database](https://en.wikipedia.org/wiki/IMDb) (IMDb).

<img src="img/imdb.png" width="400"/>

In [17]:
# make sure the required packages are installed
%pip install pandas numpy seaborn matplotlib scikit-learn keras tensorflow tensorflow-hub --quiet
# if running in colab, install the required packages and copy the necessary files
directory='data-science-course/deep-learning/rnn'
if get_ipython().__class__.__module__.startswith('google.colab'):
    !git clone --depth 1 https://github.com/francisco-ortin/data-science-course.git  2>/dev/null
    !cp --update {directory}/*.py .
    !mkdir -p img data
    !cp {directory}/data/* data/.
    !cp {directory}/img/* img/.

import os
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow as tf
import tensorflow_hub as hub
import zipfile

Note: you may need to restart the kernel to use updated packages.


## Load the IMDb dataset

 We use the 5,000 most frequent words in the dataset. We cut the reviews to a maximum length of 80 words to speed up training. We use embeddings of 1024 dimensions and a large number of epochs (50) to train the models because we use early stopping. We only consider 1,000 sentences of the IMDb dataset because of memory restrictions (lower this number if you have memory issues).

In [18]:
# Consider all the words with a frequency higher than this value. The higher, the more memory is needed.
vocabulary_size = 5_000
# Compute the maximum length of the reviews (for speeding up the training it is better to cut the reviews)
max_review_length = 80
# Max number of epochs to train the models (we use early stopping)
n_epochs = 50
# Embedding dimensions (ELMo embeddings have 1024 dimensions)
embedding_dim = 1024
# Max number of sentences for training. Reduce this number if you do not have enough memory.
max_sentences_train = 1_000

We load the dataset from the Keras API. Half of the reviews are used for training, and the other half for validation and testing.

In [19]:
(X_train, y_train), (X_half, y_half) = keras.datasets.imdb.load_data(num_words=vocabulary_size)
X_train, y_train = X_train[:max_sentences_train], y_train[:max_sentences_train]
X_half, y_half = X_half[:max_sentences_train], y_half[:max_sentences_train]
# get validation set as half of the test set
X_test, X_val = X_half[:len(X_half) // 2], X_half[len(X_half) // 2:]
y_test, y_val = y_half[:len(y_half) // 2], y_half[len(y_half) // 2:]
X_half, y_half = None, None  # free memory

print(f"Training sequences: {len(X_train):,}.\nValidation sequences: {len(X_val):,}.\nTesting sequences: {len(X_test):,}.")

Training sequences: 1,000.
Validation sequences: 500.
Testing sequences: 500.


## Process the dataset

We store all the word indexes returned by the IMDb dataset in a dictionary. An `index_to_word` dictionary is created to convert the token IDs back to words. We reserve the first four indices for special tokens `<PAD>`, `<START>`, `<OOV>`, and `<END>`.

In [20]:
# Let's print some reviews. We need to convert the integers (token ids) back to words.
word_to_index = {word: index+3 for word, index in keras.datasets.imdb.get_word_index().items()}  # word -> integer dictionary
# The IMDB dataset reserves the 4 first indices for special tokens <PAD>, <START>, <OOV>, <END>
index_to_word = {value: key for key, value in word_to_index.items()}  # integer -> word dictionary
index_to_word[0] = "<PAD>"
index_to_word[1] = "<START>"
index_to_word[2] = "<OOV>"
index_to_word[3] = "<END>"

We show the first reviews and their corresponding sentiment. 

In [21]:
def decode_review(encoded_review: list[int]) -> str:
    """Decode a review from a list of integers to a string."""
    return ' '.join(index_to_word.get(word_index, "<OOV>") for word_index in encoded_review)

print("First reviews in training set, with the corresponding labels:")
for (i, (review, label)) in enumerate(zip(X_train[:5], y_train[:5])):
    print(f"Review {i + 1}: {decode_review(review)}.\nLabel: {label}.")

First reviews in training set, with the corresponding labels:
Review 1: <START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <OOV> is an amazing actor and now the same being director <OOV> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <OOV> and would recommend it to everyone to watch and the fly <OOV> was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <OOV> to the two little <OOV> that played the <OOV> of norman and paul they were just brilliant children are often left out of the <OOV> list i think because the stars that play them all grown up are such a big <OOV> for the who

We add padding to the reviews to have the same length. We use the `post` mode to pad and truncate at the end of the reviews.

In [22]:
# pad the reviews to have the same length (padding and truncating at the end with "post")
X_train = keras.preprocessing.sequence.pad_sequences(X_train, maxlen=max_review_length, padding="post", truncating="post")
X_val = keras.preprocessing.sequence.pad_sequences(X_val, maxlen=max_review_length, padding="post", truncating="post")
X_test = keras.preprocessing.sequence.pad_sequences(X_test, maxlen=max_review_length, padding="post", truncating="post")

## One-directional LSTM model

We create a one-direction LSTM RNN where embeddings are computed as the first layer using the Keras `Embedding` layer. We have to choose the size of the embedding vectors (hyperparameter).

In [23]:
# variable length input integer sequences
inputs = keras.Input(shape=(None,), dtype="int32")
# Embed each integer to `embedding_dim` dimensional vector space
x = layers.Embedding(vocabulary_size, embedding_dim)(inputs)
# Add 1 LSTM layer
x = layers.LSTM(64)(x)
# Add a classifier (sigmoid activation function for binary classification)
outputs = layers.Dense(1, activation="sigmoid")(x)
one_directional_model = keras.Model(inputs, outputs)
one_directional_model.summary()

Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_2 (Embedding)     (None, None, 1024)        5120000   
                                                                 
 lstm_6 (LSTM)               (None, 64)                278784    
                                                                 
 dense_4 (Dense)             (None, 1)                 65        
                                                                 
Total params: 5398849 (20.59 MB)
Trainable params: 5398849 (20.59 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


**Important**: The previous model has a huge number of trainable parameters (5.4M) due to the embedding layer (even though the embedding and the vocabulary sizes are not very large). It consumes lots of memory, takes a long time to train, and it requires lots of data.

We compile, train, and evaluate the model. We use the Adam optimizer and the binary cross-entropy loss function. We use early stopping to avoid overfitting.

We save the model to avoid retraining it every time we run the notebook.

In [24]:
def compile_train_evaluate(model_p: keras.Model, x_train_p: np.array, y_train_p: np.array,
                           x_val_p: np.array, y_val_p: np.array, x_test_p: np.array, y_test_p: np.array,
                           batch_size: int, epochs: int) -> (float, float, keras.Model):
    """
    Compile, train and evaluate the model.
    :param model_p: the model to compile, train and evaluate
    :param x_train_p: train X sequences
    :param y_train_p: train y labels
    :param x_val_p: validation X sequences
    :param y_val_p: validation y labels
    :param x_test_p: test X sequences
    :param y_test_p: test y labels
    :param batch_size: batch size
    :param epochs: number of epochs
    :return: (test_loss, test_accuracy, model)
    """
    # we compile and train the model 
    model_p.compile("adam", "binary_crossentropy", metrics=["accuracy"])
    early_stopping_callback = tf.keras.callbacks.EarlyStopping(patience=2, restore_best_weights=True)
    model_p.fit(x_train_p, y_train_p, batch_size=batch_size, epochs=epochs, validation_data=(x_val_p, y_val_p),
                callbacks=[early_stopping_callback])
    # Evaluate the model on the test set
    loss, accuracy = model_p.evaluate(x_test_p, y_test_p)
    return loss, accuracy, model_p


test_loss, test_accuracy, one_directional_model = compile_train_evaluate(one_directional_model, X_train, y_train, X_val, y_val, X_test, y_test,
                       32, n_epochs)
print(f"Test loss: {test_loss:.4f}.\nTest accuracy: {test_accuracy:.4f}.")

Epoch 1/50
Epoch 2/50
Epoch 3/50
Test loss: 0.6963.
Test accuracy: 0.5280.


## Bidirectional LSTM model

We create a bidirectional LSTM model. We use the same hyperparameters as in the previous model.

In [25]:
# variable length input integer sequences
inputs = keras.Input(shape=(None,), dtype="int32")
# Embed each integer to `embedding_dim` dimensional vector space
x = layers.Embedding(vocabulary_size, embedding_dim)(inputs)
# Add 1 Bidirectional-LSTM layers
x = layers.Bidirectional(layers.LSTM(64))(x)
# Add a classifier
outputs = layers.Dense(1, activation="sigmoid")(x)
bi_lstm_model = keras.Model(inputs, outputs)
bi_lstm_model.summary()

test_loss, test_accuracy, bi_lstm_model = compile_train_evaluate(bi_lstm_model, X_train, y_train, X_val, y_val, X_test, y_test,
                                                  32, n_epochs)
print(f"Test loss: {test_loss:.4f}.\nTest accuracy: {test_accuracy:.4f}.")

Model: "model_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_6 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_3 (Embedding)     (None, None, 1024)        5120000   
                                                                 
 bidirectional_5 (Bidirecti  (None, 128)               557568    
 onal)                                                           
                                                                 
 dense_5 (Dense)             (None, 1)                 129       
                                                                 
Total params: 5677697 (21.66 MB)
Trainable params: 5677697 (21.66 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Test loss: 0.6020.
Test accuracy: 0.6880.

As in the first model, the bi-LSTM has more than 5.67M trainable parameters. It consumes lots of memory, takes a long time to train, and it requires lots of data.

## ✨ Questions

1. Does the bidirectional LSTM model have a significantly higher number of trainable parameters than the one-directional LSTM model? 
2. Why?
3. Does the bidirectional LSTM model perform better than the one-directional LSTM model? 
4. Why? 

### Answers

*Write your answers here.*


## ELMo embeddings

In the following RNN, we use [ELMo embeddings](https://en.wikipedia.org/wiki/ELMo). ELMo embeddings are context-dependent embeddings, pretrained and ready to use. ELMo embeddins have 1024 dimensions. BERT, and GPT are more powerful embeddings, but require more memory an CPU resources. Nowadays, the most powerful embeddings (e.g., `SentencePiece`) are for subword units rather than words; we use word embeddings for simplicity. 

In [26]:
elmo = hub.load("https://tfhub.dev/google/elmo/3")

def get_elmo_embeddings(sentences: list[str]) -> np.array:
    """
    Get ELMo embeddings for a list of sentences.
    """
    # ELMo returns a tensor, but we want to extract the embeddings
    embeddings = elmo.signatures['default'](tf.constant(sentences))['elmo']
    return embeddings.numpy()  # Convert to numpy array for easier manipulation

We have to generate the ELMo embeddings from the reviews. We use the `get_elmo_embeddings` function to get the embeddings for the training, validation, and test sets. This might take time and consume a lot of memory resources (change the value of `max_sentences_train`, `vocabulary_size` and `max_review_size` if you get an out-of-memory error).

In [27]:
X_train_elmo = get_elmo_embeddings([decode_review(review) for review in X_train])
X_val_elmo = get_elmo_embeddings([decode_review(review) for review in X_val])
X_test_elmo = get_elmo_embeddings([decode_review(review) for review in X_test])

Now, let's create a new RNN. *Notice* that we do not need an embedding layer because we are using ELMo embeddings as an input. This makes the model to have fewer parameters. We can use a more complex model with fewer parameters.

**Important**: the first RNN layer uses *dropout*. We also define a dropout layer after the LSTM layer. Dropout is a regularization technique that helps to avoid overfitting. I works the following way. In each iteration, some neurons are randomly set to zero. This forces the network to learn more robust features. Dropout is only used during training, not during inference. Dropout is a powerful technique to avoid overfitting in deep learning, but it slows down the training process.

In [28]:
inputs = keras.Input(shape=(None, embedding_dim), dtype="float32")
# Add 2 LSTM layers
x = layers.Bidirectional(layers.LSTM(64, return_sequences=True, dropout=0.2))(inputs)
x = layers.Bidirectional(layers.LSTM(64))(x)
# Dropout layer with 20% rate
x = layers.Dropout(0.2)(x)
# Add a classifier (sigmoid activation function for binary classification)
outputs = layers.Dense(1, activation="sigmoid")(x)
elmo_model = keras.Model(inputs, outputs)
elmo_model.summary()

Model: "model_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_7 (InputLayer)        [(None, None, 1024)]      0         
                                                                 
 bidirectional_6 (Bidirecti  (None, None, 128)         557568    
 onal)                                                           
                                                                 
 bidirectional_7 (Bidirecti  (None, 128)               98816     
 onal)                                                           
                                                                 
 dropout_2 (Dropout)         (None, 128)               0         
                                                                 
 dense_6 (Dense)             (None, 1)                 129       
                                                                 
Total params: 656513 (2.50 MB)
Trainable params: 656513 (2.

**Important**: This model has 656K trainable parameters, which is much less than the previous models (11.5% the parameters of the previous RNN network). It consumes less memory, takes less time to train, and requires fewer data.

We compile, train, and evaluate the model.

In [33]:
test_loss, test_accuracy, elmo_model = compile_train_evaluate(elmo_model, X_train_elmo, y_train, X_val_elmo, y_val, X_test_elmo, y_test,
                                                              32, n_epochs)
print(f"Test loss: {test_loss:.4f}.\nTest accuracy: {test_accuracy:.4f}.")

Epoch 1/50
Epoch 2/50
Epoch 3/50
Test loss: 0.6726.
Test accuracy: 0.7320.


## Inference

We can see how accurate the mode is by predicting the sentiment of some reviews. What follows are some example reviews. Add more reviews to test the model.

In [38]:
example_reviews = ["The movie was a great waste of time. I is awful and boring.",
                   "I loved the movie. The plot was amazing.",
                   "This movie is not worth watching.",
                   "Although the film is not a masterpiece, you may have a good time if your expectations are not high."]


review_embeddings = get_elmo_embeddings(example_reviews)
predictions = elmo_model.predict(review_embeddings)

for i, prediction in enumerate(predictions):
    print(f"Review {i + 1}: {example_reviews[i]}")
    print(f"Probability of being positive: {prediction[0]:.4f}.\n")

Review 1: The movie was a great waste of time. I is awful and boring.
Probability of being positive: 0.1262.

Review 2: I loved the movie. The plot was amazing.
Probability of being positive: 0.9213.

Review 3: This movie is not worth watching.
Probability of being positive: 0.3480.

Review 4: Although the film is not a masterpiece, you may have a good time if your expectations are not high.
Probability of being positive: 0.8530.


## ✨ Questions

5. After testing some reviews, do you think the model is performing reasonably well? 
6. Why do you think, then, that the accuracy is not higher (it is 0.7320)? Enumerate as many reasons as you can to see if you understand how it works.


### Answers

*Write your answers here.*
