[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/francisco-ortin/data-science-course/blob/main/deep-learning/rnn/sentiment.ipynb)
[![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC%20BY--NC--SA%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)

# Sentiment classification with RNNs

In this notebook, we implement a sentiment analysis task to classify movie reviews as positive or negative. By analyzing the text of written by the users, we will predict the sentiment of each review. We use the the [IMDb dataset](https://keras.io/api/datasets/imdb/), a set of 50,000 movie reviews from the [Internet Movie Database](https://en.wikipedia.org/wiki/IMDb) (IMDb).

<img src="img/imdb.png" width="400"/>

In [24]:
# make sure the required packages are installed
%pip install pandas numpy seaborn matplotlib scikit-learn keras tensorflow --quiet
# if running in colab, install the required packages and copy the necessary files
directory='data-science-course/deep-learning/rnn'
if get_ipython().__class__.__module__.startswith('google.colab'):
    !git clone https://github.com/francisco-ortin/data-science-course.git  2>/dev/null
    !cp --update {directory}/*.py .
    !mkdir -p img data
    !cp {directory}/data/* data/.
    !cp {directory}/img/* img/.

import numpy as np
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow as tf
import os
import zipfile

Note: you may need to restart the kernel to use updated packages.


## Load the IMDb dataset

 We use the 400,000 most frequent words in the dataset. We cut the reviews to a maximum length of 80 words to speed up training. We use embeddings of 50 dimensions and a large number of epochs (50) to train the models because we use early stopping.

In [25]:
# consider all the words with a frequency higher than this value
vocabulary_size = 400_000
# compute the maximum length of the reviews (for speeding up the training it is better to cut the reviews)
max_review_length = 80
# max number of epochs to train the models (we use early stopping)
n_epochs = 50
# Embedding dimensions
embedding_dim = 50

We load the dataset from the Keras API. Half of the reviews are used for training, and the other half for validation and testing.

In [26]:
# we train one half for training and the other half for validation and testing
(X_train, y_train), (x_half, y_half) = keras.datasets.imdb.load_data(num_words=vocabulary_size)
# get validation set as half of the test set
X_test, X_val = x_half[:len(x_half) // 2], x_half[len(x_half) // 2:]
y_test, y_val = y_half[:len(y_half) // 2], y_half[len(y_half) // 2:]

print(f"Training sequences: {len(X_train):,}.\nValidation sequences: {len(X_val):,}.\nTesting sequences: {len(X_test):,}.")

Training sequences: 25,000.
Validation sequences: 12,500.
Testing sequences: 12,500.


## Process the dataset

We store all the word indexes returned by the IMDb dataset in a dictionary. An `index_to_word` dictionary is created to convert the token IDs back to words. We reserve the first four indices for special tokens `<PAD>`, `<START>`, `<OOV>`, and `<END>`.

In [27]:
# Let's print some reviews. We need to convert the integers (token ids) back to words.
word_to_index = {word: index+3 for word, index in keras.datasets.imdb.get_word_index().items()}  # word -> integer dictionary
# The IMDB dataset reserves the 4 first indices for special tokens <PAD>, <START>, <OOV>, <END>
index_to_word = {value: key for key, value in word_to_index.items()}  # integer -> word dictionary
index_to_word[0] = "<PAD>"
index_to_word[1] = "<START>"
index_to_word[2] = "<OOV>"
index_to_word[3] = "<END>"

We show the first reviews and their corresponding sentiment. 

In [28]:
def decode_review(encoded_review: list[int]) -> str:
    """Decode a review from a list of integers to a string."""
    return ' '.join(index_to_word.get(word_index, "<OOV>") for word_index in encoded_review)

print("First reviews in training set, with the corresponding labels:")
for (i, (review, label)) in enumerate(zip(X_train[:5], y_train[:5])):
    print(f"Review {i + 1}: {decode_review(review)}.\nLabel: {label}.")

First reviews in training set, with the corresponding labels:
Review 1: <START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert redford's is an amazing actor and now the same being director norman's father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the part's of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such

We add padding to the reviews to have the same length. We use the `post` mode to pad and truncate at the end of the reviews.

In [29]:
# pad the reviews to have the same length (padding and truncating at the end with "post")
X_train = keras.preprocessing.sequence.pad_sequences(X_train, maxlen=max_review_length, padding="post", truncating="post")
X_val = keras.preprocessing.sequence.pad_sequences(X_val, maxlen=max_review_length, padding="post", truncating="post")
X_test = keras.preprocessing.sequence.pad_sequences(X_test, maxlen=max_review_length, padding="post", truncating="post")

## One-directional LSTM model

We create a one-direction LSTM RNN where embeddings are computed as the first layer using the Keras `Embedding` layer. We have to choose the size of the embedding vectors (hyperparameter).

In [30]:
# variable length input integer sequences
inputs = keras.Input(shape=(None,), dtype="int32")
# Embed each integer to `embedding_dim` dimensional vector space
x = layers.Embedding(vocabulary_size, embedding_dim)(inputs)
# Add 1 LSTM layer
x = layers.LSTM(64)(x)
# Add a classifier (sigmoid activation function for binary classification)
outputs = layers.Dense(1, activation="sigmoid")(x)
one_directional_model = keras.Model(inputs, outputs)
one_directional_model.summary()

Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_3 (Embedding)     (None, None, 50)          20000000  
                                                                 
 lstm_4 (LSTM)               (None, 64)                29440     
                                                                 
 dense_3 (Dense)             (None, 1)                 65        
                                                                 
Total params: 20029505 (76.41 MB)
Trainable params: 20029505 (76.41 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


**Important**: The previous model has a huge number of trainable parameters (20M) due to the embedding layer (even though the embedding size is not very high, but the vocabulary size is very large). It consumes lots of memory, takes a long time to train, and it requires lots of data.

We compile, train, and evaluate the model. We use the Adam optimizer and the binary cross-entropy loss function. We use early stopping to avoid overfitting.

We save the model to avoid retraining it every time we run the notebook.

In [31]:
def compile_train_evaluate(model_p: keras.Model, x_train_p: np.array, y_train_p: np.array,
                           x_val_p: np.array, y_val_p: np.array, x_test_p: np.array, y_test_p: np.array,
                           batch_size: int, epochs: int, zip_file_name: str, model_file_name: str) -> (float, float, keras.Model):
    """
    Compile, train and evaluate the model.
    :param model_p: the model to compile, train and evaluate
    :param x_train_p: train X sequences
    :param y_train_p: train y labels
    :param x_val_p: validation X sequences
    :param y_val_p: validation y labels
    :param x_test_p: test X sequences
    :param y_test_p: test y labels
    :param batch_size: batch size
    :param epochs: number of epochs
    :param zip_file_name: file name to store/load the model compressed
    :param model_file_name: file name to inside the zip file
    :return: (test_loss, test_accuracy, model)
    """
    # we compile and train the model if it does not exist (otherwise we load it)
    if not os.path.exists(zip_file_name):
        model_p.compile("adam", "binary_crossentropy", metrics=["accuracy"])
        early_stopping_callback = tf.keras.callbacks.EarlyStopping(patience=2, restore_best_weights=True)
        model_p.fit(x_train_p, y_train_p, batch_size=batch_size, epochs=epochs, validation_data=(x_val_p, y_val_p),
                    callbacks=[early_stopping_callback])
        # save the model
        model_p.save(model_file_name)
        # compress the model file
        with zipfile.ZipFile(zip_file_name, 'w', zipfile.ZIP_DEFLATED) as zip_file:
            zip_file.write(model_file_name, arcname=model_file_name)
        # remove the model file
        os.remove(model_file_name)
    else:
        # load the model; open the zip file and extract the model file
        with zipfile.ZipFile(zip_file_name, 'r') as zip_ref:
            zip_ref.extractall(".")
        # load the model
        model_p = keras.models.load_model(model_file_name)
        # remove the model file
        os.remove(model_file_name)
    # Evaluate the model on the test set
    loss, accuracy = model_p.evaluate(x_test_p, y_test_p)
    return loss, accuracy, model_p


test_loss, test_accuracy, one_directional_model = compile_train_evaluate(one_directional_model, X_train, y_train, X_val, y_val, X_test, y_test,
                       32, n_epochs, "data/one_directional_model.zip", 'one_directional_model.keras')
print(f"Test loss: {test_loss:.4f}.\nTest accuracy: {test_accuracy:.4f}.")

Test loss: 0.6025.
Test accuracy: 0.6826.


## Bidirectional LSTM model

We create a bidirectional LSTM model. We use the same hyperparameters as in the previous model.

In [32]:
# variable length input integer sequences
inputs = keras.Input(shape=(None,), dtype="int32")
# Embed each integer to `embedding_dim` dimensional vector space
x = layers.Embedding(vocabulary_size, embedding_dim)(inputs)
# Add 1 Bidirectional-LSTM layers
x = layers.Bidirectional(layers.LSTM(64))(x)
# Add a classifier
outputs = layers.Dense(1, activation="sigmoid")(x)
bi_lstm_model = keras.Model(inputs, outputs)
bi_lstm_model.summary()

test_loss, test_accuracy, bi_lstm_model = compile_train_evaluate(bi_lstm_model, X_train, y_train, X_val, y_val, X_test, y_test,
                                                  32, n_epochs, "data/bi_directional_model.zip", "bi_directional_model.keras")
print(f"Test loss: {test_loss:.4f}.\nTest accuracy: {test_accuracy:.4f}.")

Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_4 (Embedding)     (None, None, 50)          20000000  
                                                                 
 bidirectional_3 (Bidirecti  (None, 128)               58880     
 onal)                                                           
                                                                 
 dense_4 (Dense)             (None, 1)                 129       
                                                                 
Total params: 20059009 (76.52 MB)
Trainable params: 20059009 (76.52 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Test loss: 0.5194.
Test accuracy: 0.7508.


As in the first model, the bi-LSTM has more than 20M trainable parameters. It consumes lots of memory, takes a long time to train, and it requires lots of data.

## ✨ Questions

1. Does the bidirectional LSTM model have a significantly higher number of trainable parameters than the one-directional LSTM model? 
2. Why?
3. Does the bidirectional LSTM model perform better than the one-directional LSTM model? 
4. Why? 

### Answers

*Write your answers here.*


## GloVe embeddings

In the following RNN, we use [GloVe embeddings](https://nlp.stanford.edu/projects/glove/). GloVe embeddings are 400,000 pre-trained word vectors that can be used in any NLP task. They are not context-dependent, but they are useful when the training data is limited. ELMo, BERT, and GPT are context-dependent embeddings that are more powerful but require more memory an CPU resources. Nowadays, the most powerful embeddings (e.g., `SentencePiece`) are for subword units rather than words; we use word embeddings for simplicity. 

We use the 50-dimensional GloVe embeddings to create a dictionary with the embeddings. They are the least computationally expensive embeddings available: GloVe embeddings are also provided with 100, 200, and 300 dimensions. The last one requires more than 1 GB.

We load the GloVe embeddings from the file `glove.6B.50d.zip` and create a dictionary with the embeddings.

In [33]:
def create_glove_embeddings_from_file(zip_file_name: str, txt_file_name: str) -> dict[str, np.array]:
    """
    Create a dictionary of GloVe embeddings from a file.
    :param zip_file_name: the zip file name
    :param txt_file_name: the text file name inside the zip file
    :return: the dictionary of embeddings
    """
    glove_embeddings_loc = {}  # word -> vector(embedding_dim) mapping
    with zipfile.ZipFile(zip_file_name, 'r') as zip_file:
        with zip_file.open(txt_file_name, 'r') as file:
            # load the vocabulary_size most frequent words, including padding, start and OOV tokens
            for line in file:
                values = line.split()
                word = values[0]
                vector = np.asarray(values[1:], dtype='float32')
                glove_embeddings_loc[word] = vector
            glove_embeddings_loc["<PAD>"] = np.zeros(embedding_dim)
            glove_embeddings_loc["<START>"] = np.full(embedding_dim, 0.5)
            glove_embeddings_loc["<OOV>"] = np.ones(embedding_dim)
    return glove_embeddings_loc


glove_embeddings = create_glove_embeddings_from_file("data/glove.6B.50d.zip", "glove.6B.50d.txt")
print(f"Found {len(glove_embeddings):,} word embeddings of {embedding_dim} dimensions.")

Found 400,003 word embeddings of 50 dimensions.


We create a matrix with the GloVe embeddings for the words in the IMDb dataset. That matrix will be used in the following RNN as the embedding layer. To fix it, we iterate over the `vocabulary_sizze` words in the dataset, take the word and search for its embedding in the GloVe embeddings (OOV embedding if the word not found).

We use the embeddings for the OOV token if the word is not found.

In [34]:
def get_glove_word_embedding(word_p: str, glove_embeddings_p: dict[str, np.array]) -> np.array:
    """
    Get the GloVe embedding for a word. It is not found, return the embedding for the OOV token.
    """
    return glove_embeddings.get(word_p, glove_embeddings_p["<OOV>"])


glove_embedding_matrix = np.zeros((vocabulary_size, embedding_dim))
for word_index in range(vocabulary_size):
    word = index_to_word.get(word_index, "<OOV>")
    embedding_vector = get_glove_word_embedding(word, glove_embeddings)
    glove_embedding_matrix[word_index] = embedding_vector

Now, let's create a new RNN with the GloVe embeddings. **Notice** that the embedding layer is not trainable. We use the GloVe embeddings as they are.

In [35]:
inputs = keras.Input(shape=(None,), dtype="int32")
# We set the matrix weights to the GloVe embeddings (`weights`) and we do not train the embeddings (`trainable=False`)
x = layers.Embedding(vocabulary_size, embedding_dim, weights=[glove_embedding_matrix], trainable=False)(inputs)
# Add 2 Bidirectional-LSTM layers
x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x)
x = layers.Bidirectional(layers.LSTM(64))(x)
# Add a classifier
outputs = layers.Dense(1, activation="sigmoid")(x)
glove_lstm_model = keras.Model(inputs, outputs)
glove_lstm_model.summary()

Model: "model_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_6 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_5 (Embedding)     (None, None, 50)          20000000  
                                                                 
 bidirectional_4 (Bidirecti  (None, None, 128)         58880     
 onal)                                                           
                                                                 
 bidirectional_5 (Bidirecti  (None, 128)               98816     
 onal)                                                           
                                                                 
 dense_5 (Dense)             (None, 1)                 129       
                                                                 
Total params: 20157825 (76.90 MB)
Trainable params: 157825 

**Important**: This model has 157,825 trainable parameters, which is much less than the previous models (0.78% the parameters of the two previous RNN networks). It consumes less memory, takes less time to train, and requires fewer data.

We compile, train, and evaluate the model.

In [36]:
test_loss, test_accuracy, glove_lstm_model = compile_train_evaluate(glove_lstm_model, X_train, y_train, X_val, y_val, X_test, y_test,
                                                 32, n_epochs, "data/glove_model.zip", "glove_model.keras")
print(f"Test loss: {test_loss:.4f}.\nTest accuracy: {test_accuracy:.4f}.")

Test loss: 0.4449.
Test accuracy: 0.7890.


## Inference

We can see how accurate the mode is by predicting the sentiment of some reviews. What follows are some example reviews. Add more reviews to test the model.

In [37]:
example_reviews = ["The movie was a great waste of time. The plot was boring.",
                   "I loved the movie. The plot was amazing.",
                   "Would not recommend this movie to anyone.",
                   "The movie is not a masterpiece. You may have a good time if your expectations are not high."]


def prepare_reviews_for_prediction(reviews_p: list[str], word_to_index_p: dict[str, int], max_review_length_p: int)\
        -> np.array:    
    """
    Prepare a list of reviews for prediction: include the <START> token, convert the words to lower case,
    remove punctuation, convert the words to indexes, pad the sequences and truncate them if necessary.
    :param reviews_p: the list of reviews to be prepared 
    :param word_to_index_p: a dictionary mapping words to indexes
    :param max_review_length_p: the maximum length of the reviews
    :return: and array of token sequences, one for each review
    """
    sequences_loc = []
    for review in reviews_p:
        words_indexes = [1]  # start token
        for word in review.split():
            for char_to_remove in [".", ",", "!", "?"]:
                word = word.lower().replace(char_to_remove, "")
            words_indexes.append(word_to_index_p[word] if word in word_to_index_p else 2)  # OOV token
        sequences_loc.append(words_indexes)  
    # pad the sequences
    sequences_loc = keras.preprocessing.sequence.pad_sequences(sequences_loc, maxlen=max_review_length_p, 
                                                               padding="post", truncating="post")
    return sequences_loc


sequences_to_predict = prepare_reviews_for_prediction(example_reviews, word_to_index, max_review_length)
predictions = glove_lstm_model.predict(sequences_to_predict)

for i, prediction in enumerate(predictions):
    print(f"Review {i + 1}: {example_reviews[i]}")
    print(f"Probability of being positive: {prediction[0]:.4f}.\n")

Review 1: The movie was a great waste of time. The plot was boring.
Probability of being positive: 0.0108.

Review 2: I loved the movie. The plot was amazing.
Probability of being positive: 0.9680.

Review 3: Would not recommend this movie to anyone.
Probability of being positive: 0.4301.

Review 4: The movie is not a masterpiece. You may have a good time if your expectations are not high.
Probability of being positive: 0.6280.


## ✨ Questions

5. After testing some reviews, do yo think the model is performing reasonably well? 
6. Why do you think, then, that the accuracy is not higher (it is 0.7890)?


### Answers

*Write your answers here.*
