# **Text classification with an RNN**

This text classification tutorial trains a [recurrent neural network](https://developers.google.com/machine-learning/glossary/#recurrent_neural_network) on the [IMDB large movie review dataset](http://ai.stanford.edu/~amaas/data/sentiment/) for sentiment analysis.

### Setup

In [None]:
import numpy as np
import pandas as pd
import time
import tensorflow_datasets as tfds
import tensorflow as tf
from IPython.display import display, Markdown
import matplotlib.pyplot as plt

tfds.disable_progress_bar()
pd.set_option('display.max_columns', 500)

In [2]:
# Helper function to plot charts
def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])

### Setup input pipeline


The IMDB large movie review dataset is a *binary classification* dataset—all the reviews have either a `positive (1)` or `negative (0)` sentiment.

Download the dataset using [TFDS](https://www.tensorflow.org/datasets). See the [loading text tutorial](https://www.tensorflow.org/tutorials/load_data/text) for details on how to load this sort of data manually.


In [None]:
# download and load the dataset
dataset, info = tfds.load('imdb_reviews', with_info=True, as_supervised=True)

# TODO: extract available sets (training, testing, unsupervised)
train_dataset, test_dataset, unsupervised = ...

# print data structure
train_dataset.element_spec

In [None]:
# print some dataset's infos
info

Initially this returns a dataset of (text, label pairs):

In [None]:
for example, label in train_dataset.take(1):
  print('text: ', example.numpy().decode("utf-8"))
  print('label: ', label.numpy())

In [None]:
# Train/Validation Split
# TODO: compute the training size length. Must be the 80% of training set
train_size = ...

# TODO: get the validation set using the skip method
# skip [train_size] elements and take remaining
val_dataset = ...

# TODO: get the training set using the take method
# take first [train_size] elements
train_dataset = ...

Next shuffle the data for training and create batches of these `(text, label)` pairs:

In [7]:
# BUFFER_SIZE = 10000
TRAIN_BUFFER_SIZE = len(train_dataset)
VAL_BUFFER_SIZE = len(val_dataset)
BATCH_SIZE = 64

In [None]:
# we are using 'prefetch' to accelerate the training phase by pre-loading next needed batch while current is
# being still in use. In this way we can avoid dead times in training phase (more or less)

# The shuffle methods needs a BUFFER SIZE in input because it loads a buffer of N elements and then shuffles this elements.
# In order to have an exact shuffle we need to set BUFFER_SIZE = len(actual_dataset)
# TODO: 
#   - shuffle training/validation sets
#   - Subdivide in batch
#   - use the prefetch method
train_dataset = ...
val_dataset = ...
test_dataset = ...

In [None]:
# Take fist batch and display first 3 elements (sentence, label)
for example, label in train_dataset.take(1):
  print('texts: ', example.numpy()[:3])
  print()
  print('labels: ', label.numpy()[:3])

### **Create the text encoder**

The raw text loaded by `tfds` needs to be processed before it can be used in a model. The simplest way to process text for training is using the `TextVectorization` layer. This layer has many capabilities, but this tutorial sticks to the default behavior.

Create the layer, and pass the dataset's text to the layer's `.adapt` method:

In [None]:
# We are setting a Max Vocabulary size of 1000 words. This means our vocabulary is capped at 1000 words
VOCAB_SIZE = 1000

# TODO: set max number of token to VOCAB_SIZE
# The TextVectorization layer maps a text into a integer sequence
encoder = tf.keras.layers.TextVectorization(...)

# The adapt function will analyze the dataset, determine the frequency of individual
# string values, and create a vocabulary from them.
# The processing of each example contains the following steps:
#   1. Standardize applying: lowercasing + punctuation stripping
#   2. Split each example into words
#   3. Recombine words into tokens (usually ngrams)
#   4. Index tokens (associate a unique int value with each token)
#   5. Transform each example using this index into a vector of ints 

# TODO: use the graining set to construct the vocabulary
encoder.adapt(...)

The `.adapt` method sets the layer's vocabulary. Here are some vocabulary tokens. After the padding and unknown tokens they're sorted by frequency:

In [None]:
# TODO: get the vocabulary from encoder
vocab = ...
display(vocab)
print(len(vocab))

Once the vocabulary is set, the layer can encode text into indices. The tensors of indices are 0-padded to the longest sequence in the batch (unless you set a fixed `output_sequence_length`):

In [None]:
encoded_example = encoder(example)[:3].numpy()
display(encoded_example[0])
print(encoded_example.shape)

With the default settings, the process is not completely reversible. There are two main reasons for that:

1. The default value for `preprocessing.TextVectorization`'s `standardize` argument is `"lower_and_strip_punctuation"`.
2. The limited vocabulary size and lack of character-based fallback results in some unknown tokens.

In [None]:
for n in range(3):
  original = example[n].numpy().decode("utf-8")
  reenc = " ".join(vocab[encoded_example[n]])

  soriginal = original.split(" ")
  sreenc = reenc.split(" ")

  print(f"Example {n}")
  df = pd.DataFrame({"original": soriginal, "reenc": sreenc[:len(soriginal)]})
  display(df.T)
  print()

## **Create the model**


1. This model can be build as a `tf.keras.Sequential`.

2. The first layer is the `encoder`, which converts the text to a sequence of token indices.

3. After the encoder is an embedding layer. An embedding layer stores one vector per word. When called, it converts the sequences of word indices to sequences of vectors. These vectors are trainable. After training (on enough data), words with similar meanings often have similar vectors.

  This index-lookup is much more efficient than the equivalent operation of passing a one-hot encoded vector through a `tf.keras.layers.Dense` layer.

4. A recurrent neural network (RNN) processes sequence input by iterating through the elements. RNNs pass the outputs from one timestep to their input on the next timestep.

5. After the RNN has converted the sequence to a single vector the two `layers.Dense` do some final processing, and convert from this vector representation to a single logit as the classification output.


The code to implement this is below:

In [None]:
# TODO: define our model using LSTM
model = tf.keras.Sequential([
    # TODO: encoding layer
    ...,
    # TODO: Embedding layer
    tf.keras.layers.Embedding(
        input_dim=...,
        output_dim=64,
        # TODO: enable masking

        ),
    # TODO: LSTM 64
    ...,
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

Please note that Keras sequential model is used here since all the layers in the model only have single input and produce single output. In case you want to use stateful RNN layer, you might want to build your model with Keras functional API or model subclassing so that you can retrieve and reuse the RNN layer states. Please check [Keras RNN guide](https://www.tensorflow.org/guide/keras/rnn#rnn_state_reuse) for more details.

The embedding layer uses masking to handle the varying sequence-lengths. All the layers after the `Embedding` support masking:

In [None]:
print([layer.supports_masking for layer in model.layers])

To find out more check the Tensorflow notebook: [Padding and Masking](https://www.tensorflow.org/guide/keras/masking_and_padding)

Compile the Keras model to configure the training process:

In [16]:
model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    optimizer="adam",
    metrics=['accuracy'],
    )

### **Train the model**

In [None]:
history = model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=10,
)

In [None]:
test_loss, test_acc = model.evaluate(test_dataset)

print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)

In [None]:
plt.figure(figsize=(16, 8))
plt.subplot(1, 2, 1)
plot_graphs(history, 'accuracy')
plt.ylim(None, 1)
plt.subplot(1, 2, 2)
plot_graphs(history, 'loss')
plt.ylim(0, None)

Run a prediction on a new sentence:

If the prediction is >= 0.0, it is positive else it is negative.

In [None]:
predictions = model.predict(test_dataset)
print(predictions)

In [None]:
for original in test_dataset.take(1):
    # TODO: get original label
    y_true = ...

    # TODO: get model prediction
    preds = ...
    # TODO: check which preds are positive (1) or negative (0)
    y_pred = ...

    print("y_true:", y_true.numpy().tolist())
    print("y_pred:", y_pred)

## Glove 6B 50 dim

In [None]:
# Download pre-trained Glove embedding file from Github and extract it
# WARNING: it will download something like 1GB, so it could take a while !!
#!wget https://github.com/uclnlp/inferbeddings/raw/refs/heads/master/data/glove/glove.6B.50d.txt.gz
#!gzip -dkf glove.6B.50d.txt.gz
#!wget https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
#!unzip -q glove.6B.zip

In [None]:
# Read embedding file
# embedding file is made by rows following the format:
#   world embedding(world)
embeddings_index = {}
glove_dims = 50
with open(f"glove.6B.{glove_dims}d.txt","r") as f:
    for line in f:
        values = line.split(" ")
        word = values[0]
        embedding = np.array(values[1:], dtype="float32")
        embeddings_index[word] = embedding

display(embeddings_index)
print(len(embeddings_index))

In [None]:
# TODO: get the "house" world's embedding
...

In [None]:
# TODO: compute the embedding matrix
# The embedding matrix is built as follows:
#   - number of rows = our vocabulary length
#   - each row is the Glove embedding
# Row 0 means: "give me de Glove embedding of the word indexed by 0 in my vocabulary"
embeddings_matrix = np.zeros((..., ...))

hits = 0
misses = 0

# in this way we are getting word - word's index in our vocabulary
for index, world in enumerate(encoder.get_vocabulary()):
    embedding_vector = embeddings_index.get(world)
    
    if embedding_vector is not None:
        embeddings_matrix[index] = embedding_vector
        hits += 1
    else:
        misses += 1

print(f"Converted {hits} words ({misses} misses)")

In [None]:
embeddings_matrix

In [None]:
# TODO: use Glove embedding
model_glove = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(
        input_dim=...,
        output_dim=...,
        # TODO: disable training for this layer

        # TODO: add embedding matrix

        mask_zero=True,
        ),
    tf.keras.layers.LSTM(glove_dims),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

In [28]:
model_glove.compile(
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    optimizer="adam",
    metrics=['accuracy'],
    )

In [None]:
history = model_glove.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=10,
)

In [None]:
test_loss, test_acc = model_glove.evaluate(test_dataset)

print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)

In [None]:
preds = model_glove.predict(test_dataset.take(1))
y_preds = [1 if x >= 0 else 0 for x in preds]

for x, y_ture in test_dataset.take(1):
    for i in range(3):
        print(x[i].numpy().decode("utf-8"))
        print("y_pred", y_pred[i])
        print("y_true", y_true[i].numpy())
        print()

## Evaluate Unsupervised Sentences

In [32]:
unsupervised_dataset = unsupervised.batch(1).prefetch(tf.data.AUTOTUNE)

In [None]:
for sentence in unsupervised_dataset:
    display(Markdown("## Sentence\n"+sentence[0][0].numpy().decode("utf-8")))
    user_input = input()

    y_pred = model.predict(sentence)
    y_pred_glove = model_glove.predict(sentence)

    print("Your label is: ", user_input)
    print("Model label is: ", int(y_pred[0][0] >= 0))
    print("Model (glove) label is: ", int(y_pred_glove[0][0] >= 0))
    time.sleep(0.1)