<div><img style="float: right; width: 120px; vertical-align:middle" src="https://www.upm.es/sfs/Rectorado/Gabinete%20del%20Rector/Logos/EU_Informatica/ETSI%20SIST_INFORM_COLOR.png" alt="ETSISI logo" />


# Text classification with RNNs<a id="top"></a>

<i><small>Author: Alberto Díaz Álvarez<br>Última actualización: 2023-05-11</small></i></div>

***

## Introduction

In a previous notebook, we explored text classification with CNNs, which are well-suited for capturing local dependencies in text data. However, sometimes we need a more powerful model that can capture longer-term dependencies in the data. This is where RNNs come in.

RNNs are specifically designed to model sequential data, which makes them them ideal for text classification tasks. Unlike traditional feedforward neural networks, which process inputs independently of each other, RNNs maintain a memory of previous inputs and use this information to make predictions about the current input.

## Goals

We will explore how to use RNNs for text classification in the same problem as in the previous notebook, where we classified Amazon reviews that users made about products. We will walk through a step-by-step example of building a text classification model with RNNs using the Keras library in Python.

We will see that in reality the changes are minimal, since it is little more than changing one layer for another.

## Libraries and configuration

Next we will import the libraries that will be used throughout the notebook.

In [None]:
import os.path
import requests
from shutil import unpack_archive

import numpy as np
import pandas as pd
import tensorflow as tf

import matplotlib.pyplot as plt

We will also configure some parameters to adapt the graphic presentation.

In [None]:
plt.style.use('ggplot')
plt.rcParams.update({'figure.figsize': (20, 6),'figure.dpi': 64})

And create the necessary directories in case they have not been created previously

In [None]:
os.makedirs('tmp', exist_ok=True)

***

## Common parameters

We will keep the same global parameters to be able to compare both methods.

In [None]:
# How many dimensions our word vectors have (50, 100, 200 or 300)
EMBEDDING_DIM = 50
# Our vocabulary max. size (the most frequent ones will be chosen)
MAX_VOCAB_SIZE = 16384
# Maximum sentence length
MAX_SEQUENCE_LEN = 64

## Dataset processing

The process we will carry out will be the same as we did in the previous notebook.

### Dataset download

In [None]:
DATASET_URL = 'https://jmcauley.ucsd.edu/data/amazon_v2/categoryFilesSmall/Digital_Music_5.json.gz'
DATASET_ZIP = 'tmp/Digital_Music_5.json.gz'

# Download the remote file if it does not exist
if not os.path.exists(DATASET_ZIP):
    with open(DATASET_ZIP, 'wb') as f:
        print(f'Downloading {DATASET_ZIP}...')
        r = requests.get(DATASET_URL, verify=False)
        f.write(r.content)
        print('OK')

corpus = pd.read_json(DATASET_ZIP, lines=True)
corpus.dropna(subset=['overall', 'reviewText'], inplace=True)
corpus.head()

### Preparing inputs...

In [None]:
x_train = corpus['reviewText'].astype(str).str.strip()
print(f'Training input shape: {x_train.shape}')

### ... outputs ...

In [None]:
y_train = corpus['overall'].astype(int).replace({
    1: 0,
    2: 0,
    3: 1,
    4: 2,
    5: 2,
})
print(f'Training output shape: {y_train.shape}')

### ... TextVectorization layer

**Exercise: _Create the `vectorize_layer` layer, which will be an object of the `TextVectorization` class that will translate the words to integers. Remember to take into account the tokens for "padding" and for "unknown". An output similar to the following is expected_:**

```
Vocabulary length: 32770
```

In [None]:
vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens=MAX_VOCAB_SIZE + 2,
    output_sequence_length=MAX_SEQUENCE_LEN,
    name='vectorization',
)
vectorize_layer.adapt(x_train)

print(f'Vocabulary length: {len(vectorize_layer.get_vocabulary())}')

### ... and embedding

Which consist in download ...

In [None]:
GLOVE_URL = 'http://nlp.stanford.edu/data/glove.6B.zip'
GLOVE_FILE = 'tmp/glove.6B.zip'

# Download the compressed GloVe dataset (if you don't already have it)
if not os.path.exists(GLOVE_FILE):
    print('Downloading ...', end='')
    with open(GLOVE_FILE, 'wb') as f:
        r = requests.get(GLOVE_URL, allow_redirects=True)
        f.write(r.content)
    print('OK')

# Unzip it in the directory 'glove'.
print('Unpacking ...', end='')
unpack_archive(GLOVE_FILE, 'tmp')
print('OK')

... build weights (vectors) matrix ...

**Exercise: _Construct the weights matrix of our vocabulary words using the GLoVe weights. We expect an output similar to the following_:**

```
Loading GloVe 50-d embedding... done (400000 word vectors loaded)
Creating embedding matrix with GloVe vectors... done (4449 words unassigned)
```

In [None]:
print(f'Loading GloVe {EMBEDDING_DIM}-d embedding... ', end='')
word2vec = {}
with open(f'tmp/glove.6B.{EMBEDDING_DIM}d.txt') as f:
    for line in f:
        word, vector = line.split(maxsplit=1)
        word2vec[word] = np.fromstring(vector,'f', sep=' ')
print(f'done ({len(word2vec)} word vectors loaded)')

print('Creating embedding matrix with GloVe vectors... ', end='')
# Our newly created embedding: a matrix of zeros
embedding_matrix = np.zeros((MAX_VOCAB_SIZE + 2, EMBEDDING_DIM))

ko_words = 0
for i, word in enumerate(vectorize_layer.get_vocabulary()):
    if word == '[UNK]':
        # The second word is for an unknown token, in glove is 'unk'
        word = 'unk'

    # Get the word vector and overwrite the row in its corresponding position
    word_vector = word2vec.get(word)
    if word_vector is not None:
        embedding_matrix[i] = word_vector
    else:
        ko_words += 1
print(f'done ({ko_words} words unassigned)')

... and load the matrix inside an `Embedding` layer

**Exercise: _Create the `embedding_layer` layer, which will be an object of the `Embedding` class that will translate the tokens to word vectors. We expect an output similar to the following_:**

```
<tf.Tensor: shape=(2, 50), dtype=float32, numpy=
array([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
         0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
         0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
         0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
         0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
         0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
         0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
         0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
         0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
         0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00],
       [-7.9149e-01,  8.6617e-01,  1.1998e-01,  9.2287e-04,  2.7760e-01,
        -4.9185e-01,  5.0195e-01,  6.0792e-04, -2.5845e-01,  1.7865e-01,
         2.5350e-01,  7.6572e-01,  5.0664e-01,  4.0250e-01, -2.1388e-03,
        -2.8397e-01, -5.0324e-01,  3.0449e-01,  5.1779e-01,  1.5090e-02,
        -3.5031e-01, -1.1278e+00,  3.3253e-01, -3.5250e-01,  4.1326e-02,
         1.0863e+00,  3.3910e-02,  3.3564e-01,  4.9745e-01, -7.0131e-02,
        -1.2192e+00, -4.8512e-01, -3.8512e-02, -1.3554e-01, -1.6380e-01,
         5.2321e-01, -3.1318e-01, -1.6550e-01,  1.1909e-01, -1.5115e-01,
        -1.5621e-01, -6.2655e-01, -6.2336e-01, -4.2150e-01,  4.1873e-01,
        -9.2472e-01,  1.1049e+00, -2.9996e-01, -6.3003e-03,  3.9540e-01]],
      dtype=float32)>
```

In [None]:
embedding_layer = tf.keras.layers.Embedding(
    input_dim=embedding_matrix.shape[0],
    output_dim=embedding_matrix.shape[1],
    weights=[embedding_matrix],
    input_length=MAX_SEQUENCE_LEN,
    trainable=False,
    name='Embedding',
)

embedding_layer(np.array([0, 1]))

## Classification model based on recurrent neural networks (RNNs)

And now, instead of using CNNs, we will use RNNs, networks that are designed for this type of problem. In this case, the dimension set is performed by the `TextVectorization` layer (which converts the text into sequences of integers of length $T$) and the Embedding layer (which converts each integer into a vector of $D$ dimensions), converting the input into a tensor with the form $N \times T \times D$.

**Exercise: _Create a recurrent model that takes as input a string, vectorizes it into word ids, then into word vectors and finally, passes it through one or more recurrent layers to end up classifying it into one of the three scores_.**

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.Input(shape=(1,), dtype=tf.string),
    vectorize_layer,
    embedding_layer,
    tf.keras.layers.GRU(64, activation='relu'),
    tf.keras.layers.Dense(3, activation='sigmoid')
])

model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['sparse_categorical_accuracy'],
)
model.summary()

Although the number of parameters is similar, increasing the number of units in a recurrent unit does not increase the number of parameters in our model very much. However, it greatly increases the training time. Therefore, our model will not be able to achieve results as good as the previous ones.

Let's train the model and hope for the best.

In [None]:
history = model.fit(x_train, y_train, epochs=5)

Let's take a look at the training progress:

In [None]:
pd.DataFrame(history.history).plot()
plt.yscale('log')
plt.xlabel('Epoch num.')
plt.show()

Now let's see how it interprets the sentiment of a good, fair and bad review extracted from the amazon website.

In [None]:
good = "My nephew is on the autism spectrum and likes to fidget with things so I knew this toy would be a hit. Was concerned that it may not be \"complex\" enough for his very advanced brain but he really took to it. Both him (14 yrs) and his little brother (8 yrs) both enjoyed playing with it throughout Christmas morning. I'm always happy when I can find them something unique and engaging."
poor = "I wasn't sure about this as it's really small. I bought it for my 9 year old grandson. I was ready to send it back but my daughter decided it was a good gift so I'm hoping he likes it. Seems expensive for the price though to me."
evil = "I just wanted to follow up to say that I reported this directly to the company and had no response. I have not gotten any response from my review. The level of customer service goes a long way when an item you purchase is defective and this company didn’t care to respond. No I am even more Leary about ordering anything from this company. I never asked for a refund or replacement since I am not able to return it. I’m just wanted to let them know that this was a high dollar item and I expected it to be a quality item. Very disappointed! I bought this for my grandson for Christmas. He loved it and played with it a lot. My daughter called to say that the stickers were peeling on the corners. I am not able to take it from my grandson because he is autistic and wouldn’t understand. I just wanted to warn others who are wanting to get this. Please know that this is a cool toy and it may not happen to yours so it is up to you."

probabilities = model.predict([good, poor, evil], verbose=0)
print(f'Good was classified as {np.argmax(probabilities[0])}')
print(f'Poor was classified as {np.argmax(probabilities[1])}')
print(f'Evil was classified as {np.argmax(probabilities[2])}')

## Conclusions

We have explored how to use RNNs for text classification and compared them to CNNs. While both architectures can be effective for text classification, they have some key differences and trade-offs.

One major difference between RNNs and CNNs is that RNNs are better suited for capturing long-term dependencies in text data, while CNNs are better suited for capturing local dependencies. This makes RNNs a good choice for tasks such as sentiment analysis or language translation, where the context of a word or phrase can be crucial for determining its meaning.

However, RNNs can be more computationally expensive to train than CNNs, and they can also suffer from the vanishing gradient problem, which can make it difficult for the model to remember information from earlier in the input sequence. On the other hand, CNNs can be faster to train and can scale well to larger datasets. They are particularly well-suited for tasks such as text classification or image recognition, where local features are important.

In the end, the choice between RNNs and CNNs for text classification will depend on the specific requirements of the task, the characteristics of the data, and the resources available for training and testing the model.

***

<div><img style="float: right; width: 120px; vertical-align:top" src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-nc-sa.png" alt="Creative Commons by-nc-sa logo" />

[Back to top](#top)

</div>