# Recurrent models 101

In the evolving landscape of Natural Language Processing (NLP), Recurrent Neural Networks (RNNs) stand out for their unique ability to process sequential data, making them particularly suited for text analysis tasks. This practical exercise is designed to provide hands-on experience with RNNs, focusing on their application in classifying product reviews as either positive or negative.

Product reviews, ubiquitous across online platforms, serve as valuable data for extracting consumer sentiments and preferences. However, the unstructured nature of text data poses a challenge for traditional machine learning models. RNNs, with their sequential data processing capability, offer a robust solution to this challenge. By considering the temporal dynamics of language, RNNs can capture contextual nuances critical for accurately interpreting sentiments.

This practical will guide you through the end-to-end process of leveraging RNNs for sentiment analysis, encompassing the following steps:

- Reading the Dataset: We begin by importing a dataset comprising product reviews, each tagged with additional information such as review summary, verification status of the purchase, timestamp, adjusted log votes, and a binary indicator of the review's sentiment (positive or negative).
- Exploratory Data Analysis (EDA): Before delving into modeling, we will conduct a preliminary analysis to understand the dataset's characteristics and distribution.
- Dataset Splitting: The dataset will be divided into training and validation sets to evaluate the model's performance.
- Text Processing and Transformation: We will preprocess the text data, converting it into a format suitable for RNN processing.
- Data Batching and Iterator Creation: This step involves creating batches of data and iterators for efficiently feeding data into the model during training.
- Utilizing Pre-trained GloVe Word Embeddings: To enhance the model's understanding of language semantics, we will use GloVe (Global Vectors for Word Representation), a pre-trained word embedding.
- Setting Hyperparameters and Building the Network: We will configure the RNN's hyperparameters and architecture.
- Training the Model: The network will be trained on the prepared dataset.
- Validation: The trained model's performance will be evaluated on the validation dataset.
- Improvement Ideas: We will explore strategies for further enhancing the model's accuracy.


An important distinction to note is that while RNNs excel in processing textual data, incorporating additional features such as timestamps, log_votes, and verification status requires integrating RNNs with traditional neural network models. This practical will thus not only familiarize you with RNNs but also illustrate how to combine them with other neural network architectures to leverage both sequential and non-sequential data for comprehensive analysis.


In this practical you need to use torchtext

Import torchtext, GloVe vocabulary and get_tokenizer function from torchtext

In [15]:
import torch, torchtext
from torchtext.vocab import GloVe
from torchtext.data.utils import get_tokenizer

from collections import Counter

In [3]:
import time
import numpy as np
import torch, torchtext
import pandas as pd
from torch import nn, optim
from torch.nn import BCEWithLogitsLoss
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

You can download the dataset using this link: https://drive.google.com/file/d/1k1XD0XPRSCC4tGGX1LUN-uzrmlQNJjs2/view?usp=sharing

Then you can upload the dataset on your drive or directly in colab.

Now, Let's read the dataset:

In [4]:
import pandas as pd

df = pd.read_csv('./AMAZON-REVIEW-DATA-CLASSIFICATION.csv')

In [5]:
df.head()

Unnamed: 0,reviewText,summary,verified,time,log_votes,isPositive
0,"PURCHASED FOR YOUNGSTER WHO\nINHERITED MY ""TOO...",IDEAL FOR BEGINNER!,True,1361836800,0.0,1.0
1,unable to open or use,Two Stars,True,1452643200,0.0,0.0
2,Waste of money!!! It wouldn't load to my system.,Dont buy it!,True,1433289600,0.0,0.0
3,I attempted to install this OS on two differen...,I attempted to install this OS on two differen...,True,1518912000,0.0,0.0
4,I've spent 14 fruitless hours over the past tw...,Do NOT Download.,True,1441929600,1.098612,0.0


### Data analysis

Display some examples from the dataset

In [6]:
df.iloc[0]['reviewText']

'PURCHASED FOR YOUNGSTER WHO\nINHERITED MY "TOO sMALL FOR ME"\nLAPTOP.  IDEAL FOR LEARNING A\nFUTURE GOOD SKILL.  HER CHOICE\nOF BOOKS IS A PLUS AS WAS THIS BOOK!'

Display the number of samples

In [10]:
len(df)

70000

Display the label proportion (number of positive labels / negative labels)

In [7]:
df["isPositive"].value_counts()

isPositive
1.0    43692
0.0    26308
Name: count, dtype: int64

Display the value count for verified

In [9]:
df["verified"].value_counts()

verified
True     47208
False    22792
Name: count, dtype: int64

Remove NaN values

In [11]:
df = df.dropna(subset=['reviewText'])
len(df)

69988

### Split the dataset into training and validation

In [12]:
train_text, val_text, train_label, val_label = \
    train_test_split(df["reviewText"].tolist(),
                     df["isPositive"].tolist(),
                     test_size=0.10,
                     shuffle=True,
                     random_state=324)

### Text processing and Transformation

Use the get tokenizer function from torchtext to download `basic_english` tokenizer

In [13]:
tokenizer = get_tokenizer("basic_english")

Use Counter() from collection, that we will use to compute the histogram of our tokens

In [16]:
counter = Counter()

Update the counter for each samples of your dataset

`counter.update(tokenizer("..."))`

In [17]:
for line in train_text:
    counter.update(tokenizer(line))

Let's display the counter. You can see that we have the count for each token

In [None]:
counter

Create a vocabulary with words seen at least 5 (min_freq) times

use torchtext vocab to create your vocabulary

In [19]:
# Create a vocabulary with words seen at least 5 (min_freq) times
vocab = torchtext.vocab.vocab(counter, min_freq=5)

Add the unknow token and the pad token, using `insert_token` function

In [None]:
# Add the unknown token
# and use it by default for unknown words
unk_token = '<unk>'
vocab.insert_token(unk_token, 0)
vocab.set_default_index(0)

# Add the pad token
pad_token = '<pad>'
vocab.insert_token(pad_token, 1)

Here you have some examples

In [None]:
print(f"'home' -> {vocab['home']}")
print(f"'wash' -> {vocab['wash']}")
# unknown word (assume from test set)
print(f"'fhshbasdhb' -> {vocab['fhshbasdhb']}")

In [None]:
# you can use the following mapper to tokenize the data
text_transform_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]

In [None]:
print(f"Before transform:\t{train_text[37]}")
print(f"After transform:\t{text_transform_pipeline(train_text[37])}")

Let's write a function that modifies and pads our text data as needed. In this function, we'll truncate the text sequence when it exceeds a specified length (in this case, max_len=50). If the text is shorter than max_len, we'll append 1s to the end of the sequence, representing the padding token.

In [None]:
def transformText(text_list, max_len):
    # Transform the text
    transformed_data = [text_transform_pipeline(text)[:max_len] for text in text_list]

    # Pad zeros if the text is shoter than max_len
    for data in transformed_data:
        data[len(data) : max_len] = np.ones(max_len - len(data))

    return torch.tensor(transformed_data, dtype=torch.int64)

In [None]:
train_text[300]

In [None]:
text = train_text[7:9]
print(f"Text: {text}\n")
print(f"Num sentences: {len(text)}\n")
tt = transformText(text, max_len=50)
print(f"Transformed text: \n{tt}\n")
print(f"Shape of transformed text: {tt.shape}")

Let's use the transformText() function and create the data loaders. Here, we use max_len=100 to consider the first 100 words in the text.

Use TensorDataset to build the dataset from you tokens and labels.

In [None]:
max_len = 100
batch_size = 32

# Pass transformed and padded data to dataset
# Create data loaders
train_dataset = TensorDataset(
    transformText(train_text, max_len), torch.tensor(train_label)
)
val_dataset = TensorDataset(transformText(val_text, max_len), torch.tensor(val_label))

train_loader = DataLoader(train_dataset, batch_size=batch_size)
val_loader = DataLoader(val_dataset, batch_size=batch_size)

### Build the model

In this illustration, we'll leverage GloVe word vectors, specifically those from the '6B' dataset with dimensions of 300. This dataset encompasses 6 billion words and phrases, each represented by a vector containing 300 numerical values. The code below demonstrates how to access these word vectors and construct an embedding matrix. We'll establish a linkage between our vocabulary indices and the GloVe embeddings utilizing the get_vecs_by_tokens() function

In [None]:
glove = GloVe(name="6B", dim=300)
embedding_matrix = glove.get_vecs_by_tokens(vocab.get_itos())

Hyper parameters of the model that you can tune

In [None]:
# Size of the state vectors
hidden_size = 8

# General NN training parameters
learning_rate = 0.001
epochs = 25

# Embedding vector and vocabulary sizes
embed_size = 300  # glove.6B.300d.txt
vocab_size = len(vocab.get_itos())

Before proceeding, it's essential to ensure our data is properly formatted for the subsequent stages. Our model comprises the following layers:

- Embedding Layer: Responsible for mapping words/tokens to word vectors.
- RNN Layer: In this instance, we employ a simple RNN model, consisting of two stacked RNN layers. Further insights into the RNN architecture can be found here.
- Linear Layer: Utilized for the final prediction of 'isPositive', this layer consists of a single neuron.

Embedding documentation: https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html

In [None]:
class Net(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers=1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.RNN(
            embed_size, hidden_size, num_layers=num_layers
        )

        self.linear = nn.Linear(hidden_size*max_len, 1)
        self.act = nn.Sigmoid()

    def forward(self, inputs):
        embeddings = self.embedding(inputs)
        # Call RNN layer
        outputs, _ = self.rnn(embeddings)
        # Use the output of each time step
        # Send it all together to the linear layer
        outs = self.linear(outputs.reshape(outputs.shape[0], -1))
        return self.act(outs)

model = Net(vocab_size, embed_size, hidden_size, num_layers=2)

# Initialize the weights
def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.xavier_uniform_(m.weight)
    if type(m) == nn.RNN:
        for param in m._flat_weights_names:
            if "weight" in param:
                nn.init.xavier_uniform_(m._parameters[param])


Let's initialize this network. Then, we will need to make the embedding layer use our GloVe word vectors.

Use `model.embedding.weight.data.copy_` to set the pretrained embedding parameters and freeze the weight

In [None]:
# We set the embedding layer's parameters from GloVe
model.embedding.weight.data.copy_(embedding_matrix)

# We won't change/train the embedding layer
model.embedding.weight.requires_grad = False

Build loss and optimizer

In [None]:
# Setting our trainer
trainer = torch.optim.SGD(model.parameters(), lr=learning_rate)

# We will use Binary Cross-entropy loss
# reduction="sum" sums the losses for given output and target
cross_ent_loss = nn.BCELoss(reduction="sum")

Now build the train / validation loop

In [None]:
# Get the compute device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.apply(init_weights)
model.to(device)

for epoch in range(epochs):
    start = time.time()
    training_loss = 0
    val_loss = 0
    # Training loop, train the network
    for data, target in train_loader:
        trainer.zero_grad()
        data = data.to(device)
        target = target.to(device)
        output = model(data)
        L = cross_ent_loss(output.squeeze(1), target)
        training_loss += L.item()
        L.backward()
        trainer.step()

    # Validate the network, no training (no weight update)
    for data, target in val_loader:
        val_predictions = model(data.to(device))
        L = cross_ent_loss(val_predictions.squeeze(1), target.to(device))
        val_loss += L.item()

    # Let's take the average losses
    training_loss = training_loss / len(train_label)
    val_loss = val_loss / len(val_label)

    end = time.time()
    print(
        f"Epoch {epoch}. Train_loss {training_loss}. Val_loss {val_loss}. Seconds {end-start}"
    )

### Test our classifier

Display some prediction and display the confusion matrix / accuracy / precision / recall / f1score on the validation set

In [None]:
val_predictions = []
for data, target in val_loader:
    val_preds = model(data.to(device))
    val_predictions.extend(
        [np.rint(val_pred)[0] for val_pred in val_preds.detach().cpu().numpy()]
    )
print(val_predictions[:10])

Improvement:

- Change the batch size
- Add more layers
- Use GRU / LSTM layers