# Sentiment Analysis Improved

We have already understood the fundamentals of sentiment analysis from the previous notebook. What are we including in this notebook to improvise our result?

- Pack padded sequences
- Bidirectional and Multi-Layer RNN
- Regularization (dropout)

## Data Ingestion

In [1]:
import os
import re
import time
import math
import torch
import torch.nn as nn
from typing import List
from torchtext import data
from torch.optim import Adam
from torchtext.datasets import IMDB
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence

In [2]:
# !pip install torchdata
# !pip install -U spacy
# !python -m spacy download en_core_web_sm

In [3]:
train_iterator, test_iterator = IMDB()

## Preprocessing

The preprocessing step is same as the previous one. You can further add to the preprocessing step. Check the following techniques to work more with preprocessing:

- Apply minimum word occurance for vocabulary building
- Word stemming and lemmatization

In [4]:
def tokenize(text):
  text = re.sub(r"[-()\"#/@;:<>{}=~|.?,]", " ", text)
  text = re.sub(' +', ' ', text)
  return text.lower().split()


tokens = list()
for _, line in train_iterator:
  tokens += tokenize(line)

In [5]:
vocab = {'<PAD>':0, '<UNK>':1}
for i, word in enumerate(set(tokens), start=len(vocab)):
  vocab[word] = i


def word_to_index(text):
  return [vocab[word] if word in vocab.keys() else vocab['<UNK>'] for word in tokenize(text)]

In [6]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def collate_fn(batch):
  labels, text, text_len = [], [], []
  for label, line in batch:
    labels.append(label)
    text.append(torch.tensor(word_to_index(line)))
    text_len.append(len(line))
  text = pad_sequence(text, padding_value=vocab['<PAD>'])
  labels = torch.tensor(labels).to(device)
  text = torch.tensor(text).to(device)
  text_len = torch.tensor(text_len).to(device)
  return labels, text, text_len

## Model Definition

The improvised model contains a little more techniques implemented. We have already very frequently discussed about bidirectional and multi-layered rnn in our previous notebooks. Let's give a quick sight over what is regularization.

**Regularization:** Addition of improvising techniques to our model adds additional parameters to it. The more the number of parameters, more is the probability of our model to overfit. It is because more parameters memorize the information from training data, causing very low training error but poor generalization to unseen/test data). Regularization technique helps us combat this issue. We will be using `dropout` as a regularization technique in our case. Though a model with dropped parameters is a weaker model, it works because the predictions from all these weaker models when averaged together behave as an ensemble of weaker models that give better result.

There are `n` hidden states returned by the model with shape `[num_layers*num_directions, batch_size, hid_dim]`, ordered as `[forward_0, backward_0, ..., forward_n, backward_n]`. As we only need the last layer (both forward and backward), we will slice the hidden state using `hidden[-2,:,:]` for forward_n and `[-1,:,:]` for backward_n and then concatenate them together. Likewise we multiplied hidden_dim by two i.e. `hidden_dim*2` while initializing linear layer because we have two states for each hidden layer (forward and backward).

`Note:` Length of pack_padded_sequence must be a CPU tensor so we explicitly use `.to('cpu')` to it.

In [7]:
class SentimentAnalysis(nn.Module):
  def __init__(self, input_dim, embed_dim, hid_dim, output_dim, n_layers, bidirectional, dropout):
    super().__init__()
    self.embedding = nn.Embedding(input_dim, embed_dim)
    self.rnn = nn.LSTM(embed_dim, hid_dim, num_layers=n_layers, bidirectional=bidirectional, dropout=dropout)
    self.fc = nn.Linear(hid_dim*2, output_dim)
    self.dropout = nn.Dropout(dropout)

  def forward(self, text, text_len): # [seq_len, batch_size]
    embedded = self.dropout(self.embedding(text)) # [seq_len, batch_size, embed_dim]
    packed_embedded = pack_padded_sequence(embedded, text_len.to('cpu'), enforce_sorted=False) # Pack sequence
    packed_output, (hidden, cell) = self.rnn(packed_embedded)
    # Unpack sequence  # Output over padding tokens are zero tensors
    output, output_len = pad_packed_sequence(packed_output)
    hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
    return self.fc(hidden)

In [8]:
def binary_accuracy(preds, y):
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() # Convert to float for division 
    acc = correct.sum() / len(correct)
    return acc

## Parameter Handling

Here we will be defining parameters, feed them to our sentiment analysis model and check the count of trainable parameters using `count_parameters` function.

In [9]:
INPUT_DIM = len(vocab)
EMBED_DIM = 100
HIDDEN_DIM = 128
OUTPUT_DIM = 1
N_LAYERS = 2
BATCH_SIZE = 128
BIDIRECTIONAL = True
DROPOUT = 0.5

model = SentimentAnalysis(INPUT_DIM, EMBED_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT)

In [10]:
def count_parameters(model):
  return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 10,198,441 trainable parameters


## Model Training

## Training Model

`train_dataloader` and `test_dataloader` create the batches of dataset for train and test respectively. The `count_parameters` function is used to calculate the number of trainable parameters. `train` function is used to train each epoch of the model and so is the `evaluate` function to test or evaluate the trained model. `epoch_time` returns the execution time for each epoch.

In [11]:
train_dataloader = DataLoader(train_iterator, batch_size=BATCH_SIZE, collate_fn=collate_fn, shuffle=True)
test_dataloader = DataLoader(test_iterator, batch_size=BATCH_SIZE, collate_fn=collate_fn, shuffle=True)

In [12]:
model.to(device)
criterion = nn.BCEWithLogitsLoss().to(device)
optimizer = Adam(model.parameters(), lr=1e-3)

In [13]:
def train(model, dataloader, optimizer, criterion):
  epoch_loss = 0
  epoch_accuracy = 0
  batch_idx = 0
  model.train()
  for labels, text, text_len in dataloader:
    optimizer.zero_grad()
    predictions = model(text, text_len).squeeze(1)
    loss = criterion(predictions, labels.float())
    accuracy = binary_accuracy(predictions, labels.float())
    loss.backward()
    optimizer.step()
    epoch_loss += loss
    epoch_accuracy += accuracy
    batch_idx += 1
  return epoch_loss/batch_idx, epoch_accuracy/batch_idx

In [14]:
def evaluate(model, dataloader, criterion):
  epoch_loss = 0
  epoch_accuracy = 0
  batch_idx = 0
  model.eval()
  with torch.no_grad():
    for labels, text, text_len in dataloader:
      predictions = model(text, text_len).squeeze(1)
      loss = criterion(predictions, labels.float())
      accuracy = binary_accuracy(predictions, labels.float())
      epoch_loss += loss
      epoch_accuracy += accuracy
      batch_idx += 1
  return epoch_loss/batch_idx, epoch_accuracy/batch_idx

In [15]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [16]:
EPOCHS = 10
CLIP = 1

if not os.path.exists('./../models'):
  os.mkdir('./../models')

In [None]:
best_train_loss = float('inf')
for epoch in range(EPOCHS):
  start_time = time.time()
  train_loss, train_acc = train(model, train_dataloader, optimizer, criterion)
  end_time = time.time()
  epoch_mins, epoch_secs = epoch_time(start_time, end_time)
  if train_loss < best_train_loss:
      best_train_loss = train_loss
      torch.save(model.state_dict(), './../models/sentimet-basic.pt')
  print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}:{epoch_secs} | Train Accuracy: {train_acc:.3f} | Train Loss: {train_loss:.3f}')

In [None]:
test_loss, test_acc = evaluate(model, test_dataloader, criterion)

print(f'Test Accuracy: {test_acc:.3f} | Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f}')

## Inference

Inference is the process of running live data points into a machine learning algorithm (or “ML model”) to calculate an output such as a single numerical score. This process is also referred to as “operationalizing a machine learning model” or “putting a machine learning model into production.”

In [None]:
def predict_sentiment(model, sentence):
  model.eval()
  indexed_tokens = [vocab.get(word, vocab['<UNK>']) for word in tokenize(sentence)]
  tensors = torch.LongTensor(indexed_tokens).to(device).unsqueeze(1)
  tensor_len = torch.LongTensor([len(indexed_tokens)])
  prediction = torch.sigmoid(model(tensors, tensor_len))
  return prediction.item()

In [None]:
# Example of a Negative Review
print(predict_sentiment(model, "The movie was terrible!"))

# Example of a Positive Review
print(predict_sentiment(model, "The movie was amazing!"))

## References

- [IMDB Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/)
- [PyTorch Sentiment Analysis](https://github.com/bentrevett/pytorch-sentiment-analysis)