# Sentiment Analysis Basic

Sentiment Analysis is a Natural Language Processing technique to identify if the textual data represents positive or negative or neutral sentiment. It is widely used in industry to understand customers through social reviews and comments.

This notebook presents a basic introduction to sentiment analysis. The result obtained may not be quite good. The purpose of this notebook is to make you familiar with sentiment analysis technique. We will improvise the result in following notebooks later.

The dataset used here is popular [IMDB review](http://ai.stanford.edu/~amaas/data/sentiment/) dataset.

## Data Ingestion

In [None]:
import os
import re
import time
import math
import torch
import torch.nn as nn
from torchtext import data
from torch.optim import Adam
from torchtext.datasets import IMDB
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

In [None]:
# !pip install torchdata
# !pip install -U spacy
# !python -m spacy download en_core_web_sm

In [None]:
train_iterator, test_iterator = IMDB()

## Preprocessing

This step includes text cleaning tokenization and vocabulary building. I have created a basic workflow of text preparation here. If you want to play more with text preprocessing you can check the following techniques:

- Apply minimum word occurance for vocabulary building
- Word stemming and lemmatization

In [None]:
def tokenize(text):
  text = re.sub(r"[-()\"#/@;:<>{}=~|.?,]", " ", text)
  text = re.sub(' +', ' ', text)
  return text.lower().split()


tokens = list()
for _, line in train_iterator:
  tokens += tokenize(line)

In [None]:
vocab = {'<PAD>':0, '<UNK>':1}
for i, word in enumerate(set(tokens), start=len(vocab)):
  vocab[word] = i

In [None]:
def word_to_index(text):
  return [vocab[word] if word in vocab.keys() else vocab['<UNK>'] for word in tokenize(text)]

## Defining Model

Model Implementation for Basic Sentiment Analysis using simple rnn. The dimension of each elements (in the form of tensors) are included within the square braces after the respective element.

In [None]:
class SentimentAnalysis(nn.Module):
  def __init__(self, input_dim, embed_dim, hidden_dim, output_dim):
    super().__init__()
    self.embedding = nn.Embedding(input_dim, embed_dim)
    self.rnn = nn.RNN(embed_dim, hidden_dim)
    self.fc = nn.Linear(hidden_dim, output_dim)
  
  def forward(self, data):  # [seq_len, batch_size]
    self.embed = self.embedding(data)  # [seq_len, batch_size, embed_dim]
    output, hidden = self.rnn(self.embed)
    # output = [seq_len, batch_size, hid_dim]
    # hidden = [1, batch_size, hid_dim]
    logits = self.fc(hidden.squeeze(0))  # [batch_size, output_dim]
    return logits

`binary_accuracy` takes in two arguments: output predicted by our model and the actual output. This evaluation supports binary classification only.

In [None]:
def binary_accuracy(preds, y):
  rounded_preds = torch.round(torch.sigmoid(preds))  # round predictions to closest integer
  correct = (rounded_preds == y).float()  #convert into float for division 
  acc = correct.sum() / len(correct)
  return acc

## Training Model

Finally, the actual training of the sentiment analysis model. The `count_parameters` function is used to calculate the number of trainable parameters. `train` function is used to train each epoch of the model and so is the `evaluate` function to test or evaluate the trained model. `epoch_time` returns the execution time for each epoch.

In [None]:
INPUT_DIM = len(vocab.keys())
EMBED_DIM = 100
HIDDEN_DIM = 128
OUTPUT_DIM = 1
BATCH_SIZE = 128

model = SentimentAnalysis(INPUT_DIM, EMBED_DIM, HIDDEN_DIM, OUTPUT_DIM)

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 9,596,969 trainable parameters


In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def collate_fn(batch):
  labels, text = [], []
  for label, line in batch:
    labels.append(label)
    text.append(torch.tensor(word_to_index(line)))
  text = pad_sequence(text, padding_value=vocab['<PAD>'])
  labels = torch.tensor(labels).to(device)
  text = torch.tensor(text).to(device)
  return labels, text

In [None]:
train_dataloader = DataLoader(train_iterator, batch_size=BATCH_SIZE, collate_fn=collate_fn, shuffle=True)
test_dataloader = DataLoader(test_iterator, batch_size=BATCH_SIZE, collate_fn=collate_fn, shuffle=True)

In [None]:
optimizer = Adam(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss()

model.to(device)
criterion.to(device)

BCEWithLogitsLoss()

In [None]:
def train(model, dataloader, optimizer, criterion):
  epoch_loss = 0
  epoch_accuracy = 0
  batch_idx = 0
  model.train()
  for labels, text in dataloader:
    optimizer.zero_grad()
    predictions = model(text).squeeze(1)
    loss = criterion(predictions, labels.float())
    accuracy = binary_accuracy(predictions, labels.float())
    loss.backward()
    optimizer.step()
    epoch_loss += loss
    epoch_accuracy += accuracy
    batch_idx += 1
  return epoch_loss/batch_idx, epoch_accuracy/batch_idx

In [None]:
def evaluate(model, dataloader, criterion):
  epoch_loss = 0
  epoch_accuracy = 0
  batch_idx = 0
  model.eval()
  with torch.no_grad():
    for labels, text in dataloader:
      predictions = model(text).squeeze(1)
      loss = criterion(predictions, labels.float())
      accuracy = binary_accuracy(predictions, labels.float())
      epoch_loss += loss
      epoch_accuracy += accuracy
      batch_idx += 1
  return epoch_loss/batch_idx, epoch_accuracy/batch_idx

In [None]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [None]:
EPOCHS = 10
CLIP = 1

if not os.path.exists('./../models'):
  os.mkdir('./../models')

In [None]:
best_train_loss = float('inf')
for epoch in range(EPOCHS):
  start_time = time.time()
  train_loss, train_acc = train(model, train_dataloader, optimizer, criterion)
  end_time = time.time()
  epoch_mins, epoch_secs = epoch_time(start_time, end_time)
  if train_loss < best_train_loss:
      best_train_loss = train_loss
      torch.save(model.state_dict(), './../models/sentimet-basic.pt')
  print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}:{epoch_secs} | Train Accuracy: {train_acc:.3f} | Train Loss: {train_loss:.3f}')

  text = torch.tensor(text).to(device)


Epoch: 01 | Time: 0:11 | Train Accuracy: 0.493 | Train Loss: -11.910
Epoch: 02 | Time: 0:7 | Train Accuracy: 0.499 | Train Loss: -24.442
Epoch: 03 | Time: 0:7 | Train Accuracy: 0.499 | Train Loss: -36.168
Epoch: 04 | Time: 0:7 | Train Accuracy: 0.499 | Train Loss: -47.604
Epoch: 05 | Time: 0:6 | Train Accuracy: 0.500 | Train Loss: -59.028
Epoch: 06 | Time: 0:8 | Train Accuracy: 0.499 | Train Loss: -70.456
Epoch: 07 | Time: 0:6 | Train Accuracy: 0.499 | Train Loss: -81.699
Epoch: 08 | Time: 0:8 | Train Accuracy: 0.499 | Train Loss: -92.968
Epoch: 09 | Time: 0:9 | Train Accuracy: 0.499 | Train Loss: -104.288
Epoch: 10 | Time: 0:7 | Train Accuracy: 0.499 | Train Loss: -115.571


In [None]:
test_loss, test_acc = evaluate(model, test_dataloader, criterion)

print(f'Test Accuracy: {test_acc:.3f} | Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f}')

  text = torch.tensor(text).to(device)


Test Accuracy: 0.500 | Test Loss: -121.015 | Test PPL:   0.000


## References

- [IMDB Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/)