# Sentiment Analysis using Recurrent Neural Networks

Cloud and Machine Learning Project

Ashwin Prakash Nalwade (apn308), Mingxi Chen (mc7805)




## Data Preparation

We will adjust the seed, define the `Fields` and maintain training, validdation, and test splits. We will focus on *packed padded sequences* - this will enable the RNN to process only those parts of the sentence which are not padded, and hence for the padded parts, the `output` would be a zero tensor. For leveraging packed padded sequences, the RNN has to be aware of the length of the actual sentences . We achieve this by adjusting the `include_lengths` parameter, setting it as `True`.  



In [1]:
import torch
from torchtext import data
from torchtext import datasets

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize = 'spacy', include_lengths = True)
LABEL = data.LabelField(dtype = torch.float)

Load the IMDB dataset now.


In [2]:
from torchtext import datasets

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:08<00:00, 9.85MB/s]


Split, validation set.

In [3]:
import random

train_data, valid_data = train_data.split(random_state = random.seed(SEED))

Now we use word embeddings. We take the `glove.6B.100d` vectors. `glove` is an algorithm developed at Stanford, used to calculate the vectors. Here, '100d' indicates the fact that the vectors have 100 dimensions, and `6B` implies that they were trained on six billion tokens. 


**Note**: The size of glove is significantly large  (~862MB), so keep a track of the internet connectivity.


In [4]:
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(train_data, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

LABEL.build_vocab(train_data)

.vector_cache/glove.6B.zip: 862MB [06:31, 2.20MB/s]                          
 99%|█████████▉| 397660/400000 [00:16<00:00, 26039.11it/s]

In [5]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    sort_within_batch = True,
    device = device)

## Building the Model

### RNN Architecture

We make use of a RNN type called as a Long Short-Term Memory (or LSTM, as it is commonly known). LSTM's do not face the same issues that normal RNN's usually deal with, such s the vanishing gradient problem. LSTM's prevent this issue by including an extra auxillary state called as a _cell_, $c$ - which could be considered analogous to the "memory" part of the LSTM. They also include   multiple _gates_ that co-ordinate the flow of information from and to the memory. We can consider the LSTM as a function of $h_t$, $x_t$, and $c_t$. 

Thus,

$$(h_t, c_t) = \text{LSTM}(x_t, h_t, c_t)$$

The model looks similar to this:

![](https://github.com/ashwinpn/sentiment_analysis/blob/main/sentiment2.png?raw=1)

### Bidirectional RNN

Now, we can have a recurrent neural net that is processing the input text from left to right (forward recurrent neural net), and also a second RNN that processes the input from right to left (backward recurrent neural net). At a given time instant $t$, the forward net is working on $x_t$, while the backward net is working on $x_{T-t+1}$. 

In the following figure, the forward net is in orange, and the backward net is in green:   

![](https://github.com/ashwinpn/sentiment_analysis/blob/main/sentiment3.png?raw=1)

### Multi-layer RNN

In this case, we have multiple recurrent neural nets on the top of the first RNN - each RNN can basically be considered as a layer. The output of the first RNN feeds as an input to the RNN stacked aboce it. 

![](https://github.com/ashwinpn/sentiment_analysis/blob/main/sentiment4.png?raw=1)

### Regularization

Having a large number of parameters in the model implies that we
would have a larger probability of the occurrence of overfitting [That is, train too closely
on the training data, leading to a high training accuracy and low train error BUT lower
test, validation accuracies and low test, validation errors]. Thus, regularization is crucial
to prevent this from occurring. There are various regularization methods like lasso / ridge
regression, (l1+l2) regression, but we use dropout. Dropout operates by randomly
deleting neurons within a layer in a forward pass.





In [6]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, pad_idx):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
        
        self.rnn = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           bidirectional=bidirectional, 
                           dropout=dropout)
        
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text, text_lengths):
        
        #text = [sent len, batch size]
        
        embedded = self.dropout(self.embedding(text))
        
        #embedded = [sent len, batch size, emb dim]
        
        #pack sequence
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.cpu())
        
        packed_output, (hidden, cell) = self.rnn(packed_embedded)
        
        #unpack sequence
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)

        #output = [sent len, batch size, hid dim * num directions]
        #output over padding tokens are zero tensors
        
        #hidden = [num layers * num directions, batch size, hid dim]
        #cell = [num layers * num directions, batch size, hid dim]
        
        #concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers
        #and apply dropout
        
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
                
        #hidden = [batch size, hid dim * num directions]
            
        return self.fc(hidden)

In [7]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = RNN(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS, 
            BIDIRECTIONAL, 
            DROPOUT, 
            PAD_IDX)

In [8]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 4,810,857 trainable parameters


In [9]:
pretrained_embeddings = TEXT.vocab.vectors

print(pretrained_embeddings.shape)

torch.Size([25002, 100])


In [10]:
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.2647, -0.2753, -0.1325],
        [-0.8555, -0.7208,  1.3755,  ...,  0.0825, -1.1314,  0.3997],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 1.0406,  0.2109, -0.6788,  ...,  0.9281,  0.1573,  0.9135],
        [-0.0259,  0.2626, -0.4150,  ...,  0.2496,  1.0473, -0.8566],
        [-0.7507,  0.0280,  0.4090,  ..., -0.0273,  0.3827,  0.3968]])

In [11]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

print(model.embedding.weight.data)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 1.0406,  0.2109, -0.6788,  ...,  0.9281,  0.1573,  0.9135],
        [-0.0259,  0.2626, -0.4150,  ...,  0.2496,  1.0473, -0.8566],
        [-0.7507,  0.0280,  0.4090,  ..., -0.0273,  0.3827,  0.3968]])


## Training the Model


We will be using the `Adam` optimizer here. The `Adam` optimizer adjusts the leanring rate for all the parameters, attrubuting lower learning rates to the parameters that are updated with more frequency, and attributes higher learning rates to the parameters that are not updated frequently. 


In [12]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

In [13]:
criterion = nn.BCEWithLogitsLoss()

model = model.to(device)
criterion = criterion.to(device)

Accuracy function

In [14]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

Training function

In [15]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        
        text, text_lengths = batch.text
        
        predictions = model(text, text_lengths).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [16]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            text, text_lengths = batch.text
            
            predictions = model(text, text_lengths).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [17]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Training. Adjust the number of epochs as required.

In [18]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut2-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 30s
	Train Loss: 0.674 | Train Acc: 57.81%
	 Val. Loss: 0.667 |  Val. Acc: 55.51%
Epoch: 02 | Epoch Time: 0m 30s
	Train Loss: 0.694 | Train Acc: 53.91%
	 Val. Loss: 0.679 |  Val. Acc: 53.99%
Epoch: 03 | Epoch Time: 0m 30s
	Train Loss: 0.653 | Train Acc: 61.87%
	 Val. Loss: 0.583 |  Val. Acc: 69.79%
Epoch: 04 | Epoch Time: 0m 30s
	Train Loss: 0.583 | Train Acc: 69.60%
	 Val. Loss: 0.470 |  Val. Acc: 77.83%
Epoch: 05 | Epoch Time: 0m 30s
	Train Loss: 0.449 | Train Acc: 80.08%
	 Val. Loss: 0.348 |  Val. Acc: 86.17%


Test acccuracy

In [19]:
model.load_state_dict(torch.load('tut2-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.363 | Test Acc: 85.05%


## Predictions

We can test our model to perform sentiment analysis on input sentences. It will work the best when we give movie reviews as inputs, seeing that it was trained on the IMDb dataset. Put the model in eval mode for inference. 


Negative reviews will give a value close to zero, while positive reviews will give a value close to one.


In [20]:
import spacy
nlp = spacy.load('en')

def predict_sentiment(model, sentence):
    model.eval()
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    length = [len(indexed)]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    length_tensor = torch.LongTensor(length)
    prediction = torch.sigmoid(model(tensor, length_tensor))
    return prediction.item()

Example of a negative review

In [21]:
predict_sentiment(model, "This film is terrible")

0.02058650366961956

Example of a postive review

In [22]:
predict_sentiment(model, "This film is great")

0.9379201531410217