---
> SENTIMENT ANALYSIS OF MOVIE REVIEWS USING LSTMS
---

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount = True)

Mounted at /content/drive


---
> The reference for this particular code was taken from 
https://github.com/bentrevett/pytorch-sentiment-analysis.
> Neccessary changes were done and the changes have been mentioned as comments in the code blocks
---

--- 
> DATA PREPARATION

---

>* Field defines how the data must be processed
* TEXT field splits the string into discrete token using 'spacy' tokenizer.
* LABEL is defined by a LabelField, a special subset of the Field class specifically used for handling labels

---


In [None]:
import torch
from torchtext import data
from torchtext import datasets

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize = 'spacy', include_lengths = True)
LABEL = data.LabelField(dtype = torch.float)

> Using torchtext, we can automatically download the IMDb dataset and split it into the canonical train/test splits using torchtext.datasets objects.

In [None]:
from torchtext import datasets

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

aclImdb_v1.tar.gz:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:03<00:00, 22.1MB/s]



---
> A random seed is given to to the random_state argument, to ensure the same train/validation split each time.
---

In [None]:
import random

train_data, valid_data = train_data.split(random_state = random.seed(SEED))

---
> The following builds the vocabulary, only keeping the most common max_size tokens using 'glove.6B.100d' pretrained model.
---

In [None]:
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(train_data, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

LABEL.build_vocab(train_data)

.vector_cache/glove.6B.zip: 862MB [06:28, 2.22MB/s]                           
100%|█████████▉| 398113/400000 [00:16<00:00, 24449.10it/s]



---


> Preparing iterators of data

---


>*  BucketIterator is a type of iterator that will return a batch of examples where each example is of a similar length, minimizing the amount of padding per example.
* torch.cuda.is_available() is used to see if GPU is available to place tensors on GPU.

---

In [None]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    sort_within_batch = True,
    device = device)



---
> BUILDING THE MODEL

---
>* Within the __ init __ an embedding layer, our RNN, and a linear layer are defined. All layers have their parameters initialized to random values, unless explicitly specified.
* The embedding layer is used to transform sparse one-hot vector into a dense embedding vector
* The RNN layer takes in dense vector and the previous hidden state $h_{t-1}$, which it uses to calculate the next hidden state, $h_t$.
* The linear layer takes the final hidden state and feeds it through a fully connected layer, $f(h_T)$, transforming it to the correct output dimension.

> * Forward method is called to feed samples into the model.
* The input batch is passed through the embedding layer to get embedded, which gives us a dense vector representation of our sentences. embedded is a tensor of size.
* embedded is then fed into the RNN, which returns 2 tensors - output and hidden state.
* The last hidden state is fed to linear layer to produce a prediction.

---


In [None]:
# The model used in the original GitHub code was modified for separate functions when no. of LSTM layers is 1 or more. 
# Also a sigmoid layer was added to the original model. 

import torch.nn as nn

class LSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, pad_idx):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)

        self.num_layers = n_layers
        
        self.lstm = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           bidirectional=bidirectional, 
                           dropout=dropout)
        
        self.fc = nn.Linear(hidden_dim * n_layers, output_dim)
        
        self.dropout = nn.Dropout(dropout)

        self.sigmoid = nn.Sigmoid()

        
    def forward(self, text, text_lengths):

        text_lengths = text_lengths.cpu()

        if self.num_layers == 1:

           embedded = self.embedding(text)
           
           #pack sequence
           packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)
        
           packed_output, (hidden, cell) = self.lstm(packed_embedded)

           hidden =  hidden[-1,:,:]

           out = self.sigmoid((self.fc(hidden)))
        
        if self.num_layers > 1:
          
          embedded = self.dropout(self.embedding(text))
          
          #pack sequence
          packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)
        
          packed_output, (hidden, cell) = self.lstm(packed_embedded)

          hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))

          out = self.sigmoid((self.fc(hidden)))
             
        return out

---
>Instance of a class is created in the following block
---


In [None]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 1
BIDIRECTIONAL = False
DROPOUT = 0.0
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = LSTM(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS, 
            BIDIRECTIONAL, 
            DROPOUT, 
            PAD_IDX)

---
> The function below counts the number of parameters the model has.
---

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 2,867,049 trainable parameters


---
> The below code blocks are used to load the pretrained weights from 'glove100d' word embeddings into the embedding layer of our model.
---

In [None]:
pretrained_embeddings = TEXT.vocab.vectors

print(pretrained_embeddings.shape)

torch.Size([25002, 100])


In [None]:
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.2647, -0.2753, -0.1325],
        [-0.8555, -0.7208,  1.3755,  ...,  0.0825, -1.1314,  0.3997],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.5302, -0.8394,  0.3944,  ..., -0.6926, -0.1440,  0.2929],
        [-0.2146,  0.6712,  0.3821,  ...,  0.4095,  0.7454,  0.0046],
        [-0.3202, -0.1139,  0.4597,  ...,  0.5334, -0.0947,  0.3415]])

In [None]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

print(model.embedding.weight.data)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.5302, -0.8394,  0.3944,  ..., -0.6926, -0.1440,  0.2929],
        [-0.2146,  0.6712,  0.3821,  ...,  0.4095,  0.7454,  0.0046],
        [-0.3202, -0.1139,  0.4597,  ...,  0.5334, -0.0947,  0.3415]])


---
>Optimizer is the algorithm used to update the parameters in the module. ADAM optimiser is used here.
---

In [None]:
# In the original code, SGD was used, here we have used ADAM optimizer.

import torch.optim as optim

optimizer = optim.Adam(model.parameters())

---
> Loss function used is Binary Cross Entropy loss
> We load the model to device(GPU if available)
---

In [None]:
# In the original code, BCE with Sigmoid Loss was used. Here we sre using just the BCE.

criterion = nn.BCELoss()

model = model.to(device)
criterion = criterion.to(device)

---
> The below function calculates the accuracy.
---

In [None]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    #round predictions to the closest integer
    rounded_preds = torch.round((preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

---
>MODEL TRAINING
---

> * model.train() puts the model to train mode.
* zero_grad() is used to zero the gradients.
* batch.text is the batches of sentences fed to the model.
* loss.backward() is used to calculate the gradient of each parameter. 
* optimizer.step() is used to update the parameters using gradients and optimizer algorithm.

---


In [None]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        
        text, text_lengths = batch.text
        
        predictions = model(text, text_lengths).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

---
>* model.eval() puts the model in "evaluation mode", turning off dropout.
* No parameter update happens in evaluation mode.

---

In [None]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            text, text_lengths = batch.text
            
            predictions = model(text, text_lengths).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [None]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), './drive/My Drive/AML_assignments/AML_2/Models/baseline_1.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 9s
	Train Loss: 0.170 | Train Acc: 94.03%
	 Val. Loss: 0.369 |  Val. Acc: 86.69%
Epoch: 02 | Epoch Time: 0m 9s
	Train Loss: 0.118 | Train Acc: 96.17%
	 Val. Loss: 0.372 |  Val. Acc: 86.62%
Epoch: 03 | Epoch Time: 0m 9s
	Train Loss: 0.081 | Train Acc: 97.58%
	 Val. Loss: 0.441 |  Val. Acc: 87.15%
Epoch: 04 | Epoch Time: 0m 9s
	Train Loss: 0.048 | Train Acc: 98.78%
	 Val. Loss: 0.531 |  Val. Acc: 87.26%
Epoch: 05 | Epoch Time: 0m 9s
	Train Loss: 0.036 | Train Acc: 99.11%
	 Val. Loss: 0.558 |  Val. Acc: 86.71%


---
>TESTING THE MODEL
---
>We load the model we saved while training and test on reviews that the model hasn't seen before.
---

In [None]:
model.load_state_dict(torch.load('./drive/My Drive/AML_assignments/AML_2/Models/baseline_1.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.407 | Test Acc: 85.03%


In [None]:
import spacy
nlp = spacy.load('en')

def predict_sentiment(model, sentence):
    model.eval()
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    length = [len(indexed)]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    length_tensor = torch.LongTensor(length)
    prediction = (model(tensor, length_tensor))
    return prediction.item()

In [None]:
pred = predict_sentiment(model, "Boring script")
print(pred)
if pred >= 0.5:
  print('positive review')
else : 
  print('negative review')

0.007808869704604149
negative review


In [None]:
pred = predict_sentiment(model, "Great acting")
print(pred)
if pred >= 0.5:
  print('positive review')
else : 
  print('negative review')

0.9678888916969299
positive review


In [None]:
review_new =  "The extreme color palette and overdone scenes of gore and nudity feel like a desparate (and unfortunately unsuccessful) attempt to divert attention from the acting, plot, and dialog."
print("Review : ",review_new)
pred = predict_sentiment(model,review_new)
print(pred)
if pred >= 0.5:
  print('positive review')
else : 
  print('negative review')

In [None]:
review_new = "A truly unique movie unlike any I've seen - the offbeat humor is entertaining and original, the story never predictable, and the carefully blended soundtrack makes for a strangely cinematic experience."
print("Review : ",review_new)
pred = predict_sentiment(model, review_new)
print(pred)
if pred >= 0.5:
  print('positive review')
else : 
  print('negative review')

### **Results :**
We have used different specifications for our LSTM model and trained those models using IMDB dataset. We have used pretrained model for word embedding. 
The results sheet is shared along with this notebook is discussed here.
The Results are explained in the following points, 

1.   First result is a simple single layer LSTM model with unidirectional and with 0% dropout. We've used ADAM optimizer as well as BCE loss. Training accuracy at the end of 5 epochs is 92.94% and validation accuracy will be 87.35%. Test accuracy is 76.17%. 
2.   For the second result, number of LSTM layers are increased to 2. All the other parameter remains same we can expect that the accuracy will increase as network becomes deep. The different accuracies after 5 epochs are as follows,
Train Acc: 95.66%, Validation Acc: 87.77%, Test Acc: 84.24%. 
3.   In 3rd result, we've used dropout of 30% with two LSTM layers same as 2nd iteration. Here our training accuracy decreased but Validation and Testing accuracies increased. The Dropout will make an impression of ensembling and hence validation and test accuracies will improve. The accuracies are as follows, Train Acc: 90.62%, Val. Acc: 88.72%, Test Acc: 87.51%.
4.   In fourth iteration we've increased the dropout percentage to 60% and as dropout percentage increases training and validation accuracies will increase. The network will be able to train properly as the higher dropout will remove biasing in the network. Still we can see slight reduction in testing accuracy. The accuracies are as follows, Train Acc: 92.70%, Val. Acc: 89.23%, Test Acc: 86.87%.
5.   In 5th iteration, we've used bidirectional LSTM. Here gradient flow from future to past and also from past to future. The testing results will improve as we're using both future and past input for training. The accuracies are as follows, Train Acc: 85.94%,  Val. Acc: 84.55%, Test Acc: 81.31%.
6.   In 6th iteration, we've used GRU and it has two gates instead of 3 gates as we can see in LSTM. GRU use less training parameters and therefore use less memory, execute faster and train faster than LSTM's whereas LSTM is more accurate on dataset using longer sequence. Hence we can see in the sheet the number of parameters used in GRU are less and their training and validation accuracies are more better than LSTM. The accuracies are as follows, Train Acc: 97.13%, Val. Acc: 89.16%, Test Acc: 78.54%.










