# Advancing Sentiment Analysis
This chapter is about using the attention mechanism for the text classification problem. In this chapter, we will construct a network with and without attention mechanism. we will compare the performance of both the network and see which one performs better. 

The recurrent network with attention is can be diagrammatically represented as given below:

![](figures/Attention_based_classification.png)

Figure. Attention-based bidirectional RNN structure

As shown in, the figure the network is fed with the word at a different time step. The recursive neural network (RNN) has both forward and reverse direction. here RNN can be any unit like vanilla RNN, GRU or LSTM. hidden states are shown by $h_1, h_2, h_3,..., h_{t-1}, h_t $. The direction of the arrow over a hidden state shows its direction. $h^{\rightarrow}_1 $ shows the hidden state moving forward along the sequence and $h^{\leftarrow}_1 $ shows hidden state moving backward with a respect to the sequence. A hidden state $h_i$ fo any token is a concatenation of forward and backward state $h_1 = [h^{\leftarrow}_1, h^{\rightarrow}_1] $ .

In the RNN without the attention mechanism the vector  $h^{\rightarrow}_1 $ and the   is concatenated to form the final representation. $h^{\leftarrow}_1 $ has the crux of the entire sequence in the forward direction while $h^{\leftarrow}_1 $ has the crux of the entire sequence in the backward direction. In this method, the importance of any particular token is not considered an all token have equal weight. 

In RNN with attention, the different weight is considered for the input tokens in the sequence. This is done by the attention mechanism. In short, by name, it implies that different attention is given to different sequence. In the attention mechanism after calculating importance of each token (weight), the weighted sum of the output feature h_a is calcuated. h_a  can be calculated using the batch-wise matrix multiplication (BMM) function of the Pytorch. We have seen details of how BMM works in chapter 4, Using RNN for NLP.  while implementing this network we will see how to use BMM.

# Importing Requirements 

In [None]:
import sys
import nltk
import torch.optim as optim
import torch
from torchtext import data
from torchtext.vocab import Vectors
import spacy
from torchtext import vocab
import chakin
import json
from tqdm import tqdm
import pandas as pd
import numpy as np
import os
from tensorboardX import SummaryWriter
from torch import nn
import numpy as np
from torch.autograd import Variable
from torch.nn import functional as F
nltk.download('popular')

**Ensuring Reproducibility:** To compare two networks we must initialize the weights of the two network in a reproducible manner. PyTorch has some mechanism to facilitate the reproducible results by fixing seed and setting the engine as deterministic. It can be done as follow.

In [None]:
SEED = 1234
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Downloading embedding
The pre-trained embeddings are available and can be easily used in our model.  we will be using the GloVe vector trained having 300 dimensions.

In [None]:
embed_exists = os.path.isfile('../embeddings/glove.6B.zip')
if not embed_exists:
    print("Downloading Glove embeddings, if not downloaded properly, then delete the `../embeddings/glove.6B.zip")
    chakin.search(lang='English')
    chakin.download(number=16, save_dir='../embeddings')
    zip_ref = zipfile.ZipFile("../embeddings/glove.6B.zip", 'r')
    zip_ref.extractall("../embeddings/")
    zip_ref.close()

# Pre-procesing
It has following steps:
1. Parsing data 
2. Loading to panda Dataframe
3. Setting network parameters
4. Tokenizing data 
5. Splitting data
6. Constructing iterator

In [None]:
def parse_label(label):
        '''
        Get the actual labels from label string
        Input:
            label (string) : labels of the form '__label__2'
        Returns:
            label (int) : integer value corresponding to label string
        '''
        return int(label.strip()[-1])

In [None]:
def get_pandas_df(filename, outfile):
        '''
        Load the data into Pandas.DataFrame object
        This will be used to convert data to torchtext object
        '''
        outpointer = open (outfile, "w")
        datafile =  open(filename, 'r')  
        for each_line in datafile:
            each_line =  each_line.split(" , ")
            outpointer.write(str(json.dumps({"text": each_line[1] ,"label": parse_label(each_line[0])}))+"\n")
        outpointer.flush()

In [None]:
train_file = '../Ch5/data/ag_news.train'
test_file = '../Ch5/data/ag_news.test'
get_pandas_df(train_file, "data/train.json")
get_pandas_df(train_file, "data/test.json")

In [None]:
class Config(object):
    embed_size = 100
    hidden_layers = 1
    hidden_size = 128
    bidirectional = True
    class_num = 4
    max_epochs = 15
    lr = 0.5
    batch_size = 32
    dropout_keep = 0.2
    max_sen_len = None # Sequence length for RNN
config  = Config()

In [None]:
def tokenize(sentiments):    
    return nltk.tokenize.word_tokenize(sentiments)

def to_categorical(x):
    x = int(x)
    if x == 1:
        return [1,0,0,0]
    if x == 2:
        return [0,1,0,0]
    if x == 3:
        return [0,0,1,0]
    if x == 4:
        return [0,0,0,1]
    

In [None]:
# defining data fields
REVIEW = data.Field(sequential=True , tokenize=tokenize, use_vocab = True, lower=True,batch_first=True)
LABEL = data.Field(is_target=True,use_vocab = False, sequential=False, preprocessing = to_categorical)
fields = {'text': ('review', REVIEW), 'label': ('label', LABEL)}

# constructing tabular dataset
train_data , test_data = data.TabularDataset.splits(
                            path = 'data',
                            train = 'train.json',
                            test = 'test.json',
                            format = 'json',
                            fields = fields)


In [None]:
print ([vars(train_data[i]) for i in range (0,1)])

In [None]:
vec = vocab.Vectors(name = 'glove.6B.100d.txt',cache = "../embeddings/glove.6B/")
REVIEW.build_vocab(train_data, test_data, max_size=400000, vectors=vec)

# making iterator
train_iter, test_iter = data.Iterator.splits(
        (train_data, test_data), sort_key=lambda x: len(x.review),
        batch_sizes=(config.batch_size,config.batch_size), device=device)

In [None]:
vocab_size = len(REVIEW.vocab)
vocab_vectors = REVIEW.vocab.vectors

# Attention Model

The main part of this imlementation is to design the model which takes help of attention mechanism for text classification. In this model the implementation here it is very similar to the attention mechanism implemented in chapter 4, Using RNN for NLP.  The forward function takes the input sentence in shape `[input size, batch size]`. Trained embeddings lookup is applied to this representation to convert it to of shape `[input size, batch size, embeddings size]`. This input is passed on to the LSTM Cell along with hidden and cell state. LSTM produce output for all the time steps along with hidden and cell states. output along with final hidden.  LSTM output along with the final state is given to `attention_net` function. `attention_net` function carries out the batch-wise matrix multiplication between LSTM output and the hidden state to produce attention `attn_weights`. Attention weights then used to produce updated hidden state. A Linear transformation is applied to the updated hidden state to convert it into output classes. A softmax transformation is applied thereafter to normalize the outputs.

In [None]:
class RNNAttentionModel(torch.nn.Module):
    def __init__(self,config_object,vocab_size, weights):
        super(RNNAttentionModel, self).__init__()

        self.batch_size = config_object.batch_size
        self.class_num = config_object.class_num
        self.hidden_size = config_object.hidden_size
        self.vocab_size = vocab_size
        self.embed_size = config_object.embed_size
        self.device = device

        self.word_embeddings = nn.Embedding(self.vocab_size, self.embed_size)
        self.word_embeddings.weight.data.copy_(weights)
        self.word_embeddings.weight.requires_grad = True
        
        self.lstm = nn.LSTM(self.embed_size, self.hidden_size)
        self.label = nn.Linear(self.hidden_size, self.class_num)


    def attention_net(self, lstm_output, final_state):
        hidden = final_state.squeeze(0)
        attn_weights = torch.bmm(lstm_output, hidden.unsqueeze(2)).squeeze(2)
        soft_attn_weights = F.softmax(attn_weights, 1)
        new_hidden_state = torch.bmm(lstm_output.transpose(1, 2), soft_attn_weights.unsqueeze(2)).squeeze(2)

        return new_hidden_state

    def forward(self, input_sentences):
        input = self.word_embeddings(input_sentences)
        input = input.permute(1, 0, 2)

        h_0 = Variable(torch.zeros(1, self.batch_size, self.hidden_size)).to(self.device)
        c_0 = Variable(torch.zeros(1, self.batch_size, self.hidden_size)).to(self.device)

        output, (final_hidden_state, final_cell_state) = self.lstm(input, (h_0, c_0)) 
        output = output.permute(1, 0, 2)  

        attn_output = self.attention_net(output, final_hidden_state)
        logits = self.label(attn_output)
        return torch.softmax(logits, dim=1)


In [None]:
rnn_attention_model = RNNAttentionModel(config, vocab_size, vocab_vectors)
rnn_attention_model = rnn_attention_model.to(device)

In [None]:
class RNNModel(torch.nn.Module):
    def __init__(self,config_object,vocab_size, weights):
        super(RNNModel, self).__init__()

        self.batch_size = config_object.batch_size
        self.class_num = config_object.class_num
        self.hidden_size = config_object.hidden_size
        self.vocab_size = vocab_size
        self.embed_size = config_object.embed_size
        self.device = device

        self.word_embeddings = nn.Embedding(self.vocab_size, self.embed_size)
        self.word_embeddings.weight.data.copy_(weights)
        self.word_embeddings.weight.requires_grad = True

        self.lstm = nn.LSTM(self.embed_size, self.hidden_size)
        self.label = nn.Linear(self.hidden_size, self.class_num)

    def forward(self, input_sentences):
        input = self.word_embeddings(input_sentences)
        input = input.permute(1, 0, 2)
        if self.batch_size is None:
            h_0 = Variable(torch.zeros(1, self.batch_size, self.hidden_size)).to(self.device)
            c_0 = Variable(torch.zeros(1, self.batch_size, self.hidden_size)).to(self.device)
        else:
            h_0 = Variable(torch.zeros(1, self.batch_size, self.hidden_size)).to(self.device)
            c_0 = Variable(torch.zeros(1, self.batch_size, self.hidden_size)).to(self.device)

        output, (final_hidden_state, final_cell_state) = self.lstm(input, (
        h_0, c_0))  # final_hidden_state.size() = (1, batch_size, hidden_size)
        
        logits = self.label(final_hidden_state.view(self.batch_size, -1))

        return torch.softmax(logits, dim=1)

**Constructing model object**

In [None]:
rnn_model = RNNModel(config, vocab_size, vocab_vectors)
rnn_model = rnn_model.to(device)

**Supporting Function**

In [None]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    rounded_preds = torch.argmax(preds, dim=1)
    correct = (rounded_preds == torch.argmax(y, dim=1)).float() #convert into float for division 
    acc = correct.sum()/len(correct)
    return acc

In [None]:
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0    
    for batch in iterator:
        feature, target = batch.review, batch.label
        optimizer.zero_grad()
        predictions = model(feature.to(device))            
        loss = criterion(predictions.type(torch.FloatTensor), target.type(torch.FloatTensor))
        loss.backward()
        optimizer.step()
        acc = binary_accuracy(predictions.type(torch.FloatTensor), target.type(torch.FloatTensor))
        epoch_loss += loss.item()
        epoch_acc += acc.item()
    return model, epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
def test_accuracy_calculator(model, test_iterator):
    epoch_acc = 0
    for batch in test_iterator:
        if batch.review.shape[0] ==  32:
            feature, target = batch.review, batch.label
            predictions = model(feature.to(device))            
            acc = binary_accuracy(predictions.type(torch.FloatTensor), target.type(torch.FloatTensor))
            epoch_acc += acc.item()
    return  epoch_acc / len(test_iterator)

**Defining optimizer for a model with and without attention**

In [None]:
rnn_optimizer = torch.optim.SGD(rnn_model.parameters(), lr=0.001, momentum=0.9)
rnn_criterion = nn.MSELoss()
rnn_criterion = rnn_criterion.to(device)
rnn_attention_optimizer = torch.optim.SGD(rnn_attention_model.parameters(), lr=0.01, momentum=0.9)
rnn_attention_criterion = nn.MSELoss()
rnn_attention_criterion = rnn_attention_criterion.to(device)

## Training

In [None]:
epochs  = 50
writer = SummaryWriter()

for i in tqdm(range(epochs)):
    if (i != 0 and i%10 == 0 ):
        # chnaging learning rate for rnn_model
        for param_group in rnn_optimizer.param_groups:
            param_group['lr'] = param_group['lr']/2
        # chnaging learning rate for rnn_attention model
        for param_group in rnn_attention_optimizer.param_groups:
            param_group['lr'] = param_group['lr']/2
        
    rnn_model, rnn_epoch_loss, rnn_epoch_acc = train(rnn_model, train_iter, rnn_optimizer, rnn_criterion)
    rnn_attention_model, rnn_attention_epoch_loss, rnn_attention_epoch_acc = train(rnn_attention_model, train_iter, rnn_attention_optimizer, rnn_attention_criterion)

    rnn_test_acc = test_accuracy_calculator(rnn_model, test_iter)
    rnn_attention_test_acc = test_accuracy_calculator(rnn_attention_model, test_iter)
    
    writer.add_scalar('TRAIN_LOSS/rnn_epoch_loss',rnn_epoch_loss, i)
    writer.add_scalar('TRAIN_LOSS/rnn_attention_epoch_loss', rnn_attention_epoch_loss, i)
    
    writer.add_scalar('TRAIN_ACC/rnn_epoch_acc',rnn_epoch_acc, i)
    writer.add_scalar('TRAIN_ACC/rnn_attention_epoch_acc', rnn_attention_epoch_acc, i)
    
    writer.add_scalar('TEST/rnn_test_acc',rnn_test_acc, i)
    writer.add_scalar('TEST/rnn_attention_test_acc', rnn_attention_test_acc, i)

# Examining Results
The accuracy comparison of the Network with and without attention is given below. The RNN with attention logic reached up to accuracy near to 90% whereas the RNN without attention logic reaches the accuracy near to 60% in the same number of iterations.

![](figures/Attention_based_classification_train_acc.png)

Figure: Comparing increase in accuracy for model with attention and without attention
The loss of networks with and without attention logic is given below: The network with attention mechanism achieves the minimum loss below 0.05 whereas network without attention mechanism achieved minimum loss as 0.13 in the same number of iterations.

![](figures/Attention_based_classification_train_loss.png)

Figure: Comparing decrease in loss for model with attention and without attention
A similar trend has been observed in the case of test accuracy. The RNN with attention logic reached up to accuracy near to 87% whereas the RNN without attention logic reaches the accuracy near to 65% in the same number of iterations.

![](figures/Attention_based_classification_test.png)

Figure: Comparing increase in test accuracy for model with attention and without attention
Above given experiments provide clear intuition that attention mechanism provides better and faster convergence compared to network architecture without it.