
# Assignment 4 
In this assignment you will be working with a character based LSTM language model, which you will turn into a text classifier for sentiment analysis using **Attention**. For that, you will need to develop the Attention mechanism that aggregates the hidden output vectors that you get per character into a single vector, which you will use as an input for a final linear classifier.

In [None]:
import torch
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
!git clone https://github.com/kfirbar/course-ml-data.git

Cloning into 'course-ml-data'...
remote: Enumerating objects: 31, done.[K
remote: Counting objects: 100% (31/31), done.[K
remote: Compressing objects: 100% (27/27), done.[K
remote: Total 31 (delta 5), reused 8 (delta 0), pack-reused 0[K
Unpacking objects: 100% (31/31), done.


The data that you will be working with is SST-2, which is a collection of reviews, each is classified into 0/1 reflecting the overall sentiment of the author (0 = negative, 1 = positive). In the next piece of code, we load the data and create a dictionary (named Vocab) that assigns a unique ID per character, similar to what have done in DL Notebook 12. Finally, each one of *train* and *dev* is a list of tuples, with the first item being the text encoded as character indices, and the second item is the label (0, 1).

In [None]:
MAX_SEQ_LEN = 50

class Vocab:
    def __init__(self):
        self.char2id = {}
        self.id2char = {}
        self.n_chars = 1
        
    def index_text(self, text):
      indexes = [self.index_char(c) for c in text]
      return indexes
    
    def index_char(self, c):
        if c not in self.char2id:
            self.char2id[c] = self.n_chars
            self.id2char[self.n_chars] = c
            self.n_chars += 1
        return self.char2id[c]
            
            
def load_data(data, vocab):
  data_sequences = []
  for text in data.iterrows():
    if len(text[1]["sentence"]) <= MAX_SEQ_LEN:
      indexes = vocab.index_text(text[1]["sentence"])
      data_sequences.append((indexes, text[1]["label"]))
  return data_sequences

vocab = Vocab()
train = load_data(pd.read_csv('/content/course-ml-data/SST2_train.tsv', sep='\t'), vocab)
dev = load_data(pd.read_csv('/content/course-ml-data/SST2_dev.tsv', sep='\t'), vocab)
print(f'Train size {len(train)}, Dev size {len(dev)}, vocab size {vocab.n_chars}')

Train size 40625, Dev size 119, vocab size 63


# Task 1
The following RNN architectures takes a single sentence as an input (formatted as a 1D tensor of character ids), and returns a distribution over the labels. In our case the number of labels is 2 (negative, positive). 

I basically copied the same architecture from Notebook 12, where each input character gets an output vector from the LSTM module, which are used to precdict the next character in line. However, here, we are not really interested in predicting the next character, but in aggregating all those output vectors into a single "context" vector, which will be sent to a Linear layer for the final classification step.

Therefore, you are requested to add the relevant code for aggregating the output vectors using the **additive attention** approach, following presentation *DL 14*. Note that some of what you need to add should be parameters, which you need to define under the __init__ function.


In [None]:
class SeqModel(torch.nn.Module):
    def __init__(self, input_size, embedding_size, hidden_size, output_size):
        super(SeqModel, self).__init__()
        self.embedding = torch.nn.Embedding(input_size, embedding_size).cuda()
        self.rnn = torch.nn.LSTM(embedding_size, hidden_size).cuda()


        self.out = torch.nn.Linear(hidden_size, output_size)

        self.v = nn.Parameter(torch.empty(1, hidden_size).cuda())
        torch.nn.init.uniform_(self.v)
        self.w = nn.Linear(hidden_size, hidden_size)
        self.linear_layer = nn.Linear(47, 1)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, single_sentence):

        Embedded = self.embedding(single_sentence)
        Embedded = Embedded.view(len(single_sentence), 1, -1)
        out, hidden = self.rnn(Embedded)

        o = torch.tanh(self.w(out))
        o = torch.squeeze(o, dim=1)
        dot_product = torch.mm(self.v, o.transpose(0,1))


        Probabilities = self.softmax(dot_product)

        context = out.squeeze(dim=1) * Probabilities.transpose(0,1)
        context = torch.sum(context, dim=0)
        return self.out(context)



# Task 2
Once completed, you are now requested to write some code for training the model using the following configuration. Make sure to print the training loss every 100 sentences so you can follow. Train your code for 4 epochs, and use cuda + GPU.

In [None]:
import torch
import torch.nn as nn
model = SeqModel(vocab.n_chars, 64, 300, 2).cuda()
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())

In [None]:
epochs = 4
loss_stats = []
PRINT_EVERY = 100

for epoch in range(epochs):  
    
    epoch_loss = 0
    running_loss = 0.0
    for i, data in enumerate(train, 0):
        Inputs, labels = data
        Inputs = torch.LongTensor(Inputs).cuda()
        labels = torch.LongTensor([labels]).cuda()
        #inputs = inputs.cuda() # -- For GPU
        # labels = labels.cuda() # -- For GPU

        optimizer.zero_grad()

        Outputs = model(Inputs)
        loss = criterion(Outputs.view(1,2), labels)
        loss.backward()
        optimizer.step()

        item_loss = loss.item()
        running_loss += item_loss
        epoch_loss += item_loss
        if i % PRINT_EVERY == 0:    
            curr_loss = running_loss / PRINT_EVERY
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, curr_loss))
            running_loss = 0.0
    
    loss_stats.append(epoch_loss / i)

print('Finished Training')

[1,     1] loss: 0.007
[1,   101] loss: 0.739
[1,   201] loss: 0.720
[1,   301] loss: 0.711
[1,   401] loss: 0.698
[1,   501] loss: 0.674
[1,   601] loss: 0.692
[1,   701] loss: 0.690
[1,   801] loss: 0.713
[1,   901] loss: 0.692
[1,  1001] loss: 0.680
[1,  1101] loss: 0.670
[1,  1201] loss: 0.709
[1,  1301] loss: 0.699
[1,  1401] loss: 0.713
[1,  1501] loss: 0.704
[1,  1601] loss: 0.679
[1,  1701] loss: 0.670
[1,  1801] loss: 0.702
[1,  1901] loss: 0.680
[1,  2001] loss: 0.735
[1,  2101] loss: 0.745
[1,  2201] loss: 0.718
[1,  2301] loss: 0.701
[1,  2401] loss: 0.704
[1,  2501] loss: 0.675
[1,  2601] loss: 0.703
[1,  2701] loss: 0.724
[1,  2801] loss: 0.697
[1,  2901] loss: 0.691
[1,  3001] loss: 0.657
[1,  3101] loss: 0.717
[1,  3201] loss: 0.707
[1,  3301] loss: 0.699
[1,  3401] loss: 0.656
[1,  3501] loss: 0.690
[1,  3601] loss: 0.634
[1,  3701] loss: 0.692
[1,  3801] loss: 0.668
[1,  3901] loss: 0.663
[1,  4001] loss: 0.702
[1,  4101] loss: 0.715
[1,  4201] loss: 0.644
[1,  4301] 

# Task 3
Write some code for evaluating your model on the dev set. Since the data is almost balanced (there are 52 positives in the dev set), let's print accuracy (i.e., the number of correctly classified instances).

In [None]:
correct_classifies = 0
with torch.no_grad():
  for (Inputs, label) in dev:
    Inputs = torch.LongTensor(Inputs).cuda()  
    #label = torch.LongTensor([label]).cuda()
    Output = model(Inputs)
    correct_classifies += torch.argmax(Output.data).item() == label

In [None]:
correct_classifies / len(dev)

0.8487394957983193