<center><h2>ALTeGraD 2022<br>Lab Session 1: HAN</h2><h3>Hierarchical Attention Network Using GRU</h3> 27 / 10 / 2022<br> M. Kamal Eddine, H. Abdine<br><br>


<b>Student name:</b> Waël Doulazmi


</center>
In this lab, you will get familiar with recurrent neural networks (RNNs), self-attention, and the HAN architecture <b>(Yang et al. 2016)</b> using PyTorch. In this architecture, sentence embeddings are first individually produced, and a document embedding is then computed from the sentence embeddings.<br>
<b>The deadline for this lab is November 14, 2022 11:59 PM.</b> More details about the submission and the architecture for this lab can be found in the handout PDF.


### = = = = =  Attention Layer = = = = =
In thi section, you will fill the gaps in the code to implement the self-attention layer. This layer will be used later to define the HAN architecture. The basic idea behind attention is that rather than considering the last annotation $h_T$ as a summary of the entire sequence, which is prone to information loss, the annotations at <i>all</i> time steps are used.
The self-attention mechanism computes a weighted sum of the annotations, where the weights are determined by trainable parameters. Refer to <b>section 2.2</b> in the handout for the theoretical part, it will be needed to finish the first task.

#### <b>Task 1:</b>

In [1]:
import torch
from torch import nn
from torch.utils.data import DataLoader

class AttentionWithContext(nn.Module):
    """
    Follows the work of Yang et al. [https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf]
    "Hierarchical Attention Networks for Document Classification"
    by using a context vector to assist the attention
    # Input shape
        3D tensor with shape: `(samples, steps, features)`.
    # Output shape
        2D tensor with shape: `(samples, features)`.
    """
    
    def __init__(self, input_shape, return_coefficients=False, bias=True):
        super(AttentionWithContext, self).__init__()
        self.return_coefficients = return_coefficients

        self.W = nn.Linear(input_shape, input_shape, bias=bias)
        self.tanh = nn.Tanh()
        self.u = nn.Linear(input_shape, 1, bias=False)

        self.init_weights()

    def init_weights(self):
        initrange = 0.1
        self.W.weight.data.uniform_(-initrange, initrange)
        self.W.bias.data.uniform_(-initrange, initrange)
        self.u.weight.data.uniform_(-initrange, initrange)
    
    def generate_square_subsequent_mask(self, sz):
        # do not pass the mask to the next layers
        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
        mask = (
            mask.float()
            .masked_fill(mask == 0, float("-inf"))
            .masked_fill(mask == 1, float(0.0))
        )
        return mask
    
    def forward(self, x, mask=None):
        uit = self.W(x)  # fill the gap # compute uit = W . x  where x represents ht
        uit = self.tanh(uit)
        ait = self.u(uit)
        a = torch.exp(ait)
        
        # apply mask after the exp. will be re-normalized next
        if mask is not None:
            a = a*mask.double()
        
        # in some cases especially in the early stages of training the sum may be almost zero
        # and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
        eps = 1e-9
        a = a / (torch.sum(a, axis=1, keepdim=True) + eps)
        weighted_input = a * x
        if self.return_coefficients:
            return [torch.sum(weighted_input, dim=1), a] ### [attentional vector, coefficients] ### use torch.sum to compute s
        else:
            return  torch.sum(weighted_input, dim=1)

### = = = = = Parameters = = = = =
In this section, we define the parameters to use in our training. Such as data path, the embedding dimention <b>d</b>, the GRU layer dimensionality <b>n_units</b>, etc..<br>
The parameter <b>device</b> is used to train the model on GPU if it is available. for this purpose, if you are using Google Colab, switch your runtime to a GPU runtime to train the model with a maximum speed.<br>
<b>Bonus question:</b> What is the purpose of the parameter <i>my_patience</i>?

In [2]:
import sys
import json
import operator
import numpy as np

path_root = ''
path_to_data = path_root + 'data/'

d = 30 # dimensionality of word embeddings
n_units = 50 # RNN layer dimensionality
drop_rate = 0.5 # dropout
mfw_idx = 2 # index of the most frequent words in the dictionary 
            # 0 is for the special padding token
            # 1 is for the special out-of-vocabulary token

padding_idx = 0
oov_idx = 1
batch_size = 64
nb_epochs = 15
my_patience = 2 # for early stopping strategy
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [3]:
print(device)

cuda


### = = = = = Data Loading = = = = =
In this section we will use first <b>wget</b> to download the data the we will load it using numpy in the first cell. While in the second cell, we will use these data to define our Pytorch data loader. Note that the data is already preprocessed, tokenized and padded.<br><br>
<b>Note: if you are running your notebook on Windows or on MacOS, <i>wget</i> will probably not work if you did not install it manually. In this case, use the provided link to download the data and change the <i>path_to_data</i> in the <i>Parameters</i> section accordingly. Otherwise, you will face no problem on Ubuntu and Google Colab.</b>

#### <b>Task 2.1:</b>

In [4]:
my_docs_array_train = np.load(path_to_data + 'docs_train.npy')
my_docs_array_test = np.load(path_to_data + 'docs_test.npy')

my_labels_array_train = np.load(path_to_data + 'labels_train.npy')
my_labels_array_test = np.load(path_to_data + 'labels_test.npy')

# load dictionary of word indexes (sorted by decreasing frequency across the corpus)
with open(path_to_data + 'word_to_index.json', 'r') as my_file:
    word_to_index = json.load(my_file)

# invert mapping
index_to_word =  {word_to_index[word]: word for word in word_to_index.keys()}  ### fill the gap (use a dict comprehension) ###
input_size = my_docs_array_train.shape

In [5]:
len(index_to_word)

29936

In [6]:
import numpy
import torch
from torch.utils.data import DataLoader, Dataset


class Dataset_(Dataset):
    def __init__(self, x, y):
        self.documents = x
        self.labels = y

    def __len__(self):
        return len(self.documents)

    def __getitem__(self, index):
        document = self.documents[index]
        label = self.labels[index] 
        sample = {
            "document": torch.tensor(document),
            "label": torch.tensor(label),
            }
        return sample


def get_loader(x, y, batch_size=32):
    dataset = Dataset_(x, y)
    data_loader = DataLoader(dataset=dataset,
                            batch_size=batch_size,
                            shuffle=True,
                            pin_memory=True,
                            drop_last=True,
                            )
    return data_loader

### = = = = = Defining Architecture = = = = =
In this section, we define the HAN architecture. We start with <i>AttentionBiGRU</i> module in order to define the sentence encoder (check Figure 3 in the handout). Then, we define the <i>TimeDistributed</i> module to allow us to forward our input (batch of document) as to the sentence encoder as <b>batch of sentences</b>, where each sentence in the document will be considered as a time step. This module also reshape the output to a batch of timesteps representations per document. Finally we define the <b>HAN</b> architecture using <i>TimeDistributed</i>, <i>AttentionWithContext</i> and <i>GRU</i>.

#### <b>Task 2.2:</b>

In [7]:

class AttentionBiGRU(nn.Module):
    def __init__(self, input_shape, n_units, index_to_word, dropout=0):
        super(AttentionBiGRU, self).__init__()
        self.embedding = nn.Embedding(2+len(index_to_word),# fill the gap # vocab size
                                      d, # dimensionality of embedding space
                                      padding_idx=0)
        self.dropout = nn.Dropout(drop_rate)
        self.gru = nn.GRU(input_size=input_shape[2],
                          hidden_size=n_units,
                          num_layers=1,
                          bias=True,
                          batch_first=True,
                          bidirectional=True)
        self.attention = AttentionWithContext(2 * n_units,   # fill the gap # the input shape for the attention layer
                                              return_coefficients=True)


    def forward(self, sent_ints):
        sent_wv = self.embedding(sent_ints)
        sent_wv_dr = self.dropout(sent_wv)
        # We need to reshape the output
        sent_wv_dr_reshape = sent_wv_dr.view(sent_wv_dr.size(0) * sent_wv_dr.size(1), sent_wv_dr.size(2), sent_wv_dr.size(3))
        sent_wa, _ = self.gru(sent_wv_dr_reshape) # fill the gap # RNN layer
        sent_att_vec, word_att_coeffs = self.attention(sent_wa) # fill the gap # attentional vector for the sent
        # Reshape again
        sent_att_vec = sent_att_vec.view(sent_wv_dr.size(0), sent_wv_dr.size(1), sent_att_vec.size(-1))
        sent_att_vec_dr = self.dropout(sent_att_vec)
        # Reshape the output
        
        return sent_att_vec_dr, word_att_coeffs

class TimeDistributed(nn.Module):
    def __init__(self, module, batch_first=False):
        super(TimeDistributed, self).__init__()
        self.module = module
        self.batch_first = batch_first

    def forward(self, x):
        if len(x.size()) <= 2:
            return self.module(x)
        # Squash samples and timesteps into a single axis
        x_reshape = x.contiguous().view(-1, x.size(-1))  # (samples * timesteps, input_size) (224, 30)
        sent_att_vec_dr, word_att_coeffs = self.module(x_reshape)
        # We have to reshape the output
        if self.batch_first:
            sent_att_vec_dr = sent_att_vec_dr.contiguous().view(x.size(0), -1, sent_att_vec_dr.size(-1))  # (samples, timesteps, output_size)
            word_att_coeffs = word_att_coeffs.contiguous().view(x.size(0), -1, word_att_coeffs.size(-1))  # (samples, timesteps, output_size)
        else:
            sent_att_vec_dr = sent_att_vec_dr.view(-1, x.size(1), sent_att_vec_dr.size(-1))  # (timesteps, samples, output_size)
            word_att_coeffs = word_att_coeffs.view(-1, x.size(1), word_att_coeffs.size(-1))  # (timesteps, samples, output_size)
        return sent_att_vec_dr, word_att_coeffs      

class HAN(nn.Module):
    def __init__(self, input_shape, n_units, index_to_word, dropout=0):
        super(HAN, self).__init__()
        self.encoder = AttentionBiGRU(input_shape, n_units, index_to_word, dropout)
        self.timeDistributed = TimeDistributed(self.encoder, True)
        self.dropout = nn.Dropout(drop_rate)
        self.gru = nn.GRU(input_size=2*n_units,# fill the gap # the input shape of GRU layer
                          hidden_size=n_units,
                          num_layers=1,
                          bias=True,
                          batch_first=True,
                          bidirectional=True)
        self.attention = AttentionWithContext(2*n_units, # fill the gap # the input shape of between-sentence attention layer
                                              return_coefficients=True)
        self.lin_out = nn.Linear(2*n_units,   # fill the gap # the input size of the last linear layer
                                 1)
        self.preds = nn.Sigmoid()

    def forward(self, doc_ints):
        sent_att_vecs_dr, word_att_coeffs = self.encoder(doc_ints) # fill the gap # get sentence representation
        doc_sa, _ = self.gru(sent_att_vecs_dr)
        doc_att_vec, sent_att_coeffs = self.attention(doc_sa)
        doc_att_vec_dr = self.dropout(doc_att_vec)
        doc_att_vec_dr = self.lin_out(doc_att_vec_dr)
        return self.preds(doc_att_vec_dr), word_att_coeffs, sent_att_coeffs


### = = = = = Training = = = = =
In this section, we have two code cells. In the first one, we define our evaluation function to compute the training and validation accuracies. While in the second one, we define our model, loss and optimizer and train the model over <i>nb_epochs</i>.<br>
<b>Bonus task:</b> use <a href="https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html" target="_blank">tensorboard</a> to visualize the loss and the validation accuray during the training.

#### <b>Task 2.3:</b>

In [8]:
def evaluate_accuracy(data_loader, verbose=True):
    model.eval()
    total_loss = 0.0
    ncorrect = ntotal = 0
    with torch.no_grad():
        for idx, data in enumerate(data_loader):
            # inference 
            output = model(data["document"].to(device))[0] 
            output = output[:, -1] # only last vector
            # total number of examples
            ntotal +=  output.shape[0]
            # number of correct predictions 
            predictions = torch.round(output)
            label = data['label'].to(device)
            ncorrect += torch.sum(predictions==label) #fill me # number of correct prediction - hint: use torch.sum 
        acc = ncorrect.item() / ntotal
        if verbose:
          print("validation accuracy: {:3.2f}".format(acc*100))
        return acc

In [9]:
from tqdm import tqdm

model = HAN(input_size, n_units, index_to_word).to(device)
model = model.double()
lr = 0.001  # learning rate
criterion = torch.nn.BCELoss() # fill the gap, use Binary cross entropy from torch.nn: https://pytorch.org/docs/stable/nn.html#loss-functions
optimizer = torch.optim.Adam(model.parameters(), lr=lr) #fill me

def train(x_train=my_docs_array_train,
          y_train=my_labels_array_train,
          x_test=my_docs_array_test,
          y_test=my_labels_array_test,
          word_dict=index_to_word,
          batch_size=batch_size):
  
    train_data = get_loader(x_train, y_train, batch_size)
    test_data = get_loader(my_docs_array_test, my_labels_array_test, batch_size)

    best_validation_acc = 0.0
    p = 0 # patience

    for epoch in range(1, nb_epochs + 1): 
        losses = []
        accuracies = []
        with tqdm(train_data, unit="batch") as tepoch:
            for idx, data in enumerate(tepoch):
                tepoch.set_description(f"Epoch {epoch}")
                model.train()
                optimizer.zero_grad()
                input = data['document'].to(device)
                label = data['label'].to(device)
                label = label.double()
                output = model.forward(input)[0]
                output = output[:, -1]
                loss = criterion(output, label) # fill the gap # compute the loss
                loss.backward()
                torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5) # prevent exploding gradient 
                optimizer.step()

                losses.append(loss.item())
                accuracy = torch.sum(torch.round(output) == label).item() / batch_size
                accuracies.append(accuracy)
                tepoch.set_postfix(loss=sum(losses)/len(losses), accuracy=100. * sum(accuracies)/len(accuracies))

        train_acc = evaluate_accuracy(train_data, False)
        test_acc = evaluate_accuracy(test_data, False)
        print("===> Epoch {} Complete: Avg. Loss: {:.4f}, Validation Accuracy: {:3.2f}%"
              .format(epoch, sum(losses)/len(losses), 100.*test_acc))
        if test_acc >= best_validation_acc:
            best_validation_acc = test_acc
            print("Validation accuracy improved, saving model...")
            torch.save(model.state_dict(), './best_model.pt')
            p = 0
            print()
        else:
            p += 1
            if p==my_patience:
                print("Validation accuracy did not improve for {} epochs, stopping training...".format(my_patience))
    print("Loading best checkpoint...")    
    model.load_state_dict(torch.load('./best_model.pt'))
    model.eval()
    print('done.')

train()

Epoch 1: 100%|█████████████████████████████████████████| 390/390 [00:12<00:00, 31.90batch/s, accuracy=58.3, loss=0.666]


===> Epoch 1 Complete: Avg. Loss: 0.6660, Validation Accuracy: 68.11%
Validation accuracy improved, saving model...



Epoch 2: 100%|██████████████████████████████████████████| 390/390 [00:11<00:00, 33.30batch/s, accuracy=70.2, loss=0.57]


===> Epoch 2 Complete: Avg. Loss: 0.5700, Validation Accuracy: 75.97%
Validation accuracy improved, saving model...



Epoch 3: 100%|█████████████████████████████████████████| 390/390 [00:11<00:00, 34.02batch/s, accuracy=74.8, loss=0.511]


===> Epoch 3 Complete: Avg. Loss: 0.5105, Validation Accuracy: 78.48%
Validation accuracy improved, saving model...



Epoch 4: 100%|█████████████████████████████████████████| 390/390 [00:11<00:00, 34.37batch/s, accuracy=78.5, loss=0.461]


===> Epoch 4 Complete: Avg. Loss: 0.4608, Validation Accuracy: 80.91%
Validation accuracy improved, saving model...



Epoch 5: 100%|█████████████████████████████████████████| 390/390 [00:11<00:00, 33.96batch/s, accuracy=80.4, loss=0.428]


===> Epoch 5 Complete: Avg. Loss: 0.4276, Validation Accuracy: 82.01%
Validation accuracy improved, saving model...



Epoch 6: 100%|█████████████████████████████████████████| 390/390 [00:11<00:00, 33.61batch/s, accuracy=82.2, loss=0.396]


===> Epoch 6 Complete: Avg. Loss: 0.3956, Validation Accuracy: 82.73%
Validation accuracy improved, saving model...



Epoch 7: 100%|█████████████████████████████████████████| 390/390 [00:11<00:00, 33.47batch/s, accuracy=83.1, loss=0.379]


===> Epoch 7 Complete: Avg. Loss: 0.3786, Validation Accuracy: 82.92%
Validation accuracy improved, saving model...



Epoch 8: 100%|██████████████████████████████████████████| 390/390 [00:11<00:00, 34.08batch/s, accuracy=84.3, loss=0.36]


===> Epoch 8 Complete: Avg. Loss: 0.3599, Validation Accuracy: 83.65%
Validation accuracy improved, saving model...



Epoch 9: 100%|█████████████████████████████████████████| 390/390 [00:11<00:00, 34.55batch/s, accuracy=85.3, loss=0.344]


===> Epoch 9 Complete: Avg. Loss: 0.3436, Validation Accuracy: 84.05%
Validation accuracy improved, saving model...



Epoch 10: 100%|████████████████████████████████████████| 390/390 [00:11<00:00, 33.44batch/s, accuracy=85.6, loss=0.331]


===> Epoch 10 Complete: Avg. Loss: 0.3314, Validation Accuracy: 82.72%


Epoch 11: 100%|████████████████████████████████████████| 390/390 [00:11<00:00, 34.61batch/s, accuracy=86.8, loss=0.315]


===> Epoch 11 Complete: Avg. Loss: 0.3146, Validation Accuracy: 83.70%
Validation accuracy did not improve for 2 epochs, stopping training...


Epoch 12: 100%|████████████████████████████████████████| 390/390 [00:11<00:00, 34.40batch/s, accuracy=87.3, loss=0.304]


===> Epoch 12 Complete: Avg. Loss: 0.3041, Validation Accuracy: 83.93%


Epoch 13: 100%|████████████████████████████████████████| 390/390 [00:11<00:00, 34.05batch/s, accuracy=87.9, loss=0.292]


===> Epoch 13 Complete: Avg. Loss: 0.2923, Validation Accuracy: 84.48%
Validation accuracy improved, saving model...



Epoch 14: 100%|█████████████████████████████████████████| 390/390 [00:11<00:00, 33.73batch/s, accuracy=88.3, loss=0.28]


===> Epoch 14 Complete: Avg. Loss: 0.2805, Validation Accuracy: 84.78%
Validation accuracy improved, saving model...



Epoch 15: 100%|████████████████████████████████████████| 390/390 [00:11<00:00, 33.80batch/s, accuracy=88.7, loss=0.276]


===> Epoch 15 Complete: Avg. Loss: 0.2757, Validation Accuracy: 84.61%
Loading best checkpoint...
done.


### = = = = = Extraction of Attention Coefficients = = = = =
In this section, we will extract and display the attention coefficients on two levels: sentence level and word level. To do so, we will extract the corresponding weights from our model.
#### <b>Task 3:</b>

In [24]:
# select last review:
my_review = my_docs_array_test[-5:,:,:]
 
# convert integer review to text:
index_to_word[1] = 'OOV'
my_review_text = [[index_to_word[idx] for idx in sent if idx in index_to_word] for sent in my_review.tolist()[0]]

print(my_review_text)

[['My', 'qualifications', 'for', 'this', 'review', '?'], ['I', 'own', 'all', 'the', 'Alien', 'and', 'Predator', 'movies', '&', 'I', 'have', 'and', 'have', 'read', 'almost', 'all', 'the', 'books', 'I', 'can', 'find', 'that', 'are', 'related', 'to', 'this', 'series', '.'], ['I', 'can', 'safely', 'say', ',', 'this', 'movie', 'is', 'a', 'OOV', '.'], ['Save', 'your', 'money', '&', 'do', "n't", 'waste', 'your', 'time', '.'], ['If', 'you', 'like', 'mindless', 'action', ',', 'mindless', 'gore', ',', 'no', 'plot', 'to', 'speak', 'of', '&', 'like', 'being', 'taken', 'by', 'Hollywood', ',', 'see', 'the', 'movie', '.'], ['If', 'you', 'are', 'a', 'serious', 'Alien', 'series', 'fan', ',', 'send', 'a', 'message', 'to', 'the', 'over', 'stuffed', ',', 'over', 'paid', 'suits', 'in', 'Hollywood', '&', '20th', 'Century', 'Fox', '&', 'do', "n't", 'give'], ['This', 'movie', 'has', 'so', 'many', 'plot', 'holes', 'in', 'it', 'you', 'could', 'OOV', 'pasta', 'through', 'it', '.']]


###   &emsp;&emsp;  = = = = = Attention Over Sentences in the Document = = = = =

In [25]:
my_input = torch.tensor(my_review).to(device)
sent_coeffs = model.forward(my_input)[2]
sent_coeffs = sent_coeffs[0,:,:]

for elt in zip(sent_coeffs[:,0].tolist(),[' '.join(elt) for elt in my_review_text]):
    print(round(elt[0]*100,2),elt[1])

7.8 My qualifications for this review ?
8.09 I own all the Alien and Predator movies & I have and have read almost all the books I can find that are related to this series .
14.0 I can safely say , this movie is a OOV .
35.25 Save your money & do n't waste your time .
7.53 If you like mindless action , mindless gore , no plot to speak of & like being taken by Hollywood , see the movie .
8.41 If you are a serious Alien series fan , send a message to the over stuffed , over paid suits in Hollywood & 20th Century Fox & do n't give
18.92 This movie has so many plot holes in it you could OOV pasta through it .


### &emsp;&emsp; = = = = = Attention Over Words in Each Sentence = = = = =

In [26]:
word_coeffs = model.forward(my_input)[1]

word_coeffs_list = word_coeffs.reshape(35,30).tolist()

# match text and coefficients:
text_word_coeffs = [list(zip(words,word_coeffs_list[idx][:len(words)])) for idx,words in enumerate(my_review_text)]

for sent in text_word_coeffs:
    [print(elt) for elt in sent]
    print('= = = =')

# sort words by importance within each sentence:
text_word_coeffs_sorted = [sorted(elt,key=operator.itemgetter(1),reverse=True) for elt in text_word_coeffs]

for sent in text_word_coeffs_sorted:
    [print(elt) for elt in sent]
    print('= = = =')

('My', 0.04322487144668034)
('qualifications', 0.05740911467985909)
('for', 0.04221436426902895)
('this', 0.03762763882790029)
('review', 0.049934280279544746)
('?', 0.052032918559249165)
= = = =
('I', 0.03880938749658358)
('own', 0.028227324938628638)
('all', 0.034486170050654776)
('the', 0.027872352404946676)
('Alien', 0.09715767890220363)
('and', 0.03817438862846461)
('Predator', 0.08228158579628778)
('movies', 0.027617303806024354)
('&', 0.025068116198917266)
('I', 0.02423523855417599)
('have', 0.023272346319724493)
('and', 0.02437903014839657)
('have', 0.02206714496967554)
('read', 0.03671260060159055)
('almost', 0.025935296927112387)
('all', 0.02496695920367788)
('the', 0.019916440807830983)
('books', 0.02385582314384015)
('I', 0.024415302339321095)
('can', 0.02359575416356501)
('find', 0.028677525710348924)
('that', 0.03302017472057713)
('are', 0.025687937998548405)
('related', 0.02511677151646435)
('to', 0.02409256398439847)
('this', 0.026458235839321772)
('series', 0.063104984