<center><h2>ALTeGraD 2022<br>Lab Session 1: HAN</h2><h3>Hierarchical Attention Network Using GRU</h3> 27 / 10 / 2022<br> M. Kamal Eddine, H. Abdine<br><br>


<b>Student name:</b> Sicheng MAO


</center>
In this lab, you will get familiar with recurrent neural networks (RNNs), self-attention, and the HAN architecture <b>(Yang et al. 2016)</b> using PyTorch. In this architecture, sentence embeddings are first individually produced, and a document embedding is then computed from the sentence embeddings.<br>
<b>The deadline for this lab is November 14, 2022 11:59 PM.</b> More details about the submission and the architecture for this lab can be found in the handout PDF.


### = = = = =  Attention Layer = = = = =
In this section, you will fill the gaps in the code to implement the self-attention layer. This layer will be used later to define the HAN architecture. The basic idea behind attention is that rather than considering the last annotation $h_T$ as a summary of the entire sequence, which is prone to information loss, the annotations at <i>all</i> time steps are used.
The self-attention mechanism computes a weighted sum of the annotations, where the weights are determined by trainable parameters. Refer to <b>section 2.2</b> in the handout for the theoretical part, it will be needed to finish the first task.

#### <b>Task 1:</b>

In [1]:
import torch
from torch import nn
from torch.utils.data import DataLoader

In [2]:
class AttentionWithContext(nn.Module):
    """
    Follows the work of Yang et al. [https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf]
    "Hierarchical Attention Networks for Document Classification"
    by using a context vector to assist the attention
    # Input shape
        3D tensor with shape: `(samples, steps, features)`.
    # Output shape
        2D tensor with shape: `(samples, features)`.
    """

    def __init__(self, input_shape, return_coefficients=False, bias=True):
        super(AttentionWithContext, self).__init__()
        self.return_coefficients = return_coefficients

        self.W = nn.Linear(input_shape, input_shape, bias=bias)
        self.tanh = nn.Tanh()
        self.u = nn.Linear(input_shape, 1, bias=False)

        self.init_weights()

    def init_weights(self):
        initrange = 0.1
        self.W.weight.data.uniform_(-initrange, initrange)
        self.W.bias.data.uniform_(-initrange, initrange)
        self.u.weight.data.uniform_(-initrange, initrange)

    def generate_square_subsequent_mask(self, sz):
        # do not pass the mask to the next layers
        # torch.triu return the upper triangle part of matrix
        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
        # () are used for line breaking, alternatively use \ explicitly.
        mask = (
            mask.float()
            .masked_fill(mask == 0, float("-inf"))
            .masked_fill(mask == 1, float(0.0))
        )
        # can be replaced by
        # mask = torch.triu(torch.ones(sz,sz)).transpose(0,1)
        # mask = mask.masked_fill(mask==0, float("-inf"))..masked_fill(mask == 1, float(0.0))
        return mask

    def forward(self, x, mask=None):
        # compute uit = W . x  where x represents ht
        uit = self.W(x)  # fill the gap
        uit = self.tanh(uit)
        ait = self.u(uit)
        a = torch.exp(ait)

        # apply mask after the exp. will be re-normalized next
        if mask is not None:
            a = a*mask.double()

        # in some cases especially in the early stages of training the sum may be almost zero
        # and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
        eps = 1e-9
        a = a / (torch.sum(a, axis=1, keepdim=True) + eps)
        weighted_input = a * x  # compute the attentional vector
        if self.return_coefficients:
            return weighted_input.sum(axis=1), a  # [attentional vector, coefficients] ### use torch.sum to compute s
        else:
            return weighted_input.sum(axis=1)  # attentional vector only ###

### = = = = = Parameters = = = = =
In this section, we define the parameters to use in our training. Such as data path, the embedding dimention <b>d</b>, the GRU layer dimensionality <b>n_units</b>, etc..<br>
The parameter <b>device</b> is used to train the model on GPU if it is available. for this purpose, if you are using Google Colab, switch your runtime to a GPU runtime to train the model with a maximum speed.<br>
<b>Bonus question:</b> What is the purpose of the parameter <i>my_patience</i>?

In [3]:
import sys
import json
import operator
import numpy as np

path_root = ''
path_to_data = path_root + 'data/'

d = 30 # dimensionality of word embeddings
n_units = 50 # RNN layer dimensionality
drop_rate = 0.5 # dropout
mfw_idx = 2 # index of the most frequent words in the dictionary 
            # 0 is for the special padding token
            # 1 is for the special out-of-vocabulary token

padding_idx = 0
oov_idx = 1
batch_size = 64
nb_epochs = 15
my_patience = 2 # for early stopping strategy
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### = = = = = Data Loading = = = = =
In this section we will use first <b>wget</b> to download the data the we will load it using numpy in the first cell. While in the second cell, we will use these data to define our Pytorch data loader. Note that the data is already preprocessed, tokenized and padded.<br><br>
<b>Note: if you are running your notebook on Windows or on MacOS, <i>wget</i> will probably not work if you did not install it manually. In this case, use the provided link to download the data and change the <i>path_to_data</i> in the <i>Parameters</i> section accordingly. Otherwise, you will face no problem on Ubuntu and Google Colab.</b>

#### <b>Task 2.1:</b>

In [4]:
# !wget -c "https://onedrive.live.com/download?cid=AE69638675180117&resid=AE69638675180117%2199289&authkey=AHgxt3xmgG0Fu5A" -O "data.zip"
# !unzip data.zip

my_docs_array_train = np.load(path_to_data + 'docs_train.npy')
my_docs_array_test = np.load(path_to_data + 'docs_test.npy')

my_labels_array_train = np.load(path_to_data + 'labels_train.npy')
my_labels_array_test = np.load(path_to_data + 'labels_test.npy')

# load dictionary of word indexes (sorted by decreasing frequency across the corpus)
with open(path_to_data + 'word_to_index.json', 'r') as my_file:
    word_to_index = json.load(my_file)

# invert mapping
index_to_word = {index: word for (word, index) in word_to_index.items()} ### fill the gap (use a dict comprehension) ###
input_size = my_docs_array_train.shape

In [5]:
import numpy
from torch.utils.data import Dataset

class Dataset_(Dataset):
    def __init__(self, x, y):
        self.documents = x
        self.labels = y

    def __len__(self):
        return len(self.documents)

    def __getitem__(self, index):
        document = self.documents[index]
        label = self.labels[index] 
        sample = {
            "document": torch.tensor(document),
            "label": torch.tensor(label),
            }
        return sample


def get_loader(x, y, batch_size=32):
    dataset = Dataset_(x, y)
    data_loader = DataLoader(dataset=dataset,
                            batch_size=batch_size,
                            shuffle=True,
                            pin_memory=True,
                            drop_last=True,
                            )
    return data_loader

### = = = = = Defining Architecture = = = = =
In this section, we define the HAN architecture. We start with <i>AttentionBiGRU</i> module in order to define the sentence encoder (check Figure 3 in the handout). Then, we define the <i>TimeDistributed</i> module to allow us to forward our input (batch of document) as to the sentence encoder as <b>batch of sentences</b>, where each sentence in the document will be considered as a time step. This module also reshape the output to a batch of timesteps representations per document. Finally we define the <b>HAN</b> architecture using <i>TimeDistributed</i>, <i>AttentionWithContext</i> and <i>GRU</i>.

#### <b>Task 2.2:</b>

In [6]:
nn.GRU

torch.nn.modules.rnn.GRU

In [7]:
class AttentionBiGRU(nn.Module):
    def __init__(self, input_shape, n_units, index_to_word, dropout=0):
        super(AttentionBiGRU, self).__init__()
        self.embedding = nn.Embedding(len(index_to_word)+2,  # vocab size
                                      d,  # dimensionality of embedding space
                                      padding_idx=0)
        self.dropout = nn.Dropout(drop_rate)
        self.gru = nn.GRU(input_size=d,
                          hidden_size=n_units,
                          num_layers=1,
                          bias=True,
                          batch_first=True,
                          bidirectional=True)
        self.attention = AttentionWithContext(2 * n_units,   # the input shape for the attention layer == output shape of GRU
                                              return_coefficients=True)

    def forward(self, sent_ints):
        sent_wv = self.embedding(sent_ints)
        sent_wv_dr = self.dropout(sent_wv)
        sent_wa, _ = self.gru(sent_wv_dr)  # fill the gap # RNN layer
        sent_att_vec, word_att_coeffs = self.attention(sent_wa)  # attentional vector for the sent
        sent_att_vec_dr = self.dropout(sent_att_vec)
        return sent_att_vec_dr, word_att_coeffs


class TimeDistributed(nn.Module): # mimic keras TimeDistributed layer
    def __init__(self, module, batch_first=False):
        super(TimeDistributed, self).__init__()
        self.module = module
        self.batch_first = batch_first

    def forward(self, x):
        if len(x.size()) <= 2:
            return self.module(x)
        # Squash samples and timesteps into a single axis
        x_reshape = x.contiguous().view(-1, x.size(-1))  # (samples * timesteps, input_size) (448, 30)
        sent_att_vec_dr, word_att_coeffs = self.module(x_reshape)
        # We have to reshape the output
        if self.batch_first:
            sent_att_vec_dr = sent_att_vec_dr.contiguous().view(x.size(0), -1, sent_att_vec_dr.size(-1))  # (samples, timesteps, output_size)
            word_att_coeffs = word_att_coeffs.contiguous().view(x.size(0), -1, word_att_coeffs.size(-1))  # (samples, timesteps, output_size)
        else:
            sent_att_vec_dr = sent_att_vec_dr.view(-1, x.size(1), sent_att_vec_dr.size(-1))  # (timesteps, samples, output_size)
            word_att_coeffs = word_att_coeffs.view(-1, x.size(1), word_att_coeffs.size(-1))  # (timesteps, samples, output_size)
        return sent_att_vec_dr, word_att_coeffs


class HAN(nn.Module):
    def __init__(self, input_shape, n_units, index_to_word, dropout=0):
        super(HAN, self).__init__()
        self.encoder = AttentionBiGRU(input_shape, n_units, index_to_word, dropout)
        self.timeDistributed = TimeDistributed(self.encoder, True)
        self.dropout = nn.Dropout(drop_rate)
        self.gru = nn.GRU(input_size=2 * n_units,  # the input shape of GRU layer
                          hidden_size=n_units,
                          num_layers=1,
                          bias=True,
                          batch_first=True,
                          bidirectional=True)
        self.attention = AttentionWithContext(2 * n_units,  # fill the gap # the input shape of between-sentence attention layer
                                              return_coefficients=True)
        self.lin_out = nn.Linear(2 * n_units,  # fill the gap # the input size of the last linear layer
                                 1)
        self.preds = nn.Sigmoid()

    def forward(self, doc_ints):
        sent_att_vecs_dr, word_att_coeffs = self.timeDistributed(doc_ints) # get sentence representation
        doc_sa, _ = self.gru(sent_att_vecs_dr)
        doc_att_vec, sent_att_coeffs = self.attention(doc_sa)
        doc_att_vec_dr = self.dropout(doc_att_vec)
        doc_att_vec_dr = self.lin_out(doc_att_vec_dr)
        return self.preds(doc_att_vec_dr), word_att_coeffs, sent_att_coeffs


### = = = = = Training = = = = =
In this section, we have two code cells. In the first one, we define our evaluation function to compute the training and validation accuracies. While in the second one, we define our model, loss and optimizer and train the model over <i>nb_epochs</i>.<br>
<b>Bonus task:</b> use <a href="https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html" target="_blank">tensorboard</a> to visualize the loss and the validation accuray during the training.

#### <b>Task 2.3:</b>

In [8]:
def evaluate_accuracy(data_loader, verbose=True):
    model.eval()
    total_loss = 0.0
    ncorrect = ntotal = 0
    with torch.no_grad():
        for idx, data in enumerate(data_loader):
            # inference
            output = model(data["document"].to(device))[0]
            output = output[:, -1]  # only last vector
            # total number of examples
            ntotal += output.shape[0]
            # number of correct predictions
            predictions = torch.round(output)
            ncorrect += torch.sum(data['label'].to(device) == predictions)  # number of correct prediction - hint: use torch.sum 
        acc = ncorrect.item() / ntotal
        if verbose:
          print("validation accuracy: {:3.2f}".format(acc*100))
        return acc

In [9]:
from tqdm import tqdm

In [10]:
my_docs_array_train[0].shape

(7, 30)

In [11]:
input_size

(25000, 7, 30)

In [12]:
model = HAN(input_size, n_units, index_to_word).to(device)
model = model.double()
lr = 0.001  # learning rate
criterion = torch.nn.BCELoss()  # fill the gap, use Binary cross entropy from torch.nn: https://pytorch.org/docs/stable/nn.html#loss-functions
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

def train(x_train=my_docs_array_train,
          y_train=my_labels_array_train,
          x_test=my_docs_array_test,
          y_test=my_labels_array_test,
          word_dict=index_to_word,
          batch_size=batch_size):

    train_data = get_loader(x_train, y_train, batch_size)
    test_data = get_loader(my_docs_array_test, my_labels_array_test, batch_size)

    best_validation_acc = 0.0
    p = 0  # patience

    for epoch in range(1, nb_epochs + 1):
        losses = []
        accuracies = []
        with tqdm(train_data, unit="batch") as tepoch:
            for idx, data in enumerate(tepoch):
                tepoch.set_description(f"Epoch {epoch}")
                model.train()
                optimizer.zero_grad()
                input = data['document'].to(device)
                # print(input.shape)
                # break
                label = data['label'].to(device)
                label = label.double()
                output = model.forward(input)[0]
                output = output[:, -1]
                loss = criterion(output, label)  # compute the loss
                loss.backward()
                torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)  # prevent exploding gradient 
                optimizer.step()

                losses.append(loss.item())
                accuracy = torch.sum(torch.round(output) == label).item() / batch_size
                accuracies.append(accuracy)
                tepoch.set_postfix(loss=sum(losses)/len(losses), accuracy=100. * sum(accuracies)/len(accuracies))

        # train_acc = evaluate_accuracy(train_data, False)
        test_acc = evaluate_accuracy(test_data, False)
        print("===> Epoch {} Complete: Avg. Loss: {:.4f}, Validation Accuracy: {:3.2f}%"
              .format(epoch, sum(losses)/len(losses), 100.*test_acc))
        if test_acc >= best_validation_acc:
            best_validation_acc = test_acc
            print("Validation accuracy improved, saving model...")
            torch.save(model.state_dict(), './best_model.pt')
            p = 0
            print()
        else:
            p += 1
            if p==my_patience:
                print("Validation accuracy did not improve for {} epochs, stopping training...".format(my_patience))
    print("Loading best checkpoint...")    
    model.load_state_dict(torch.load('./best_model.pt'))
    model.eval()
    print('done.')

train()

Epoch 1: 100%|██████████| 390/390 [00:10<00:00, 35.71batch/s, accuracy=60.9, loss=0.649]


===> Epoch 1 Complete: Avg. Loss: 0.6494, Validation Accuracy: 70.36%
Validation accuracy improved, saving model...



Epoch 2: 100%|██████████| 390/390 [00:10<00:00, 36.06batch/s, accuracy=71.2, loss=0.561]


===> Epoch 2 Complete: Avg. Loss: 0.5606, Validation Accuracy: 76.10%
Validation accuracy improved, saving model...



Epoch 3: 100%|██████████| 390/390 [00:10<00:00, 36.30batch/s, accuracy=75.6, loss=0.503]


===> Epoch 3 Complete: Avg. Loss: 0.5030, Validation Accuracy: 77.07%
Validation accuracy improved, saving model...



Epoch 4: 100%|██████████| 390/390 [00:10<00:00, 36.23batch/s, accuracy=78.5, loss=0.46] 


===> Epoch 4 Complete: Avg. Loss: 0.4599, Validation Accuracy: 80.59%
Validation accuracy improved, saving model...



Epoch 5: 100%|██████████| 390/390 [00:10<00:00, 35.95batch/s, accuracy=80.3, loss=0.429]


===> Epoch 5 Complete: Avg. Loss: 0.4287, Validation Accuracy: 81.81%
Validation accuracy improved, saving model...



Epoch 6: 100%|██████████| 390/390 [00:11<00:00, 35.42batch/s, accuracy=82, loss=0.403]  


===> Epoch 6 Complete: Avg. Loss: 0.4030, Validation Accuracy: 82.62%
Validation accuracy improved, saving model...



Epoch 7: 100%|██████████| 390/390 [00:10<00:00, 35.69batch/s, accuracy=83.1, loss=0.381]


===> Epoch 7 Complete: Avg. Loss: 0.3807, Validation Accuracy: 81.73%


Epoch 8: 100%|██████████| 390/390 [00:10<00:00, 35.50batch/s, accuracy=84.1, loss=0.362]


===> Epoch 8 Complete: Avg. Loss: 0.3620, Validation Accuracy: 83.38%
Validation accuracy improved, saving model...



Epoch 9: 100%|██████████| 390/390 [00:11<00:00, 35.30batch/s, accuracy=85.4, loss=0.341]


===> Epoch 9 Complete: Avg. Loss: 0.3412, Validation Accuracy: 83.96%
Validation accuracy improved, saving model...



Epoch 10: 100%|██████████| 390/390 [00:11<00:00, 35.30batch/s, accuracy=86.2, loss=0.322]


===> Epoch 10 Complete: Avg. Loss: 0.3223, Validation Accuracy: 84.32%
Validation accuracy improved, saving model...



Epoch 11: 100%|██████████| 390/390 [00:11<00:00, 35.12batch/s, accuracy=86.5, loss=0.317]


===> Epoch 11 Complete: Avg. Loss: 0.3166, Validation Accuracy: 83.68%


Epoch 12: 100%|██████████| 390/390 [00:11<00:00, 35.12batch/s, accuracy=87.3, loss=0.301]


===> Epoch 12 Complete: Avg. Loss: 0.3013, Validation Accuracy: 84.68%
Validation accuracy improved, saving model...



Epoch 13: 100%|██████████| 390/390 [00:11<00:00, 34.95batch/s, accuracy=87.8, loss=0.289]


===> Epoch 13 Complete: Avg. Loss: 0.2891, Validation Accuracy: 84.79%
Validation accuracy improved, saving model...



Epoch 14: 100%|██████████| 390/390 [00:11<00:00, 34.80batch/s, accuracy=88.3, loss=0.28] 


===> Epoch 14 Complete: Avg. Loss: 0.2798, Validation Accuracy: 84.85%
Validation accuracy improved, saving model...



Epoch 15: 100%|██████████| 390/390 [00:11<00:00, 34.71batch/s, accuracy=89, loss=0.267]  


===> Epoch 15 Complete: Avg. Loss: 0.2674, Validation Accuracy: 84.23%
Loading best checkpoint...
done.


### = = = = = Extraction of Attention Coefficients = = = = =
In this section, we will extract and display the attention coefficients on two levels: sentence level and word level. To do so, we will extract the corresponding weights from our model.
#### <b>Task 3:</b>

In [13]:
# select last review:
my_review = my_docs_array_test[-1:, :, :]
# convert integer review to text:
index_to_word[1] = 'OOV'
my_review_text = [[index_to_word[idx] for idx in sent if idx in index_to_word] for sent in my_review.tolist()[0]]

In [14]:
print(my_review.shape)
print(type(my_review))

(1, 7, 30)
<class 'numpy.ndarray'>


###   &emsp;&emsp;  = = = = = Attention Over Sentences in the Document = = = = =

In [15]:
sent_coeffs = model.forward(torch.Tensor(my_review).int().to(device))[2]  # get sentence attention coeffs by passing the review to the model - (you need to convert the inout torch tensor)
sent_coeffs = sent_coeffs[0,:,:]

for elt in zip(sent_coeffs[:,0].tolist(),[' '.join(elt) for elt in my_review_text]):
    print(round(elt[0]*100,2),elt[1])

6.26 There 's a sign on The Lost Highway that says : OOV SPOILERS OOV ( but you already knew that , did n't you ? )
7.48 Since there 's a great deal of people that apparently did not get the point of this movie , I 'd like to contribute my interpretation of why the plot
9.92 As others have pointed out , one single viewing of this movie is not sufficient .
17.47 If you have the DVD of MD , you can OOV ' by looking at David Lynch 's 'Top 10 OOV to OOV MD ' ( but only upon second
22.55 ; ) First of all , Mulholland Drive is downright brilliant .
26.68 A masterpiece .
9.65 This is the kind of movie that refuse to leave your head .


In [17]:
import matplotlib.pyplot as plt

### &emsp;&emsp; = = = = = Attention Over Words in Each Sentence = = = = =

Notice that due to padding, sum of word attention in short sentences is less than 1

In [16]:
word_coeffs = model.forward(torch.Tensor(my_review).int().to(device))[1]  # get words attention coeffs by passing the review to the model - (you need to convert the inout torch tensor)

word_coeffs_list = word_coeffs.reshape(7,30).tolist()

# match text and coefficients:
text_word_coeffs = [list(zip(words,word_coeffs_list[idx][:len(words)])) for idx,words in enumerate(my_review_text)]

for sent in text_word_coeffs:
    [print(elt) for elt in sent]
    print('= = = =')

# sort words by importance within each sentence:
text_word_coeffs_sorted = [sorted(elt,key=operator.itemgetter(1),reverse=True) for elt in text_word_coeffs]

for sent in text_word_coeffs_sorted:
    [print(elt) for elt in sent]
    print('= = = =')

('There', 0.03889559125801324)
("'s", 0.023873977798934663)
('a', 0.03315303444564445)
('sign', 0.03902794493102821)
('on', 0.028495436109275706)
('The', 0.02202037034989289)
('Lost', 0.03589686330249601)
('Highway', 0.039496466606913576)
('that', 0.03101618687831968)
('says', 0.03460194084007355)
(':', 0.034469388716209214)
('OOV', 0.026308862940549394)
('SPOILERS', 0.02724132073675315)
('OOV', 0.026433721283606472)
('(', 0.02757673324485127)
('but', 0.044679000904633746)
('you', 0.06402837934581311)
('already', 0.03706032350754792)
('knew', 0.031163507613516155)
('that', 0.027524386895823868)
(',', 0.03400208405639647)
('did', 0.03666231247783293)
("n't", 0.026234464207184383)
('you', 0.03713430137548305)
('?', 0.03857142790185892)
(')', 0.03582578989825407)
= = = =
('Since', 0.03355075266357415)
('there', 0.04106674677512241)
("'s", 0.035149862042709916)
('a', 0.03937334385081633)
('great', 0.0682926958098518)
('deal', 0.05213710603125568)
('of', 0.051550833192459504)
('people', 0.0