# Basic tutorial: Question answer
#### Author: Matteo Caorsi

This short tutorial provides you with the basic functioning of *giotto-deep* API.

The example described in this tutorial is the one of question answer.

The main steps of the tutorial are the following:
 1. creation of a dataset
 2. creation of a model
 3. define metrics and losses
 4. train the model
 5. extract some features of the network

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

import numpy as np

import torch
from torch import nn

from gdeep.models import FFNet

from gdeep.visualisation import  persistence_diagrams_of_activations

from torch.utils.tensorboard import SummaryWriter
from gdeep.data import TorchDataLoader
from gdeep.pipeline import Pipeline

from gtda.diagrams import BettiCurve

from gtda.plotting import plot_betti_surfaces

No TPUs...


# Initialize the tensorboard writer

In order to analyse the reuslts of your models, you need to start tensorboard.
On the terminal, move inside the `/example` folder. There run the following command:

```
tensorboard --logdir=runs
```

Then go [here](http://localhost:6006/) after the training to see all the visualisation results.

In [2]:
writer = SummaryWriter()

# Create your dataset

In [3]:
from torch.utils.data.sampler import SubsetRandomSampler

# the only part of the training set we are interested in
train_indices = list(range(32*10))

dl = TorchDataLoader(name="SQuAD2", convert_to_map_dataset=True)
dl_tr_str, dl_ts_str = dl.build_dataloaders(sampler=SubsetRandomSampler(train_indices), batch_size=1)


The dataset contains a context and a question whose answer can be found within that context. The correct answer as well as the starting characters are also provided.

In [4]:
datum = next(iter(dl_tr_str))
datum

[('Beyoncé embarked on The Mrs. Carter Show World Tour on April 15 in Belgrade, Serbia; the tour included 132 dates that ran through to March 2014. It became the most successful tour of her career and one of the most-successful tours of all time. In May, Beyoncé\'s cover of Amy Winehouse\'s "Back to Black" with André 3000 on The Great Gatsby soundtrack was released. She was also honorary chair of the 2013 Met Gala. Beyoncé voiced Queen Tara in the 3D CGI animated film, Epic, released by 20th Century Fox on May 24, and recorded an original song for the film, "Rise Up", co-written with Sia.',),
 ("What was the name of Beyoncé's tour that she started on April 15?",),
 [('The Mrs. Carter Show World Tour',)],
 [tensor([20])]]

In [5]:
datum = next(iter(dl_ts_str))
datum

[("When finally Edward the Confessor returned from his father's refuge in 1041, at the invitation of his half-brother Harthacnut, he brought with him a Norman-educated mind. He also brought many Norman counsellors and fighters, some of whom established an English cavalry force. This concept never really took root, but it is a typical example of the attitudes of Edward. He appointed Robert of Jumièges archbishop of Canterbury and made Ralph the Timid earl of Hereford. He invited his brother-in-law Eustace II, Count of Boulogne to his court in 1051, an event which resulted in the greatest of early conflicts between Saxon and Norman and ultimately resulted in the exile of Earl Godwin of Wessex.",),
 ('Who did Edward make archbishop of Canterbury?',),
 [('Robert of Jumièges',), ('Robert of Jumièges',), ('Robert of Jumièges',)],
 [tensor([382]), tensor([382]), tensor([382])]]

## Required preprocessing

Neural networks cannot direcly deal with strings. We have first to preprocess the dataset in three main ways:
 1. Tokenise the strings into its words
 2. Build a vocabulary out of these words
 3. Embed each word into a vector, so that each sentence becomes a list of vectors

The first two steps are performed by the `PreprocessTextQA`. The embedding will be added directly to the model.

In [6]:
from gdeep.data import PreprocessTextQA

prec = PreprocessTextQA((dl_tr_str, dl_ts_str))

(dl_tr, dl_ts) = prec.build_dataloaders(batch_size=3)


In [7]:
aa = next(iter(dl_tr))
aa

[tensor([[[ 778,   53,  110,  ...,   41,  191, 2567],
          [ 715,  271,   27,  ..., 2567, 2567, 2567]],
 
         [[3792,  281,   14,  ..., 2567, 2567, 2567],
          [  66,  211, 4077,  ..., 2567, 2567, 2567]],
 
         [[ 943,   53,  114,  ..., 2567, 2567, 2567],
          [ 133,  110,  144,  ..., 2567, 2567, 2567]]]),
 tensor([[184, 185],
         [ 46,  49],
         [128, 129]])]

## Define and train your model

The model for QA shall accept as input the context and the question and return the probabilities for the initial and final token of the answer in the input context. The output than, is a pair of logits.

In [8]:
from torch.nn import Transformer
from torch.optim import Adam, SparseAdam, SGD
import copy
# my simple transformer model
class QATransformer(nn.Module):

    def __init__(self, src_vocab_size, tgt_vocab_size, embed_dim):
        super(QATransformer, self).__init__()
        self.transformer = Transformer(d_model=embed_dim,
                                       nhead=2,
                                       num_encoder_layers=1,
                                       num_decoder_layers=1,
                                       dim_feedforward=512,
                                       dropout=0.1)
        self.embedding_src = nn.Embedding(src_vocab_size, embed_dim, sparse=True)
        self.embedding_tgt = nn.Embedding(tgt_vocab_size, embed_dim, sparse=True)
        self.generator = nn.Linear(embed_dim, 2)
        
    def forward(self, X):
        #print(X.shape)
        src = X[:,0,:]
        tgt = X[:,1,:]
        #print(src.shape, tgt.shape)
        src_emb = self.embedding_src(src)
        tgt_emb = self.embedding_tgt(tgt)
        #print(src_emb.shape, tgt_emb.shape)
        self.outs = self.transformer(src_emb, tgt_emb)
        #print(outs.shape)
        logits = self.generator(self.outs)
        #print(logits.shape)
        #out = torch.topk(logits, k=1, dim=2).indices.reshape(-1,44)
        #print(out, out.shape)
        return logits
    
    def __deepcopy__(self, memo):
        """this is needed to make sure that the 
        non-leaf nodes do not
        interfere with copy.deepcopy()
        """
        cls = self.__class__
        result = cls.__new__(cls)
        memo[id(self)] = result
        for k, v in self.__dict__.items():
            setattr(result, k, copy.deepcopy(v, memo))
        return result
    
    def encode(self, src, src_mask):
        """this method is used only at the inference step"""
        return self.transformer.encoder(
                            self.embedding_src(src), src_mask)

    def decode(self, tgt, memory, tgt_mask):
        """this method is used only at the inference step"""
        return self.transformer.decoder(
                          self.embedding_tgt(tgt), memory,
                          tgt_mask)

In [9]:
vocab_size = 5500

src_vocab_size = vocab_size
tgt_vocab_size = vocab_size
emb_size = 64

model = QATransformer(src_vocab_size, tgt_vocab_size, emb_size)
print(model)

QATransformer(
  (transformer): Transformer(
    (encoder): TransformerEncoder(
      (layers): ModuleList(
        (0): TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=64, out_features=64, bias=True)
          )
          (linear1): Linear(in_features=64, out_features=512, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=512, out_features=64, bias=True)
          (norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
      )
      (norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
    )
    (decoder): TransformerDecoder(
      (layers): ModuleList(
        (0): TransformerDecoderLayer(
          (self_attn): MultiheadAttention(
            (out_pr

## Define the loss function

This loss function is a adapted version of the Cross Entropy for the trnasformer architecture.

In [10]:

def loss_fn(output_of_network, label_of_dataloader):
    #print(output_of_network.shape, label_of_dataloader.shape)
    tgt_out = label_of_dataloader
    #print(tgt_out)
    logits = output_of_network
    cel = nn.CrossEntropyLoss()
    return cel(logits, tgt_out)


In [11]:
# prepare a pipeline class with the model, dataloaders loss_fn and tensorboard writer
pipe = Pipeline(model, (dl_tr, dl_ts), loss_fn, writer)

# train the model
pipe.train(SGD, 3, False, {"lr":0.01}, {"batch_size":16})

Epoch 1
-------------------------------
Epoch training loss: 5.798102 	Epoch training accuracy: 0.39%                                                            
Time taken for this epoch: 4.00s
Learning rate value: 0.01000000



Cannot store data in the PR curve



Validation results: 
 Accuracy: 3.12%,                 Avg loss: 5.585218 

Epoch 2
-------------------------------
Epoch training loss: 5.593606 	Epoch training accuracy: 0.39%                                                             
Time taken for this epoch: 4.00s
Learning rate value: 0.01000000
Validation results: 
 Accuracy: 1.56%,                 Avg loss: 5.435014 

Epoch 3
-------------------------------
Epoch training loss: 5.488928 	Epoch training accuracy: 1.17%                                                             
Time taken for this epoch: 4.00s
Learning rate value: 0.01000000
Validation results: 
 Accuracy: 0.00%,                 Avg loss: 5.397848 



(5.397847771644592, 0.0)

## Answering questions!

Here we have a question and its associated context:

In [12]:
bb = next(iter(dl_ts_str))
bb[:2]

[('In the visual arts, the Normans did not have the rich and distinctive traditions of the cultures they conquered. However, in the early 11th century the dukes began a programme of church reform, encouraging the Cluniac reform of monasteries and patronising intellectual pursuits, especially the proliferation of scriptoria and the reconstitution of a compilation of lost illuminated manuscripts. The church was utilised by the dukes as a unifying force for their disparate duchy. The chief monasteries taking part in this "renaissance" of Norman art and scholarship were Mont-Saint-Michel, Fécamp, Jumièges, Bec, Saint-Ouen, Saint-Evroul, and Saint-Wandrille. These centres were in contact with the so-called "Winchester school", which channeled a pure Carolingian artistic tradition to Normandy. In the final decade of the 11th and first of the 12th century, Normandy experienced a golden age of illustrated manuscripts, but it was brief and the major scriptoria of Normandy ceased to function aft


Get the vocabulary and numericize the question and context to then input both to the model.

In [13]:
voc = prec.vocabulary
context = prec.tokenizer(bb[0][0])
question = prec.tokenizer(bb[1][0])

# get the indexes in the vocabulary of the tokens
context_idx = torch.tensor(list(map(voc.__getitem__,context)))
question_idx = torch.tensor(list(map(voc.__getitem__,question)))

In [14]:
length_to_pad = aa[0].shape[-1]
pad_fn = lambda item : torch.cat([item, prec.pad_item * torch.ones(length_to_pad - item.shape[0])])

# these tansors are ready for be fitted into teh model
context_ready_for_model = pad_fn(context_idx)
question_ready_for_model = pad_fn(question_idx)

Put the two tensors of context and question together and input them to the model

In [15]:
inp = torch.stack((context_ready_for_model, question_ready_for_model)).reshape(1,*aa[0].shape[1:]).long()
out = model(inp)

The output is the ligits for the start and end tokens of the answer. It is now time to extract them with `torch.argmax`

In [16]:
answer_idx = torch.argmax(out, dim=1)

try:
    if answer_idx[0][1] > answer_idx[0][0]:
        print("The model proposes: '", context[answer_idx[0][0]:answer_idx[0][1]],"...'")
    else:
        print("The model proposes: '", context[answer_idx[0][0]],"...'")
except IndexError:
    print("The model was not able to find the answer.")
print("The actual answer was: '" + bb[2][0][0]+"'")

The model proposes: ' ['arts', ',', 'the', 'normans'] ...'
The actual answer was: 'dukes'


# Extract inner data from your models

In [17]:
from gdeep.models import ModelExtractor

me = ModelExtractor(pipe.model, loss_fn)

lista = me.get_layers_param()

for k, item in lista.items():
    print(k,item.shape)


transformer.encoder.layers.0.self_attn.in_proj_weight torch.Size([192, 64])
transformer.encoder.layers.0.self_attn.in_proj_bias torch.Size([192])
transformer.encoder.layers.0.self_attn.out_proj.weight torch.Size([64, 64])
transformer.encoder.layers.0.self_attn.out_proj.bias torch.Size([64])
transformer.encoder.layers.0.linear1.weight torch.Size([512, 64])
transformer.encoder.layers.0.linear1.bias torch.Size([512])
transformer.encoder.layers.0.linear2.weight torch.Size([64, 512])
transformer.encoder.layers.0.linear2.bias torch.Size([64])
transformer.encoder.layers.0.norm1.weight torch.Size([64])
transformer.encoder.layers.0.norm1.bias torch.Size([64])
transformer.encoder.layers.0.norm2.weight torch.Size([64])
transformer.encoder.layers.0.norm2.bias torch.Size([64])
transformer.encoder.norm.weight torch.Size([64])
transformer.encoder.norm.bias torch.Size([64])
transformer.decoder.layers.0.self_attn.in_proj_weight torch.Size([192, 64])
transformer.decoder.layers.0.self_attn.in_proj_bias t

In [18]:
DEVICE = torch.device("cpu")
x = next(iter(dl_tr))[0]
pipe.model.eval()
pipe.model(x.to(DEVICE))

list_activations = me.get_activations(x)
len(list_activations)


30

In [19]:
x = next(iter(dl_tr))[0][0]
if x.dtype is not torch.int64:
    res = me.get_decision_boundary(x, n_epochs=1)
    res.shape

In [20]:
x, target = next(iter(dl_tr))
if x.dtype is torch.float:
    for gradient in me.get_gradients(x, target=target)[1]:
        print(gradient.shape)

# Visualise activations and other topological aspects of your model

In [21]:
from gdeep.visualisation import Visualiser

vs = Visualiser(pipe)

vs.plot_data_model()
#vs.plot_activations(x)
#vs.plot_persistence_diagrams(x)
