# Basic tutorial: Question answer
#### Author: Matteo Caorsi

This short tutorial provides you with the basic functioning of *giotto-deep* API.

The example described in this tutorial is the one of question answer.

The main steps of the tutorial are the following:
 1. creation of a dataset
 2. creation of a model
 3. define metrics and losses
 4. train the model
 5. extract some features of the network

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

import numpy as np

import torch
from torch import nn

from gdeep.models import FFNet

from gdeep.visualisation import  persistence_diagrams_of_activations

from torch.utils.tensorboard import SummaryWriter
from gdeep.data import TorchDataLoader
from gdeep.pipeline import Pipeline

from gtda.diagrams import BettiCurve

from gtda.plotting import plot_betti_surfaces

No TPUs...


# Initialize the tensorboard writer

In order to analyse the reuslts of your models, you need to start tensorboard.
On the terminal, move inside the `/example` folder. There run the following command:

```
tensorboard --logdir=runs
```

Then go [here](http://localhost:6006/) after the training to see all the visualisation results.

In [2]:
writer = SummaryWriter()

# Create your dataset

In [3]:
from torch.utils.data.sampler import SubsetRandomSampler

# the only part of the training set we are interested in
train_indices = list(range(32*10))

dl = TorchDataLoader(name="SQuAD2", convert_to_map_dataset=True)
dl_tr_str, dl_ts_str = dl.build_dataloaders(sampler=SubsetRandomSampler(train_indices), batch_size=1)


The dataset contains a context and a question whose answer can be found within that context. The correct answer as well as the starting characters are also provided.

In [4]:
datum = next(iter(dl_tr_str))
datum

[('The group changed their name to Destiny\'s Child in 1996, based upon a passage in the Book of Isaiah. In 1997, Destiny\'s Child released their major label debut song "Killing Time" on the soundtrack to the 1997 film, Men in Black. The following year, the group released their self-titled debut album, scoring their first major hit "No, No, No". The album established the group as a viable act in the music industry, with moderate sales and winning the group three Soul Train Lady of Soul Awards for Best R&B/Soul Album of the Year, Best R&B/Soul or Rap New Artist, and Best R&B/Soul Single for "No, No, No". The group released their multi-platinum second album The Writing\'s on the Wall in 1999. The record features some of the group\'s most widely known songs such as "Bills, Bills, Bills", the group\'s first number-one single, "Jumpin\' Jumpin\'" and "Say My Name", which became their most successful song at the time, and would remain one of their signature songs. "Say My Name" won the Best 

In [5]:
datum = next(iter(dl_ts_str))
datum

[('Closely related fields in theoretical computer science are analysis of algorithms and computability theory. A key distinction between analysis of algorithms and computational complexity theory is that the former is devoted to analyzing the amount of resources needed by a particular algorithm to solve a problem, whereas the latter asks a more general question about all possible algorithms that could be used to solve the same problem. More precisely, it tries to classify problems that can or cannot be solved with appropriately restricted resources. In turn, imposing restrictions on the available resources is what distinguishes computational complexity from computability theory: the latter theory asks what kind of problems can, in principle, be solved algorithmically.',),
 ('What field of computer science analyzes all possible algorithms in aggregate to determine the resource requirements needed to solve to a given problem?  ',),
 [('computational complexity theory',),
  ('computationa

## Required preprocessing

Neural networks cannot direcly deal with strings. We have first to preprocess the dataset in three main ways:
 1. Tokenise the strings into its words
 2. Build a vocabulary out of these words
 3. Embed each word into a vector, so that each sentence becomes a list of vectors

The first two steps are performed by the `PreprocessTextQA`. The embedding will be added directly to the model.

In [6]:
from gdeep.data import PreprocessTextQA

prec = PreprocessTextQA((dl_tr_str, dl_ts_str))

(dl_tr, dl_ts) = prec.build_dataloaders(batch_size=3)


In [7]:
aa = next(iter(dl_tr))
aa

[tensor([[[1828,   90, 4077,  ..., 2567, 2567, 2567],
          [ 133,  568,   25,  ..., 2567, 2567, 2567]],
 
         [[ 518,   41,   28,  ..., 2567, 2567, 2567],
          [ 117,   64,   17,  ..., 2567, 2567, 2567]],
 
         [[ 778,   72,   33,  ..., 2567, 2567, 2567],
          [ 117,  158, 1133,  ..., 2567, 2567, 2567]]]),
 tensor([[103, 104],
         [178, 180],
         [ 17,  20]])]

## Define and train your model

The model for QA shall accept as input the context and the question and return the probabilities for the initial and final token of the answer in the input context. The output than, is a pair of logits.

In [8]:
from torch.nn import Transformer
from torch.optim import Adam, SparseAdam, SGD
import copy
# my simple transformer model
class QATransformer(nn.Module):

    def __init__(self, src_vocab_size, tgt_vocab_size, embed_dim):
        super(QATransformer, self).__init__()
        self.transformer = Transformer(d_model=embed_dim,
                                       nhead=2,
                                       num_encoder_layers=1,
                                       num_decoder_layers=1,
                                       dim_feedforward=512,
                                       dropout=0.1)
        self.embedding_src = nn.Embedding(src_vocab_size, embed_dim, sparse=True)
        self.embedding_tgt = nn.Embedding(tgt_vocab_size, embed_dim, sparse=True)
        self.generator = nn.Linear(embed_dim, 2)
        
    def forward(self, X):
        #print(X.shape)
        src = X[:,0,:]
        tgt = X[:,1,:]
        #print(src.shape, tgt.shape)
        src_emb = self.embedding_src(src)
        tgt_emb = self.embedding_tgt(tgt)
        #print(src_emb.shape, tgt_emb.shape)
        self.outs = self.transformer(src_emb, tgt_emb)
        #print(outs.shape)
        logits = self.generator(self.outs)
        #print(logits.shape)
        #out = torch.topk(logits, k=1, dim=2).indices.reshape(-1,44)
        #print(out, out.shape)
        return logits
    
    def __deepcopy__(self, memo):
        """this is needed to make sure that the 
        non-leaf nodes do not
        interfere with copy.deepcopy()
        """
        cls = self.__class__
        result = cls.__new__(cls)
        memo[id(self)] = result
        for k, v in self.__dict__.items():
            setattr(result, k, copy.deepcopy(v, memo))
        return result
    
    def encode(self, src, src_mask):
        """this method is used only at the inference step"""
        return self.transformer.encoder(
                            self.embedding_src(src), src_mask)

    def decode(self, tgt, memory, tgt_mask):
        """this method is used only at the inference step"""
        return self.transformer.decoder(
                          self.embedding_tgt(tgt), memory,
                          tgt_mask)

In [9]:
vocab_size = 5500

src_vocab_size = vocab_size
tgt_vocab_size = vocab_size
emb_size = 64

model = QATransformer(src_vocab_size, tgt_vocab_size, emb_size)
print(model)

QATransformer(
  (transformer): Transformer(
    (encoder): TransformerEncoder(
      (layers): ModuleList(
        (0): TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=64, out_features=64, bias=True)
          )
          (linear1): Linear(in_features=64, out_features=512, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=512, out_features=64, bias=True)
          (norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
      )
      (norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
    )
    (decoder): TransformerDecoder(
      (layers): ModuleList(
        (0): TransformerDecoderLayer(
          (self_attn): MultiheadAttention(
            (out_pr

## Define the loss function

This loss function is a adapted version of the Cross Entropy for the trnasformer architecture.

In [10]:

def loss_fn(output_of_network, label_of_dataloader):
    #print(output_of_network.shape, label_of_dataloader.shape)
    tgt_out = label_of_dataloader
    #print(tgt_out)
    logits = output_of_network
    cel = nn.CrossEntropyLoss()
    return cel(logits, tgt_out)


In [11]:
# prepare a pipeline class with the model, dataloaders loss_fn and tensorboard writer
pipe = Pipeline(model, (dl_tr, dl_ts), loss_fn, writer)

# train the model
pipe.train(SGD, 3, False, {"lr":0.01}, {"batch_size":16})

Epoch 1
-------------------------------
Epoch training loss: 5.765003 	Epoch training accuracy: 1.17%                                                            
Time taken for this epoch: 4.00s
Learning rate value: 0.01000000



Cannot store data in the PR curve



Validation results: 
 Accuracy: 0.00%,                 Avg loss: 5.579422 

Epoch 2
-------------------------------
Epoch training loss: 5.559848 	Epoch training accuracy: 1.17%                                                             
Time taken for this epoch: 4.00s
Learning rate value: 0.01000000
Validation results: 
 Accuracy: 0.00%,                 Avg loss: 5.424694 

Epoch 3
-------------------------------
Epoch training loss: 5.460648 	Epoch training accuracy: 1.17%                                                            
Time taken for this epoch: 4.00s
Learning rate value: 0.01000000
Validation results: 
 Accuracy: 1.56%,                 Avg loss: 5.428532 



(5.428531527519226, 1.5625)

## Answering questions!

Here we have a question and its associated context:

In [12]:
bb = next(iter(dl_ts_str))
bb[:2]

[('Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty, and relating those classes to each other. A computational problem is understood to be a task that is in principle amenable to being solved by a computer, which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps, such as an algorithm.',),
 ('What is a manual application of mathematical steps?',)]


Get the vocabulary and numericize the question and context to then input both to the model.

In [13]:
voc = prec.vocabulary
context = prec.tokenizer(bb[0][0])
question = prec.tokenizer(bb[1][0])

# get the indexes in the vocabulary of the tokens
context_idx = torch.tensor(list(map(voc.__getitem__,context)))
question_idx = torch.tensor(list(map(voc.__getitem__,question)))

In [14]:
length_to_pad = aa[0].shape[-1]
pad_fn = lambda item : torch.cat([item, prec.pad_item * torch.ones(length_to_pad - item.shape[0])])

# these tansors are ready for be fitted into teh model
context_ready_for_model = pad_fn(context_idx)
question_ready_for_model = pad_fn(question_idx)

Put the two tensors of context and question together and input them to the model

In [15]:
inp = torch.stack((context_ready_for_model, question_ready_for_model)).reshape(1,*aa[0].shape[1:]).long()
out = model(inp)

The output is the ligits for the start and end tokens of the answer. It is now time to extract them with `torch.argmax`

In [16]:
answer_idx = torch.argmax(out, dim=1)

try:
    if answer_idx[0][1] > answer_idx[0][0]:
        print("The model proposes: '", context[answer_idx[0][0]:answer_idx[0][1]],"...'")
    else:
        print("The model proposes: '", context[answer_idx[0][0]],"...'")
except IndexError:
    print("The model was not able to find the answer.")
print("The actual answer was: '" + bb[2][0][0]+"'")

The model was not able to find the answer.
The actual answer was: ''


# Extract inner data from your models

In [17]:
from gdeep.models import ModelExtractor

me = ModelExtractor(pipe.model, loss_fn)

lista = me.get_layers_param()

for k, item in lista.items():
    print(k,item.shape)


transformer.encoder.layers.0.self_attn.in_proj_weight torch.Size([192, 64])
transformer.encoder.layers.0.self_attn.in_proj_bias torch.Size([192])
transformer.encoder.layers.0.self_attn.out_proj.weight torch.Size([64, 64])
transformer.encoder.layers.0.self_attn.out_proj.bias torch.Size([64])
transformer.encoder.layers.0.linear1.weight torch.Size([512, 64])
transformer.encoder.layers.0.linear1.bias torch.Size([512])
transformer.encoder.layers.0.linear2.weight torch.Size([64, 512])
transformer.encoder.layers.0.linear2.bias torch.Size([64])
transformer.encoder.layers.0.norm1.weight torch.Size([64])
transformer.encoder.layers.0.norm1.bias torch.Size([64])
transformer.encoder.layers.0.norm2.weight torch.Size([64])
transformer.encoder.layers.0.norm2.bias torch.Size([64])
transformer.encoder.norm.weight torch.Size([64])
transformer.encoder.norm.bias torch.Size([64])
transformer.decoder.layers.0.self_attn.in_proj_weight torch.Size([192, 64])
transformer.decoder.layers.0.self_attn.in_proj_bias t

In [18]:
DEVICE = torch.device("cpu")
x = next(iter(dl_tr))[0]
pipe.model.eval()
pipe.model(x.to(DEVICE))

list_activations = me.get_activations(x)
len(list_activations)


30

In [19]:
x = next(iter(dl_tr))[0][0]
if x.dtype is not torch.int64:
    res = me.get_decision_boundary(x, n_epochs=1)
    res.shape

In [20]:
x, target = next(iter(dl_tr))
if x.dtype is torch.float:
    for gradient in me.get_gradients(x, target=target)[1]:
        print(gradient.shape)

# Visualise activations and other topological aspects of your model

In [21]:
from gdeep.visualisation import Visualiser

vs = Visualiser(pipe)

vs.plot_data_model()
#vs.plot_activations(x)
#vs.plot_persistence_diagrams(x)
