# Basic tutorial: translation
#### Author: Matteo Caorsi

This short tutorial provides you with the basic functionalities of *giotto-deep* API.

The example described in this tutorial is the one of translation. You will build your own transformer model, train it, and use it to translate from German to English!

Here is an example of what a translation task is about:

 - German sentence: 'Ich mag Pizza.'
 - English translation: 'I like pizza.'

## The plan for this tutorial

The main steps of the tutorial are the following:
 1. creation of a dataset
 2. creation of a model
 3. definition of the metrics and losses
 4. trainining of the model
 5. using the model to translate some sentences
 6. (extra) extract some features of the network for further analysis
 
Let's start with importing the requried libraries!

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

import copy
import sys

import numpy as np
import torch
from torch import nn
from torch.utils.tensorboard import SummaryWriter
from gtda.diagrams import BettiCurve
from gtda.plotting import plot_betti_surfaces
from torchtext.data.utils import get_tokenizer
from torch.utils.data.sampler import SubsetRandomSampler
from torch.nn import Transformer
from torch.optim import Adam, SparseAdam, SGD


# our special guests!
from gdeep.models import FFNet
from gdeep.visualisation import  persistence_diagrams_of_activations
from gdeep.data.datasets import DatasetBuilder
from gdeep.trainer import Trainer
from gdeep.data import TransformingDataset
from gdeep.data.preprocessors import TokenizerTranslation
from gdeep.data.datasets import DataLoaderBuilder
from gdeep.models import ModelExtractor
from gdeep.visualisation import Visualiser

# Initialize the tensorboard writer

In order to analyse the results of your models, you need to start tensorboard. All data about the model, the training, the hyperparameters... will be stored there.

## How to start tensorboard
On the terminal, move inside the `/examples` folder. There run the following command:

```
tensorboard --logdir=runs
```

Then go [here](http://localhost:6006/) after the training to see all the visualisation results.

In [None]:
writer = SummaryWriter()

# Create your dataset

We propose to use the [Multi30k](https://github.com/multi30k/dataset) dataset: in the next cell you will see how easy it is to import datasets in giotto-deep (as easy as iit is in `torchtext`).

In [None]:
# prepare the builder class
bd = DatasetBuilder(name="Multi30k", convert_to_map_dataset=True)

# build the dataset
ds_tr_str, ds_val_str, ds_ts_str = bd.build()


The dataset contains a list of pairs of sentences: the German sentence and its English translation.

In [None]:
print("Before preprocessing: \n", ds_tr_str[0])

## Required preprocessing

Neural networks cannot direcly deal with strings. We have first to preprocess the dataset in three main ways:
 1. Tokenise each string into its words (and maybe adjust each word to remove plurals, interjections, capital letters...)
 2. Build a vocabulary out of these tokens (each word so modified is called token)
 3. Embed each token into a vector, so that each sentence becomes a list of vectors

The **first two steps** are performed by the `TokenizerTranslation` class. The embedding will be added directly to the model (see below).

In [None]:

# install german tokenizer
!{sys.executable} -m spacy download de_core_news_sm

# get the german tokenizer
de_tokenizer = get_tokenizer('spacy', language='de_core_news_sm')

# initialise the giotto-deep tokenizer
tokenizer = TokenizerTranslation(tokenizer=de_tokenizer)

# fit the tokenizer to the dataset (note that te vocabularies will be automatically built in this case
tokenizer.fit_to_dataset(ds_tr_str)

# prprocess the dataset
transformed_textds = tokenizer.attach_transform_to_dataset(ds_tr_str)
transformed_textts = tokenizer.attach_transform_to_dataset(ds_val_str)

print("After the preprocessing: \n", transformed_textds[0])

# subsample the training and test datasets
train_indices = list(range(64*2))
test_indices = list(range(64*1))

dl_tr, dl_val, _ = DataLoaderBuilder((transformed_textds, 
                                   transformed_textts)).build(({"batch_size":16, 
                                                                          "sampler":SubsetRandomSampler(train_indices)},{"batch_size":16, 
                                                                          "sampler":SubsetRandomSampler(test_indices)}
                                                                          ))




As you can see, now the data is not in string format anymore: rather, the sentences have each been transformed to a `torch.Tensor` of type `long`. Each of these umbers represents the index in the vocabulary of the associated token.

## Define your model

The model we play with is a simple transformer model with two embedding layers (for the German and English sentence) followed by a single transformer layer. 

The input ...

In [None]:


# my simple transformer model
class TranslationTransformer(nn.Module):

    def __init__(self, src_vocab_size, tgt_vocab_size, embed_dim):
        super(TranslationTransformer, self).__init__()
        self.transformer = Transformer(d_model=embed_dim,
                                       nhead=2,
                                       num_encoder_layers=1,
                                       num_decoder_layers=1,
                                       dim_feedforward=512,
                                       dropout=0.1)
        self.embedding_src = nn.Embedding(src_vocab_size, embed_dim, sparse=True)
        self.embedding_tgt = nn.Embedding(tgt_vocab_size, embed_dim, sparse=True)
        self.generator = nn.Linear(embed_dim, tgt_vocab_size)
        
    def forward(self, X):
        src = X[:,0,:]
        tgt = X[:,1,:]
        src_emb = self.embedding_src(src)
        tgt_emb = self.embedding_tgt(tgt)
        self.outs = self.transformer(src_emb, tgt_emb)
        logits = self.generator(self.outs)
        return logits
    
    def encode(self, src, src_mask):
        """this method is used only at the inference step.
        This method runs the data through the encoder of the
        transformer"""
        return self.transformer.encoder(
                            self.embedding_src(src), src_mask)

    def decode(self, tgt, memory, tgt_mask):
        """this method is used only at the inference step
        This method runs the data through the decoder of the
        transformer"""
        return self.transformer.decoder(
                          self.embedding_tgt(tgt), memory,
                          tgt_mask)
    


We initialise the model as follows:
 - we need to set the maximum vocabulary size to fix the `Embedding` architectures
 - we need to set the embedding dimension
 - initialise the model class with the needed parameters.

In [None]:
vocab_size = 50000 # to be discussed

src_vocab_size = vocab_size # len(prec.vocabulary)
tgt_vocab_size = vocab_size # len(prec.vocabulary_target)
emb_dim = 64

model = TranslationTransformer(src_vocab_size, tgt_vocab_size, emb_dim)
X = next(iter(dl_tr));  # a datum
#assert model(X[0]).argmax(2).shape == X[1].shape
print("This is our model: \n", model)

## Define the loss function

This loss function is an adapted version of the Cross Entropy Loss for the transformer architecture we just defined.

In [None]:

def loss_fn(logits, tgt_out):
    cel = nn.CrossEntropyLoss()
    return cel(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))


And here we are: all is set upo and we are ready to train the model.

# Trainig the model

In giotto deep all is done via a `Trainer`, meaning a class taht takes care of the training and validation steps, storing the intermediate results to tensorboard, using (or not) cross validation, ... basically everythin you need.

In [None]:
# prepare a pipeline class with the model, dataloaders loss_fn and tensorboard writer
pipe = Trainer(model, (dl_tr, dl_val), loss_fn, writer)

# train the model
pipe.train(optimizer=SGD, 
           n_epochs=3, 
           cross_validation=False, 
           optimizers_param={"lr":0.01}, 
           dataloaders_param={"batch_size":16})

## Translation!

So the model has been trained and now it is time to translate a sentece. Let's take thee following example:

In [None]:
de, en = next(iter(ds_tr_str))
print(de, "\n", en)


Get the vocabulary and numericize the German sentence

In [None]:
voc = tokenizer.vocabulary
#sent = str.lower(de).split()
#de_sentence = list(map(voc.__getitem__,sent))
de_sentence = tokenizer((de, en))[0][0]
de_sentence

We need to build a couple of auxiliary functiosn to help us translate. The logic is to use the transformer encoder and decoder layers directly. Let's have a quick reminder o how things work:

<img src="./images/translation_transformer.png" alt="drawing" width="400" class="center"/>

The above drawing is a good representation of our architecture. During training, the English and German sentence are inputted from "below". Then, once the model is used in inference, there is no need to use the "Outputs", as the "Input" ncoder will have its information flow though the transfomer till he probabilistic output on top: this is where we get the translation. 

In [None]:

def greedy_decode(model, src, src_mask, max_len, start_symbol):
    """function to generate output sequence using greedy algorithm"""
    memory = model.encode(src, src_mask)
    out = model.decode(src, memory, None)
    prob = model.generator(out)
    greedy_out = torch.max(prob, dim=2).indices
    return greedy_out
    

def translate(model: torch.nn.Module, dl_ts_item):
    """actual function to translate input sentence into target language"""
    model.eval()
    voc = tokenizer.vocabulary_target
    src = dl_ts_item
    num_tokens = src.shape[1]
    src_mask = None
    tgt_tokens_raw = greedy_decode(
        model,  src, src_mask, max_len=num_tokens + 5, start_symbol=111).flatten()
    tgt_tokens = [token for token in tgt_tokens_raw if token < len(list(voc.vocab))]
    return " ".join(list(map(list(voc.vocab).__getitem__, tgt_tokens)))
    
# translation!
print("German sentence: ", de)
print("English translation: ", translate(pipe.model, de_sentence.reshape(1,-1)))

In [None]:
# add data saliency map on translation

If you have just run this notebooks, you would most likely have gotten some funny nonsensical answer: consider that you trained a very simple model on a very small subset of data for only two epochs.

## Challenge

Starting from this simple notebook, do you think you can enlarge the dataset, the model, the training epochs and get a decent translator? Good luck!

# Huggingface transformers

If you are actually interested in a pretrained transformers that works directly with a few lines of code, `giotto-deep` supports Hugginface transformers.

The next section explains how to run one such transformer.

In [None]:
# from transformers import ...

# take T5small

# finetune on the Multi30k

# go to TB to see the results + screenshot

# Appendix: Extract inner data from your models

With `gdeep` is is pretty straight forwasrd to extract data about the inner working of the models.

For example you can use the `ModelExtract` and get:
 - the parameters of each layer
 - the activation functions (given an input)
 - the gradients, given a target

In [None]:

# initialise the ModelExtractor
me = ModelExtractor(pipe.model, loss_fn)

lista = me.get_layers_param()

for k, item in lista.items():
    print(k,item.shape)


In [None]:
DEVICE = torch.device("cpu")
x = next(iter(dl_tr))[0]
pipe.model.eval()
pipe.model(x.to(DEVICE))

list_activations = me.get_activations(x)
len(list_activations)


In [None]:
x = next(iter(dl_tr))[0][0]
if x.dtype is not torch.int64:
    res = me.get_decision_boundary(x, n_epochs=1)
    res.shape

In [None]:
x, target = next(iter(dl_tr))
if x.dtype is torch.float:
    for gradient in me.get_gradients(x, target=target)[1]:
        print(gradient.shape)

You can also visualise the model graph on tensorboard interactively!

In [None]:

# initialise the Visualiser
vs = Visualiser(pipe)

vs.plot_data_model()

