# Basic tutorial: translation
#### Author: Matteo Caorsi

This short tutorial provides you with the basic functionalities of *giotto-deep* API.

The example described in this tutorial is the one of translation following a sequence-to-sequence transformer model: all teh details can be found [here](https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf). 

## Probabilistic modelling

The basic idea would be to predict the target sentence $y$ given source sentence $x$, more precisely you ant to predict $P(y|x)$. But this is very hard to do directly, since one would need a lot of data to have a reasonable estimate of this conditional probability.
Hence, one would ratehr factor the probability using the chain rule, then predict each factor:

$$
P(y_1,...,y_N|x) = \prod_i^N P(y_i|y_{<i},x)
$$

In the NLP language, this corresponds to predicting the next token of the output sequence given the already predicted tokens and the input sequence.

## Scope

You will build your own transformer model, train it, and use it to translate German sentences to English!

Here is an example of what a translation task is about:

 - German sentence: 'Ich mag Pizza.'
 - English translation: 'I like pizza.'

## The plan for this tutorial

The main steps of the tutorial are the following:
 1. creation of a dataset
 2. preprocessing of the dataset
 3. creation of a model
 4. definition of the metrics and losses
 5. trainining of the model
 6. using the model to translate some sentences
 7. (extra) extract some features of the network for interpretability analysis
 
Let's start with importing the requried libraries!

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import copy
import sys

import numpy as np
import torch
from torch import nn
from gtda.diagrams import BettiCurve
from gtda.plotting import plot_betti_surfaces
from torchtext.data.utils import get_tokenizer
from torch.utils.data.sampler import SubsetRandomSampler
from torch.nn import Transformer
from torch.optim import Adam, SparseAdam, SGD
import torchtext

# our special guests!
from gdeep.models import FFNet
from gdeep.visualization import persistence_diagrams_of_activations
from gdeep.data.datasets import DatasetBuilder
from gdeep.trainer import Trainer
from gdeep.data import TransformingDataset
from gdeep.data.preprocessors import TokenizerTranslation
from gdeep.data.datasets import DataLoaderBuilder
from gdeep.models import ModelExtractor
from gdeep.analysis.interpretability import Interpreter
from gdeep.visualization import Visualiser
from gdeep.search import GiottoSummaryWriter


# Initialize the tensorboard writer

In order to analyse the results of your models, you need to start tensorboard. All data about the model, the training, the hyperparameters... will be stored there.

## How to start tensorboard
On the terminal, move inside the `/examples` folder. There run the following command:

```
tensorboard --logdir=runs
```

Then go [here](http://localhost:6006/) after the training step to visualise all the results.

In [None]:
writer = GiottoSummaryWriter()


# Create your dataset

We propose to use the [Multi30k](https://github.com/multi30k/dataset) dataset: in giotto-deep it is very easy to import datasets! It is eough to write:

```python
# initialise the builder with the dataset name
bd = DatasetBuilder(name="Multi30k", convert_to_map_dataset=True)

# build the datasets
ds_tr_str, ds_val_str, ds_ts_str = bd.build()

```

### Disclaimer: the next cell is a temporary fix to a third-party download link issue.

In [None]:
# temporary cell, due to a failure of theird party link: we are required to download the dataset manually!
# yout would expect to simply run:
# bd = DatasetBuilder(name="Multi30k", convert_to_map_dataset=True)

# and then
# ds_tr_str, ds_val_str, ds_ts_str = bd.build()

# instead we are doing this mess:

import os
import gzip
import shutil
from typing import List

from torchdata.datapipes.iter import FileOpener, FileLister, StreamReader
from torch.utils.data import Dataset

from gdeep.utility import DEFAULT_DOWNLOAD_DIR


base_path = os.path.join(DEFAULT_DOWNLOAD_DIR, "Multi30k")
url_de = 'https://github.com/multi30k/dataset/raw/master/data/task1/raw/train.de.gz'
url_en = 'https://github.com/multi30k/dataset/raw/master/data/task1/raw/train.en.gz'
torchtext.utils.download_from_url(url_en, path=os.path.join(base_path, "train.en.gz"))
torchtext.utils.download_from_url(url_de, path=os.path.join(base_path, "train.de.gz"))
url_de = 'https://github.com/multi30k/dataset/raw/master/data/task1/raw/val.de.gz'
url_en = 'https://github.com/multi30k/dataset/raw/master/data/task1/raw/val.en.gz'
torchtext.utils.download_from_url(url_en, path=os.path.join(base_path, "val.en.gz"))
torchtext.utils.download_from_url(url_de, path=os.path.join(base_path, "val.de.gz"))
url_de = 'https://github.com/multi30k/dataset/raw/master/data/task1/raw/test_2016_flickr.de.gz'
url_en = 'https://github.com/multi30k/dataset/raw/master/data/task1/raw/test_2016_flickr.en.gz'
torchtext.utils.download_from_url(url_en, path=os.path.join(base_path, "test.en.gz"))
torchtext.utils.download_from_url(url_de, path=os.path.join(base_path, "test.de.gz"))

# we store the data in the RAM
in_memory_data = []


for item in ["train.de", "train.en", "val.de", "val.en", "test.de", "test.en"]:
    with gzip.open(os.path.join(base_path, item+".gz"), 'rb') as f_in:
        with open(os.path.join(base_path, item), 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)

    with open(os.path.join(base_path, item), 'r', encoding="utf8") as f:
        in_memory_data.append(f.read())

assert len(in_memory_data) == 6
assert len(in_memory_data[0].split("\n")) == len(in_memory_data[1].split("\n"))
assert len(in_memory_data[2].split("\n")) == len(in_memory_data[3].split("\n"))

class MyMulti30k(Dataset):
    """Temporary Multi30k Dataset."""

    def __init__(self, data_tuple: List[List[str]]) -> None:
        """
        Args:
            data:
                list of data
        """
        
        assert len(data_tuple) == 2, "expected 2 lists corresponding to the two languages"
        self.data = list(zip(data_tuple[0].split("\n"), data_tuple[1].split("\n")))


    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]
    
ds_tr_str, ds_val_str, ds_ts_str = MyMulti30k(in_memory_data[:2]), MyMulti30k(in_memory_data[2:4]), MyMulti30k(in_memory_data[4:6])

The dataset contains a list of pairs of sentences: the German sentence and its English translation.

In [None]:
print("Before preprocessing: \n", ds_tr_str[0])


## Required preprocessing

Neural networks cannot direcly deal with strings. We have first to preprocess the dataset in three main ways:
 1. Tokenise each string into its words (and maybe adjust each word to remove plurals, interjections, capital letters...)
 2. Build a vocabulary out of these tokens (each modified word of point a. is called a token)
 3. Embed each token into a vector, so that each sentence becomes a list of vectors

The **first two steps** are performed by the `TokenizerTranslation` class. The embedding will be added directly to the model (see below).

In [None]:

# install german tokenizer
!{sys.executable} -m spacy download de_core_news_sm

# get the german tokenizer
de_tokenizer = get_tokenizer('spacy', language='de_core_news_sm')

# initialise the giotto-deep tokenizer
tokenizer = TokenizerTranslation(tokenizer=de_tokenizer)

# fit the tokenizer to the dataset (note that te vocabularies will be automatically built in this case
tokenizer.fit_to_dataset(ds_tr_str)

# prprocess the dataset
transformed_textds = tokenizer.attach_transform_to_dataset(ds_tr_str)
transformed_textts = tokenizer.attach_transform_to_dataset(ds_val_str)

print("After the preprocessing: \n", transformed_textds[0])

# subsample the training and test datasets
train_indices = list(range(64*2))
test_indices = list(range(64*1))

dl_tr, dl_val, _ = DataLoaderBuilder((transformed_textds, 
                                      transformed_textts)).build(({"batch_size":16, 
                                                                   "sampler":SubsetRandomSampler(train_indices)},{"batch_size":16, 
                                                                                                                  "sampler":SubsetRandomSampler(test_indices)}
                                                                 ))


As you can see, now the data is not in string format anymore: rather, the sentences have each been transformed to a `torch.Tensor` of type `long`. Each of these numbers represents the index in the vocabulary of the associated token.

## Define your model

The model we play with is a simple transformer model with two embedding layers (for the German and English sentence) followed by a single transformer layer. 

The input of this model is assumed to be a tensors. The tensor is of the form `shape(transformed_textds[0][0])`. 

Of course, as in all neural networks, data are inputted to the model  in batches: hence, the first dimension will be the batch size, then a dimension to choose the source or target sentence, and the last dimension will be about the actual tokenisation.

In [None]:
# my simple transformer model
class TranslationTransformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, embed_dim):
        super(TranslationTransformer, self).__init__()
        self.transformer = Transformer(
            d_model=embed_dim,
            nhead=2,
            num_encoder_layers=1,
            num_decoder_layers=1,
            dim_feedforward=512,
            dropout=0.1,
        )
        self.embedding_src = nn.Embedding(src_vocab_size, embed_dim, sparse=True)
        self.embedding_tgt = nn.Embedding(tgt_vocab_size, embed_dim, sparse=True)
        self.generator = nn.Linear(embed_dim, tgt_vocab_size)

    def forward(self, src, tgt):
        src_emb = self.embedding_src(src).permute(1, 0, 2)
        tgt_emb = self.embedding_tgt(tgt).permute(1, 0, 2)
        self.outs = self.transformer(src_emb, tgt_emb).permute(1, 0, 2)
        logits = self.generator(self.outs)
        return logits

    def encode(self, src, src_mask):
        """this method is used only at the inference step.
        This method runs the data through the encoder of the
        transformer"""
        return self.transformer.encoder(self.embedding_src(src), src_mask)

    def decode(self, tgt, memory, tgt_mask):
        """this method is used only at the inference step
        This method runs the data through the decoder of the
        transformer"""
        return self.transformer.decoder(self.embedding_tgt(tgt), memory, tgt_mask)


We initialise the model as follows:
 - we need to set the maximum vocabulary size to fix the `Embedding` architectures
 - we need to set the embedding dimension
 - initialise the model class with the needed parameters.

In [None]:
src_vocab_size = len(tokenizer.vocabulary)
tgt_vocab_size = len(tokenizer.vocabulary_target)
emb_dim = 64

model = TranslationTransformer(src_vocab_size, tgt_vocab_size, emb_dim)
X = next(iter(dl_tr))
# a datum
# assert model(X[0]).argmax(2).shape == X[1].shape
print("This is our model: \n", model)


## Define the loss function

This loss function is an adapted version of the Cross Entropy Loss for the transformer architecture we just defined.

In [None]:
def loss_fn(logits, tgt_out):
    cel = nn.CrossEntropyLoss()
    return cel(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))


And here we are: all is set upo and we are ready to train the model.

# Trainig the model

In giotto deep all is done via a `Trainer`, meaning a class taht takes care of the training and validation steps, storing the intermediate results to tensorboard, using (or not) cross validation, ... basically everythin you need.

In [None]:
# prepare a pipeline class with the model, dataloaders loss_fn and tensorboard writer
pipe = Trainer(model, (dl_tr, dl_val), loss_fn, writer)

# train the model
pipe.train(
    optimizer=SGD,
    n_epochs=2,
    cross_validation=False,
    optimizers_param={"lr": 0.01},
    dataloaders_param={"batch_size": 16},
)


## Translation!

So the model has been trained and now it is time to translate a sentece. Let's take thee following example:

In [None]:
de, en = next(iter(ds_tr_str))
print(de, "\n", en)



Get the vocabulary and numericize the German sentence

In [None]:
voc = tokenizer.vocabulary
# sent = str.lower(de).split()
# de_sentence = list(map(voc.__getitem__,sent))
de_sentence, en_sentence = tokenizer((de, en))[0]
de_sentence


We need to build a couple of auxiliary functiosn to help us translate. The logic is to use the transformer encoder and decoder layers directly. Let's have a quick reminder o how things work:

<img src="./images/translation_transformer.png" alt="drawing" width="400" class="center"/>

The above drawing is a good representation of our architecture. During training, the English and German sentence are inputted from "below". Then, once the model is used in inference, there is an important change ongoing. Basically, the full input sentence (the German one in our case) is given as input to the encoder: the output (the probablities) are then greedily transformed into a signle token: that token is then used in the "Outputs" below: together with the Input sentence, the next word is predicted (at the top) and used again as input in the "Outputs" below. Hence, word after word, you get the translation of the original sentence.

In [None]:
from gdeep.utility import DEVICE

pad_item = 0
max_len = len(en_sentence)


def greedy_decode(model, src, src_mask, max_len):
    """function to generate output sequence using greedy algorithm"""

    memory = model.encode(src, src_mask)
    # ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long)
    ys = torch.ones(*src.shape).fill_(pad_item).type(torch.long).to(DEVICE)
    for i in range(max_len - 1):
        # print("before:", ys)
        out = model.decode(ys, memory, None)
        out = out.transpose(0, 1)
        prob = model.generator(out[:, -1])
        next_word = torch.argmax(prob, dim=1)
        next_word = next_word[i + 1].item()  # the shift
        ys[0, i] = next_word
        # print("after:", ys)
    return ys


def translate(model: torch.nn.Module, dl_ts_item):
    """actual function to translate input sentence into target language"""
    model.eval()
    voc = tokenizer.vocabulary_target
    src = dl_ts_item.to(DEVICE)
    num_tokens = src.shape[1]
    src_mask = None
    tgt_tokens_tensor = greedy_decode(model, src, src_mask, max_len).flatten().to(DEVICE)
    tgt_tokens = tgt_tokens_tensor.tolist()
    # print(tgt_tokens)
    return " ".join(voc.lookup_tokens(tgt_tokens))


# translation!
print("German sentence: ", de)
print("English translation: ", translate(pipe.model, de_sentence.reshape(1, -1)))


## Extra: interpretability

In the next section we show you how to run some simple intepretability tools, like plotting thee attention matrices of the transformer. 

The first step is to extract the values of the attention activations given an input `x`.

In [None]:
from gdeep.models import ModelExtractor

# the candidate datum
x = [de_sentence.view(1,-1).to(DEVICE), en_sentence.view(1,-1).to(DEVICE)]  # only one pair of sentences, shapd as a batch of dim = 1

# the model extractor
ex = ModelExtractor(pipe.model, loss_fn)

# getting the names of the layers
layer_names = ex.get_layers_param().keys()

print("Let's extract the activations of the first attention layer: ", next(iter(layer_names)))
self_attention = ex.get_activations(x)[:2]


In [None]:
# let's plot the tensor! First, load th visualizer
vs = Visualiser(pipe)
vs.plot_self_attention(self_attention, tokenizer.tokenizer(en), de_tokenizer(de), figsize=(20, 20));


If you have just run this notebooks, you would most likely have gotten some funny nonsensical answer: consider that you trained a very simple model on a very small subset of data for only two epochs.

## Challenge

Starting from this simple notebook, do you think you can enlarge the dataset, the model, the training epochs and get a decent translator? Good luck!

# (WIP) Huggingface transformers

If you are actually interested in a pretrained transformers that works directly with a few lines of code, `giotto-deep` supports Hugginface transformers.

The next section explains how to run one such transformer.

In [None]:
# from transformers import ...

# take T5

# finetune on the Multi30k

# go to TB to see the results + screenshot
