# Reimplementing the Transformer to learn more about the Attention mechanism
## Baptiste Amato, Alexis Durocher, Gabriel Hurtado, Alexandre Jouandin, Vincent Marois

Spring 2019 CS 7643 Deep Learning Class Project
Georgia Tech

This webpage represent a progress report on our idea of studying the Transformer ([cite](https://arxiv.org/abs/1706.03762)) architecture.

# Abstract / Introduction / Motivation

The `Transformer` model, published in 2017 by Vaswani et al. has established state-of-the-art results in translation, using an Encoder-Decoder architecture which does not present recurrence, breaking with previously established models. It only relies on the attention mechanism ([cite](https://arxiv.org/abs/1409.0473)) to compute representations of its inputs & outputs. Intrinsically, this makes it a very interesting problem to study. Additionally, several new models using its architecture have been published (Universal Transformers ([cite](https://arxiv.org/abs/1807.03819)), OpenAI's GPT-2 ([cite](https://openai.com/blog/better-language-models/)) etc.), only reinforcing its interest as a robust neural brick for complex models.

Our approach has been so far to reimplement the `Transformer` architecture, using help from online resources (such as [The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html) from Harvard's NLP group), and emphasizing an object-oriented programming approach. 

As of today, we have been able to successfully train the `Transformer` on the simple algorithmic *copy task* (i.e. learning to copy the inputs to the output). This makes us confident that our implementation of the model is correct and that gradients are flowing correctly througout its structure.

Future tasks concern training on the main translation dataset, multi-gpu support and visualization of the attention, all subject to the time constraint.

A Colab notebook is available [here](https://colab.research.google.com/drive/1QPp8bFpzEgSdZlRmgCOTL60SwUl5gwf2) and shows the training of the model on the *copy task*, along with a correct prediction of an input tensor.

# Teaser figure

Here is an image of the `Transformer` architecture, drawn from [here](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html):

![The Transformer architecture and a zoom on the attention mechanism](img/transformer.png)

# Approach

Here are some remarks on our approach:

- We first began by reading up on the model. This includes the original paper (~1500 citations as of today), the tutorial by Harvard's NLP group pointed above, and other papers (notably on the Attention mechanism).
- Following, we started to implement the main layers of the Transformer, ensuring an object-oriented approach:
    - the `Encoder` & `Decoder` layer,
    - `MultiHeadAttention` & `ScaledDotProductAttention`
    - `ResidualConnection`, `LayerNormalization`
    - etc.

- Once these building blocks were implemented and unit-tested, we created a main object (called the `Trainer`) handling training & evaluation of the model. This objects creates the model, the optimizer, the dataset (discussed below), and trains the network with a loss function (either CrossEntropy or KL Divergence with a label-smoothing regularization). The optimizer is based on Adam ([cite](https://arxiv.org/abs/1412.6980)) with a custom learning-rate update rule.

- In parallel, the dataset ([IWSLT 2016](https://workshop2016.iwslt.org)) has been implemented as a class, which handles:
    - The creation of the vocabulary set (i.e. removing infrequent words, trimming punctuation etc.),
    - The tokenization of each word (mapping word -> index), containing special tokens (`start_token="<s>"`, `eos_token="</s>"`, `blank_token="<blank>"`)
    - The creation of the masks hiding the padding for each batch.

- A simple dataset class, represent the *copy task* has also been implemented. This class generates random samples, of a given size, in a given range, and feeds them as both the inputs & outputs sequences of the model.

- A simplified model has then been trained on the *copy-task* and is able to converge rapidly on a reduced-size dataset.

# Current Status

We now have a model able to converge on the *copy-task*, and produce relatively-accurate predictions. You can see an example (running on GPUs) on Colab [here](https://colab.research.google.com/drive/1QPp8bFpzEgSdZlRmgCOTL60SwUl5gwf2). 

We are now working towards integrating the full IWSLT dataset into the training procedure. This should not take long. Nonetheless, we expect difficulties in making the model converge on the entire dataset. Although, we have some proof of its correctness, thanks to the simple algorithmic task, we plan on making the model overfit a small portion of the dataset to further assert its implementation.

The table below shows the accomplished tasks & effort distribution per member. All project members provided equal contribution in this progress report. Please see section below for the list of future tasks.

|                        Tasks                        |   Duration   |        Team members         |
|-----------------------------------------------------|--------------|-----------------------------|
|   Reading up on `Transformer` and attention         |    1 week    |          all members        |
|-----------------------------------------------------|--------------|-----------------------------|
|   Initial implementation of the main classes        |    1 week    |      Alexis, Alexandre      |
|-----------------------------------------------------|--------------|-----------------------------|
|         GPU Support & AWS script setup              |    1 day     |            Baptiste         |
|-----------------------------------------------------|--------------|-----------------------------|
|         Dataset implementation                      |    1 week    |  Alexis, Alexandre, Gabriel |
|-----------------------------------------------------|--------------|-----------------------------|
|         Optimizer, loss, training loop              |    1 week    |  Vincent, Gabriel, Baptiste |
|-----------------------------------------------------|--------------|-----------------------------|
|         Convergence on *copy task*                  |    1 week    |      Vincent, Alexandre     |
|-----------------------------------------------------|--------------|-----------------------------|
|       Redaction of this progress report             |    1 day     |          all members        |
|-----------------------------------------------------|--------------|-----------------------------|

# Experimental Plan

Here a list of tasks we are planning towards the end of the project:

* [ ]  Integrate the main dataset into the training loop
* [ ]  Integrate the BLEU metric as equivalent of accuracy
* [ ]  Have the model overfit on a small portion of the dataset
* [ ]  Hyper-parameters impact & tuning: starting from the indicated values in the original paper, we want to:
        - First, ensure the model is converging to appropriate performance: i.e. close to the original results.
        - Test the impact of certain hyper-parameters of the model (i.e. the number of attention heads or the number of layers in the encoder / decoder). 

The following tasks will be constrained by the time taken to have a model correctly converging
* [ ]  Support multi-gpu training
* [ ]  Include the support of Tensorboard, to monitor training
* [ ]  Build a visualization tool for the attention mechanisms
* [ ]  Some features indicated in the original paper have not yet been implemented, due to their complexity and resources consumption. Namely:
       - Beam search, i.e. consider the top k translation candidates instead of performing a max operation at each iteration,
        - Model Averaging: The paper averages the last k checkpoints to create an ensembling effect.
    If time allows, we will incorpore these.


* [ ]  Build an API around the model's prediction ability to integrate it e.g. in a web page


We expect to focus most of our time and effort on getting the model to converge properly.

# Help & Conclusion

We think that we have a good initial implementation to continue our effort towards training the `Transformer` on the IWSLT dataset. We predict 1 to 2 weeks to get this part done. 

While we have limited access to GPUs using Colab (which has worked well so far), we would like to use AWS, specifically for multi-gpu training. Indeed, this could potentially speed up our experiments.
Each team member got some AWS credit thanks to another class, but they do not allow GPU instances. 