## Introduction

This is a short textual write-up looking to explain the repo, the steps taken and the results achieved. No coding will be done here, but we will paste results and content from elsewhere in the repo.

It is hard to know the right amount of detail to go into. There is also commenting done in the other notebooks and in the Python functions' files. But I may have explained something too much and something else too little.

## What the repo contains

This repo started as an exercise in learning PyTorch and understanding the attentional seq2seq architecture in full detail. We set up a toy French to English translation task, __implement the attentional seq2seq architecture of [Leung et al. (2015)](https://arxiv.org/pdf/1508.04025)__ and experiment with the effects of hyperparameters to find a good combination. Then the final model is investigated a bit.

I have borrowed heavily from the [practical-pytorch](https://github.com/spro/practical-pytorch) repo by [spro](https://github.com/spro), in particular, [this notebook](https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation-batched.ipynb). This resource provided the starting point and answered many of the initial questions I had. However, all the code I have borrowed has been double-checked, sometimes fixed, sometimes reworked. And I have integrated this with my own code for training and plotting the learning curves, beam-search decoding, etc.

The translation task has been greatly simplified here for ease of compute, yet the same code should be able to handle much more involved seq2seq learning problems. All the computations here have been done using an Apple M2 Max machine, and the time taken, as reported in the legends of learning curves, is for that machine.

__Final result in a nutshell.__ Training on 5k French-English sentence pairs of length 2 to 5 tokens (inclusive), we are able to reach the cross-entropy score of 1.39:

<img src="results/5e-03_1_32_{'h_size':90,'dropout':0.2,'n_layers':2,'att_method':'general','c':'final_model'}.png" alt="pic" width="300"/>
                  
When we beam-search from this model, we find...

## Sources

[The Leung paper](https://arxiv.org/pdf/1508.04025), one of the pioneering works on attention. We implement the architecture proposed therein.

The [paper](https://arxiv.org/pdf/1506.03099) proposing Scheduled Sampling. There is a set of helpful illustrations at the top of page 4. We implement Scheduled Sampling for the decoder we use.

[practical-pytorch](https://github.com/spro/practical-pytorch) repo. Provided the starting point for this work.

Oxford [Deep Learnig for NLP](https://github.com/oxford-cs-deepnlp-2017/lectures) lecture course. Good for understanding the broader research context at the time the Leung paper emerged.

## Data preparation

The dataset is available here. 

Create the "datasets" folder, place the "eng-fra.txt" file in there. The "dataprep.ipynb" notebook should be end-to-end executable.

We clean the file and simplify the corpus in terms of maximum length and number of words encountered. The vocabulary for both French and English ends up consisting of about 700 words each.

The dataprep notebook should reproduce exactly upon re-run. These are the files one ends up with after executing the notebook:

<img src="write-up_pics/a.png" alt="pic" width="150"/>

where trial1 and trial2 are just shortened datasets that can be useful for debugging and experiments.

## Coding up the model

The model we wish to implement is summarised in [the Leung paper](https://arxiv.org/pdf/1508.04025), page 3, as well as in many blogs

This has been coded up, in a batched form, in model_functions.py. We provide commenting throughout.

## Training and hyperparameter choices

Ok, so we have a model now. What are reasonable hyperparameter choices? Let's spend some time finding good hyperparameters, though we will not optimise for the very best possible.

Start with a guess and see what happens

<img src="results/5e-03_1_32_{'h_size':30,'dropout':0,'n_layers':2,'att_method':'dot','c':'first'}.png" alt="pic" width="300"/>

Ok, we're making reasonable choices and the learning rate, in particular, looks sound. Let's investigate what happens if we vary the number of layers and the hidden dimension:

<img src="results/atlases/hsize-layers.png" alt="pic" width="500"/>

It's not clear that having 3 layers makes a difference. However, increasing the dimensionality to 90 looks like a good choice. Let's now experiment with dropout and changing the attention mechanism. We will here disregard the 'concat' attention mechanism, because it is more computationally costly and also because it does not seem to improve performance significantly. The latter is also reported in [the Leung paper](https://arxiv.org/pdf/1508.04025), page 8: "For *content-based* functions, our implementation *concat* does not yield good performances and more analysis should be done to understand the reason."


Ok, let's have a look:

<img src="results/atlases/att_method-dropout.png" alt="pic" width="500"/>

Going from 'dot' to 'general' does seem to help. And increasing dropout to 0.2 looks sensible.

Finally, let's quickly check if we should increase or decrease the decoder's learning rate, relative to the encoder.



## Investigating the trained model

## Possible next steps

Here are some ideas: 
* Use pre-trained embeddings, instead of training the embedding matrices from scratch.
* Plot and investigate attention patterns (as they do in [this notebook](https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation-batched.ipynb)).
* See what results are obtainable by non-attention seq2seq GRUs. Or try the transformer architecture.
* Implement the architecture extension mentioned in the [The Leung paper](https://arxiv.org/pdf/1508.04025), page 5:
  
  <img src="write-up_pics/b.png" alt="pic" width="250"/>
* Find a suitable model on HuggingFace to fine-tune, perhaps a T5.
* ...
