# OpenNMT Tutorial and Starter Code
(modified from the OpenNMT quickstart to work in Colab)

While creating your own models from scratch is common for many tasks, often times it's useful to rely on a tool or framework to aid in this. In this exercise we're going to look at one popular NMT tool, OpenNMT, as a way to use beam search, which could be tricky to implement efficiently on your own.

Finally we'll look at how to configure different models for OpenNMT including Transformer, which we'll look at in detail next week.

OpenNMT, is similar to other ML frameworks in that it relies on a combination of editable .yaml files and command line tools to run the training procedure.  
### Make sure you have the *.yml config files from the lab repository.



#### Due to some colab compatibility issues we will use a different version of torch.

In [1]:
!pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.6.0+cu101
[?25l  Downloading https://download.pytorch.org/whl/cu101/torch-1.6.0%2Bcu101-cp37-cp37m-linux_x86_64.whl (708.0MB)
[K     |████████████████████████████████| 708.0MB 25kB/s 
[?25hCollecting torchvision==0.7.0+cu101
[?25l  Downloading https://download.pytorch.org/whl/cu101/torchvision-0.7.0%2Bcu101-cp37-cp37m-linux_x86_64.whl (5.9MB)
[K     |████████████████████████████████| 5.9MB 60.7MB/s 
Installing collected packages: torch, torchvision
  Found existing installation: torch 1.7.1+cu101
    Uninstalling torch-1.7.1+cu101:
      Successfully uninstalled torch-1.7.1+cu101
  Found existing installation: torchvision 0.8.2+cu101
    Uninstalling torchvision-0.8.2+cu101:
      Successfully uninstalled torchvision-0.8.2+cu101
Successfully installed torch-1.6.0+cu101 torchvision-0.7.0+cu101


### Next let's get OpenNMT as well as a toy English to German corpus.

In [2]:
!git clone https://github.com/OpenNMT/OpenNMT-py.git
!cd OpenNMT-py; pip install -e .
!wget https://s3.amazonaws.com/opennmt-trainingdata/toy-ende.tar.gz
!tar xf toy-ende.tar.gz


Cloning into 'OpenNMT-py'...
remote: Enumerating objects: 3, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 17074 (delta 0), reused 0 (delta 0), pack-reused 17071[K
Receiving objects: 100% (17074/17074), 272.99 MiB | 37.78 MiB/s, done.
Resolving deltas: 100% (12317/12317), done.
Obtaining file:///content/OpenNMT-py
Collecting tqdm<5,>=4.51
[?25l  Downloading https://files.pythonhosted.org/packages/4e/8c/f1035bd24b0e352ddba7be320abc1603fc4c9976fcda6971ed287be59164/tqdm-4.58.0-py2.py3-none-any.whl (73kB)
[K     |████████████████████████████████| 81kB 10.4MB/s 
Collecting torchtext==0.5.0
[?25l  Downloading https://files.pythonhosted.org/packages/79/ef/54b8da26f37787f5c670ae2199329e7dccf195c060b25628d99e587dac51/torchtext-0.5.0-py3-none-any.whl (73kB)
[K     |████████████████████████████████| 81kB 7.5MB/s 
[?25hCollecting configargparse<2,>=1.2.3
[?25l  Downloading https://files.pythonhosted.org/packages/3f/75

## Processing Vocab

Once we have the corpus and OpenNMT we can build the vocab we'll use. This relies on having a config file with this information laid out.

Let's take a second to look at the config file we'll be using toy-ende.yml, which you should upload to Colab using the file upload on the left.

The important part of the data processing are in the top parts of the yaml file:

```
# toy_en_de.yaml

## Where the samples will be written
save_data: toy-ende/run/example
## Where the vocab(s) will be written
src_vocab: toy-ende/run/example.vocab.src
tgt_vocab: toy-ende/run/example.vocab.tgt
# Prevent overwriting existing files in the folder
overwrite: False

# Corpus opts:
data:
    corpus_1:
        path_src: toy-ende/src-train.txt
        path_tgt: toy-ende/tgt-train.txt
    valid:
        path_src: toy-ende/src-val.txt
        path_tgt: toy-ende/tgt-val.txt


# Vocabulary files that were just created
src_vocab: toy-ende/run/example.vocab.src
tgt_vocab: toy-ende/run/example.vocab.tgt


```

We specify where the data is, where to save it, as well as the vocab files corresponding to the corpus.

In [4]:
!onmt_build_vocab -config toy-ende.yml -n_sample 10000


Corpus corpus_1's weight should be given. We default it to 1 for you.
[2021-03-01 21:26:24,710 INFO] Counter vocab from 10000 samples.
[2021-03-01 21:26:24,710 INFO] Build vocab on 10000 transformed examples/corpus.
[2021-03-01 21:26:24,720 INFO] corpus_1's transforms: TransformPipe()
[2021-03-01 21:26:24,721 INFO] Loading ParallelCorpus(toy-ende/src-train.txt, toy-ende/tgt-train.txt, align=None)...
[2021-03-01 21:26:25,022 INFO] Counters src:24995
[2021-03-01 21:26:25,023 INFO] Counters tgt:35816


## Training

Next we will beging training with OpenNMT, again using the same config file, however, below we'll look at the relevant parts:

```

# Train on a single GPU
world_size: 1
gpu_ranks: [0]

# Where to save the checkpoints
# Note it won't actually make it to 10,000 steps because of early stopping
save_model: toy-ende/run/model
save_checkpoint_steps: 500
train_steps: 10000
valid_steps: 500
early_stopping: 2


# Checkpoint settings
keep_checkpoint: 3
seed: 531
warmup_steps: 400
report_every: 100

# Model (note these are actually default values, but I've explicitely written them out to show how you can edit them)
decoder_type: rnn
encoder_type: rnn 
enc_layers: 2
dec_layers: 2
enc_rnn_size: 500
dec_rnn_size: 500
dropout: 0.3
global_attention : dot


# Optimizer settings
optim: sgd
learning_rate: 1

```

Here the config file covers two major things: Model checkpointing and Model Hyperparameters.

Certain settings are available only for certain models, for instance you wouldn't (want to) use positional encoding for an RNN-based model, however, it is necessary for proper training of Transformers and we could include it if we added a line ```positional_encoding: 'true'```.

If we wanted to know more about any of these settings, we could take a peek at the OpenNMT [train documentation](https://opennmt.net/OpenNMT-py/options/train.html)

For instance for the encoder options, it shows what available models can be used:
```
--encoder_type, -encoder_type
Possible choices: rnn, brnn, ggnn, mean, transformer, cnn, transformer_lm

Type of encoder layer to use. Non-RNN layers are experimental. Options are [rnn|brnn|ggnn|mean|transformer|cnn|transformer_lm].

```


Finally we will train our model with this configuration. (It took about 10 minutes for the small RNN model to train). 

In [25]:
!onmt_train -config toy-ende.yml

[2021-03-02 00:42:56,005 INFO] Missing transforms field for corpus_1 data, set to default: [].
[2021-03-02 00:42:56,005 INFO] Missing transforms field for valid data, set to default: [].
[2021-03-02 00:42:56,005 INFO] Parsed 2 corpora from -data.
[2021-03-02 00:42:56,006 INFO] Get special vocabs from Transforms: {'src': set(), 'tgt': set()}.
[2021-03-02 00:42:56,006 INFO] Loading vocab from text file...
[2021-03-02 00:42:56,006 INFO] Loading src vocabulary from toy-ende/run/example.vocab.src
[2021-03-02 00:42:56,049 INFO] Loaded src vocab has 24995 tokens.
[2021-03-02 00:42:56,059 INFO] Loading tgt vocabulary from toy-ende/run/example.vocab.tgt
[2021-03-02 00:42:56,127 INFO] Loaded tgt vocab has 35816 tokens.
[2021-03-02 00:42:56,141 INFO] Building fields with vocab in counters...
[2021-03-02 00:42:56,207 INFO]  * tgt vocab size: 35820.
[2021-03-02 00:42:56,237 INFO]  * src vocab size: 24997.
[2021-03-02 00:42:56,239 INFO]  * src vocab size = 24997
[2021-03-02 00:42:56,239 INFO]  * tgt

Once our model is saved. We can use it to actually generate predictions on our output files. Our models will be saved under the ```save_model``` setting of our config file, in this case: ```toy-ende/run/model_```  Since we are only saving every 500 training steps, and keeping the past three checkpoints, we can choose from the available models. ```model_step_1000.pt``` and ```model_step_1500.pt``` and ```model_step_2000.pt```. Our early stopping indicates the best model (lowest perplexity/highest acc) of the three is 1000, but let's look at how to pick between these three using BLEU:

## Translating

To do so we will need to translate the source sentences, decoding with Beam search, in this case we've chosen a ```-beam_size``` of 10, however you will be asked in the question to adjust it to different sizes.

Let's first create predictions for our ```_step_2000.pt```, ```_step_2500.pt``` , ```_step_3000.pt``` models (NOTE YOUR MODEL MAY HAVE STOPPED AT A DIFFERENT POINT, IN WHICH CASE USE THE APPROPRIATE 3 LAST CHECKPOINTS):

In [26]:
!onmt_translate -model toy-ende/run/model_step_2000.pt -src toy-ende/src-val.txt -output toy-ende/val_2000.txt -gpu 0 -beam_size 10 -seed 531 -block_ngram 2
!onmt_translate -model toy-ende/run/model_step_2500.pt -src toy-ende/src-val.txt -output toy-ende/val_2500.txt -gpu 0 -beam_size 10 -seed 531 -block_ngram 2
!onmt_translate -model toy-ende/run/model_step_3000.pt -src toy-ende/src-val.txt -output toy-ende/val_3000.txt -gpu 0 -beam_size 10 -seed 531 -block_ngram 2


[2021-03-02 00:55:11,643 INFO] Translating shard 0.
[2021-03-02 00:57:56,340 INFO] PRED AVG SCORE: -1.8684, PRED PPL: 6.4777
[2021-03-02 00:58:00,669 INFO] Translating shard 0.
[2021-03-02 01:00:08,496 INFO] PRED AVG SCORE: -1.7845, PRED PPL: 5.9568
[2021-03-02 01:00:12,873 INFO] Translating shard 0.
[2021-03-02 01:02:52,549 INFO] PRED AVG SCORE: -1.6705, PRED PPL: 5.3148


[Note we can now manually inspect the results under val_*.txt]

Finally let's calculate the BLEU scores of the outputs! We would eventually want to select the model with Highest BLEU (in our case 37 with our 2500 step model) and use this on our test set.



In [27]:
!perl  OpenNMT-py/tools/multi-bleu.perl toy-ende/tgt-val.txt < toy-ende/val_2000.txt
!perl  OpenNMT-py/tools/multi-bleu.perl toy-ende/tgt-val.txt < toy-ende/val_2500.txt
!perl  OpenNMT-py/tools/multi-bleu.perl toy-ende/tgt-val.txt < toy-ende/val_3000.txt

Use of uninitialized value in division (/) at OpenNMT-py/tools/multi-bleu.perl line 139, <STDIN> line 3000.
BLEU = 0.00, 18.0/1.0/0.1/0.0 (BP=0.873, ratio=0.880, hyp_len=63094, ref_len=71666)
BLEU = 0.37, 21.5/1.8/0.2/0.0 (BP=0.550, ratio=0.626, hyp_len=44858, ref_len=71666)
BLEU = 0.32, 18.5/1.0/0.1/0.0 (BP=0.866, ratio=0.874, hyp_len=62645, ref_len=71666)


# Teamwork Exercise 3

We have seen how OpenNMT can be used, now let's apply it to our Multi30k dataset.

You can run your code in here and then download the results to submit on github.

This is a Team assignment to enable students helping one another understand the different components of the OpenNMT framework and running them correctly.

*You are provided with a Multi30k.yaml to fill in, be sure to submit this alongside your colab notebook and other files in the repository.*

## T3.1

### Build the vocab for the Multi30k En-Fr dataset

While just having a vocabulary is fine for some cases, using a sub-word tokenization might help capture morphological information better.

To do this, in your config file add ```transforms: [sentencepiece, filtertoolong]``` to both the training and validation corpora.

Give the code you ran to build the vocab as well as the "data" section of your multi30k config file.


In [None]:
# TODO build Multi30k Vocab

```
Include changes you made to the Data saving, Corpus, and Vocab section in the Config HERE
````

## T3.2
Train Model

Fill in the multi30k.yaml config to setup a seq2seq model that has a 3 layer RNN encoder 2 layer RNN decoder, MLP attention, with 20% dropout, using Adam as your optimizer.

Copy and paste the changed parts of the *.yml file below along with the training command you used.

In [None]:
# TODO Train Model

```
Changes to model, and optimizer here.
```

## T3.3

Decoding

Create predictions for the validation set using your saved models and select the one that has the highest BLEU. You should set beam size to 5 for each of these models.

Report the BLEU on this model.

In [None]:
## Code to create predictions and calculate BLEU for models

## T3.4 

Comparing Beam Width

For your BEST model compare the peformance (Both BLEU and clocktime to run)  with the following Beam Sizes: 5 (done above), 10, 15, and 20.

Give your code and outputs below.

In [None]:
## TODO Beam comparison