# OpenNMT Tutorial and Starter Code
(modified from the OpenNMT quickstart to work in Colab)

While creating your own models from scratch is common for many tasks, often times it's useful to rely on a tool or framework to aid in this. In this exercise we're going to look at one popular NMT tool, OpenNMT, as a way to use beam search, which could be tricky to implement efficiently on your own.

Finally we'll look at how to configure different models for OpenNMT including Transformer, which we'll look at in detail next week.

OpenNMT, is similar to other ML frameworks in that it relies on a combination of editable .yaml files and command line tools to run the training procedure.  
### Make sure you have the toy-ende.yml from the lab repository.



#### Due to some colab compatibility issues we will use a different version of torch.

In [1]:
!pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.6.0+cu101
[?25l  Downloading https://download.pytorch.org/whl/cu101/torch-1.6.0%2Bcu101-cp37-cp37m-linux_x86_64.whl (708.0MB)
[K     |████████████████████████████████| 708.0MB 26kB/s 
[?25hCollecting torchvision==0.7.0+cu101
[?25l  Downloading https://download.pytorch.org/whl/cu101/torchvision-0.7.0%2Bcu101-cp37-cp37m-linux_x86_64.whl (5.9MB)
[K     |████████████████████████████████| 5.9MB 59.8MB/s 
[31mERROR: torchtext 0.9.0 has requirement torch==1.8.0, but you'll have torch 1.6.0+cu101 which is incompatible.[0m
Installing collected packages: torch, torchvision
  Found existing installation: torch 1.8.0+cu101
    Uninstalling torch-1.8.0+cu101:
      Successfully uninstalled torch-1.8.0+cu101
  Found existing installation: torchvision 0.9.0+cu101
    Uninstalling torchvision-0.9.0+cu101:
      Successfully uninstalled torchvision-0.9.0+cu101
Successfully installed torch-1.6.0+cu101 torchvi

### Next let's get OpenNMT as well as a toy English to German corpus.

In [2]:
!git clone https://github.com/OpenNMT/OpenNMT-py.git
!cd OpenNMT-py; pip install -e .
!wget https://s3.amazonaws.com/opennmt-trainingdata/toy-ende.tar.gz
!tar xf toy-ende.tar.gz


Cloning into 'OpenNMT-py'...
remote: Enumerating objects: 11, done.[K
remote: Counting objects: 100% (11/11), done.[K
remote: Compressing objects: 100% (11/11), done.[K
remote: Total 17082 (delta 1), reused 1 (delta 0), pack-reused 17071[K
Receiving objects: 100% (17082/17082), 272.99 MiB | 38.04 MiB/s, done.
Resolving deltas: 100% (12318/12318), done.
Obtaining file:///content/OpenNMT-py
Collecting tqdm<5,>=4.51
[?25l  Downloading https://files.pythonhosted.org/packages/f8/3e/2730d0effc282960dbff3cf91599ad0d8f3faedc8e75720fdf224b31ab24/tqdm-4.59.0-py2.py3-none-any.whl (74kB)
[K     |████████████████████████████████| 81kB 8.1MB/s 
Collecting torchtext==0.5.0
[?25l  Downloading https://files.pythonhosted.org/packages/79/ef/54b8da26f37787f5c670ae2199329e7dccf195c060b25628d99e587dac51/torchtext-0.5.0-py3-none-any.whl (73kB)
[K     |████████████████████████████████| 81kB 6.9MB/s 
[?25hCollecting configargparse<2,>=1.2.3
[?25l  Downloading https://files.pythonhosted.org/packages/5

## Processing Vocab

Once we have the corpus and OpenNMT we can build the vocab we'll use. This relies on having a config file with this information laid out.

Let's take a second to look at the config file we'll be using toy-ende.yml, which you should upload to Colab using the file upload on the left.

The important part of the data processing are in the top parts of the yaml file:

```
# toy_en_de.yaml

## Where the samples will be written
save_data: toy-ende/run/example
## Where the vocab(s) will be written
src_vocab: toy-ende/run/example.vocab.src
tgt_vocab: toy-ende/run/example.vocab.tgt
# Prevent overwriting existing files in the folder
overwrite: False

# Corpus opts:
data:
    corpus_1:
        path_src: toy-ende/src-train.txt
        path_tgt: toy-ende/tgt-train.txt
    valid:
        path_src: toy-ende/src-val.txt
        path_tgt: toy-ende/tgt-val.txt




```

We specify where the data is, where to save it, as well as the vocab files corresponding to the corpus.

In [None]:
!onmt_build_vocab -config toy-ende.yml -n_sample 10000


Corpus corpus_1's weight should be given. We default it to 1 for you.
[2021-03-15 22:37:40,227 INFO] Counter vocab from 10000 samples.
[2021-03-15 22:37:40,227 INFO] Build vocab on 10000 transformed examples/corpus.
[2021-03-15 22:37:40,236 INFO] corpus_1's transforms: TransformPipe()
[2021-03-15 22:37:40,237 INFO] Loading ParallelCorpus(toy-ende/src-train.txt, toy-ende/tgt-train.txt, align=None)...
[2021-03-15 22:37:40,538 INFO] Counters src:24995
[2021-03-15 22:37:40,538 INFO] Counters tgt:35816


## Training

Next we will beging training with OpenNMT, again using the same config file, however, below we'll look at the relevant parts:

```

# Train on a single GPU
world_size: 1
gpu_ranks: [0]

# Where to save the checkpoints
# Note it won't actually make it to 10,000 steps because of early stopping
save_model: toy-ende/run/model
save_checkpoint_steps: 500
train_steps: 10000
valid_steps: 500
early_stopping: 2


# Checkpoint settings
keep_checkpoint: 3
seed: 531
warmup_steps: 400
report_every: 100

# Model (note these are actually defaul values, but I've explicitely written them out to show how you can edit them)
decoder_type: rnn
encoder_type: rnn 
enc_layers: 2
dec_layers: 2
enc_rnn_size: 500
dec_rnn_size: 500
dropout: 0.3
global_attention : dot


# Optimizer settings
optim: sgd
learning_rate: 1

```

Here the config file covers two major things: Model checkpointing and Model Hyperparameters.

Certain settings are available only for certain models, for instance you wouldn't (want to) use positional encoding for an RNN-based model, however, it is necessary for proper training of Transformers and we could include it if we added a line ```positional_encoding: 'true'```.

If we wanted to know more about any of these settings, we could take a peek at the OpenNMT [train documentation](https://opennmt.net/OpenNMT-py/options/train.html)

For instance for the encoder options, it shows what available models can be used:
```
--encoder_type, -encoder_type
Possible choices: rnn, brnn, ggnn, mean, transformer, cnn, transformer_lm

Type of encoder layer to use. Non-RNN layers are experimental. Options are [rnn|brnn|ggnn|mean|transformer|cnn|transformer_lm].

```


Finally we will train our model with this configuration. (It took about 10 minutes for the small RNN model to train). 

In [None]:
!onmt_train -config toy-ende.yml

[2021-03-15 22:37:41,310 INFO] Missing transforms field for corpus_1 data, set to default: [].
[2021-03-15 22:37:41,311 INFO] Missing transforms field for valid data, set to default: [].
[2021-03-15 22:37:41,311 INFO] Parsed 2 corpora from -data.
[2021-03-15 22:37:41,311 INFO] Get special vocabs from Transforms: {'src': set(), 'tgt': set()}.
[2021-03-15 22:37:41,311 INFO] Loading vocab from text file...
[2021-03-15 22:37:41,311 INFO] Loading src vocabulary from toy-ende/run/example.vocab.src
[2021-03-15 22:37:41,353 INFO] Loaded src vocab has 24995 tokens.
[2021-03-15 22:37:41,362 INFO] Loading tgt vocabulary from toy-ende/run/example.vocab.tgt
[2021-03-15 22:37:41,434 INFO] Loaded tgt vocab has 35816 tokens.
[2021-03-15 22:37:41,447 INFO] Building fields with vocab in counters...
[2021-03-15 22:37:41,511 INFO]  * tgt vocab size: 35820.
[2021-03-15 22:37:41,540 INFO]  * src vocab size: 24997.
[2021-03-15 22:37:41,542 INFO]  * src vocab size = 24997
[2021-03-15 22:37:41,542 INFO]  * tgt

Once our model is saved. We can use it to actually generate predictions on our output files. Our models will be saved under the ```save_model``` setting of our config file, in this case: ```toy-ende/run/model_```  Since we are only saving every 500 training steps, and keeping the past three checkpoints, we can choose from the available models. ```model_step_1000.pt``` and ```model_step_1500.pt``` and ```model_step_2000.pt```. Our early stopping indicates the best model (lowest perplexity/highest acc) of the three is 1000, but let's look at how to pick between these three using BLEU:

## Translating

To do so we will need to translate the source sentences, decoding with Beam search, in this case we've chosen a ```-beam_size``` of 10, however you will be asked in the question to adjust it to different sizes.

Let's first create predictions for our ```_step_2000.pt```, ```_step_2500.pt``` , ```_step_3000.pt``` models (NOTE YOUR MODEL MAY HAVE STOPPED AT A DIFFERENT POINT, IN WHICH CASE USE THE APPROPRIATE 3 LAST CHECKPOINTS):

In [None]:
!onmt_translate -model toy-ende/run/model_step_2000.pt -src toy-ende/src-val.txt -output toy-ende/val_2000.txt -gpu 0 -beam_size 10 -seed 531 -block_ngram 2
!onmt_translate -model toy-ende/run/model_step_2500.pt -src toy-ende/src-val.txt -output toy-ende/val_2500.txt -gpu 0 -beam_size 10 -seed 531 -block_ngram 2
!onmt_translate -model toy-ende/run/model_step_3000.pt -src toy-ende/src-val.txt -output toy-ende/val_3000.txt -gpu 0 -beam_size 10 -seed 531 -block_ngram 2


[2021-03-02 00:55:11,643 INFO] Translating shard 0.
[2021-03-02 00:57:56,340 INFO] PRED AVG SCORE: -1.8684, PRED PPL: 6.4777
[2021-03-02 00:58:00,669 INFO] Translating shard 0.
[2021-03-02 01:00:08,496 INFO] PRED AVG SCORE: -1.7845, PRED PPL: 5.9568
[2021-03-02 01:00:12,873 INFO] Translating shard 0.
[2021-03-02 01:02:52,549 INFO] PRED AVG SCORE: -1.6705, PRED PPL: 5.3148


[Note we can now manually inspect the results under val_*.txt]

Finally let's calculate the BLEU scores of the outputs! We would eventually want to select the model with Highest BLEU (in our case 37 with our 2500 step model) and use this on our test set.



In [None]:
!perl  OpenNMT-py/tools/multi-bleu.perl toy-ende/tgt-val.txt < toy-ende/val_2000.txt
!perl  OpenNMT-py/tools/multi-bleu.perl toy-ende/tgt-val.txt < toy-ende/val_2500.txt
!perl  OpenNMT-py/tools/multi-bleu.perl toy-ende/tgt-val.txt < toy-ende/val_3000.txt

Use of uninitialized value in division (/) at OpenNMT-py/tools/multi-bleu.perl line 139, <STDIN> line 3000.
BLEU = 0.00, 18.0/1.0/0.1/0.0 (BP=0.873, ratio=0.880, hyp_len=63094, ref_len=71666)
BLEU = 0.37, 21.5/1.8/0.2/0.0 (BP=0.550, ratio=0.626, hyp_len=44858, ref_len=71666)
BLEU = 0.32, 18.5/1.0/0.1/0.0 (BP=0.866, ratio=0.874, hyp_len=62645, ref_len=71666)


# Teamwork Exercise 3

We have seen how OpenNMT can be used, now let's apply it to our Multi30k dataset.

You can run your code in here and then download the results to submit on github.

This is a Team assignment to enable students helping one another understand the different components of the OpenNMT framework and running them correctly.

*You are provided with a Multi30k.yaml to fill in, be sure to submit this alongside your colab notebook and other files in the repository.*

## T3.1

### Build the vocab for the Multi30k En-Fr dataset

While just having a vocabulary is fine for some cases, using a sub-word tokenization might help capture morphological information better.

To do this, in your config file add ```transforms: [sentencepiece, filtertoolong]``` to both the training and validation corpora.

Give the code you ran to build the vocab as well as the "data" section of your multi30k config file.


In [6]:
!python -m spacy download en_core_web_sm
!python -m spacy download fr_core_news_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
Collecting fr_core_news_sm==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-2.2.5/fr_core_news_sm-2.2.5.tar.gz (14.7MB)
[K     |████████████████████████████████| 14.7MB 18.8MB/s 
Building wheels for collected packages: fr-core-news-sm
  Building wheel for fr-core-news-sm (setup.py) ... [?25l[?25hdone
  Created wheel for fr-core-news-sm: filename=fr_core_news_sm-2.2.5-cp37-none-any.whl size=14727027 sha256=33865127c8a003e45bf039fde79d007bf0f9f9a6a7d3cfbdb6850095a94e6aa7
  Stored in directory: /tmp/pip-ephem-wheel-cache-ahrpeoig/wheels/46/1b/e6/29b020e3f9420a24c3f463343afe5136aaaf955dbc9e46dfc5
Successfully built fr-core-news-sm
Installing collected packages: fr-core-news-sm
Successfully installed fr-core-news-sm-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('fr_core_n

In [11]:
# TODO build Multi30k Vocab
import fr_core_news_sm
import en_core_web_sm
spacy_fr = fr_core_news_sm.load()
spacy_en = en_core_web_sm.load()

for file in ["train.fr","val.fr"]:
  with open(file,"r", encoding='utf-8') as fp:
    outfile = file + ".tokd"
    with open(outfile, "w", encoding="utf-8") as out:
      for line in fp:
        tokenized = [tok.text for tok in spacy_fr.tokenizer(line)]
        out.write(" ".join(tokenized))
for file in ["train-1.en","val.en"]:
  with open(file,"r", encoding='utf-8') as fp:
    outfile = file + ".tokd"
    with open(outfile, "w", encoding="utf-8") as out:
      for line in fp:
        tokenized = [tok.text for tok in spacy_en.tokenizer(line)]
        out.write(" ".join(tokenized))           


In [12]:
!onmt_build_vocab -config multi30k_sol.yml -n_sample 10000


Corpus corpus_1's weight should be given. We default it to 1 for you.
[2021-03-16 01:45:38,941 INFO] Counter vocab from 10000 samples.
[2021-03-16 01:45:38,941 INFO] Build vocab on 10000 transformed examples/corpus.
[2021-03-16 01:45:38,947 INFO] corpus_1's transforms: TransformPipe(FilterTooLongTransform(src_seq_length=200, tgt_seq_length=200))
[2021-03-16 01:45:38,947 INFO] Loading ParallelCorpus(multi30k/train.fr.tokd, multi30k/train-1.en.tokd, align=None)...
[2021-03-16 01:45:39,148 INFO] Counters src:6872
[2021-03-16 01:45:39,148 INFO] Counters tgt:6412


```
Include changes you made to the Data saving, Corpus, and Vocab section in the Config HERE

# multi30k.yaml

## TO DO COMPLETE DATA SAVING
save_data: multi30k/run/
## Where the vocab(s) will be written
src_vocab: multi30k/run/vocab.src
tgt_vocab: multi30k/run/vocab.tgt

# Corpus opts:
data:
## TODO COMPLETE CORPUS OPTIONS
## Add sentencepiece and filter long segments
    corpus_1:
        path_src: multi30k/train.fr.tokd
        path_tgt: multi30k/train-1.en.tokd
        transforms: [filtertoolong]
    valid:
        path_src: multi30k/val.fr.tokd
        path_tgt: multi30k/val.en.tokd
        #NOTE NO FILTERTOOLONG --> You don't want to bias your validation score


#TODO Fill in vocab you create (already have this above)
src_vocab: multi30k/run/vocab.src
tgt_vocab: multi30k/run/vocab.tgt

````

## T3.2
Train Model

Fill in the multi30k.yaml config to setup a seq2seq model that has a 3 layer RNN encoder 2 layer RNN decoder, MLP attention, with 20% dropout, using Adam as your optimizer.

Copy and paste the changed parts of the *.yml file below along with the training command you used.

In [14]:
# TODO Train Model

!onmt_train -config multi30k_sol.yml

[2021-03-16 01:47:00,107 INFO] Missing transforms field for valid data, set to default: [].
[2021-03-16 01:47:00,107 INFO] Parsed 2 corpora from -data.
[2021-03-16 01:47:00,108 INFO] Get special vocabs from Transforms: {'src': set(), 'tgt': set()}.
[2021-03-16 01:47:00,108 INFO] Loading vocab from text file...
[2021-03-16 01:47:00,108 INFO] Loading src vocabulary from multi30k/run/vocab.src
[2021-03-16 01:47:00,121 INFO] Loaded src vocab has 6872 tokens.
[2021-03-16 01:47:00,123 INFO] Loading tgt vocabulary from multi30k/run/vocab.tgt
[2021-03-16 01:47:00,134 INFO] Loaded tgt vocab has 6412 tokens.
[2021-03-16 01:47:00,136 INFO] Building fields with vocab in counters...
[2021-03-16 01:47:00,143 INFO]  * tgt vocab size: 6416.
[2021-03-16 01:47:00,151 INFO]  * src vocab size: 6874.
[2021-03-16 01:47:00,151 INFO]  * src vocab size = 6874
[2021-03-16 01:47:00,151 INFO]  * tgt vocab size = 6416
[2021-03-16 01:47:00,154 INFO] Building model...
[2021-03-16 01:47:03,252 INFO] NMTModel(
  (enco

```
Changes to model, and optimizer here.

#Changes to model, and optimizer here.
decoder_type: rnn
encoder_type: rnn 
enc_layers: 3
dec_layers: 2
enc_rnn_size: 500
dec_rnn_size: 500
dropout: 0.2
global_attention : mlp


# Optimizer settings
optim: adam
learning_rate: 0.001  #note changing Adam you should change the starting learning rate



```

## T3.3

Decoding

Create predictions for the validation set using your saved models and select the one that has the highest BLEU. You should set beam size to 5 for each of these models.

Report the BLEU on this model.

In [16]:
## Code to create predictions and calculate BLEU for models

!onmt_translate -model multi30k/run/model_step_3000.pt -src multi30k/val.fr.tokd -output multi30k/val_3000.txt -gpu 0 -beam_size 5 -seed 531 -block_ngram 2
!onmt_translate -model multi30k/run/model_step_3500.pt -src multi30k/val.fr.tokd -output multi30k/val_3500.txt -gpu 0 -beam_size 5 -seed 531 -block_ngram 2
!onmt_translate -model multi30k/run/model_step_4000.pt -src multi30k/val.fr.tokd -output multi30k/val_4000.txt -gpu 0 -beam_size 5 -seed 531 -block_ngram 2



[2021-03-16 01:53:41,325 INFO] Translating shard 0.
[2021-03-16 01:53:54,390 INFO] PRED AVG SCORE: -0.3489, PRED PPL: 1.4175
[2021-03-16 01:53:58,297 INFO] Translating shard 0.
[2021-03-16 01:54:11,366 INFO] PRED AVG SCORE: -0.3111, PRED PPL: 1.3649
[2021-03-16 01:54:15,266 INFO] Translating shard 0.
[2021-03-16 01:54:27,576 INFO] PRED AVG SCORE: -0.2963, PRED PPL: 1.3449


In [17]:
!perl  OpenNMT-py/tools/multi-bleu.perl multi30k/val.en.tokd < multi30k/val_3000.txt
!perl  OpenNMT-py/tools/multi-bleu.perl multi30k/val.en.tokd < multi30k/val_3500.txt
!perl  OpenNMT-py/tools/multi-bleu.perl multi30k/val.en.tokd < multi30k/val_4000.txt


BLEU = 43.32, 72.5/51.1/37.4/27.4 (BP=0.981, ratio=0.981, hyp_len=13172, ref_len=13426)
BLEU = 43.85, 73.1/51.7/38.0/28.1 (BP=0.978, ratio=0.978, hyp_len=13133, ref_len=13426)
BLEU = 43.09, 73.3/51.8/37.9/27.9 (BP=0.963, ratio=0.963, hyp_len=12933, ref_len=13426)


## T3.4 

Comparing Beam Width

For your BEST model compare the peformance (Both BLEU and clocktime to run)  with the following Beam Sizes: 5 (done above), 10, 15, and 20.

Give your code and outputs below.

In [21]:
%%time
#INCLUDED FOR CURIOSITY SAKE
!onmt_translate -model multi30k/run/model_step_3500.pt -src multi30k/val.fr.tokd -output multi30k/beam1.txt -gpu 0 -beam_size 1 -seed 531 -block_ngram 2


[2021-03-16 01:58:49,161 INFO] Translating shard 0.
[2021-03-16 01:58:53,656 INFO] PRED AVG SCORE: -0.7389, PRED PPL: 2.0936
CPU times: user 18.6 ms, sys: 12.2 ms, total: 30.7 ms
Wall time: 8.26 s


In [23]:
%%time

!onmt_translate -model multi30k/run/model_step_3500.pt -src multi30k/val.fr.tokd -output multi30k/beam5.txt -gpu 0 -beam_size 5 -seed 531 -block_ngram 2


[2021-03-16 01:59:35,899 INFO] Translating shard 0.
[2021-03-16 01:59:48,986 INFO] PRED AVG SCORE: -0.3111, PRED PPL: 1.3649
CPU times: user 35.7 ms, sys: 12.2 ms, total: 47.9 ms
Wall time: 17 s


In [18]:
%%time
!onmt_translate -model multi30k/run/model_step_3500.pt -src multi30k/val.fr.tokd -output multi30k/beam10.txt -gpu 0 -beam_size 10 -seed 531 -block_ngram 2



[2021-03-16 01:56:26,498 INFO] Translating shard 0.
[2021-03-16 01:56:49,723 INFO] PRED AVG SCORE: -0.2985, PRED PPL: 1.3478
CPU times: user 49.2 ms, sys: 17.4 ms, total: 66.6 ms
Wall time: 27.1 s


In [19]:
%%time

!onmt_translate -model multi30k/run/model_step_3500.pt -src multi30k/val.fr.tokd -output multi30k/beam15.txt -gpu 0 -beam_size 15 -seed 531 -block_ngram 2


[2021-03-16 01:57:22,825 INFO] Translating shard 0.
[2021-03-16 01:57:57,147 INFO] PRED AVG SCORE: -0.2971, PRED PPL: 1.3459
CPU times: user 67.3 ms, sys: 22.3 ms, total: 89.5 ms
Wall time: 38.3 s


In [20]:
%%time

!onmt_translate -model multi30k/run/model_step_3500.pt -src multi30k/val.fr.tokd -output multi30k/beam20.txt -gpu 0 -beam_size 20 -seed 531 -block_ngram 2


[2021-03-16 01:58:01,617 INFO] Translating shard 0.
[2021-03-16 01:58:45,384 INFO] PRED AVG SCORE: -0.2973, PRED PPL: 1.3462
CPU times: user 84.8 ms, sys: 19 ms, total: 104 ms
Wall time: 47.6 s


In [24]:
!perl  OpenNMT-py/tools/multi-bleu.perl multi30k/val.en.tokd < multi30k/beam1.txt
!perl  OpenNMT-py/tools/multi-bleu.perl multi30k/val.en.tokd < multi30k/beam5.txt
!perl  OpenNMT-py/tools/multi-bleu.perl multi30k/val.en.tokd < multi30k/beam10.txt
!perl  OpenNMT-py/tools/multi-bleu.perl multi30k/val.en.tokd < multi30k/beam15.txt
!perl  OpenNMT-py/tools/multi-bleu.perl multi30k/val.en.tokd < multi30k/beam20.txt


BLEU = 42.09, 69.3/48.7/35.5/26.2 (BP=1.000, ratio=1.035, hyp_len=13891, ref_len=13426)
BLEU = 43.85, 73.1/51.7/38.0/28.1 (BP=0.978, ratio=0.978, hyp_len=13133, ref_len=13426)
BLEU = 43.72, 73.3/52.1/38.2/28.2 (BP=0.971, ratio=0.971, hyp_len=13036, ref_len=13426)
BLEU = 43.82, 73.5/52.2/38.4/28.5 (BP=0.968, ratio=0.968, hyp_len=13000, ref_len=13426)
BLEU = 43.84, 73.7/52.4/38.5/28.5 (BP=0.967, ratio=0.967, hyp_len=12988, ref_len=13426)


A Narrow Beam (5) is better than "best path" (beam=1) as well as larger beam sizes (10+)

In [27]:
import csv
for file in ["long_test_eng_fre.tsv","short_test_eng_fre.tsv"]:
  with open(file,"r", encoding='utf-8') as tsv:
    tsv_reader = csv.reader(tsv, delimiter ="\t")
    next(tsv_reader, None) 
    outfile_fr = file + ".fr.tokd"
    outfile_en = file + ".en.tokd"
    with open(outfile_fr, "w", encoding="utf-8") as out_fr:
      with open(outfile_en, "w", encoding="utf-8") as out_en:
        for row in tsv_reader:
          tokenized_en = [tok.text for tok in spacy_en.tokenizer(row[0])]
          tokenized_fr = [tok.text for tok in spacy_fr.tokenizer(row[1])]
          out_fr.write(" ".join(tokenized_fr)+"\n")
          out_en.write(" ".join(tokenized_en)+"\n")

In [28]:
#Part 2 

!onmt_translate -model multi30k/run/model_step_3500.pt -src long_test_eng_fre.tsv.fr.tokd -output long_trans.txt -gpu 0 -beam_size 5 -seed 531 -block_ngram 2
!onmt_translate -model multi30k/run/model_step_3500.pt -src short_test_eng_fre.tsv.fr.tokd -output short_trans.txt -gpu 0 -beam_size 5 -seed 531 -block_ngram 2



[2021-03-16 02:05:11,086 INFO] Translating shard 0.
[2021-03-16 02:05:12,671 INFO] PRED AVG SCORE: -0.4366, PRED PPL: 1.5474
[2021-03-16 02:05:16,495 INFO] Translating shard 0.
[2021-03-16 02:05:17,091 INFO] PRED AVG SCORE: -0.2847, PRED PPL: 1.3294


In [29]:
!perl  OpenNMT-py/tools/multi-bleu.perl long_test_eng_fre.tsv.en.tokd < long_trans.txt
!perl  OpenNMT-py/tools/multi-bleu.perl short_test_eng_fre.tsv.en.tokd < short_trans.txt


BLEU = 36.21, 69.1/45.2/31.7/23.0 (BP=0.933, ratio=0.935, hyp_len=1461, ref_len=1563)
BLEU = 44.60, 70.7/51.3/38.2/28.5 (BP=1.000, ratio=1.055, hyp_len=672, ref_len=637)
