# seq2seq with Fairseq

This notebook uses Fairseq and PyTorch to train a sequence-to-sequence model.

It clones and runs [github.com/deeplanguageclass/fairseq-transliteration/](https://github.com/deeplanguageclass/fairseq-transliteration/).

The data are at [github.com/deeplanguageclass/fairseq-transliteration-data](https://github.com/deeplanguageclass/fairseq-transliteration-data).

The notebook code itself is at [github.com/deeplanguageclass/fairseq-transliteration.ipynb](https://github.com/deeplanguageclass/fairseq-transliteration.ipynb).

Note you must turn on GPU to use Fairseq!

> *Edit > Notebook settings > Hardware accelerator: GPU*




## Requirements

In [2]:
%cd /content/
!rm -rf fairseq
!git clone https://github.com/deeplanguageclass/fairseq.git
%cd fairseq
!ls
!pip install -r requirements.txt

/content
Cloning into 'fairseq'...
remote: Counting objects: 2224, done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 2224 (delta 23), reused 34 (delta 23), pack-reused 2189[K
Receiving objects: 100% (2224/2224), 2.76 MiB | 6.93 MiB/s, done.
Resolving deltas: 100% (1636/1636), done.
/content/fairseq
CONTRIBUTING.md       fairseq.gif		PATENTS		  scripts
distributed_train.py  generate.py		preprocess.py	  setup.py
eval_lm.py	      interactive.py		README.md	  tests
examples	      LICENSE			requirements.txt  train.py
fairseq		      multiprocessing_train.py	score.py
Collecting cffi (from -r requirements.txt (line 1))
[?25l  Downloading https://files.pythonhosted.org/packages/6d/c0/47db8f624f3e4e2f3f27be03a93379d1ba16a1450a7b1aacfa0366e2c0dd/cffi-1.11.5-cp36-cp36m-manylinux1_x86_64.whl (421kB)
[K    100% |████████████████████████████████| 430kB 6.1MB/s 
Collecting torch (from -r requirements.txt (line 3))
[?25l  Downloading https://files.pythonhosted.org/packages/4

In [3]:
!python setup.py build
!python setup.py develop

running build
running build_py
creating build
creating build/lib.linux-x86_64-3.6
creating build/lib.linux-x86_64-3.6/tests
copying tests/test_label_smoothing.py -> build/lib.linux-x86_64-3.6/tests
copying tests/test_sequence_scorer.py -> build/lib.linux-x86_64-3.6/tests
copying tests/test_convtbc.py -> build/lib.linux-x86_64-3.6/tests
copying tests/test_average_checkpoints.py -> build/lib.linux-x86_64-3.6/tests
copying tests/test_sequence_generator.py -> build/lib.linux-x86_64-3.6/tests
copying tests/test_train.py -> build/lib.linux-x86_64-3.6/tests
copying tests/test_utils.py -> build/lib.linux-x86_64-3.6/tests
copying tests/test_binaries.py -> build/lib.linux-x86_64-3.6/tests
copying tests/test_dictionary.py -> build/lib.linux-x86_64-3.6/tests
copying tests/utils.py -> build/lib.linux-x86_64-3.6/tests
copying tests/test_data_utils.py -> build/lib.linux-x86_64-3.6/tests
copying tests/__init__.py -> build/lib.linux-x86_64-3.6/tests
creating build/lib.linux-x86_64-3.6/fairseq
copying 

## Data pre-processing

In [4]:
%cd examples/translation/
!bash prepare-translit.sh
%cd ../..

/content/fairseq/examples/translation
Cloning Moses github repository (for tokenization scripts)...
Cloning into 'mosesdecoder'...
remote: Counting objects: 147104, done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 147104 (delta 0), reused 2 (delta 0), pack-reused 147098[K
Receiving objects: 100% (147104/147104), 129.65 MiB | 22.00 MiB/s, done.
Resolving deltas: 100% (113695/113695), done.
Cloning Subword NMT repository (for BPE pre-processing)...
Cloning into 'subword-nmt'...
remote: Counting objects: 462, done.[K
remote: Compressing objects: 100% (32/32), done.[K
remote: Total 462 (delta 20), reused 22 (delta 10), pack-reused 420[K
Receiving objects: 100% (462/462), 208.23 KiB | 3.25 MiB/s, done.
Resolving deltas: 100% (264/264), done.
--2018-08-18 05:23:32--  https://deeplanguageclass.github.io/fairseq-transliteration-data/la-hy.train.tar.gz
Resolving deeplanguageclass.github.io (deeplanguageclass.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110

In [8]:
!python preprocess.py --source-lang la --target-lang hy \
  --trainpref examples/translation/translit_la_hy/train \
  --validpref examples/translation/translit_la_hy/valid \
  --testpref examples/translation/translit_la_hy/test \
  --destdir data-bin/translit_la_hy

Namespace(alignfile=None, destdir='data-bin/translit_la_hy', joined_dictionary=False, nwordssrc=-1, nwordstgt=-1, only_source=False, output_format='binary', padding_factor=8, source_lang='la', srcdict=None, target_lang='hy', testpref='examples/translation/translit_la_hy/test', tgtdict=None, thresholdsrc=0, thresholdtgt=0, trainpref='examples/translation/translit_la_hy/train', validpref='examples/translation/translit_la_hy/valid')
| [la] Dictionary: 1399 types
| [la] examples/translation/translit_la_hy/train.la: 979053 sents, 41160824 tokens, 0.0% replaced by <unk>
| [la] Dictionary: 1399 types
| [la] examples/translation/translit_la_hy/valid.la: 9892 sents, 416323 tokens, 0.000961% replaced by <unk>
| [la] Dictionary: 1399 types
| [la] examples/translation/translit_la_hy/test.la: 10000 sents, 430508 tokens, 0.0446% replaced by <unk>
| [hy] Dictionary: 1479 types
| [hy] examples/translation/translit_la_hy/train.hy: 979053 sents, 41698210 tokens, 0.0% replaced by <unk>
| [hy] Dictionary:

## Training

In [0]:
!mkdir -p checkpoints/fconv
!CUDA_VISIBLE_DEVICES=0 python train.py data-bin/translit_la_hy \
  --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 132 \
  --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
  --lr-scheduler fixed --force-anneal 200 \
  --arch fconv_iwslt_de_en --save-dir checkpoints/fconv \
  --skip-invalid-size-inputs-valid-test --max-epoch 10


Namespace(arch='fconv_iwslt_de_en', clip_norm=0.1, criterion='label_smoothed_cross_entropy', data='data-bin/translit_la_hy', decoder_attention='True', decoder_embed_dim=256, decoder_embed_path=None, decoder_layers='[(256, 3)] * 3', decoder_out_embed_dim=256, device_id=0, distributed_backend='nccl', distributed_init_method=None, distributed_port=-1, distributed_rank=0, distributed_world_size=1, dropout=0.2, encoder_embed_dim=256, encoder_embed_path=None, encoder_layers='[(256, 3)] * 4', force_anneal=200, fp16=False, keep_interval_updates=-1, label_smoothing=0.1, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.25], lr_scheduler='fixed', lr_shrink=0.1, max_epoch=10, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=132, max_update=0, min_loss_scale=0.0001, min_lr=1e-05, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, normalization_constant=0.5, optimizer

## Testing

In [0]:
!python generate.py data-bin/translit.tokenized.latn-armn \
  --path checkpoints/fconv/checkpoint_best.pt \
  --batch-size 128 --beam 5 \
  --skip-invalid-size-inputs-valid-test