# End-to-End Neural Machine Machine Translation System 

In this notebook we will learn how to train a neural machine transaltion system from scracth. We will use a bidirectional LSTM with attention encoder-decoder network to translate between two languages (from English, Spanish, or Wixárika, an indigenous langauge of Mexico). The code in this notebook is adapted from https://opennmt.net/OpenNMT-py/examples/Library.html.

### Deliverable

A 1 page report describing model behavior for at least 2 the following 3 experiments: 


1. Spanish --> English , English --> Spanish
1. English --> Wixárika, Wixárika --> English
1. Spanish --> Wixárika, Wixárika --> Spanish


Which language pair is easier to translate?  How may the data influence this behavior? Can you think of creative ways to improve model performance? If you do any type of hyperparameter exploration make sure to mention what you did (e.g., changing the number of layers in the encoder/decoder, the size of the word embeddings, etc). 

## Let's get started

Create a conda environment following [this notebook](https://github.com/cgpotts/cs224u/blob/master/setup.ipynb), activate the environment, and download the following two packages. 

In [None]:
# Neural Machine Translation framework
#!pip install openNMT-py 

# Evaluation metrics
#!pip install sacrebleu

We now begin our NMT journey by importing a few modules and functions that will be necessary.


In [None]:
import yaml
import torch
import torch.nn as nn
from argparse import Namespace
from collections import defaultdict, Counter

In [None]:
import onmt
from onmt.inputters.inputter import _load_vocab, _build_fields_vocab, get_fields, IterOnDevice
from onmt.inputters.corpus import ParallelCorpus
from onmt.inputters.dynamic_iterator import DynamicDatasetIter
from onmt.translate import GNMTGlobalScorer, Translator, TranslationBuilder
from onmt.utils.misc import set_random_seed

Enable logging

In [None]:
# enable logging
from onmt.utils.logging import init_logger, logger
init_logger()

Set random seed and check whether we have access to a gpu (it's okay if we don't have one, a cpu will be sufficient for this notebook). 

In [None]:
is_cuda = torch.cuda.is_available()
set_random_seed(1111, is_cuda)
is_cuda

## Data

An NMT system uses a parallel corpus, sometimes called a bitext. In this notebook, we will use a small parallel corpus of Spanish and [Wixárika](https://en.wikipedia.org/wiki/Huichol_language). The Wixárika language is spoken in Western Mexico. This language is also commonly known as Huichol, however, the Wixáritari people prefer the term Wixárika to refer to their language. The parallel corpus is a collection of Wixárika sentences, and their Spanish translations, from a descriptive grammar [(Gomez, 1999)](https://arqueologiamexicana.mx/sites/default/files/banco_imagenes/huichol-de-san-andres-cohamiata.pdf). We have included an English translation of the corpus for you to play around and get a hold of the task and and data.

We specify the source and target languages below.

In [None]:
source = 'wixarika'
target =  'spanish'

In [None]:
source_data = open(f'toy-data/{source}.txt','r').readlines()
target_data = open(f'toy-data/{target}.txt','r').readlines()

Since there aren't  predetermined data splits for this corpus, we will create our own. We will use 90% for training and 10% for development.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = [i for i in source_data]
y = [i for i in target_data]
X_train, X_val, y_train, y_val = train_test_split( X, y, test_size=0.1, random_state=42)

In [None]:
datasets = {'src': {'train':X_train,'val':X_val,},
'tgt': {'train':y_train,'val':y_val}}

We write the new data splits in the `toy-data` folder

In [None]:
for lang in datasets:
    for split in datasets[lang]:
        with open (f'toy-data/{lang}-{split}.txt','w') as f:
            f.write("".join(datasets[lang][split]))

In [None]:
!ls toy-data

As for any use case of OpenNMT-py 2.0, we can start by creating a simple YAML configuration with our datasets. This is the easiest way to build the proper opts Namespace that will be used to create the vocabulary(ies).

In [None]:
yaml_config = """
## Where the samples will be written
save_data: toy-data/run/example
## Where the vocab(s) will be written
src_vocab: toy-data/run/example.vocab.src
tgt_vocab: toy-data/run/example.vocab.tgt
# Corpus opts:
data:
    corpus:
        path_src: toy-data/src-train.txt
        path_tgt: toy-data/tgt-train.txt
        transforms: []
        weight: 1
    valid:
        path_src: toy-data/src-val.txt
        path_tgt: toy-data/tgt-val.txt
        transforms: []
"""
config = yaml.safe_load(yaml_config)
with open("toy-data/config.yaml", "w") as f:
    f.write(yaml_config)

In [None]:
from onmt.utils.parse import ArgumentParser
parser = ArgumentParser(description='build_vocab.py')

In [None]:
from onmt.opts import dynamic_prepare_opts
dynamic_prepare_opts(parser, build_vocab_only=True)

In [None]:
base_args = (["-config", "toy-data/config.yaml", "-n_sample", "10000"])
opts, unknown = parser.parse_known_args(base_args)

In [None]:
# this will delete vocabularies if there are any, 
# otherwise you may run into an error in the cell below
!rm toy-data/run/example.vocab.src
!rm toy-data/run/example.vocab.tgt


In [None]:
from onmt.bin.build_vocab import build_vocab_main
build_vocab_main(opts)

In [None]:
ls toy-data/run

We just created our source and target vocabularies, respectively toy-data/run/example.vocab.src and toy-data/run/example.vocab.tgt.

### Build fields
We can build the fields from the text files that were just created.

In [None]:
src_vocab_path = "toy-data/run/example.vocab.src"
tgt_vocab_path = "toy-data/run/example.vocab.tgt"

In [None]:
# initialize the frequency counter
counters = defaultdict(Counter)
# load source vocab
_src_vocab, _src_vocab_size = _load_vocab(
    src_vocab_path,
    'src',
    counters)
# load target vocab
_tgt_vocab, _tgt_vocab_size = _load_vocab(
    tgt_vocab_path,
    'tgt',
    counters)

Let's take a look at the most frequent tokens in the vocabularies.

In [None]:
# source vocab
_src_vocab[:15]

In [None]:
_src_vocab_size

In [None]:
# target vocab
_tgt_vocab[:15]

In [None]:
_tgt_vocab_size

In [None]:
# initialize fields
src_nfeats, tgt_nfeats = 0, 0 # do not support word features for now
fields = get_fields(
    'text', src_nfeats, tgt_nfeats)

In [None]:
fields

In [None]:
# build fields vocab
share_vocab = False
vocab_size_multiple = 1
src_vocab_size = 2000
tgt_vocab_size = 2000
src_words_min_frequency = 1
tgt_words_min_frequency = 1
vocab_fields = _build_fields_vocab(
    fields, counters, 'text', share_vocab,
    vocab_size_multiple,
    src_vocab_size, src_words_min_frequency,
    tgt_vocab_size, tgt_words_min_frequency)

An alternative way of creating these fields is to run `onmt_train` without actually training, to just output the necessary files.

### Prepare for training: model and optimizer creation
Let’s get a few fields/vocab related variables to simplify the model creation a bit:

In [None]:
src_text_field = vocab_fields["src"].base_field
src_vocab = src_text_field.vocab
src_padding = src_vocab.stoi[src_text_field.pad_token]

tgt_text_field = vocab_fields['tgt'].base_field
tgt_vocab = tgt_text_field.vocab
tgt_padding = tgt_vocab.stoi[tgt_text_field.pad_token]

### Architecture 

Next we specify the core model itself. Here we will build a small model with an encoder and an attention based input feeding decoder. Both models will be RNNs (more specifically LSTMs) and the encoder will be bidirectional. 

### Core Model

In [None]:
emb_size = 500
rnn_size = 500 
## encoder 
encoder_embeddings = onmt.modules.Embeddings(emb_size, len(src_vocab),
                                             word_padding_idx=src_padding)

encoder = onmt.encoders.RNNEncoder(hidden_size=rnn_size, num_layers=1,
                                   rnn_type="LSTM", bidirectional=True,
                                   embeddings=encoder_embeddings)

## decoder 
decoder_embeddings = onmt.modules.Embeddings(emb_size,
                                             len(tgt_vocab),
                                             word_padding_idx=tgt_padding)

decoder = onmt.decoders.decoder.InputFeedRNNDecoder( hidden_size=rnn_size, 
                                                    num_layers=1, 
                                                    bidirectional_encoder=True, 
                                                    rnn_type="LSTM", 
                                                    embeddings=decoder_embeddings)

Putting the model in the gpu or cpu

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = onmt.models.model.NMTModel(encoder, decoder)
model.to(device)

### Word generator & loss module

In [None]:
# Specify the tgt word generator and loss computation module

model.generator = nn.Sequential(
    nn.Linear(rnn_size, len(tgt_vocab)),
    nn.LogSoftmax(dim=-1)).to(device)

loss = onmt.utils.loss.NMTLossCompute(
    criterion=nn.NLLLoss(ignore_index=tgt_padding, reduction="sum"),
    generator=model.generator)

### Optimizer

Now we set up the optimizer. This could be a core torch optim class, or our wrapper which handles learning rate updates and gradient normalization automatically.

In [None]:
lr = 1
torch_optimizer = torch.optim.SGD(model.parameters(), lr=lr)
optim = onmt.utils.optimizers.Optimizer(
    torch_optimizer, learning_rate=lr, max_grad_norm=2)

### Create the training and validation data iterators

Now we need to create the dynamic dataset iterator. This is not very ‘library-friendly’ for now because of the way the DynamicDatasetIter constructor is defined. It may evolve in the future.


In [None]:
src_train = "toy-data/src-train.txt"
tgt_train = "toy-data/tgt-train.txt"
src_val = "toy-data/src-val.txt"
tgt_val = "toy-data/tgt-val.txt"

# build the ParallelCorpus
corpus = ParallelCorpus("corpus", src_train, tgt_train)
valid = ParallelCorpus("valid", src_val, tgt_val)

In [None]:
# build the training iterator
train_iter = DynamicDatasetIter(
    corpora={"corpus": corpus},
    corpora_info={"corpus": {"weight": 1}},
    transforms={},
    fields=vocab_fields,
    is_train=True,
    batch_type="tokens",
    batch_size=128,
    batch_size_multiple=1,
    data_type="text")

In [None]:
# make sure the iteration happens on GPU 0 (-1 for CPU, N for GPU N)
train_iter = iter(IterOnDevice(train_iter, -1))

In [None]:
# build the validation iterator
valid_iter = DynamicDatasetIter(
    corpora={"valid": valid},
    corpora_info={"valid": {"weight": 1}},
    transforms={},
    fields=vocab_fields,
    is_train=False,
    batch_type="sents",
    batch_size=8,
    batch_size_multiple=1,
    data_type="text")

In [None]:
# If we have a gpu, we set -1 to 0
valid_iter = IterOnDevice(valid_iter, -1)

# Training

We specify some training options

In [None]:
## play around with val_every
val_every = 50
report_manager = onmt.utils.ReportMgr(
    report_every=val_every, start_time=None, tensorboard_writer=None)

trainer = onmt.Trainer(model=model,
                       train_loss=loss,
                       valid_loss=loss,
                       optim=optim,
                       report_manager=report_manager,
                       dropout=[0.3])

Finally we train.


In [None]:
## play around with train_steps
trainer.train(train_iter=train_iter,
              train_steps=1000,
              valid_iter=valid_iter,
              valid_steps=val_every)

# Translate 
For translation, we can build a “traditional” (as opposed to dynamic) dataset for now.

In [None]:
src_data = {"reader": onmt.inputters.str2reader["text"](), "data": src_val,   "features": {} }
tgt_data = {"reader": onmt.inputters.str2reader["text"](), "data": tgt_val,   "features": {} }


In [None]:
_readers, _data = onmt.inputters.Dataset.config(
    [('src', src_data), ('tgt', tgt_data)])

In [None]:
dataset = onmt.inputters.Dataset(
    vocab_fields, readers=_readers, data=_data,
    sort_key=onmt.inputters.str2sortkey["text"])

In [None]:
data_iter = onmt.inputters.OrderedIterator(
            dataset=dataset,
            device=device,
            batch_size=10,
            train=False,
            sort=False,
            sort_within_batch=True,
            shuffle=False
        )

In [None]:
src_reader = onmt.inputters.str2reader["text"]
tgt_reader = onmt.inputters.str2reader["text"]
scorer = GNMTGlobalScorer(alpha=0.7, 
                          beta=0., 
                          length_penalty="avg", 
                          coverage_penalty="none")
gpu = 0 if torch.cuda.is_available() else -1
translator = Translator(model=model, 
                        fields=vocab_fields, 
                        src_reader=src_reader, 
                        tgt_reader=tgt_reader, 
                        global_scorer=scorer,
                        gpu=gpu)
builder = onmt.translate.TranslationBuilder(data=dataset, 
                                            fields=vocab_fields)


In [None]:
retrieve = lambda x:[ i.split(':') for i in x.split('\n')]

Note: translations will be very poor, because of the very low quantity of data, the absence of proper tokenization, and the brevity of the training

In [None]:
predicted_output = []
for batch in data_iter:
    trans_batch = translator.translate_batch(
        batch=batch, src_vocabs=[src_vocab],
        attn_debug=False)
    translations = builder.from_batch(trans_batch)
    for trans in translations:
        
        a=trans.log(0)
        print(a)
        source_sentence = ((retrieve(a))[1][1])
        prediction = ((retrieve(a))[2][1])
        predicted_output.append(prediction)
   

# Evaluation


We will use two evaluation metrics for machine translation: [BLEU](https://aclanthology.org/P02-1040.pdf) and [CHRF](https://aclanthology.org/W17-4770.pdf). 

In [None]:
from sacrebleu import BLEU, CHRF

In [None]:
bleu = BLEU()
chrf = CHRF()

In [None]:
gold = open(tgt_val,'r').readlines()

In [None]:
import pandas as pd

In [None]:
da = [i for i in zip(gold,predicted_output)]

In [None]:
df = pd.DataFrame(da, columns=['gold','prediction'])

In [None]:
df

In [None]:
bleu_ = bleu.corpus_score(predicted_output,[gold])
print(bleu_.score)

In [None]:
chrf_ = chrf.corpus_score(predicted_output,gold)
print(chrf_.score)