### Loading modules

In [3]:
import onmt
from onmt.inputters.inputter import _load_vocab, _build_fields_vocab, get_fields, IterOnDevice
from onmt.inputters.corpus import ParallelCorpus
from onmt.inputters.dynamic_iterator import DynamicDatasetIter
from onmt.translate import GNMTGlobalScorer, Translator, TranslationBuilder
from onmt.utils.misc import set_random_seed

In [4]:
import yaml
import torch
import torch.nn as nn
from argparse import Namespace
from collections import defaultdict, Counter

In [5]:
# enable logging
from onmt.utils.logging import init_logger, logger
init_logger()

<RootLogger root (INFO)>

In [6]:
is_cuda = torch.cuda.is_available()
set_random_seed(1111, is_cuda)

Dummy dataset is downloaded with this: `wget https://s3.amazonaws.com/opennmt-trainingdata/toy-ende.tar.gz`

## Build vocabs

In [7]:
yaml_config = """
## Where the samples will be written
save_data: toy-ende/run/example
## Where the vocab(s) will be written
src_vocab: toy-ende/run/example.vocab.src
tgt_vocab: toy-ende/run/example.vocab.tgt
# Corpus opts:
data:
    corpus:
        path_src: toy-ende/src-train.txt
        path_tgt: toy-ende/tgt-train.txt
        transforms: []
        weight: 1
    valid:
        path_src: toy-ende/src-val.txt
        path_tgt: toy-ende/tgt-val.txt
        transforms: []
"""
config = yaml.safe_load(yaml_config)
with open("toy-ende/config.yaml", "w") as f:
    f.write(yaml_config)

Create the `config.yaml` file that used for building vocab

In [14]:
from onmt.utils.parse import ArgumentParser
parser = ArgumentParser(description='build_vocab.py')

Create the arguments parser for execution with `build_vocab.py` file
 
**TO DO:** Detailize the vocabulary building process used for this translation task: english to french

In [15]:
from onmt.opts import dynamic_prepare_opts
dynamic_prepare_opts(parser, build_vocab_only=True)

In [19]:
base_args = (["-config", "toy-ende/config.yaml", "-n_sample", "10000", "-overwrite", "True"])
opts, unknown = parser.parse_known_args(base_args)

The above two cells creating a ``wrapper'' for the arguments to be parsed for building vocab

In [20]:
opts



In [21]:
from onmt.bin.build_vocab import build_vocab_main
build_vocab_main(opts)

[2022-06-15 16:03:28,076 INFO] Parsed 2 corpora from -data.
[2022-06-15 16:03:28,077 INFO] Counter vocab from 10000 samples.
[2022-06-15 16:03:28,077 INFO] Build vocab on 10000 transformed examples/corpus.
[2022-06-15 16:03:28,081 INFO] corpus's transforms: TransformPipe()
[2022-06-15 16:03:28,246 INFO] Counters src:24995
[2022-06-15 16:03:28,246 INFO] Counters tgt:35816


## Build fields
Build the fields from the text files that were just created

**Note:** What are `fields` ? The following definition taken from [inputters documentation](https://opennmt.net/OpenNMT-py/onmt.inputters.html)
> A dict with the structure returned by `onmt.inputters.get_fields()`. Usually that means the dataset side, `"src"` or `"tgt"`. Keys match the keys of items yielded by the readers, while values are lists of `(name, Field)` pairs. An attribute with this name will be created for each `torchtext.data.Example` object and its value will be the result of applying the Field to the data that matches the key. The advantage of having sequences of fields for each piece of raw input is that it allows the dataset to store multiple “views” of each input, which allows for easy implementation of token-level features, mixed word- and character-level models, and so on. (See also `onmt.inputters.TextMultiField`)

In short, `field` defines how the input should be processed and having mulitple `fields` of input allow its to be processed in multiple granular-levels which are all can be used in the downstream modules as input for translation models

In [32]:
src_vocab_path = "toy-ende/run/example.vocab.src"
tgt_vocab_path = "toy-ende/run/example.vocab.tgt"

In [33]:
# initialize the frequency counter
counters = defaultdict(Counter)
# load source vocab
_src_vocab, _src_vocab_size = _load_vocab(
    src_vocab_path,
    'src',
    counters)
# load target vocab
_tgt_vocab, _tgt_vocab_size = _load_vocab(
    tgt_vocab_path,
    'tgt',
    counters)

[2022-06-15 16:22:49,137 INFO] Loading src vocabulary from toy-ende/run/example.vocab.src
[2022-06-15 16:22:49,226 INFO] Loaded src vocab has 24995 tokens.
[2022-06-15 16:22:49,233 INFO] Loading tgt vocabulary from toy-ende/run/example.vocab.tgt
[2022-06-15 16:22:49,274 INFO] Loaded tgt vocab has 35816 tokens.


In [29]:
from pprint import pprint
print(type(_src_vocab))
pprint(_src_vocab[:10])

<class 'list'>
[['the', '12670'],
 [',', '9710'],
 ['.', '9647'],
 ['of', '6634'],
 ['and', '5787'],
 ['to', '5610'],
 ['in', '4072'],
 ['a', '3655'],
 ['is', '3138'],
 ['that', '2286']]


In [34]:
# initialize fields
src_nfeats, tgt_nfeats = 0, 0 # do not support word features for now
fields = get_fields(
    'text', src_nfeats, tgt_nfeats)

In [35]:
fields

{'src': <onmt.inputters.text_dataset.TextMultiField at 0x7f2ca0d2f970>,
 'tgt': <onmt.inputters.text_dataset.TextMultiField at 0x7f2ca0d2ff40>,
 'indices': <torchtext.data.field.Field at 0x7f2ca0d2f190>}

In [36]:
# build fields vocab
share_vocab = False # share vocab is turned off due to the fact of different syntax space between input and output, when activated the src_vocab_size and tgt_tgt_size are equal
vocab_size_multiple = 1
src_vocab_size = 30000
tgt_vocab_size = 30000
src_words_min_frequency = 1
tgt_words_min_frequency = 1
vocab_fields = _build_fields_vocab(
    fields, counters, 'text', share_vocab,
    vocab_size_multiple,
    src_vocab_size, src_words_min_frequency,
    tgt_vocab_size, tgt_words_min_frequency)

[2022-06-15 18:55:25,703 INFO]  * tgt vocab size: 30004.
[2022-06-15 18:55:25,728 INFO]  * src vocab size: 24997.


**Note** An alternative way of creating these fields is to run onmt_train without actually training, to just output the necessary files.

### Prepare for training: model and optimizer creation

In [17]:
src_text_field = vocab_fields["src"].base_field
src_vocab = src_text_field.vocab
src_padding = src_vocab.stoi[src_text_field.pad_token]

tgt_text_field = vocab_fields['tgt'].base_field
tgt_vocab = tgt_text_field.vocab
tgt_padding = tgt_vocab.stoi[tgt_text_field.pad_token]

In [18]:
emb_size = 100
rnn_size = 500
# Specify the core model.

encoder_embeddings = onmt.modules.Embeddings(emb_size, len(src_vocab),
                                             word_padding_idx=src_padding)

encoder = onmt.encoders.RNNEncoder(hidden_size=rnn_size, num_layers=1,
                                   rnn_type="LSTM", bidirectional=True,
                                   embeddings=encoder_embeddings)

decoder_embeddings = onmt.modules.Embeddings(emb_size, len(tgt_vocab),
                                             word_padding_idx=tgt_padding)
decoder = onmt.decoders.decoder.InputFeedRNNDecoder(
    hidden_size=rnn_size, num_layers=1, bidirectional_encoder=True, 
    rnn_type="LSTM", embeddings=decoder_embeddings)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = onmt.models.model.NMTModel(encoder, decoder)
model.to(device)

# Specify the tgt word generator and loss computation module
model.generator = nn.Sequential(
    nn.Linear(rnn_size, len(tgt_vocab)),
    nn.LogSoftmax(dim=-1)).to(device)

loss = onmt.utils.loss.NMTLossCompute(
    criterion=nn.NLLLoss(ignore_index=tgt_padding, reduction="sum"),
    generator=model.generator)

In [19]:
lr = 1
torch_optimizer = torch.optim.SGD(model.parameters(), lr=lr)
optim = onmt.utils.optimizers.Optimizer(
    torch_optimizer, learning_rate=lr, max_grad_norm=2)

### Create the training and validation data iterators

The data iterators for loading the data and do pre-training data transformations

In [20]:
src_train = "toy-ende/src-train.txt"
tgt_train = "toy-ende/tgt-train.txt"
src_val = "toy-ende/src-val.txt"
tgt_val = "toy-ende/tgt-val.txt"

# build the ParallelCorpus
corpus = ParallelCorpus("corpus", src_train, tgt_train)
valid = ParallelCorpus("valid", src_val, tgt_val)

In [21]:
# build the training iterator
train_iter = DynamicDatasetIter(
    corpora={"corpus": corpus},
    corpora_info={"corpus": {"weight": 1}},
    transforms={},
    fields=vocab_fields,
    is_train=True,
    batch_type="tokens",
    batch_size=4096,
    batch_size_multiple=1,
    data_type="text")

In [22]:
# make sure the iteration happens on GPU 0 (-1 for CPU, N for GPU N)
train_iter = iter(IterOnDevice(train_iter, 0))

In [23]:
# build the validation iterator
valid_iter = DynamicDatasetIter(
    corpora={"valid": valid},
    corpora_info={"valid": {"weight": 1}},
    transforms={},
    fields=vocab_fields,
    is_train=False,
    batch_type="sents",
    batch_size=8,
    batch_size_multiple=1,
    data_type="text")

In [24]:
valid_iter = IterOnDevice(valid_iter, 0)

### Training

In [25]:
report_manager = onmt.utils.ReportMgr(
    report_every=50, start_time=None, tensorboard_writer=None)

trainer = onmt.Trainer(model=model,
                       train_loss=loss,
                       valid_loss=loss,
                       optim=optim,
                       report_manager=report_manager,
                       dropout=[0.1])

trainer.train(train_iter=train_iter,
              train_steps=1000,
              valid_iter=valid_iter,
              valid_steps=500)

[2022-06-15 07:58:47,711 INFO] Start training loop and validate every 500 steps...
[2022-06-15 07:58:47,712 INFO] corpus's transforms: TransformPipe()
[2022-06-15 07:58:47,713 INFO] Weighted corpora loaded so far:
			* corpus: 1
[2022-06-15 07:58:55,304 INFO] Step 50/ 1000; acc:   7.52; ppl: 8841.41; xent: 9.09; lr: 1.00000; 14836/14801 tok/s;      8 sec
[2022-06-15 07:59:00,088 INFO] Weighted corpora loaded so far:
			* corpus: 2
[2022-06-15 07:59:02,966 INFO] Step 100/ 1000; acc:   9.45; ppl: 1917.17; xent: 7.56; lr: 1.00000; 14760/14661 tok/s;     15 sec
[2022-06-15 07:59:10,697 INFO] Step 150/ 1000; acc:  10.71; ppl: 1375.01; xent: 7.23; lr: 1.00000; 14614/14579 tok/s;     23 sec
[2022-06-15 07:59:15,824 INFO] Weighted corpora loaded so far:
			* corpus: 3
[2022-06-15 07:59:18,641 INFO] Step 200/ 1000; acc:  11.12; ppl: 1126.51; xent: 7.03; lr: 1.00000; 14351/14221 tok/s;     31 sec
[2022-06-15 07:59:26,336 INFO] Step 250/ 1000; acc:  12.63; ppl: 911.46; xent: 6.82; lr: 1.00000; 14

<onmt.utils.statistics.Statistics at 0x7ff7932573d0>

### Translate

For translation, we can build a "traditional" (as opposed to dynamic) dataset for now.

In [29]:
src_data = {"reader": onmt.inputters.str2reader["text"](), "data": src_val, "features": {}}
tgt_data = {"reader": onmt.inputters.str2reader["text"](), "data": tgt_val, "features": {}}
_readers, _data = onmt.inputters.Dataset.config([('src', src_data), ('tgt', tgt_data)])

In [30]:
dataset = onmt.inputters.Dataset(
    vocab_fields, readers=_readers, data=_data,
    sort_key=onmt.inputters.str2sortkey["text"])

In [31]:
data_iter = onmt.inputters.OrderedIterator(
            dataset=dataset,
            device="cuda",
            batch_size=10,
            train=False,
            sort=False,
            sort_within_batch=True,
            shuffle=False
        )

In [32]:
src_reader = onmt.inputters.str2reader["text"]
tgt_reader = onmt.inputters.str2reader["text"]
scorer = GNMTGlobalScorer(alpha=0.7, 
                          beta=0., 
                          length_penalty="avg", 
                          coverage_penalty="none")
gpu = 0 if torch.cuda.is_available() else -1
translator = Translator(model=model, 
                        fields=vocab_fields, 
                        src_reader=src_reader, 
                        tgt_reader=tgt_reader, 
                        global_scorer=scorer,
                        gpu=gpu)
builder = onmt.translate.TranslationBuilder(data=dataset, 
                                            fields=vocab_fields)

**Note**: translations will be very poor, because of the very low quantity of data, the absence of proper tokenization, and the brevity of the training.

In [33]:
for batch in data_iter:
    trans_batch = translator.translate_batch(
        batch=batch, src_vocabs=[src_vocab],
        attn_debug=False)
    translations = builder.from_batch(trans_batch)
    for trans in translations:
        print(trans.log(0))
    break


SENT 0: ['Parliament', 'Does', 'Not', 'Support', 'Amendment', 'Freeing', 'Tymoshenko']
PRED 0: Parlament das Parlament auf , die sich in der Lage , die sich in der Lage <unk> .
PRED SCORE: -1.5629


SENT 0: ['Today', ',', 'the', 'Ukraine', 'parliament', 'dismissed', ',', 'within', 'the', 'Code', 'of', 'Criminal', 'Procedure', 'amendment', ',', 'the', 'motion', 'to', 'revoke', 'an', 'article', 'based', 'on', 'which', 'the', 'opposition', 'leader', ',', 'Yulia', 'Tymoshenko', ',', 'was', 'sentenced', '.']
PRED 0: In der Nähe des Hotels in der Nähe des Hotels , die in der Lage , die in der Lage , die in der Lage , in der Lage ist .
PRED SCORE: -1.7963


SENT 0: ['The', 'amendment', 'that', 'would', 'lead', 'to', 'freeing', 'the', 'imprisoned', 'former', 'Prime', 'Minister', 'was', 'revoked', 'during', 'second', 'reading', 'of', 'the', 'proposal', 'for', 'mitigation', 'of', 'sentences', 'for', 'economic', 'offences', '.']
PRED 0: Die Tatsache , die sich in der Lage waren , um die für eine

  self._batch_index = self.topk_ids // vocab_size
