<a href="https://colab.research.google.com/github/angomoson/pytorch/blob/main/Copy_of_2_NMT_Training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Install OpenNMT-py 3.x
!pip3 install OpenNMT-py

Collecting OpenNMT-py
  Downloading OpenNMT_py-3.5.1-py3-none-any.whl (262 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m262.8/262.8 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
Collecting configargparse (from OpenNMT-py)
  Downloading ConfigArgParse-1.7-py3-none-any.whl (25 kB)
Collecting ctranslate2<5,>=4 (from OpenNMT-py)
  Downloading ctranslate2-4.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (36.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m36.7/36.7 MB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
Collecting waitress (from OpenNMT-py)
  Downloading waitress-3.0.0-py3-none-any.whl (56 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.7/56.7 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyonmttok<2,>=1.37 (from OpenNMT-py)
  Downloading pyonmttok-1.37.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [3

# Prepare Your Datasets
Please make sure you have completed the [first exercise](https://colab.research.google.com/drive/1rsFPnAQu9-_A6e2Aw9JYK3C8mXx9djsF?usp=sharing).

In [2]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [3]:
# Open the folder where you saved your prepapred datasets from the first exercise
# You might need to mount your Google Drive first
%cd drive/MyDrive/nmt/MT-Preparation/
!ls

/content/drive/MyDrive/nmt
MT-Preparation	source.model  source.vocab  target.model  target.vocab


In [8]:
%ls

config.yaml                           [0m[01;34mfiltering[0m/                           README.md
english.en                            french.fr                            requirements.txt
english.en-filtered.en                french.fr-filtered.fr                [01;34mrun[0m/
english.en-filtered.en.subword        french.fr-filtered.fr.subword        [01;34msubwording[0m/
english.valid.en                      french.valid.fr                      [01;34mtrain_dev_split[0m/
english.valid.en-filtered.en          french.valid.fr-filtered.fr          train.log
english.valid.en-filtered.en.subword  french.valid.fr-filtered.fr.subword
[01;34mextra[0m/                                [01;34mmodels[0m/


# Create the Training Configuration File

The following config file matches most of the recommended values for the Transformer model [Vaswani et al., 2017](https://arxiv.org/abs/1706.03762). As the current dataset is small, we reduced the following values:
* `train_steps` - for datasets with a few millions of sentences, consider using a value between 100000 and 200000, or more! Enabling the option `early_stopping` can help stop the training when there is no considerable improvement.
* `valid_steps` - 10000 can be good if the value `train_steps` is big enough.
* `warmup_steps` - obviously, its value must be less than `train_steps`. Try 4000 and 8000 values.

Refer to [OpenNMT-py training parameters](https://opennmt.net/OpenNMT-py/options/train.html) for more details. If you are interested in further explanation of the Transformer model, you can check this article, [Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/).

In [6]:
# Create the YAML configuration file
# On a regular machine, you can create it manually or with nano
# Note here we are using some smaller values because the dataset is small
# For larger datasets, consider increasing: train_steps, valid_steps, warmup_steps, save_checkpoint_steps, keep_checkpoint

config = '''# config.yaml


## Where the samples will be written
save_data: run

# Training files
data:
    corpus_1:
        path_src: french.fr-filtered.fr.subword
        path_tgt: english.en-filtered.en.subword
        transforms: [filtertoolong]
    valid:
        path_src: french.valid.fr-filtered.fr.subword
        path_tgt: english.valid.en-filtered.en.subword
        transforms: [filtertoolong]

# Vocabulary files, generated by onmt_build_vocab
src_vocab: run/source.vocab
tgt_vocab: run/target.vocab

# Vocabulary size - should be the same as in sentence piece
src_vocab_size: 50000
tgt_vocab_size: 50000

# Filter out source/target longer than n if [filtertoolong] enabled
src_seq_length: 150
src_seq_length: 150

# Tokenization options
src_subword_model: source.model
tgt_subword_model: target.model

# Where to save the log file and the output models/checkpoints
log_file: train.log
save_model: models/model.fren

# Stop training if it does not imporve after n validations
early_stopping: 4

# Default: 5000 - Save a model checkpoint for each n
save_checkpoint_steps: 1000

# To save space, limit checkpoints to last n
# keep_checkpoint: 3

seed: 3435

# Default: 100000 - Train the model to max n steps
# Increase to 200000 or more for large datasets
# For fine-tuning, add up the required steps to the original steps
train_steps: 3000

# Default: 10000 - Run validation after n steps
valid_steps: 1000

# Default: 4000 - for large datasets, try up to 8000
warmup_steps: 1000
report_every: 100

# Number of GPUs, and IDs of GPUs
world_size: 1
gpu_ranks: [0]

# Batching
bucket_size: 262144
num_workers: 0  # Default: 2, set to 0 when RAM out of memory
batch_type: "tokens"
batch_size: 4096   # Tokens per batch, change when CUDA out of memory
valid_batch_size: 2048
max_generator_batches: 2
accum_count: [4]
accum_steps: [0]

# Optimization
model_dtype: "fp16"
optim: "adam"
learning_rate: 2
# warmup_steps: 8000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
encoder_type: transformer
decoder_type: transformer
position_encoding: true
enc_layers: 6
dec_layers: 6
heads: 8
hidden_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
'''

with open("config.yaml", "w+") as config_yaml:
  config_yaml.write(config)

In [7]:
# [Optional] Check the content of the configuration file
!cat config.yaml

# config.yaml


## Where the samples will be written
save_data: run

# Training files
data:
    corpus_1:
        path_src: french.fr-filtered.fr.subword
        path_tgt: english.en-filtered.en.subword
        transforms: [filtertoolong]
    valid:
        path_src: french.valid.fr-filtered.fr.subword
        path_tgt: english.valid.en-filtered.en.subword
        transforms: [filtertoolong]

# Vocabulary files, generated by onmt_build_vocab
src_vocab: run/source.vocab
tgt_vocab: run/target.vocab

# Vocabulary size - should be the same as in sentence piece
src_vocab_size: 50000
tgt_vocab_size: 50000

# Filter out source/target longer than n if [filtertoolong] enabled
src_seq_length: 150
src_seq_length: 150

# Tokenization options
src_subword_model: source.model
tgt_subword_model: target.model

# Where to save the log file and the output models/checkpoints
log_file: train.log
save_model: models/model.fren

# Stop training if it does not imporve after n validations
early_stopping: 4

# D

# Build Vocabulary

For large datasets, it is not feasable to use all words/tokens found in the corpus. Instead, a specific set of vocabulary is extracted from the training dataset, usually betweeen 32k and 100k words. This is the main purpose of the vocabulary building step.

In [9]:
# Find the number of CPUs/cores on the machine
!nproc --all

2


In [10]:
# Build Vocabulary

# -config: path to your config.yaml file
# -n_sample: use -1 to build vocabulary on all the segment in the training dataset
# -num_threads: change it to match the number of CPUs to run it faster

!onmt_build_vocab -config config.yaml -n_sample -1 -num_threads 2

Corpus corpus_1's weight should be given. We default it to 1 for you.
[2024-04-20 13:33:38,113 INFO] Counter vocab from -1 samples.
[2024-04-20 13:33:38,113 INFO] n_sample=-1: Build vocab on full datasets.
[2024-04-20 13:33:39,214 INFO] * Transform statistics for corpus_1(50.00%):
			* FilterTooLongStats(filtered=2)

[2024-04-20 13:33:39,234 INFO] * Transform statistics for corpus_1(50.00%):
			* FilterTooLongStats(filtered=3)

[2024-04-20 13:33:39,267 INFO] Counters src: 15799
[2024-04-20 13:33:39,267 INFO] Counters tgt: 11817
Traceback (most recent call last):
  File "/usr/local/bin/onmt_build_vocab", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/onmt/bin/build_vocab.py", line 283, in main
    build_vocab_main(opts)
  File "/usr/local/lib/python3.10/dist-packages/onmt/bin/build_vocab.py", line 267, in build_vocab_main
    save_counter(src_counter, opts.src_vocab)
  File "/usr/local/lib/python3.10/dist-packages/onmt/bin/build_vocab.py", line 

From the **Runtime menu** > **Change runtime type**, make sure that the "**Hardware accelerator**" is "**GPU**".


In [11]:
# Check if the GPU is active
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-600cd45c-cbc9-cf40-6405-b140b629230b)


In [12]:
# Check if the GPU is visable to PyTorch

import torch

print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

gpu_memory = torch.cuda.mem_get_info(0)
print("Free GPU memory:", gpu_memory[0]/1024**2, "out of:", gpu_memory[1]/1024**2)

True
Tesla T4
Free GPU memory: 14999.0625 out of: 15102.0625


# Training

Now, start training your NMT model! 🎉 🎉 🎉

In [13]:
!rm -rf drive/MyDrive/nmt/models/

In [14]:
# Train the NMT model
!onmt_train -config config.yaml

[2024-04-20 13:34:40,025 INFO] Parsed 2 corpora from -data.
[2024-04-20 13:34:40,025 INFO] Get special vocabs from Transforms: {'src': [], 'tgt': []}.
[2024-04-20 13:34:40,603 INFO] The first 10 tokens of the vocabs are:['<unk>', '<blank>', '<s>', '</s>', "'", ',', '▁de', '.', '▁la', '▁et']
[2024-04-20 13:34:40,603 INFO] The decoder start token is: <s>
[2024-04-20 13:34:40,603 INFO] Building model...
[2024-04-20 13:34:42,032 INFO] Switching model to float32 for amp/apex_amp
[2024-04-20 13:34:42,033 INFO] Non quantized layer compute is fp16
[2024-04-20 13:34:42,359 INFO] NMTModel(
  (encoder): TransformerEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(15808, 512, padding_idx=1)
        )
        (pe): PositionalEncoding()
      )
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): ModuleList(
      (0-5): 6 x TransformerEncoderLayer(
        (self_attn): MultiHeadedAttention(
       

In [None]:
# For error debugging try:
# !dmesg -T

# Translation

Translation Options:
* `-model` - specify the last model checkpoint name; try testing the quality of multiple checkpoints
* `-src` - the subworded test dataset, source file
* `-output` - give any file name to the new translation output file
* `-gpu` - GPU ID, usually 0 if you have one GPU. Otherwise, it will translate on CPU, which would be slower.
* `-min_length` - [optional] to avoid empty translations
* `-verbose` - [optional] if you want to print translations

Refer to [OpenNMT-py translation options](https://opennmt.net/OpenNMT-py/options/translate.html) for more details.

In [18]:
%ls

config.yaml                           french.fr-filtered.fr.subword
english.en                            french.test.fr
english.en-filtered.en                french.test.fr-filtered.fr
english.en-filtered.en.subword        french.test.fr-filtered.fr.subword
english.test.en                       french.valid.fr
english.test.en-filtered.en           french.valid.fr-filtered.fr
english.test.en-filtered.en.subword   french.valid.fr-filtered.fr.subword
english.valid.en                      [0m[01;34mmodels[0m/
english.valid.en-filtered.en          README.md
english.valid.en-filtered.en.subword  requirements.txt
[01;34mextra[0m/                                [01;34mrun[0m/
[01;34mfiltering[0m/                            [01;34msubwording[0m/
french.fr                             [01;34mtrain_dev_split[0m/
french.fr-filtered.fr                 train.log


In [19]:
# Translate the "subworded" source file of the test dataset
# Change the model name, if needed.
!onmt_translate -model models/model.fren_step_1000.pt -src french.test.fr-filtered.fr.subword -output en.translated -gpu 0 -min_length 1

[2024-04-20 14:25:57,622 INFO] Loading checkpoint from models/model.fren_step_1000.pt
[2024-04-20 14:26:00,495 INFO] Loading data into the model
[2024-04-20 14:26:04,183 INFO] PRED SCORE: -0.6307, PRED PPL: 1.88 NB SENTENCES: 99
Time w/o python interpreter load/terminate:  6.5700390338897705


In [20]:
# Check the first 5 lines of the translation file
!head -n 5 en.translated

▁The ▁Commission ▁and ▁the ▁Committee ▁on ▁Legal ▁Affairs ▁and ▁the ▁Internal ▁Market .
▁The ▁increased ▁awareness ▁of ▁that , ▁Europe , ▁is ▁developing ▁the ▁economic ▁and ▁social ▁bodies ▁as ▁well ▁as ▁the ▁times ▁of ▁the ▁highest ▁food ▁safety ▁regime ▁in ▁the ▁times ▁of ▁the ▁food ▁safety .
▁The ▁accession ▁of ▁the ▁Member ▁States ▁with ▁regard ▁to ▁the ▁European ▁Union , ▁as ▁a ▁member ▁of ▁the ▁countries ▁of ▁the ▁Union ▁to ▁this ▁type ▁of ▁Community ▁initiatives , ▁can ▁be ▁studied ▁it .
▁It ▁has ▁to ▁be ▁remembered ▁that ▁the ▁European ▁continent ▁which ▁are ▁at ▁the ▁heart ▁of ▁a ▁at ▁European ▁level , ▁and ▁not , ▁as ▁presented ▁by ▁the ▁rapporteur , ▁propose ▁a ▁global ▁level .
▁Secondly , ▁it ▁is ▁the ▁case ▁that ▁we ▁have ▁seen ▁a ▁guarantee ▁the ▁necessary ▁requirement ▁in ▁terms ▁of ▁managing ▁national ▁legislation , ▁and ▁in ▁the ▁law ▁of ▁the ▁law , ▁which ▁is ▁supported ▁by ▁the ▁rapporteur ▁mentioned ▁earlier .


In [22]:
%ls

config.yaml                           french.fr-filtered.fr.subword
english.en                            french.test.fr
english.en-filtered.en                french.test.fr-filtered.fr
english.en-filtered.en.subword        french.test.fr-filtered.fr.subword
english.test.en                       french.valid.fr
english.test.en-filtered.en           french.valid.fr-filtered.fr
english.test.en-filtered.en.subword   french.valid.fr-filtered.fr.subword
english.valid.en                      [0m[01;34mmodels[0m/
english.valid.en-filtered.en          README.md
english.valid.en-filtered.en.subword  requirements.txt
en.translated                         [01;34mrun[0m/
[01;34mextra[0m/                                [01;34msubwording[0m/
[01;34mfiltering[0m/                            [01;34mtrain_dev_split[0m/
french.fr                             train.log
french.fr-filtered.fr


In [23]:
# If needed install/update sentencepiece
# !pip3 install --upgrade -q sentencepiece

# Desubword the translation file
!python3 subwording/3-desubword.py ../target.model en.translated

Done desubwording! Output: en.translated.desubword


In [24]:
# Check the first 5 lines of the desubworded translation file
!head -n 5 UN.en.translated.desubword

head: cannot open 'UN.en.translated.desubword' for reading: No such file or directory


In [25]:
!python3 subwording/3-desubword.py ../target.model english.test.en-filtered.en.subword

Done desubwording! Output: english.test.en-filtered.en.subword.desubword


In [27]:
# Check the first 5 lines of the desubworded reference
!head -n 5 english.test.en-filtered.en.subword.desubword

As far the Commission and the Legal Affairs Committee are concerned, it comes under competition.
The antibiotics, growth promoters and genetically-modified organisms mixed in with feedingstuffs end up in the human food chain.
In the territory of Member States of the European Union, applicant countries or Member States of EFTA, there may be interrelations between the various Community initiatives.
The real priority for the European continent is to establish an operational level within a European framework and not a global framework as proposed by the rapporteur.
Secondly, it makes sense to issue provisions for genetically-modified feedingstuff additives similar to the regulations in seed marketing legislation, as the rapporteur has said.


# MT Evaluation

There are several MT Evaluation metrics such as BLEU, TER, METEOR, COMET, BERTScore, among others.

Here we are using BLEU. Files must be detokenized/desubworded beforehand.

In [28]:
# Download the BLEU script
!wget https://raw.githubusercontent.com/ymoslem/MT-Evaluation/main/BLEU/compute-bleu.py

--2024-04-20 14:30:26--  https://raw.githubusercontent.com/ymoslem/MT-Evaluation/main/BLEU/compute-bleu.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 957 [text/plain]
Saving to: ‘compute-bleu.py’


2024-04-20 14:30:26 (51.5 MB/s) - ‘compute-bleu.py’ saved [957/957]



In [29]:
# Install sacrebleu
!pip3 install sacrebleu



In [30]:
%cat english.test.en-filtered.en.subword.desubword | sacrebleu en.translated.desubword

{
 "name": "BLEU",
 "score": 10.5,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.4.2",
 "verbose_score": "39.4/12.9/6.4/3.7 (BP = 1.000 ratio = 1.050 hyp_len = 2767 ref_len = 2635)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.4.2"
}
[0m

In [31]:
%cat english.test.en-filtered.en.subword | sacrebleu en.translated

{
 "name": "BLEU",
 "score": 12.8,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.4.2",
 "verbose_score": "40.3/15.3/8.4/5.3 (BP = 1.000 ratio = 1.096 hyp_len = 3087 ref_len = 2817)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.4.2"
}
[0m

In [32]:
# Evaluate the translation (without subwording)
!python3 compute-bleu.py english.test.en-filtered.en.subword.desubword en.translated.desubword

Reference 1st sentence: As far the Commission and the Legal Affairs Committee are concerned, it comes under competition.
MTed 1st sentence: The Commission and the Committee on Legal Affairs and the Internal Market.
BLEU:  10.524850519884739
