<a href="https://colab.research.google.com/github/cw118/domain-adapted-nmt/blob/main/2_fr_domain_adapted_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A two-step approach to NMT

## Part 2: training NMT models

In [1]:
!pip3 install OpenNMT-py

Collecting OpenNMT-py
  Downloading OpenNMT_py-3.4.3-py3-none-any.whl (257 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/257.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.9/257.3 kB[0m [31m2.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m257.3/257.3 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
Collecting configargparse (from OpenNMT-py)
  Downloading ConfigArgParse-1.7-py3-none-any.whl (25 kB)
Collecting ctranslate2<4,>=3.17 (from OpenNMT-py)
  Downloading ctranslate2-3.24.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (36.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m36.8/36.8 MB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
Collecting waitress (from OpenNMT-py)
  Downloading waitress-2.1.2-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/57.7 kB[

In [2]:
# change into folder where prepared datasets were saved in the text processing step
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
%cd drive/MyDrive/domain-adapted-nmt/nmt-tools

/content/drive/MyDrive/domain-adapted-nmt/nmt-tools


### Train the general/base model

In [None]:
# corpora generated from step 1: https://drive.google.com/drive/folders/1fVe2e2MvT2CCTpSSDrBkykoR-7JKy-w4?usp=sharing
config = '''# config.yaml

# where the samples will be written
save_data: run

# train the general/base model first
data:
    corpus_1:
        path_src: corpora/enfr/en-fr-general.en-filtered.en.subword.train
        path_tgt: corpora/enfr/en-fr-general.fr-filtered.fr.subword.train
        transforms: [filtertoolong]
        weight: 1
    valid:
        path_src: corpora/enfr/en-fr-general.en-filtered.en.subword.dev
        path_tgt: corpora/enfr/en-fr-general.fr-filtered.fr.subword.dev
        transforms: [filtertoolong]

# vocab files generated by onmt_build_vocab
src_vocab: run/source.vocab
tgt_vocab: run/target.vocab

# vocabulary size: should be same as in sentencepiece
src_vocab_size: 50000
tgt_vocab_size: 50000

# Filter out source/target longer than n if [filtertoolong] enabled
src_seq_length: 150
tgt_seq_length: 150

# Tokenization options
src_subword_model: source-general.model
tgt_subword_model: target-general.model

# Where to save the log file and the output models/checkpoints
log_file: train.log
save_model: models/model-base.enfr

# Stop training if it does not improve after n validations
early_stopping: 4

# Default: 5000 - Save a model checkpoint for each n
save_checkpoint_steps: 2000

# To save space, limit checkpoints to last n
# keep_checkpoint: 4

seed: 3435

# For fine-tuning, add up the required steps to the original steps
train_steps: 10000

# Default: 10000 - Run validation after n steps
valid_steps: 2000

# Default: 4000 - for large datasets, try up to 8000
warmup_steps: 2000
report_every: 200

# Number of GPUs, and IDs of GPUs
world_size: 1
gpu_ranks: [0]

# Batching
bucket_size: 262144
num_workers: 0  # Default: 2, set to 0 when RAM out of memory
batch_type: "tokens"
batch_size: 4096   # Tokens per batch, change when CUDA out of memory
valid_batch_size: 2048
max_generator_batches: 2
accum_count: [4]
accum_steps: [0]

# Optimization
model_dtype: "fp16"
optim: "adam"
learning_rate: 2
# warmup_steps: 8000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
encoder_type: transformer
decoder_type: transformer
position_encoding: true
enc_layers: 6
dec_layers: 6
heads: 8
hidden_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
'''

with open("config.yaml", "w+") as config_yaml:
  config_yaml.write(config)

In [None]:
# if this file is already in your Drive, you don't need to run this again
!wget https://raw.githubusercontent.com/OpenNMT/OpenNMT-py/master/tools/spm_to_vocab.py

In [None]:
# convert sentencepiece-generated vocab to be compatible with OpenNMT-py (cf. https://forum.opennmt.net/t/steps-to-convert-sentencepiece-vocab-to-opennmt-py-vocab/4879)
!cat source-general.vocab | python3 spm_to_vocab.py > source-general.onmt_vocab

In [None]:
!nproc --all

2


In [None]:
# match -num_threads to number of CPUs to increase speed
# -1 for -n_sample to use entire corpus when building vocab
!onmt_build_vocab -config config.yaml -n_sample -1 -num_threads 2

2024-01-28 05:23:01.582726: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-28 05:23:01.582836: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-28 05:23:01.587370: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-28 05:23:01.604501: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2024-01-28 05:23:08,539 INFO] Counter vocab from -1 

In [None]:
# once runtime type is changed to GPU, check that the GPU is active
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-0ed94908-7be9-b88d-005d-e79bdccb6bdf)


In [None]:
# check that the GPU is visible to PyTorch
import torch

print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

gpu_memory = torch.cuda.mem_get_info(0)
print("Free GPU memory:", gpu_memory[0] / 1024**2, "out of", gpu_memory[1] / 1024**2)

True
Tesla T4
Free GPU memory: 14999.0625 out of 15102.0625


In [None]:
# clear the models directory for a fresh start
!rm -rf /content/drive/MyDrive/nmt-tools/models

In [None]:
# train NMT model
!onmt_train -config config.yaml

2024-01-28 15:01:32.174207: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-28 15:01:32.174277: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-28 15:01:32.176258: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-28 15:01:32.187529: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-28 15:01:35.130932: I external/local_xla/xla/

In [None]:
# in case Colab is suddenly unable to navigate through directories
import os
path = '/content/drive/MyDrive/domain-adapted-nmt/nmt-tools'
os.chdir(path)

In [4]:
# -gpu 0 to use gpu
!onmt_translate -model models/model-base.enfr_step_10000.pt -src corpora/enfr/en-fr-general.en-filtered.en.subword.test -output general-fr.translated -gpu 0 -min_length 1

2024-01-29 14:46:04.447709: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-29 14:46:04.447768: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-29 14:46:04.449309: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-29 14:46:04.462116: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-29 14:46:10.769925: I external/local_

In [5]:
!onmt_translate -model models/model-base.enfr_step_10000.pt -src corpora/enfr/en-fr-technology.en-filtered.en.subword.test -output general-fr-tech.translated -gpu 0 -min_length 1

2024-01-29 14:48:13.037642: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-29 14:48:13.037714: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-29 14:48:13.039130: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-29 14:48:13.046444: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-29 14:48:15.882688: I external/local_

In [6]:
!onmt_translate -model models/model-base.enfr_step_10000.pt -src corpora/enfr/en-fr-medicine.en-filtered.en.subword.test -output general-fr-medicine.translated -gpu 0 -min_length 1


2024-01-29 14:51:14.586189: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-29 14:51:14.586253: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-29 14:51:14.588172: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-29 14:51:14.599555: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-29 14:51:18.347375: I external/local_

In [None]:
!head -n 5 general-fr.translated

▁Il ▁ne ▁faut ▁pas ▁oublier ▁que ▁ 1 ▁ 2 6 0 ▁prisonniers ▁marocains ▁sont ▁toujours ▁détenus ▁dans ▁des ▁prisons ▁du ▁Front ▁POLISARIO , ▁où ▁ils ▁ont ▁été ▁maintenus ▁pendant ▁plus ▁de ▁ 2 5 ▁ans ▁en ▁violation ▁flagrante ▁du ▁droit ▁international ▁humanitaire .
▁Fournir ▁au ▁Comité ▁des ▁exemplaires ▁du ▁texte ▁de ▁la ▁Convention ▁relative ▁aux ▁droits ▁de ▁l ' enfant ▁dans ▁toutes ▁les ▁langues ▁officielles ▁ou ▁dans ▁d ' autres ▁langues ▁ou ▁dialectes , ▁lorsque ▁cela ▁est ▁disponible .
▁Nous ▁réaffirmons ▁que ▁les ▁lois ▁en ▁vigueur ▁dans ▁le ▁Sultanat ▁garantissent ▁la ▁protection ▁des ▁droits ▁de ▁l ' homme , ▁y ▁compris ▁les ▁droits ▁de ▁l ' enfant , ▁en ▁particulier ▁en ▁ce ▁qui ▁concerne ▁l ' implication ▁d ' enfants ▁dans ▁les ▁conflits ▁armés .
▁Conformément ▁à ▁l ' accord ▁auquel ▁le ▁Conseil ▁est ▁parvenu ▁lors ▁de ▁ses ▁consultations ▁préalables , ▁j ' invite ▁les ▁membres ▁du ▁Conseil ▁à ▁poursuivre ▁l ' examen ▁de ▁la ▁question .
▁Décisions ▁et ▁recommandations


In [4]:
!pip3 install --upgrade -q sentencepiece

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.3 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/1.3 MB[0m [31m3.5 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.5/1.3 MB[0m [31m6.9 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━[0m [32m0.9/1.3 MB[0m [31m8.6 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.3/1.3 MB[0m [31m9.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [14]:
# desubword translated file
!python3 MT-Preparation/subwording/3-desubword.py target-general.model general-fr.translated

Done desubwording! Output: general-fr.translated.desubword


In [15]:
!python3 MT-Preparation/subwording/3-desubword.py target-general.model general-fr-tech.translated
!python3 MT-Preparation/subwording/3-desubword.py target-general.model general-fr-medicine.translated

Done desubwording! Output: general-fr-tech.translated.desubword
Done desubwording! Output: general-fr-medicine.translated.desubword


In [None]:
!head -n 5 general-fr.translated.desubword

Il ne faut pas oublier que 1 260 prisonniers marocains sont toujours détenus dans des prisons du Front POLISARIO, où ils ont été maintenus pendant plus de 25 ans en violation flagrante du droit international humanitaire.
Fournir au Comité des exemplaires du texte de la Convention relative aux droits de l'enfant dans toutes les langues officielles ou dans d'autres langues ou dialectes, lorsque cela est disponible.
Nous réaffirmons que les lois en vigueur dans le Sultanat garantissent la protection des droits de l'homme, y compris les droits de l'enfant, en particulier en ce qui concerne l'implication d'enfants dans les conflits armés.
Conformément à l'accord auquel le Conseil est parvenu lors de ses consultations préalables, j'invite les membres du Conseil à poursuivre l'examen de la question.
Décisions et recommandations


In [16]:
!python3 MT-Preparation/subwording/3-desubword.py target-general.model corpora/enfr/en-fr-general.fr-filtered.fr.subword.test

Done desubwording! Output: corpora/enfr/en-fr-general.fr-filtered.fr.subword.test.desubword


In [26]:
!python3 MT-Preparation/subwording/3-desubword.py target-general.model corpora/enfr/en-fr-technology.fr-filtered.fr.subword.test
!python3 MT-Preparation/subwording/3-desubword.py target-general.model corpora/enfr/en-fr-medicine.fr-filtered.fr.subword.test

Done desubwording! Output: corpora/enfr/en-fr-technology.fr-filtered.fr.subword.test.desubword
Done desubwording! Output: corpora/enfr/en-fr-medicine.fr-filtered.fr.subword.test.desubword


In [None]:
!head -n 5 corpora/enfr/en-fr-general.fr-filtered.fr.subword.test.desubword

Il ne faut pas oublier que 1 260 détenus marocains sont toujours en captivité dans les geôles du Polisario, et ce depuis plus de 25 ans, en violation flagrante du droit international humanitaire.
Faire parvenir au Comité des exemplaires du texte de la Convention relative aux droits de l'enfant dans toutes les langues officielles de l'État partie ainsi que dans ses autres langues ou dialectes, si elle a été traduite.
Nous réaffirmons que la législation en vigueur dans le Sultanat garantit la protection des droits de l'homme, y compris les droits de l'enfant, en particulier pour ce qui est de l'implication d'enfants dans les conflits armés.
Conformément à l'accord auquel le Conseil est parvenu lors de ses consultations préalables, j'invite à présent les membres du Conseil à poursuivre le débat sur la question dans le cadre de consultations.
Décisions et recommandations


In [None]:
# test bleu for baseline score
# only need to run this cell once (once the script is in your Drive, you don't need to run this anymore)
!wget https://raw.githubusercontent.com/ymoslem/MT-Evaluation/main/BLEU/compute-bleu.py
!pip3 install sacrebleu

In [18]:
!python3 compute-bleu.py corpora/enfr/en-fr-general.fr-filtered.fr.subword.test.desubword general-fr.translated.desubword

Reference 1st sentence: Il ne faut pas oublier que 1 260 détenus marocains sont toujours en captivité dans les geôles du Polisario, et ce depuis plus de 25 ans, en violation flagrante du droit international humanitaire.
MTed 1st sentence: Il ne faut pas oublier que 1 260 prisonniers marocains sont toujours détenus dans des prisons du Front POLISARIO où ils ont été détenus depuis plus de 25 ans en violation flagrante du droit international humanitaire.
BLEU:  43.191244985557496


In [27]:
!python3 compute-bleu.py corpora/enfr/en-fr-technology.fr-filtered.fr.subword.test.desubword general-fr-tech.translated.desubword
!python3 compute-bleu.py corpora/enfr/en-fr-medicine.fr-filtered.fr.subword.test.desubword general-fr-medicine.translated.desubword

Reference 1st sentence: Crée une nouvelle demande de réunion
MTed 1st sentence: V.▁Adoption d'une nouvelle demande
BLEU:  7.718966299615998
Reference 1st sentence: Cependant, le temps nécessaire à l’ aggravation de la maladie après le traitement était le même dans les deux groupes (environ 10 mois).
MTed 1st sentence: Cependant, le temps pris pour▁lutter contre la maladie après un traitement est le même dans les deux groupes (environ 10 mois).
BLEU:  12.591000835462784


The BLEU scores for "out-of-domain" text, or in this case, non-generic text, are quite horrible. This is expected since the baseline model hasn't seen any tech or med-specific inputs.

Later on we can fine-tune this model with more generic corpora to improve it even more, but first we'll fine-tune our domain-adapted models for technology and medicine.

### Fine-tuning for technology corpus

In [28]:
tech = """# tech.yaml
# for technology corpus/model fine-tuning
share_vocab: true
src_vocab: run-tech/source.vocab
tgt_vocab: run-tech/target.vocab
src_vocab_size: 50000
tgt_vocab_size: 50000

data:
  # different corpus weighting for mixed fine-tuning approach
  tech_corpus:
    path_src: corpora/enfr/en-fr-technology.en-filtered.en.subword.train
    path_tgt: corpora/enfr/en-fr-technology.fr-filtered.fr.subword.train
    transforms: [filtertoolong]
    weight: 10
  # randomly sample portions from the generic/general data we used to train the baseline model for mixed fine-tuning
  gen_corpus:
    path_src: corpora/enfr/en-fr-general.en-filtered.en.subword.train
    path_tgt: corpora/enfr/en-fr-general.fr-filtered.fr.subword.train
    transforms: [filtertoolong]
    weight: 1
  valid: # validation set
    path_src: corpora/enfr/en-fr-technology.en-filtered.en.subword.dev
    path_tgt: corpora/enfr/en-fr-technology.fr-filtered.fr.subword.dev
    transforms: [filtertoolong]

update_vocab: true
train_from: 'models/model-base.enfr_step_10000.pt' # the base model trained earlier
reset_optim: all

# filtertoolong
src_seq_length: 150
tgt_seq_length: 150

# tokenization
src_subword_model: source-general.model
tgt_subword_model: target-general.model

save_data: run-tech
save_model: models/model-tech.enfr
log_file: train-tech.log
early_stopping: 4

keep_checkpoint: 4
save_checkpoint_steps: 1000
average_decay: 0.0005
seed: 1234
warmup_steps: 1000
report_every: 100

train_steps: 3000
valid_steps: 1000

# Number of GPUs, and IDs of GPUs
world_size: 1
gpu_ranks: [0]

# Batching
bucket_size: 262144
num_workers: 0
batch_type: "tokens"
batch_size: 4096
valid_batch_size: 2048
max_generator_batches: 2
accum_count: [4]
accum_steps: [0]

# Optimization
model_dtype: "fp16"
optim: "adam"
learning_rate: 2
# warmup_steps: 8000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
encoder_type: transformer
decoder_type: transformer
position_encoding: true
enc_layers: 6
dec_layers: 6
heads: 8
hidden_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
"""

with open("tech.yaml", "w+") as tech_yaml:
  tech_yaml.write(tech)

In [None]:
!cat source-technology.vocab | python3 spm_to_vocab.py > source-technology.onmt_vocab

In [29]:
!onmt_build_vocab -config tech.yaml -n_sample -1 -num_threads 2

2024-01-29 14:58:18.799477: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-29 14:58:18.799560: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-29 14:58:18.801580: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-29 14:58:18.812923: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-29 14:58:23.142471: I external/local_

In [30]:
!onmt_train -config tech.yaml

2024-01-29 14:59:06.074962: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-29 14:59:06.075019: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-29 14:59:06.076355: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-29 14:59:06.083819: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-29 14:59:08.725939: I external/local_

In [31]:
# translate domain-specific and generic corpora
!onmt_translate -model models/model-tech.enfr_step_3000.pt -src corpora/enfr/en-fr-technology.en-filtered.en.subword.test -output technology-fr.translated -gpu 0 -min_length 1

2024-01-29 15:51:03.240545: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-29 15:51:03.240615: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-29 15:51:03.242652: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-29 15:51:03.253953: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-29 15:51:07.487728: I external/local_

In [32]:
!onmt_translate -model models/model-tech.enfr_step_3000.pt -src corpora/enfr/en-fr-medicine.en-filtered.en.subword.test -output technology-fr-medicine.translated -gpu 0 -min_length 1

2024-01-29 15:51:53.570142: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-29 15:51:53.570205: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-29 15:51:53.572278: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-29 15:51:53.583765: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-29 15:51:57.174776: I external/local_

In [33]:
!onmt_translate -model models/model-tech.enfr_step_3000.pt -src corpora/enfr/en-fr-general.en-filtered.en.subword.test -output technology-fr-general.translated -gpu 0 -min_length 1

2024-01-29 15:53:07.381295: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-29 15:53:07.381355: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-29 15:53:07.382711: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-29 15:53:07.390340: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-29 15:53:10.109405: I external/local_

In [34]:
# desubword translated files
!python3 MT-Preparation/subwording/3-desubword.py target-general.model technology-fr.translated
!python3 MT-Preparation/subwording/3-desubword.py target-general.model technology-fr-medicine.translated
!python3 MT-Preparation/subwording/3-desubword.py target-general.model technology-fr-general.translated

Done desubwording! Output: technology-fr.translated.desubword
Done desubwording! Output: technology-fr-medicine.translated.desubword
Done desubwording! Output: technology-fr-general.translated.desubword


In [35]:
# BLEU scores for technology-specific NMT model
# the filtered test datasets were already desubworded and saved previously when evaluating the baseline model, so we can just reuse them
!python3 compute-bleu.py corpora/enfr/en-fr-technology.fr-filtered.fr.subword.test.desubword technology-fr.translated.desubword
!python3 compute-bleu.py corpora/enfr/en-fr-medicine.fr-filtered.fr.subword.test.desubword technology-fr-medicine.translated.desubword
!python3 compute-bleu.py corpora/enfr/en-fr-general.fr-filtered.fr.subword.test.desubword technology-fr-general.translated.desubword

Reference 1st sentence: Crée une nouvelle demande de réunion
MTed 1st sentence: Crée une nouvelle demande de réunionNew
BLEU:  57.09065291734113
Reference 1st sentence: Cependant, le temps nécessaire à l’ aggravation de la maladie après le traitement était le même dans les deux groupes (environ 10 mois).
MTed 1st sentence: Toutefois, le temps nécessaire pour que la maladie soit pire après le traitement est le même dans les deux groupes (environ 10 mois).
That's 100 lines that end in a tokenized period ('.')
It looks like you forgot to detokenize your test data, which may hurt your score.
If you insist your data is detokenized, or don't care, you can suppress this message with the `force` parameter.
BLEU:  16.21355600694869
Reference 1st sentence: Il ne faut pas oublier que 1 260 détenus marocains sont toujours en captivité dans les geôles du Polisario, et ce depuis plus de 25 ans, en violation flagrante du droit international humanitaire.
MTed 1st sentence: Il ne faut pas oublier que 1

### Fine-tuning for medicine corpus

In [16]:
med = """# med.yaml
# for medicine corpus/model fine-tuning
share_vocab: true
src_vocab: run-med/source.vocab
tgt_vocab: run-med/target.vocab
src_vocab_size: 50000
tgt_vocab_size: 50000

data:
  # different corpus weighting for mixed fine-tuning approach
  med_corpus:
    path_src: corpora/enfr/en-fr-medicine.en-filtered.en.subword.train
    path_tgt: corpora/enfr/en-fr-medicine.fr-filtered.fr.subword.train
    transforms: [filtertoolong]
    weight: 10
  # randomly sample portions from the generic/general data we used to train the baseline model for mixed fine-tuning
  gen_corpus:
    path_src: corpora/enfr/en-fr-general.en-filtered.en.subword.train
    path_tgt: corpora/enfr/en-fr-general.fr-filtered.fr.subword.train
    transforms: [filtertoolong]
    weight: 1
  valid:
    path_src: corpora/enfr/en-fr-medicine.en-filtered.en.subword.dev
    path_tgt: corpora/enfr/en-fr-medicine.fr-filtered.fr.subword.dev
    transforms: [filtertoolong]

update_vocab: true
train_from: 'models/model-base.enfr_step_10000.pt' # the base model trained earlier
reset_optim: all

# filtertoolong
src_seq_length: 150
tgt_seq_length: 150

# tokenization
src_subword_model: source-general.model
tgt_subword_model: target-general.model

save_data: run-tech
save_model: models/model-med.enfr
log_file: train-tech.log
early_stopping: 4

keep_checkpoint: 4
save_checkpoint_steps: 1000
average_decay: 0.0005
seed: 1234
warmup_steps: 2000
report_every: 200

train_steps: 6000
valid_steps: 2000

# Number of GPUs, and IDs of GPUs
world_size: 1
gpu_ranks: [0]

# Batching
bucket_size: 262144
num_workers: 0
batch_type: "tokens"
batch_size: 4096
valid_batch_size: 2048
max_generator_batches: 2
accum_count: [4]
accum_steps: [0]

# Optimization
model_dtype: "fp16"
optim: "adam"
learning_rate: 2
# warmup_steps: 8000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
encoder_type: transformer
decoder_type: transformer
position_encoding: true
enc_layers: 6
dec_layers: 6
heads: 8
hidden_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
"""

with open("med.yaml", "w+") as med_yaml:
  med_yaml.write(med)

In [6]:
!cat source-medicine.vocab | python3 spm_to_vocab.py > source-medicine.onmt_vocab

2024-01-30 14:52:27.024012: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-30 14:52:27.024095: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-30 14:52:27.025995: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-30 14:52:27.036812: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-30 14:52:30.202486: I external/local_

In [None]:
!onmt_build_vocab -config med.yaml -n_sample -1 -num_threads 2

In [18]:
!onmt_train -config med.yaml

2024-01-30 15:16:54.886987: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-30 15:16:54.887060: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-30 15:16:54.888363: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-30 15:16:54.895406: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-30 15:16:57.158636: I external/local_

In [19]:
!onmt_translate -model models/model-med.enfr_step_6000.pt -src corpora/enfr/en-fr-medicine.en-filtered.en.subword.test -output medicine-fr.translated -gpu 0 -min_length 1

2024-01-30 16:46:57.800454: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-30 16:46:57.800515: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-30 16:46:57.802323: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-30 16:46:57.812804: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-30 16:47:02.366393: I external/local_

In [20]:
!onmt_translate -model models/model-med.enfr_step_6000.pt -src corpora/enfr/en-fr-technology.en-filtered.en.subword.test -output medicine-fr-technology.translated -gpu 0 -min_length 1

2024-01-30 16:48:11.197567: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-30 16:48:11.197632: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-30 16:48:11.198969: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-30 16:48:11.206266: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-30 16:48:13.632687: I external/local_

In [21]:
!onmt_translate -model models/model-med.enfr_step_6000.pt -src corpora/enfr/en-fr-general.en-filtered.en.subword.test -output medicine-fr-general.translated -gpu 0 -min_length 1

2024-01-30 16:49:10.006582: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-30 16:49:10.006647: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-30 16:49:10.007948: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-30 16:49:10.015254: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-30 16:49:12.834949: I external/local_

In [26]:
# desubword translated files
!python3 MT-Preparation/subwording/3-desubword.py target-general.model medicine-fr.translated
!python3 MT-Preparation/subwording/3-desubword.py target-general.model medicine-fr-technology.translated
!python3 MT-Preparation/subwording/3-desubword.py target-general.model medicine-fr-general.translated

Done desubwording! Output: medicine-fr.translated.desubword
Done desubwording! Output: medicine-fr-technology.translated.desubword
Done desubwording! Output: medicine-fr-general.translated.desubword


In [27]:
# BLEU scores for medicine-specific NMT model
# the filtered test datasets were already desubworded and saved previously when evaluating the baseline model, so we can just reuse them
!python3 compute-bleu.py corpora/enfr/en-fr-medicine.fr-filtered.fr.subword.test.desubword medicine-fr.translated.desubword
!python3 compute-bleu.py corpora/enfr/en-fr-technology.fr-filtered.fr.subword.test.desubword medicine-fr-technology.translated.desubword
!python3 compute-bleu.py corpora/enfr/en-fr-general.fr-filtered.fr.subword.test.desubword medicine-fr-general.translated.desubword

Reference 1st sentence: Cependant, le temps nécessaire à l’ aggravation de la maladie après le traitement était le même dans les deux groupes (environ 10 mois).
MTed 1st sentence: Cependant, le délai avant aggravation de la maladie était identique dans les deux groupes (environ 10 mois).
BLEU:  54.31552770767254
Reference 1st sentence: Crée une nouvelle demande de réunion
MTed 1st sentence: ⁇  une nouvelle demande de réunion
That's 100 lines that end in a tokenized period ('.')
It looks like you forgot to detokenize your test data, which may hurt your score.
If you insist your data is detokenized, or don't care, you can suppress this message with the `force` parameter.
BLEU:  12.834869950762396
Reference 1st sentence: Il ne faut pas oublier que 1 260 détenus marocains sont toujours en captivité dans les geôles du Polisario, et ce depuis plus de 25 ans, en violation flagrante du droit international humanitaire.
MTed 1st sentence: Il ne faut pas oublier que 1 260 prisonniers marocaines s