<a href="https://colab.research.google.com/github/cw118/domain-adapted-nmt/blob/main/2_domain_adapted_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Domain-adapted NMT

## Training NMT models

In [1]:
!pip3 install OpenNMT-py

Collecting OpenNMT-py
  Downloading OpenNMT_py-3.4.3-py3-none-any.whl (257 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m257.3/257.3 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Collecting configargparse (from OpenNMT-py)
  Downloading ConfigArgParse-1.7-py3-none-any.whl (25 kB)
Collecting ctranslate2<4,>=3.17 (from OpenNMT-py)
  Downloading ctranslate2-3.24.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (36.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m36.8/36.8 MB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
Collecting waitress (from OpenNMT-py)
  Downloading waitress-2.1.2-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/57.7 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyonmttok<2,>=1.35 (from OpenNMT-py)
  Downloading pyonmttok-1.37.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m

In [2]:
# change into folder where prepared datasets were saved in the text processing step
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
%cd drive/MyDrive/domain-adapted-nmt/nmt

/content/drive/MyDrive/domain-adapted-nmt/nmt


### Train the general/base model

In [4]:
# corpora generated from step 1: https://drive.google.com/drive/folders/1fVe2e2MvT2CCTpSSDrBkykoR-7JKy-w4?usp=sharing
config = '''# config.yaml


## where the samples will be written
save_data: run

# train the general/base model first
data:
    corpus_1:
        path_src: corpora/en-de-general.en-filtered.en.subword.train
        path_tgt: corpora/en-de-general.de-filtered.de.subword.train
        transforms: [filtertoolong]
        weight: 1
    valid:
        path_src: corpora/en-de-general.en-filtered.en.subword.dev
        path_tgt: corpora/en-de-general.de-filtered.de.subword.dev
        transforms: [filtertoolong]

# vocab files generated by onmt_build_vocab
src_vocab: run/source.vocab
tgt_vocab: run/target.vocab

# vocabulary size: should be same as in sentencepiece
src_vocab_size: 50000
tgt_vocab_size: 50000

# Filter out source/target longer than n if [filtertoolong] enabled
src_seq_length: 150
src_seq_length: 150

# Tokenization options
src_subword_model: source.model
tgt_subword_model: target.model

# Where to save the log file and the output models/checkpoints
log_file: train.log
save_model: models/model-base.ende

# Stop training if it does not improve after n validations
early_stopping: 4

# Default: 5000 - Save a model checkpoint for each n
save_checkpoint_steps: 1000

# To save space, limit checkpoints to last n
# keep_checkpoint: 3

seed: 3435

# For fine-tuning, add up the required steps to the original steps
train_steps: 3000

# Default: 10000 - Run validation after n steps
valid_steps: 1000

# Default: 4000 - for large datasets, try up to 8000
warmup_steps: 1000
report_every: 100

# Number of GPUs, and IDs of GPUs
world_size: 1
gpu_ranks: [0]

# Batching
bucket_size: 262144
num_workers: 0  # Default: 2, set to 0 when RAM out of memory
batch_type: "tokens"
batch_size: 4096   # Tokens per batch, change when CUDA out of memory
valid_batch_size: 2048
max_generator_batches: 2
accum_count: [4]
accum_steps: [0]

# Optimization
model_dtype: "fp16"
optim: "adam"
learning_rate: 2
# warmup_steps: 8000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
encoder_type: transformer
decoder_type: transformer
position_encoding: true
enc_layers: 6
dec_layers: 6
heads: 8
hidden_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
'''

with open("config.yaml", "w+") as config_yaml:
  config_yaml.write(config)

In [11]:
# corpora generated from step 1: https://drive.google.com/drive/folders/1fVe2e2MvT2CCTpSSDrBkykoR-7JKy-w4?usp=sharing
testconf = '''# testconf.yaml


## where the samples will be written
save_data: run

# train the general/base model first
data:
    corpus_1:
        path_src: corpora/en-de-general.en-filtered.en.subword.train
        path_tgt: corpora/en-de-general.de-filtered.de.subword.train
        transforms: [filtertoolong]
        weight: 1
    valid:
        path_src: corpora/en-de-general.en-filtered.en.subword.dev
        path_tgt: corpora/en-de-general.de-filtered.de.subword.dev
        transforms: [filtertoolong]

# vocab files generated by onmt_build_vocab
src_vocab: run/source.vocab
tgt_vocab: run/target.vocab

# vocabulary size: should be same as in sentencepiece
src_vocab_size: 50000
tgt_vocab_size: 50000

# Filter out source/target longer than n if [filtertoolong] enabled
src_seq_length: 150
src_seq_length: 150

# Tokenization options
src_subword_model: source.model
tgt_subword_model: target.model

# Where to save the log file and the output models/checkpoints
log_file: train.log
save_model: models/model-base.ende

# Stop training if it does not improve after n validations
early_stopping: 4

# Default: 5000 - Save a model checkpoint for each n
save_checkpoint_steps: 200

# To save space, limit checkpoints to last n
# keep_checkpoint: 3

seed: 3435

# For fine-tuning, add up the required steps to the original steps
train_steps: 1000

# Default: 10000 - Run validation after n steps
valid_steps: 300

# Default: 4000 - for large datasets, try up to 8000
warmup_steps: 500
report_every: 50

# Batching
bucket_size: 262144
num_workers: 0  # Default: 2, set to 0 when RAM out of memory
batch_type: "tokens"
# batch_size: 4096   # Tokens per batch, change when CUDA out of memory
valid_batch_size: 2048
max_generator_batches: 2
accum_count: [4]
accum_steps: [0]

# Optimization
model_dtype: "fp16"
optim: "adam"
learning_rate: 2
# warmup_steps: 8000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
encoder_type: transformer
decoder_type: transformer
position_encoding: true
enc_layers: 6
dec_layers: 6
heads: 8
hidden_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
'''

with open("testconf.yaml", "w+") as testconf_yaml:
  testconf_yaml.write(testconf)

In [5]:
!nproc --all

2


In [6]:
# match -num_threads to number of CPUs to increase speed
# -1 for -n_sample to use entire corpus when building vocab
!onmt_build_vocab -config config.yaml -n_sample -1 -num_threads 2

2024-01-22 15:08:47.678508: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-22 15:08:47.678572: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-22 15:08:47.679946: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-22 15:08:47.687513: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-22 15:08:50.644749: I external/local_xla/xla/

In [None]:
# match -num_threads to number of CPUs to increase speed
# -1 for -n_sample to use entire corpus when building vocab
!onmt_build_vocab -config testconf.yaml -n_sample -1 -num_threads 2

In [7]:
# once runtime type is changed to GPU, check that the GPU is active
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-d5e9e4af-f8a2-85f3-9ecd-b101bbb4aed9)


In [8]:
# check that the GPU is visible to PyTorch
import torch

print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

gpu_memory = torch.cuda.mem_get_info(0)
print("Free GPU memory:", gpu_memory[0] / 1024**2, "out of", gpu_memory[1] / 1024**2)

True
Tesla T4
Free GPU memory: 14999.0625 out of 15102.0625


In [9]:
# clear the models directory for a fresh start
!rm -rf /content/drive/MyDrive/nmt/models

In [10]:
# train NMT model
!onmt_train -config config.yaml

2024-01-22 15:10:10.964379: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-22 15:10:10.964442: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-22 15:10:10.966215: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-22 15:10:10.976501: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-22 15:10:14.072785: I external/local_xla/xla/

In [None]:
# train NMT model
!onmt_train -config testconf.yaml

In [27]:
# in case Colab is suddenly unable to navigate through directories
import os
path = '/content/drive/MyDrive/domain-adapted-nmt/nmt'
os.chdir(path)

In [28]:
# -gpu 0 to use gpu
!onmt_translate -model models/model-base.ende_step_3000.pt -src corpora/en-de-general.en-filtered.en.subword.test -output general-de.translated -gpu 0 -min_length 1

2024-01-22 15:51:45.649895: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-22 15:51:45.649962: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-22 15:51:45.651747: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-22 15:51:45.662215: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-22 15:51:48.001538: I external/local_xla/xla/

In [None]:
# if on cpu
!onmt_translate -model models/model-base.ende_step_1000.pt -src en-de-general.en-filtered.en.subword.test -output general-de.translated -min_length 1

In [29]:
!head -n 5 general-de.translated

▁Fri ed en , ▁Sicherheit ▁und ▁Wieder ver ein igung ▁über ▁die ▁ko re anisch e ▁Hal bin sel
▁Im ▁An schluss ▁an ▁mehrere ▁Miss ionen ▁in ▁der ▁Reg ion ▁be richtet e ▁ich ▁dem ▁Sicherheits rat ▁am ▁ 1 6 . ▁Ju n i , ▁dass ▁Is ra el ▁die ▁is ra el ischen ▁Tr uppe n ▁aus ▁Lib an on ▁zurück gezogen ▁hatte , ▁die ▁sich ▁aus ▁Lib an on ▁Rück zug s l ini e ▁im ▁Ein k lang ▁mit ▁der ▁Re s ol ut ion ▁ 4 2 5 ▁( 1 9 7 8 ) ▁des ▁Rat es ▁hervor ge ht .
▁J ust i z ▁ist ▁ein ▁wesentlich er ▁Bestandteil ▁des ▁Recht s st a at s .
▁Interna tionale r ▁Ge richt ▁für ▁Ru and a
▁Eine ▁ erheblich e ▁Zu nahme ▁der ▁erforderlich en ▁Ressourcen ▁wird ▁erforderlich ▁sein , ▁um ▁die ▁k ünftige n ▁Her aus for der ungen ▁wirk sam ▁zu ▁bew ältig en .


In [35]:
!pip3 install --upgrade -q sentencepiece

In [41]:
%cd nmt
!ls

/content/drive/MyDrive/domain-adapted-nmt/nmt
config.yaml	       models	       run	     target.model   train.log
corpora		       MT-Preparation  source.model  target.vocab
general-de.translated  nmt	       source.vocab  testconf.yaml


In [42]:
# desubword translated file
!python3 MT-Preparation/subwording/3-desubword.py target.model general-de.translated

Done desubwording! Output: general-de.translated.desubword


In [61]:
!head -n 5 general-de.translated.desubword

Frieden, Sicherheit und Wiedervereinigung über die koreanische Halbinsel
Im Anschluss an mehrere Missionen in der Region berichtete ich dem Sicherheitsrat am 16. Juni, dass Israel die israelischen Truppen aus Libanon zurückgezogen hatte, die sich aus Libanon Rückzugslinie im Einklang mit der Resolution 425 (1978) des Rates hervorgeht.
Justiz ist ein wesentlicher Bestandteil des Rechtsstaats.
Internationaler Gericht für Ruanda
Eine erhebliche Zunahme der erforderlichen Ressourcen wird erforderlich sein, um die künftigen Herausforderungen wirksam zu bewältigen.


In [84]:
!python3 MT-Preparation/subwording/3-desubword.py target.model corpora/en-de-general.en-filtered.en.subword.test

Done desubwording! Output: corpora/en-de-general.en-filtered.en.subword.test.desubword


In [101]:
# remove underscores: desubwording did not remove these successfully :/
import re
with open('corpora/en-de-general.en-filtered.en.subword.test.desubword', 'r') as ref, open('corpora/en-de-general-final.subword.test.desubword', 'w') as f:
  for l in ref:
    f.write(re.sub('▁', ' ', l).lstrip())

In [104]:
!head -n 5 corpora/en-de-general.en-filtered.en.subword.test.desubword

Peace,▁security and reunification on the Korean peninsula
▁Following▁several▁missions to the▁region by my Special Envoy, I▁reported to the▁Security▁Council on 16 June that Israeli▁forces had withdrawn from Lebanon in compliance with▁Council resolution 425 (1978).
Justice is a vital▁component of the▁rule of▁law.
International Tribunal for Rwanda
A▁significant▁increase in▁resources will be▁required to address▁future challenges▁effectively.


In [102]:
!head -n 5 corpora/en-de-general-final.subword.test.desubword

Peace, security and reunification on the Korean peninsula
Following several missions to the region by my Special Envoy, I reported to the Security Council on 16 June that Israeli forces had withdrawn from Lebanon in compliance with Council resolution 425 (1978).
Justice is a vital component of the rule of law.
International Tribunal for Rwanda
A significant increase in resources will be required to address future challenges effectively.


In [None]:
# test bleu for baseline score
!wget https://raw.githubusercontent.com/ymoslem/MT-Evaluation/main/BLEU/compute-bleu.py
!pip3 install sacrebleu

In [103]:
!python3 compute-bleu.py corpora/en-de-general-final.subword.test.desubword general-de.translated.desubword

Reference 1st sentence: Peace, security and reunification on the Korean peninsula
MTed 1st sentence: Frieden, Sicherheit und Wiedervereinigung über die koreanische Halbinsel
BLEU:  4.342904968955745
