## Installations

In [None]:
%pip install OpenNMT-py sentencepiece sacrebleu torch

In [None]:
# Clone the github repository that contains the scripts and the training data

!git clone https://github.com/diarray-hub/nko.git

# Change directory for this repo.
%cd nko

## Data preparation WORK

Cette première étape consiste à préparer les données pour l'entraînement du modèle de traduction.\
Ces données que nous allons utiliser sont structuré en "lignes-parallèles" dans divers fichier simplement nommées d'après la langue et l'usage qu'on à fera (eg, train.bam, test.fr et dev.bam) chacun ayant son correspondant parralèle, chaque ligne ayant sa traduction dans ce correspondant.\
Avant de pouvoir entraîner un modèle, ces données doivent être "néttoyé" car ils contiennent très probablement des ligne mal structurée ou même vide, des doublons et autre type de "noise".\
Enfin les données seront "tokenizer", cette étape consiste à encoder les mots en différent tokens qui sont des représentations scalaire des mots ou "sous mots" pour être précis, un token ne correspond pas nécéssairement à un mot mais on ne se préoccupera pas de cette étape dans ce notebook, OpenNMT va gérer ça pour nous.

**Assurez vous que les répoertoires "scripts" et "data" sont présents**

In [None]:
!wget https://s3.amazonaws.com/opennmt-models/nllb-200/flores200_sacrebleu_tokenizer_spm.model -P data
#!wget https://s3.amazonaws.com/opennmt-models/nllb-200/nllb-inference.yaml
#!wget https://dl.fbaipublicfiles.com/large_objects/nllb/models/spm_200/dictionary.txt -P data
#!wget https://s3.amazonaws.com/opennmt-models/nllb-200/nllb-200-600M-onmt.pt


In [None]:
# Cleaning task
!python scripts/filter.py data/train.fr data/train.bam
# Training the Sentence piece subwording models for French and Bam
!python scripts/unigram.py data/train.fr.fil.txt data/train.bam.fil.txt
# Deplacer les fichier créer par les différents scripts dans le dossier data
!mv *.vocab *.model data

In [None]:
# Subwording the data for inference
!python scripts/subword.py data/source.model data/target.model data/50_test.fr data/50_test.bam
# Change the names of the new created files in folder "data"
!mv data/50_test.sub-src.txt data/50_test.sub.fr && mv data/50_test.sub-trg.txt data/50_test.sub.bam

## Model / Training Configuration (skip this step for inference/ Sautez cette étape pour l'inference) 

In [None]:
# Try to avoid running out of memory
#!export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128

In [None]:
import os

model_name = "fr2bam"
vocab_size = 50000

training_steps = 5000
valid_steps = int(training_steps / 5)
save_ckpt_freq = int(training_steps / 10)
warmup_steps = int(training_steps / 10)
reporting =  25 # int(training_steps/10)
GPU = 1 # TOGGLE for GPU

if(not os.path.exists(model_name)):
  os.makedirs(model_name)

config = f"""

# config.yaml


## Where the samples will be written
save_data: run

# Training files
data:
    corpus_1:
        path_src: data/train.sub.fr
        path_tgt: data/train.sub.bam
        transforms: [sentencepiece] # try to tune the transform method (can find possible choice in the docs of OpenNMT-py)
    valid:
        path_src: data/dev.sub.fr
        path_tgt: data/dev.sub.bam
        transforms: [sentencepiece] # try to tune the transform method (can find possible choice in the docs of OpenNMT-py)

# Vocabulary files, generated by onmt_build_vocab
src_vocab: models/{model_name}/run/source.vocab
tgt_vocab: models/{model_name}/run/target.vocab
share_vocab: true

train_from: models/fr2bam_step_500.pt

# Vocabulary size - should be the same as in sentence piece
src_vocab_size: {vocab_size}
tgt_vocab_size: {vocab_size}

# Tokenization options
src_subword_model: data/flores200_sacrebleu_tokenizer_spm.model
tgt_subword_model: data/flores200_sacrebleu_tokenizer_spm.model

# Where to save the log file and the output models/checkpoints
log_file: train.log
save_model: models/{model_name}

# Stop training if it does not imporve after n validations
early_stopping: 3

# Default: 5000 - Save a model checkpoint for each n
save_checkpoint_steps: {save_ckpt_freq}

# To save space, limit checkpoints to last n
keep_checkpoint: 2

seed: 3456

# Default: 100000 - Train the model to max n steps 
# Increase to 200000 or more for large datasets
# For fine-tuning, add up the required steps to the original steps
train_steps: {training_steps}

# Default: 10000 - Run validation after n steps
valid_steps: {valid_steps}

# Default: 4000 - for large datasets, try up to 8000
warmup_steps: {warmup_steps}
report_every: {reporting}

# Batching
num_workers: 2  # Default: 2, set to 0 when RAM out of memory
batch_type: "tokens"
batch_size: 512  # Tokens per batch, change when CUDA out of memory
valid_batch_size: 512
max_generator_batches: 2
accum_count: [4]
accum_steps: [0]

# Optimization
model_dtype: "fp16"
optim: "adam"
reset_optim: all
learning_rate: 0.0015
normalization: "tokens"

# Model
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]

"""

if(GPU):
  config += """
world_size: 1
gpu_ranks: [0]
  """

#with open(f"{model_name}/config.yaml", "w") as fp:
#  fp.write(config)

## Run inference

In [None]:
import torch
torch.cuda.empty_cache()

In [None]:
!onmt_translate -model fr2bam_step_500.pt -src data/50_test.sub.fr -output fr2bamNllb.pred_500.txt -gpu 0

In [None]:
# Show a preview of reference and predicted subwords 
!head data/50_test.sub.bam fr2bamNllb.pred_500.txt

In [None]:
# Desubword 
!python scripts/desubword.py data/flores200_sacrebleu_tokenizer_spm.model fr2bamNllb.pred_500.txt

In [None]:
# Show a preview of reference and predicted sentences 
!head fr2bamNllb.pred_500.txt.desub.txt data/test.bam