### What is OpenNMT?
OpenNMT est un framework open-source pour la traduction machine basée sur les réseaux de neurones (Neural Machine translation).\
Passer par un framework tel que openNMT est indispensable pour les débutants que nous somme, la conception d'une architecture pour un modèle et toutes les autres choses liées à la traduction machine en générale peuvent être complex.

**Dans cette competition nous utiliserons l'implentation Pytorch de openNMT (OpenNMT-py)**

### Installing the packages

En exécutant la cellule ci dessous vous procedez à l'installation d'OpenNMT-py (le main framework), de Pytorch (dont onmt a besoin) et de sentencepiece (utilisé par les Tokenizers)

In [2]:
%pip install datasets OpenNMT-py sentencepiece sacrebleu torch==1.12.1

/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Collecting OpenNMT-py
  Downloading OpenNMT_py-3.0.4-py3-none-any.whl (219 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m219.6/219.6 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting torch==1.12.1
  Downloading torch-1.12.1-cp37-cp37m-manylinux1_x86_64.whl (776.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m776.3/776.3 MB[0m [31m934.2 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting pyonmttok<2,>=1.35
  Downloading pyonmttok-1.36.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.9/16.9 MB[0m [31m33.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting waitress
  Downloading waitress-2.1.2-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/57.7 kB[0m [31m6.1 MB/s[0m et

### Data preparation WORK

Cette première étape consiste à préparer les données pour l'entraînement du modèle de traduction.\
Ces données que nous allons utiliser sont structuré en "lignes-parallèles" dans divers fichier simplement nommées d'après la langue et l'usage qu'on à fera (eg, train.bam, test.fr et dev.bam) chacun ayant son correspondant parralèle, chaque ligne ayant sa traduction dans ce correspondant.\
Avant de pouvoir entraîner un modèle, ces données doivent être "néttoyé" car ils contiennent très probablement des ligne mal structurée ou même vide, des doublons et autre type de "noise".\
Enfin les données seront "tokenizer", cette étape consiste à encoder les mots en différent tokens qui sont des représentations scalaire des mots ou "sous mots" pour être précis, un token ne correspond pas nécéssairement à un mot mais on ne se préoccupera pas de cette étape dans ce notebook, OpenNMT va gérer ça pour nous.

**Assurez vous que les répoertoires "scripts" et "data" sont présents**

In [3]:
ls data/ # lister le contenu du répertoire data

/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
data/dev.bam         data/target.model  data/train.bam.fil.txt
data/dev.fr          data/target.vocab  data/train.fr
data/dev.sub.bam     data/test.bam      data/train.fr.fil.txt
data/dev.sub.fr      data/test.fr       data/train.sub.bam
data/robotsmali.tsv  data/test.sub.bam  data/train.sub.fr
data/source.model    data/test.sub.fr
data/source.vocab    data/train.bam


In [None]:
# Cleaning task
!python scripts/filter.py data/train.fr data/train.bam

In [None]:
# Training the Sentence piece subwording models for French and Bam
!python scripts/unigram.py data/train.fr.fil.txt data/train.bam.fil.txt

In [None]:
# Deplacer les fichier créer par les différents scripts dans le dossier data
!mv *.vocab *.model data

In [None]:
# Subwording the train, test and dev sets
!python scripts/subword.py data/source.model data/target.model data/train.fr.fil.txt data/train.bam.fil.txt
!python scripts/subword.py data/source.model data/target.model data/dev.fr data/dev.bam
!python scripts/subword.py data/source.model data/target.model data/test.fr data/test.bam

In [None]:
# Take a look at the final training subwords that will be tokenized
!head -10 train.sub-src.txt train.sub-trg.txt

In [None]:
# Move the new created files in folder "data" and change their names
!mv data/train.sub-src.txt data/train.sub.fr && mv data/train.sub-trg.txt data/train.sub.bam
!mv data/dev.sub-src.txt data/dev.sub.fr && mv data/dev.sub-trg.txt data/dev.sub.bam
!mv data/test.sub-src.txt data/test.sub.fr && mv data/test.sub-trg.txt data/test.sub.bam

In [None]:
!wc data/*

### Model / Training Configuration

La cellule ci dessous prend en charge la configuration des parametrès et hyperparamètres pour l'entrainement du modèle.\
Sentez vous libre de modifier ces paramètres pour entrainer et observer les resultats sur le modèle

In [5]:
import os

model_name = "fr2bam"
vocab_size = 50000

training_steps = 25000
valid_steps = int(training_steps / 5)
save_ckpt_freq = valid_steps
warmup_steps = int(training_steps / 10)
reporting =  10 # int(training_steps/10)

GPU = 1 # TOGGLE for GPU

if(not os.path.exists(model_name)):
  os.makedirs(model_name)

config = f"""

# config.yaml


## Where the samples will be written
save_data: run

# Training files
data:
    corpus_1:
        path_src: data/train.sub.fr
        path_tgt: data/train.sub.bam
        transforms: [filtertoolong] # change the transform method
    valid:
        path_src: data/dev.sub.fr
        path_tgt: data/dev.sub.bam
        transforms: [filtertoolong] # change the transform method

# Vocabulary files, generated by onmt_build_vocab
src_vocab: models/{model_name}/run/source.vocab
tgt_vocab: models/{model_name}/run/target.vocab

# Vocabulary size - should be the same as in sentence piece
src_vocab_size: 50000
tgt_vocab_size: 50000

# Filter out source/target longer than n if [filtertoolong] enabled
src_seq_length: 150
src_seq_length: 150

# Tokenization options
src_subword_model: source.model
tgt_subword_model: target.model

# Where to save the log file and the output models/checkpoints
log_file: train.log
save_model: models/{model_name}

# Stop training if it does not imporve after n validations
early_stopping: 3

# Default: 5000 - Save a model checkpoint for each n
save_checkpoint_steps: {save_ckpt_freq}

# To save space, limit checkpoints to last n
keep_checkpoint: 2

seed: 1234

# Default: 100000 - Train the model to max n steps 
# Increase to 200000 or more for large datasets
# For fine-tuning, add up the required steps to the original steps
train_steps: {training_steps}

# Default: 10000 - Run validation after n steps
valid_steps: {valid_steps}

# Default: 4000 - for large datasets, try up to 8000
warmup_steps: {warmup_steps}
report_every: {reporting}

# Batching
num_workers: 2  # Default: 2, set to 0 when RAM out of memory
batch_type: "tokens"
batch_size: 1024   # Tokens per batch, change when CUDA out of memory
valid_batch_size: 1024
max_generator_batches: 2
accum_count: [4]
accum_steps: [0]

# Optimization
model_dtype: "fp16"
optim: "adam"
learning_rate: 2
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
encoder_type: transformer
decoder_type: transformer
position_encoding: true
enc_layers: 6
dec_layers: 6
heads: 8
hidden_size: 1024
word_vec_size: 512 # investigate
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]

"""

if(GPU):
  config += """
# Number of GPUs, and IDs of GPUs
world_size: 1
gpu_ranks: [0]
"""

with open(f"{model_name}/config.yaml", "w") as fp:
  fp.write(config)

In [6]:
!nproc --all

/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
2


In [7]:
!onmt_build_vocab -config fr2bam/config.yaml -n_sample -1 -num_threads 2

/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
[2023-02-15 19:22:58,804 INFO] Counter vocab from -1 samples.
[2023-02-15 19:22:58,804 INFO] n_sample=-1: Build vocab on full datasets.
[2023-02-15 19:23:00,369 INFO] * Transform statistics for corpus_1(50.00%):
			* FilterTooLongStats(filtered=9)

[2023-02-15 19:23:00,414 INFO] * Transform statistics for corpus_1(50.00%):
			* FilterTooLongStats(filtered=5)

[2023-02-15 19:23:00,534 INFO] Counters src:25554
[2023-02-15 19:23:00,534 INFO] Counters tgt:34695
Traceback (most recent call last):
  File "/opt/conda/bin/onmt_build_vocab", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.7/site-packages/onmt/bin/build_vocab.py", line 202, in main
    build_vocab_main(opts)
  File "/opt/conda/lib/python3.7/site-packages/onmt/bin/build_vocab.py", line 186, in build_vocab_main
    save_counter(src_counter, opts.src_vocab)
  File "/opt/conda/lib/python3.7/site-packages/onmt/bin

In [None]:
!onmt_train -config fr2bam/config.yaml -verbose

/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
[2023-02-15 19:23:16,689 INFO] Parsed 2 corpora from -data.
[2023-02-15 19:23:16,689 INFO] Get special vocabs from Transforms: {'src': [], 'tgt': []}.
[2023-02-15 19:23:16,903 INFO] Building model...
[2023-02-15 19:23:20,825 INFO] NMTModel(
  (encoder): TransformerEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(25560, 512, padding_idx=1)
        )
        (pe): PositionalEncoding()
      )
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=False)
          (linear_values): Linear(in_features=512, out_features=512, bias=False)
          (linear_query): Linear(in_features=512, out_features=512, bias=False)
          (softmax): 

## Model Evaluation

In [None]:
!onmt_translate -model fr2bam/models/fr2bam_step_10000.pt -src data/test.sub.fr -output fr2bam/models/pred_10000.txt -gpu -1 -verbose

In [None]:
!python scripts/desubword.py data/target.model fr2bam/models/pred_10000.txt

In [None]:
# Sacrebleu testing CODE (calculating bleu and ter score)
bleu = !sacrebleu data/test.fr -i bam2fr/models/pred_10000.txt.desub.txt -m bleu -b -w 4
ter = !sacrebleu data/test.fr -i bam2fr/models/pred_10000.txt.desub.txt -m ter -b -w 4

print(bleu)
print(ter)