# Starter notebook

The purpose of this Notebook is to build baseline model that translate dyula (dyu) language (source language) into French (fr) language (target language). we'll train from scratch a Transformer model using JoeyNMT.

NB: Run time execution of this notebook it less than **1h** respect resources (GPU, RAM) define below.

For more details about JoeyNMT see [here](https://github.com/joeynmt)

## Environmental setup

> ⚠ **Important:** Before you start, set runtime type to GPU.

In [None]:
!nvidia-smi

Thu Mar 28 11:28:27 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

<h2> If need it, you can create the Secret in Google Colab</h1>

1. Open your Google Colab notebook and click on the `secrets` tab.
2. Create a new secret with the name `EARTHENGINE_TOKEN`.
3. Paste the content from the clipboard into the `Value` input box of the created secret.
4. Toggle the button on the left to allow notebook access to the secret.

![](https://i.imgur.com/Z9R08uU.png)

In [None]:
import torch
torch.__version__

'2.2.1+cu121'

Mount your Google Drive


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%%capture
!pip install joeynmt==2.3.0
!pip install datasets==2.18.0

## Data Preparation

### Download

We download the corpus train-dev-test subsets from Huggingface hub.

In [None]:
from datasets import load_dataset, DatasetDict, Translation
repo_name = "data354/Koumankan_mt_dyu_fr"
dataset = load_dataset(repo_name)
dataset

Downloading readme:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/530k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/102k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/55.8k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8065 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1471 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1393 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['ID', 'translation'],
        num_rows: 8065
    })
    validation: Dataset({
        features: ['ID', 'translation'],
        num_rows: 1471
    })
    test: Dataset({
        features: ['ID', 'translation'],
        num_rows: 1393
    })
})

In [None]:
import re

## Data preprocessing

# Optional: lower case the corpora - this will make it easier to generalize, but without proper casing.
# Optional: remove punctuation.

src_lang = 'dyu'
trg_lang = "fr"
chars_to_remove_regex = '[!"&\(\),-./:;=?+.\n\[\]]'
def remove_special_characters(text):
    text = re.sub(chars_to_remove_regex, ' ', text.lower())
    return text.strip()

def clean_text(batch):
    # process source text
    batch['translation'][src_lang] = remove_special_characters(batch['translation'][src_lang])
    # process target text
    batch['translation'][trg_lang] = remove_special_characters(batch['translation'][trg_lang])

    return batch


dataset = dataset.map(clean_text)
dataset

Map:   0%|          | 0/8065 [00:00<?, ? examples/s]

Map:   0%|          | 0/1471 [00:00<?, ? examples/s]

Map:   0%|          | 0/1393 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['ID', 'translation'],
        num_rows: 8065
    })
    validation: Dataset({
        features: ['ID', 'translation'],
        num_rows: 1471
    })
    test: Dataset({
        features: ['ID', 'translation'],
        num_rows: 1393
    })
})

Let's inspect the sentences.

In [None]:
dataset["validation"]["translation"][:3]

[{'dyu': 'i tɔgɔ bi cogodɔ', 'fr': 'tu portes un nom de fantaisie'},
 {'dyu': 'puɛn saba fɔlɔ', 'fr': 'trois points d’avance'},
 {'dyu': 'tile bena', 'fr': 'le soleil s’est couché'}]

Save the train-dev subsets on disk.

In [None]:
data_dir = "data/dyu_fr"
dataset.save_to_disk(data_dir)

Saving the dataset (0/1 shards):   0%|          | 0/8065 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1471 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1393 [00:00<?, ? examples/s]

### Vocabulary

We will use the [sentencepiece](https://github.com/google/sentencepiece) library to split words into subwords (BPE) according to their frequency in the training corpus.

`build_vocab.py` script will train the BPE model and creates joint vocabulary. It takes the same config file as the joeynmt.

In [None]:
from pathlib import Path

# model dir
model_dir = "models/dyu_fr"

# Create the config
config = """
name: "dyu_fr_transformer-sp"
joeynmt_version: "2.3.0"
model_dir: "{model_dir}"
use_cuda: True
fp16: False

data:
    train: "{data_dir}"
    dev: "{data_dir}"
    test: "{data_dir}"
    dataset_type: "huggingface"
    dataset_cfg:
        name: "dyu-fr"
    sample_dev_subset: 1460
    src:
        lang: "dyu"
        max_length: 100
        lowercase: False
        normalize: False
        level: "bpe"
        voc_limit: 4000
        voc_min_freq: 1
        voc_file: "{data_dir}/vocab.txt"
        tokenizer_type: "sentencepiece"
        tokenizer_cfg:
            model_file: "{data_dir}/sp.model"
    trg:
        lang: "fr"
        max_length: 100
        lowercase: False
        normalize: False
        level: "bpe"
        voc_limit: 4000
        voc_min_freq: 1
        voc_file: "{data_dir}/vocab.txt"
        tokenizer_type: "sentencepiece"
        tokenizer_cfg:
            model_file: "{data_dir}/sp.model"
    special_symbols:
        unk_token: "<unk>"
        unk_id: 0
        pad_token: "<pad>"
        pad_id: 1
        bos_token: "<s>"
        bos_id: 2
        eos_token: "</s>"
        eos_id: 3

""".format(data_dir=data_dir, model_dir=model_dir)
with (Path(data_dir) / "config.yaml").open('w') as f:
    f.write(config)

Call the `build_vocab.py` script with `--joint` flag to build the vocabulary

In [None]:
!wget https://raw.githubusercontent.com/joeynmt/joeynmt/v2.3/scripts/build_vocab.py
! sudo chmod 777 build_vocab.py

--2024-03-28 11:48:26--  https://raw.githubusercontent.com/joeynmt/joeynmt/v2.3/scripts/build_vocab.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13170 (13K) [text/plain]
Saving to: ‘build_vocab.py’


2024-03-28 11:48:26 (73.7 MB/s) - ‘build_vocab.py’ saved [13170/13170]



In [None]:
!python build_vocab.py {data_dir}/config.yaml --joint

2024-03-28 11:48:31.515312: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-28 11:48:31.515401: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-28 11:48:31.632289: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-28 11:48:31.639749: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Dropping NaN...: 100% 8065/8065 [00:00<00:00, 125570.

The generated vocabulary looks like this:

In [None]:
!head -10 {data_dir}/vocab.txt

<unk>
<pad>
<s>
</s>
▁a
s
▁ka
'
a
▁la


## Model Training

### Configuration

Joey NMT reads model and training hyperparameters from a configuration file. We're generating this now to configure paths in the appropriate places.

The configuration below builds a small Transformer model with shared embeddings between source and target language on the base of the subword vocabularies created above.

In [None]:
config += """
testing:
    #load_model: "{model_dir}/best.ckpt"
    n_best: 1
    beam_size: 5
    beam_alpha: 1.0
    batch_size: 256
    batch_type: "token"
    max_output_length: 100
    eval_metrics: ["bleu"]
    #return_prob: "hyp"
    #return_attention: False
    sacrebleu_cfg:
        tokenize: "13a"

training:
    #load_model: "{model_dir}/latest.ckpt"
    #reset_best_ckpt: False
    #reset_scheduler: False
    #reset_optimizer: False
    #reset_iter_state: False
    random_seed: 42
    optimizer: "adamw"
    normalization: "tokens"
    adam_betas: [0.9, 0.999]
    scheduling: "warmupinversesquareroot"
    learning_rate_warmup: 100
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    loss: "crossentropy"
    batch_size: 512
    batch_type: "token"
    batch_multiplier: 4
    early_stopping_metric: "bleu"
    epochs: 30
    updates: 550
    validation_freq: 30
    logging_freq: 5
    overwrite: True
    shuffle: True
    print_valid_sents: [0, 1, 2, 3]
    keep_best_ckpts: 3

model:
    initializer: "xavier_uniform"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier_uniform"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4
        embeddings:
            embedding_dim: 256
            scale: True
            dropout: 0.0
        # typically ff_size = 4 x hidden_size
        hidden_size: 256
        ff_size: 1024
        dropout: 0.2
        layer_norm: "pre"
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 8
        embeddings:
            embedding_dim: 256
            scale: True
            dropout: 0.0
        # typically ff_size = 4 x hidden_size
        hidden_size: 256
        ff_size: 1024
        dropout: 0.1
        layer_norm: "pre"

""".format(model_dir=model_dir)
with (Path(data_dir) / "config.yaml").open('w') as f:
    f.write(config)

### Run training
⏳ The log reports the training process, look out for the prints of example translations and the BLEU evaluation scores to get an impression of the current quality.

In [None]:
!python -m joeynmt train {data_dir}/config.yaml --skip-test

2024-03-28 12:24:14.691991: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-28 12:24:14.692040: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-28 12:24:14.693416: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-28 12:24:14.700426: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-28 12:24:17,620 - INFO - root - Hello! This i

In [None]:
# Add the best model info on config file
with (Path(model_dir) / "config.yaml").open('r') as f:
    config = f.read()
resume_config = config\
  .replace(f'#load_model: "{model_dir}/best.ckpt"',
           f'load_model: "{model_dir}/best.ckpt"')

resume_config = resume_config\
  .replace(f'model_file: "{data_dir}/sp.model"',
           f'model_file: "{model_dir}/sp.model"')

resume_config = resume_config\
  .replace(f'voc_file: "{data_dir}/vocab.txt"',
           f'voc_file: "{model_dir}/vocab.txt"')

with (Path(model_dir) / "config.yaml").open('w') as f:
    f.write(resume_config)

In [None]:
!cp {data_dir}/vocab.txt  {model_dir}
!cp -R {model_dir} /content/drive/MyDrive/mt-dyu-fr

## Submit baseline model to Zindi

In [None]:
import pandas as pd

In [None]:
test = dataset["test"]
test

Dataset({
    features: ['ID', 'translation'],
    num_rows: 1393
})

In [None]:
with open("source.txt", "w") as f:
     f.write("\n".join([sample["dyu"] for sample in test['translation']]))

In [None]:
!python -m joeynmt translate {model_dir}/config.yaml < source.txt > translation.txt

2024-03-28 12:39:09.172185: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-28 12:39:09.172243: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-28 12:39:09.173638: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-28 12:39:09.181272: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-28 12:39:12,678 - INFO - root - Hello! This i

In [None]:
with open("translation.txt") as f:
    sub = pd.DataFrame({
        "ID": test["ID"],
        "Target": f.read().strip().split("\n")}
    )

sub

Unnamed: 0,ID,Target
0,ID_17345911362699,tu es sûr qu'est pas
1,ID_173626847.3381,c’est ce qu’est pas
2,ID_17704632382547,je suis sûr qu’ai pas
3,ID_19793499384156,tu es sûr qu’a
4,ID_17802727385575,c’est ce qu’est pas
...,...,...
1388,ID_17319547625075,je suis sûr qu’est pas
1389,ID_19774885625614,c’est ce qu’est pas
1390,ID_17405744626334,qu’est vous vous
1391,ID_19074638626892,il n'est pas de l'as


In [None]:
sub.to_csv("Submission.csv", index=False)
!cp -R Submission.csv /content/drive/MyDrive/mt-dyu-fr

Your submission should get you a bleu score of around 16% on the leadboard. To improve your models please play around with the different parameters in the config file and don't forget to clean your data as much as possible.

Author : [Data354](https://data354.com/en/)

Thanks to Julia & Co for making [JoeyNMT transformers](https://github.com/joeynmt/joeynmt) so simple to use.
