## LLM M2M100
M2M100  is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation.

The model that can directly translate between the 9,900 directions of 100 languages. To translate into a target language, the target language id is forced as the first generated token. To force the target language id as the first generated token, pass the forced_bos_token_id parameter to the generate method.

In [1]:
!pip install evaluate
!pip install sacrebleu
!pip install evaluate

In [None]:
import numpy as np
import pandas as pd
import evaluate
from datasets import load_dataset
from transformers import M2M100Tokenizer, M2M100ForConditionalGeneration
from transformers import DataCollatorForSeq2Seq
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer



## 1. Choose language to train translation model

__Availiable languages:__ ("ar", "cs"), ("ar", "de"),
    ("cs", "de"),
    ("ar", "en"),
    ("cs", "en"),
    ("de", "en"),
    ("ar", "es"),
    ("cs", "es"),
    ("de", "es"),
    ("en", "es"),
    ("ar", "fr"),
    ("cs", "fr"),
    ("de", "fr"),
    ("en", "fr"),
    ("es", "fr"),
    ("ar", "it"),
    ("cs", "it"),
    ("de", "it"),
    ("en", "it"),
    ("es", "it"),
    ("fr", "it"),
    ("ar", "ja"),
    ("cs", "ja"),
    ("de", "ja"),
    ("en", "ja"),
    ("es", "ja"),
    ("fr", "ja"),
    ("ar", "nl"),
    ("cs", "nl"),
    ("de", "nl"),
    ("en", "nl"),
    ("es", "nl"),
    ("fr", "nl"),
    ("it", "nl"),
    ("ar", "pt"),
    ("cs", "pt"),
    ("de", "pt"),
    ("en", "pt"),
    ("es", "pt"),
    ("fr", "pt"),
    ("it", "pt"),
    ("nl", "pt"),
    ("ar", "ru"),
    ("cs", "ru"),
    ("de", "ru"),
    ("en", "ru"),
    ("es", "ru"),
    ("fr", "ru"),
    ("it", "ru"),
    ("ja", "ru"),
    ("nl", "ru"),
    ("pt", "ru"),
    ("ar", "zh"),
    ("cs", "zh"),
    ("de", "zh"),
    ("en", "zh"),
    ("es", "zh"),
    ("fr", "zh"),
    ("it", "zh"),
    ("ja", "zh"),
    ("nl", "zh"),
    ("pt", "zh"),
    ("ru", "zh"),

In [None]:
news_en = load_dataset("news_commentary", "en-ru")
news_en = news_en["train"].train_test_split(test_size=0.2)

Downloading builder script:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/7.41k [00:00<?, ?B/s]

Downloading and preparing dataset news_commentary/en-ru (download: 23.70 MiB, generated: 79.42 MiB, post-processed: Unknown size, total: 103.12 MiB) to /root/.cache/huggingface/datasets/news_commentary/en-ru/11.0.0/cfab724ce975dc2da51cdae45302389860badc88b74db8570d561ced6004f8b4...


Downloading data:   0%|          | 0.00/24.8M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/190104 [00:00<?, ? examples/s]

Dataset news_commentary downloaded and prepared to /root/.cache/huggingface/datasets/news_commentary/en-ru/11.0.0/cfab724ce975dc2da51cdae45302389860badc88b74db8570d561ced6004f8b4. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset news_commentary/fr-ru (download: 21.35 MiB, generated: 72.45 MiB, post-processed: Unknown size, total: 93.80 MiB) to /root/.cache/huggingface/datasets/news_commentary/fr-ru/11.0.0/cfab724ce975dc2da51cdae45302389860badc88b74db8570d561ced6004f8b4...


Downloading data:   0%|          | 0.00/22.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/160740 [00:00<?, ? examples/s]

Dataset news_commentary downloaded and prepared to /root/.cache/huggingface/datasets/news_commentary/fr-ru/11.0.0/cfab724ce975dc2da51cdae45302389860badc88b74db8570d561ced6004f8b4. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset news_commentary/ar-ru (download: 27.32 MiB, generated: 100.90 MiB, post-processed: Unknown size, total: 128.22 MiB) to /root/.cache/huggingface/datasets/news_commentary/ar-ru/11.0.0/cfab724ce975dc2da51cdae45302389860badc88b74db8570d561ced6004f8b4...


Downloading data:   0%|          | 0.00/28.6M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/84455 [00:00<?, ? examples/s]

Dataset news_commentary downloaded and prepared to /root/.cache/huggingface/datasets/news_commentary/ar-ru/11.0.0/cfab724ce975dc2da51cdae45302389860badc88b74db8570d561ced6004f8b4. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]


## 2. Data preprocessing and tokenization

In [None]:
def preprocess_function(examples):
    target_lang="ru"
    source_lang="en"
    prefix = "translate English to Russian: "
    tokenizer = M2M100Tokenizer.from_pretrained(
                                            "facebook/m2m100_418M", # to use LLM you may choose m2m100_1.2B
                                            src_lang="en",
                                            tgt_lang="ru")

    inputs = [prefix + example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs

In [None]:
tokenized_news_en = news_en.map(preprocess_function, batched=True)

  0%|          | 0/153 [00:00<?, ?ba/s]

Downloading tokenizer_config.json:   0%|          | 0.00/272 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/3.71M [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/908 [00:00<?, ?B/s]

  0%|          | 0/39 [00:00<?, ?ba/s]

  0%|          | 0/129 [00:00<?, ?ba/s]

  0%|          | 0/33 [00:00<?, ?ba/s]

  0%|          | 0/68 [00:00<?, ?ba/s]

  0%|          | 0/17 [00:00<?, ?ba/s]

In [None]:
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

metric = evaluate.load("sacrebleu")

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]


## 3. Model initialization

In [None]:
model = M2M100ForConditionalGeneration.from_pretrained('facebook/m2m100_418M')
tokenizer = M2M100Tokenizer.from_pretrained('facebook/m2m100_418M')
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

Downloading pytorch_model.bin:   0%|          | 0.00/1.94G [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/233 [00:00<?, ?B/s]

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="my_awesome_opus_books_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=2,
    predict_with_generate=True,
)



## 4. Model training

In [None]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_news_en["train"].select(range(10000)),
    eval_dataset=tokenized_news_en["test"].select(range(1000)),
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,1.4214,1.234021,26.4803,41.297


In [None]:
trainer.save_model('/kaggle/working/my/')

## 5. Test the model

In [None]:
from transformers import pipeline
translator = pipeline("translation", model='/kaggle/working/my/')

In [None]:
text_en = "Hello my friend"
translator(text_en, src_lang='en', tgt_lang='ru')