<a href="https://colab.research.google.com/github/ThatCodeCodingGuy/Fine-tuning-MarianMT-for-English-Vietnamese-Translation/blob/main/finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Installing necessary packages**

In [1]:
!pip install torch
!pip install datasets
!pip install transformers sentencepiece
!pip install sacrebleu 

Collecting datasets
  Downloading datasets-1.18.3-py3-none-any.whl (311 kB)
[K     |████████████████████████████████| 311 kB 5.3 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.1.0-py3-none-any.whl (133 kB)
[K     |████████████████████████████████| 133 kB 47.0 MB/s 
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 4.7 MB/s 
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 43.6 MB/s 
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 44.4 MB/s 
Collecting async-timeout<5.0,>=4.0.0a3
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting yarl<2.0,>=1.0
  Downloading yarl-1.7.2-cp37-cp37m-manylinux_2_5_x86_6

# **Loading our dataset through HuggingFace**

In [2]:
import datasets
from datasets import load_dataset

dataset = load_dataset("mt_eng_vietnamese", 'iwslt2015-en-vi')
dataset

Downloading:   0%|          | 0.00/1.88k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

Downloading and preparing dataset mt_eng_vietnamese/iwslt2015-en-vi (download: 30.83 MiB, generated: 31.59 MiB, post-processed: Unknown size, total: 62.42 MiB) to /root/.cache/huggingface/datasets/mt_eng_vietnamese/iwslt2015-en-vi/1.0.0/53add551a01e9874588066f89d42925f9fad43db347199dad00f7e4b0c905a71...


Downloading:   0%|          | 0.00/13.6M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/18.1M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/140k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/188k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/132k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/184k [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset mt_eng_vietnamese downloaded and prepared to /root/.cache/huggingface/datasets/mt_eng_vietnamese/iwslt2015-en-vi/1.0.0/53add551a01e9874588066f89d42925f9fad43db347199dad00f7e4b0c905a71. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 133318
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 1269
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 1269
    })
})

# **Forming our transformer**

In [3]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-vi")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-vi")

Downloading:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/790k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/738k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/275M [00:00<?, ?B/s]

In [4]:
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "vi"
def preprocess_function(examples):
   inputs = [ex[source_lang] for ex in examples["translation"]]
   targets = [ex[target_lang] for ex in examples["translation"]]
   model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)
   # Setting up the tokenizer for targets
   with tokenizer.as_target_tokenizer():
       labels = tokenizer(targets, max_length=max_target_length, truncation=True)
   model_inputs["labels"] = labels["input_ids"]
   return model_inputs
tokenized_datasets = dataset.map(preprocess_function, batched=True)

  0%|          | 0/134 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [5]:
# The range "800" is chosen because of limited computational sources
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(800))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(800))

# **Importing packages for the training of the model**

In [6]:
from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

In [7]:
batch_size = 16
model_name = "marianMT"
args = Seq2SeqTrainingArguments(
f"marianMT-finetuned-en-vi",
   evaluation_strategy = "epoch",
   learning_rate=2e-5,
   per_device_train_batch_size=batch_size,
   per_device_eval_batch_size=batch_size,
   weight_decay=0.01,
   save_total_limit=3,
   num_train_epochs=1,
   predict_with_generate=True   
)

In [8]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [9]:
from datasets import load_metric
import numpy as np
metric = load_metric("sacrebleu")

def postprocess_text(preds, labels):
   preds = [pred.strip() for pred in preds]
   labels = [[label.strip()] for label in labels]
   return preds, labels

def compute_metrics(eval_preds):
   preds, labels = eval_preds
   if isinstance(preds, tuple):
       preds = preds[0]
   decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
   # Replacing -100 in the labels since we can't decode them.
   labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
   decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
   # Post-processing
   decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
   result = metric.compute(predictions=decoded_preds, references=decoded_labels)
   prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
   result = {'bleu' : result['score']}
   result["gen_len"] = np.mean(prediction_lens)
   result = {k: round(v, 4) for k, v in result.items()}
   return result

Downloading:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

# **Training our model**

In [10]:
trainer = Seq2SeqTrainer(
   model,
   args,
   train_dataset=small_train_dataset,
   eval_dataset=small_eval_dataset,
   data_collator=data_collator,
   tokenizer=tokenizer,
   compute_metrics=compute_metrics
)
trainer.train()

The following columns in the training set  don't have a corresponding argument in `MarianMTModel.forward` and have been ignored: translation.
***** Running training *****
  Num examples = 800
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 50


Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,No log,1.499966,33.9028,28.1725


The following columns in the evaluation set  don't have a corresponding argument in `MarianMTModel.forward` and have been ignored: translation.
***** Running Evaluation *****
  Num examples = 800
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=50, training_loss=1.9826734924316407, metrics={'train_runtime': 177.7283, 'train_samples_per_second': 4.501, 'train_steps_per_second': 0.281, 'total_flos': 13885617733632.0, 'train_loss': 1.9826734924316407, 'epoch': 1.0})

# **Saving our model**

In [11]:
trainer.save_model()

Saving model checkpoint to marianMT-finetuned-en-vi
Configuration saved in marianMT-finetuned-en-vi/config.json
Model weights saved in marianMT-finetuned-en-vi/pytorch_model.bin
tokenizer config file saved in marianMT-finetuned-en-vi/tokenizer_config.json
Special tokens file saved in marianMT-finetuned-en-vi/special_tokens_map.json


# **Loading and using our fine-tuned model**

In [12]:
import os
for dirname, _, filenames in os.walk('/content/marianMT-finetuned-en-vi'):
   for filename in filenames:
       print(os.path.join(dirname, filename))
from transformers import MarianMTModel, MarianTokenizer
src_text = ["Natural language processing (NLP) refers to the branch of computer science—and more specifically, the branch of artificial intelligence or AI—concerned with giving computers the ability to understand text and spoken words in much the same way human beings can."]
model_name = '/content/marianMT-finetuned-en-vi'
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
[tokenizer.decode(t, skip_special_tokens=True) for t in translated]

Didn't find file /content/marianMT-finetuned-en-vi/added_tokens.json. We won't load it.
Didn't find file /content/marianMT-finetuned-en-vi/tokenizer.json. We won't load it.
loading file /content/marianMT-finetuned-en-vi/source.spm
loading file /content/marianMT-finetuned-en-vi/target.spm
loading file /content/marianMT-finetuned-en-vi/vocab.json
loading file /content/marianMT-finetuned-en-vi/tokenizer_config.json
loading file None
loading file /content/marianMT-finetuned-en-vi/special_tokens_map.json
loading file None
loading configuration file /content/marianMT-finetuned-en-vi/config.json
Model config MarianConfig {
  "_name_or_path": "Helsinki-NLP/opus-mt-en-vi",
  "activation_dropout": 0.0,
  "activation_function": "swish",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "MarianMTModel"
  ],
  "attention_dropout": 0.0,
  "bad_words_ids": [
    [
      53684
    ]
  ],
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "

/content/marianMT-finetuned-en-vi/training_args.bin
/content/marianMT-finetuned-en-vi/pytorch_model.bin
/content/marianMT-finetuned-en-vi/tokenizer_config.json
/content/marianMT-finetuned-en-vi/special_tokens_map.json
/content/marianMT-finetuned-en-vi/config.json
/content/marianMT-finetuned-en-vi/target.spm
/content/marianMT-finetuned-en-vi/vocab.json
/content/marianMT-finetuned-en-vi/source.spm
/content/marianMT-finetuned-en-vi/runs/Feb05_07-48-02_52e211f11071/events.out.tfevents.1644047303.52e211f11071.81.0
/content/marianMT-finetuned-en-vi/runs/Feb05_07-48-02_52e211f11071/1644047303.6339252/events.out.tfevents.1644047303.52e211f11071.81.1


All model checkpoint weights were used when initializing MarianMTModel.

All the weights of MarianMTModel were initialized from the model checkpoint at /content/marianMT-finetuned-en-vi.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MarianMTModel for predictions without further training.


['Xử lý ngôn ngữ tự nhiên (NLP) muốn nói đến chi nhánh của khoa học máy tính và cụ thể hơn nữa, chi nhánh của trí tuệ nhân tạo hoặc AI đối xứng với việc cung cấp cho máy tính khả năng hiểu văn bản và ngôn ngữ nói theo cùng một cách mà con người có thể.']