In [None]:
# Fine-tuning the mBART with new vocabulary of French-English

Today, fine-tuning is one of the main strategies for developing a model (be it an LLM or a specialized model) for a particular domain or language. It makes it possible to adapt a generic model to a particular vocabulary, turn of phrase or syntactic construction. The aim of this lab is twofold :

— we want to see how a translation system can be finetuned.  
— we want to identify the criteria on which fine-tuning can adapt a model  
In this lab, we consider translation between French and English using the mbart-25 model.

# Data

In [1]:
import csv

In [2]:
from google.colab import files

with open("/content/drive/MyDrive/Studying/_Diderot/🎨 MultiLG NLP/Celine270test.csv") as file :
  csv_file = csv.reader(file)
  # for ele in csv_file:
  #   print(ele)

# with open("/content/Celine270test.csv") as file:
#   csv_file = csv.reader(file)

#   for ele in csv_file:
#     print(ele)

In [3]:
import pandas as pd

# file_path = "/content/Celine270test.csv"
file_path = "/content/drive/MyDrive/Studying/_Diderot/🎨 MultiLG NLP/Celine270test.csv"

df = pd.read_csv(file_path, delimiter='\t')

sources = df['source'].tolist()
targets = df['target'].tolist()
merge = list(zip(sources, targets))

sources = [ele.lower() for ele in sources if not ele == '"'] # fr
targets = [ele.lower() for ele in targets if not ele == '"'] # en
# sources

In [4]:
print(sources)
print(targets)

["la sienne, c'est vrai, elle se présentait un peu mieux, mais pas beaucoup.", "moi j'ai fait ça tout de suite très mal.", "lui c'était par un cargo qu'il était arrivé.", "ma façon, c'était pas beaucoup.", "on peut pas tout faire!... moi, c'est l'apéro que je préfère!", 'lui, le père, je l\'apercevais encore quand je passais devant l\'étalage de son magasin, au coin du boulevard poincaré, dans la maison de "chaussures pour pieds sensibles" où il était premier vendeur.', "qui c'est qui le payera?", "ça c'est vrai.", "mes clients, eux, c'étaient des égoïstes, des pauvres, matérialistes tout rétrécis dans leurs sales projets de retraite, par le crachat sanglant et positif.", "les jeunes c'est toujours si pressé d'aller faire l'amour, ça se dépêche tellement de saisir tout ce qu'on leur donne à croire pour s'amuser, qu'ils y regardent pas à deux fois en fait de sensations.", "tout ça c'est des regrets qui ne font pas bouillir la marmite.", "le truc qu'il savait faire avant la guerre robins

vocab

In [5]:
# import nltk
# nltk.download('punkt')
# from nltk.tokenize import word_tokenize

# def make_vocab(sentences):
#   vocabulary = set()
#   for sentence in sentences:
#       words = word_tokenize(sentence)  # NLTK's tokenizer
#       vocabulary.update(words)
#   vocabulary_list = list(vocabulary)
#   vocabulary_list = [voca.lower() for voca in vocabulary_list]
#   return vocabulary_list


In [6]:
# voca_fr = make_vocab(sources)
# voca_en = make_vocab(targets)

# len(voca_fr) #954
# len(voca_en) #812

# print(voca_fr)
# print(voca_en)

# About lab

1. give mabrt modified corpus

> modification meaning :
- taking 100 FR words from celine and swap letter or subtoken, these modified words doesnt exist in voca of FR.celine.
- for its corresponding EN do same modification of swapping
ex. ("maman", "mother") -> (mmaan, mrthoe)

for each of this 100 pairs, put it into 10 sentences -> 1000 sent contiaing new words -> 100 (test) + 900 (train)
ex. qui paiera maman ça ? -> who will pay for mrthoe ?

2. train with the 900

3. test with the 100

finetune with modified corpus



# First, we’ll test the possibility for the model to learn to translate new words.

## 1. Generate a list of 100 French words and translate them into English (e.g. by randomly swapping the letters of existing words, or by concatenating sub-words units). It is essential to ensure that the words generated do not appear in the vocabulary and that they are segmented into several sub-tokens.

In [7]:
!pip install pandas openpyxl



manually choose 100 words

In [8]:

import pandas
tok_src = [s.split() for s in sources]
tok_trg = [t.split() for t in targets]

from itertools import chain
tok_src = list(chain.from_iterable(tok_src ))[:200]
tok_trg = list(chain.from_iterable(tok_trg ))[:200]
pair = [(s,t) for s,t in zip(tok_src, tok_trg)]

df = pd.DataFrame(pair, columns=['French', 'English'])
df.to_excel('output.xlsx', index=False)

In [9]:
file_path = '/content/drive/My Drive/Studying/_Diderot/🎨 MultiLG NLP/voca120.xlsx'

df = pd.read_excel(file_path)
voca_fr = df['french'].tolist()
voca_en = df['english'].tolist()

In [10]:
import random
def shuffle(word):
  chars = list(word)
  random.shuffle(chars)
  return ''.join(chars)

In [11]:
shuffled_voca = {}; count = 0
for f,e in zip(voca_fr, voca_en):
  if shuffle(f) not in voca_fr and shuffle(e) not in voca_en: # chech if new voca presents in original voca by chance
    shuffled_voca[(f,e)] = (shuffle(f),shuffle(e))
    count+=1
    if count == 100 : break
# shuffled_voca # (fr original ,en original):(fr swapped ,en swapped)

## 2. For each pair of words generated, insert them into 10 sentences and their translation. You can thus build up a test corpus of 100 sentences containing the 100 new words and a training corpus of 900 sentences.


add new voca at the end of the sentence

In [12]:
pair = list(zip(sources,targets))
modified_sents_fr = []; modified_sents_en = []

for _, shuffled in shuffled_voca.items():
  sent_pairs = random.sample(pair,10)
  for sent in sent_pairs:
    modified_sents_fr.append(sent[0] + ' '+ shuffled[0])
    modified_sents_en.append(sent[1] + ' '+ shuffled[1])
modified_sents = list(zip(modified_sents_fr, modified_sents_en))
random.shuffle(modified_sents)
print(modified_sents)
# print(modified_sents_fr)
# print(modified_sents_en)

[('"ce qui gênerait c\'est plutôt leur odeur de poussière, qui vous retient par le bout du nez. ,vsurpea', '"what would bother you is their smell of dust, which takes you from the end of your nose. or,op'), ('"les conversations à propos de mariages, moi je n\'ai jamais su comment les orienter, ni comment en sortir. nsad', '"when talking about marriages, i never knew how to guide them or how to get out of them. in'), ("ce qui me manque, tu vois, c'est de pouvoir supporter la boisson. ufaecrfh", 'what i miss, you see, is being able to bear the drink. htea'), ('"elle s\'en irait au grand cimetière d\'à côté d\'abord la tante, où les morts c\'est comme une foule qui attend. tetshcuases', '"she would go to the grand cemetery next to the aunt first, where the dead are like a crowd waiting. ckoss'), ('"c\'est que, des puces j\'en avais, c\'est vrai, moi aussi, attrapé pendant la nuit au-dessus des malades. ouvlati', '"that\'s because i had some of them, it\'s true, me too, caught over sick pe

maybe i can try inserting the swapped word into radonm position of the sent, not at the end all the time.

In [13]:
pair = list(zip(sources,targets))
modified_sents_fr = []; modified_sents_en = []
#FR
for _, shuffled in shuffled_voca.items():
  sent_pairs = random.sample(pair,10)
  for sent in sent_pairs:
    splited_fr = sent[0].split()
    splited_fr.insert(random.randint(0,len(splited_fr)), shuffled[0])
    modified_sents_fr.append( ' '.join(splited_fr) )

#EN
for _, shuffled in shuffled_voca.items():
  sent_pairs = random.sample(pair,10)
  for sent in sent_pairs:
    splited_en = sent[1].split()
    splited_en.insert(random.randint(0,len(splited_en)), shuffled[1])
    modified_sents_en.append( ' '.join(splited_en) )

modified_sents_2 = list(zip(modified_sents_fr, modified_sents_en))
random.shuffle(modified_sents_2)


In [14]:
# modified_sents

## 3. Are the new words correctly translated ?

## mBart

first just give the modifed corpus to pretrained mbart

In [15]:
!pip install transformers
!pip install SentencePiece



In [16]:
from transformers import MBartForConditionalGeneration, MBart50Tokenizer

def mbart_translate(text):
    tokenizer = MBart50Tokenizer.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
    model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
    tokenizer.src_lang = "fr_XX"
    encoded_fr = tokenizer(text, return_tensors="pt")
    generated_tokens = model.generate(**encoded_fr, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
    translation = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

    return translation[0]

In [17]:
# # translated = [mbart_translate(fr) for fr,en in modified_sents[:10]]
# for  fr,en in modified_sents[10:40]:
#   print("fr : ", fr)
#   print("en : ",en)
#   print("translate : " , mbart_translate(fr))

In [18]:
# translated

maybe i can try inserting the swapped word into radonm position of the sent, not at the end all the time.

## 4. Fine-tune the model on the training corpus. Are the words translated correctly after this step ?

train mbart with same data set (modified) and make him translate again

In [19]:
! pip install -U accelerate -U transformers transformers[torch]



In [20]:
!pip install sentencepiece datasets sacrebleu



In [21]:
# from transformers import AutoTokenizer
# model_checkpoint ="Helsinki-NLP/opus-mt-en-ro"
# tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer = MBart50Tokenizer.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")


In [22]:
from datasets import Dataset

def create_dataset(data):
  examples = []
  for d in data:
    encodings_fr = tokenizer(d[0], max_length=128, truncation=True)
    example = {"translation" : {'fr': d[0], 'en': d[1]},
              "input_ids" : encodings_fr["input_ids"],
              "attention_mask":encodings_fr["attention_mask"],
              "labels": tokenizer(d[1], max_length=128, truncation=True)["input_ids"]}
    examples.append(example)
  datasets = Dataset.from_list(examples)

  return datasets


In [23]:
train_dt = create_dataset(modified_sents_2[:900])
test_dt = create_dataset(modified_sents_2[900:])
last_dt = create_dataset(last_test)
# train_dt

In [24]:
test_dt

Dataset({
    features: ['translation', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 100
})

In [25]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

In [26]:
batch_size = 8
model_name = "facebook/mbart-large-50-many-to-many-mmt"
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=5,
    predict_with_generate=True
)

In [27]:
import numpy as np
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]
    return preds, labels

from datasets import load_metric
metric = load_metric("sacrebleu")

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

  metric = load_metric("sacrebleu")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/2.85k [00:00<?, ?B/s]

In [28]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [29]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=train_dt,
    eval_dataset=test_dt,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [30]:
trainer.train()

Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,No log,2.530201,0.9499,32.4
2,No log,1.486616,1.6558,26.78
3,No log,1.099763,0.2739,21.16
4,No log,0.969757,0.6279,25.22
5,1.573900,0.95028,0.7064,22.97


TrainOutput(global_step=565, training_loss=1.4622186846437708, metrics={'train_runtime': 629.9382, 'train_samples_per_second': 7.144, 'train_steps_per_second': 0.897, 'total_flos': 646058631364608.0, 'train_loss': 1.4622186846437708, 'epoch': 5.0})

In [31]:
test_results = trainer.predict(test_dt)
# Obtain and display the test BLEU score
print("Test Bleu Score: ", test_results.metrics["test_bleu"])

Test Bleu Score:  0.7064


In [32]:
# shuffled_voca = {}; count = 0
# for f,e in zip(voca_fr, voca_en):
#   if shuffle(f) not in voca_fr and shuffle(e) not in voca_en: # chech if new voca presents in original voca by chance
#     shuffled_voca[(f,e)] = (shuffle(f),shuffle(e))
#     count+=1
#     if count == 100 : break
# # shuffled_voca # (fr original ,en original):(fr swapped ,en swapped)

# pair = list(zip(sources,targets))
# modified_sents_fr = []; modified_sents_en = []

# for _, shuffled in shuffled_voca.items():
#   sent_pairs = random.sample(pair,10)
#   for sent in sent_pairs:
#     modified_sents_fr.append(sent[0] + ' '+ shuffled[0])
#     modified_sents_en.append(sent[1] + ' '+ shuffled[1])
# test2 = list(zip(modified_sents_fr, modified_sents_en))
# random.shuffle(test2)
# print(test2)

# test_dt_2 = create_dataset(test2[:50])


In [38]:
pair = list(zip(sources,targets))
modified_sents_fr = []; modified_sents_en = []
#FR
for _, shuffled in shuffled_voca.items():
  sent_pairs = random.sample(pair,10)
  for sent in sent_pairs:
    splited_fr = sent[0].split()
    splited_fr.insert(random.randint(0,len(splited_fr)), shuffled[0])
    modified_sents_fr.append( ' '.join(splited_fr) )

#EN
for _, shuffled in shuffled_voca.items():
  sent_pairs = random.sample(pair,10)
  for sent in sent_pairs:
    splited_en = sent[1].split()
    splited_en.insert(random.randint(0,len(splited_en)), shuffled[1])
    modified_sents_en.append( ' '.join(splited_en) )

last_test = list(zip(modified_sents_fr, modified_sents_en))
random.shuffle(last_test)


In [40]:
last_dt = create_dataset(last_test)

test_results = trainer.predict(last_dt)

In [37]:
test_dt['translation']

[{'en': '"love is misery and nothing else, it is always, that comes to eflhss,i deceive in our mouths, trust, that\'s all.',
  'fr': "ésetgï,so les fortifications, c'est trop voyou..."},
 {'en': '"life is that, a in light that ends in the night.',
  'fr': '"-- bébert, docteur, faut que je vous dise, parce que vous nsad êtes médecin, c\'est un petit saligaud!... il se ""touche""!'},
 {'en': 'he had ckoss arrived by a cargo ship.',
  'fr': '"madelon, c\'était un nom facile à se tetshcuases souvenir.'},
 {'en': "the orem fortifications, it's too foolish...",
  'fr': 'uslp un enfant, c’est précieux et fragile.'},
 {'en': 'even he, the eagle to his joséphine! the fire on the train, it is the s,oph case to say it against and against everything.',
  'fr': 'à moi aussi, snmi,gaa elle me parle.'},
 {'en': "the lstli fortifications, it's too foolish...",
  'fr': 'des yeux de statue, on cenero en avait vu par milliers.'},
 {'en': "above wsayal all, it's the accountant.",
  'fr': '"moi, je savais 

In [36]:
for ele in test_results.predictions:
  translated_text = tokenizer.decode(ele, skip_special_tokens=True)
  print(translated_text)
# test_results.predictions

"but what was rather surprising to me was that he didn't succeed in america either.
"but what was rather surprising to me was that he didn't succeed in america either.
"the truth is, it's an agony that doesn't end.
"if they wanted me, they would just call me in the rules and then it would be 20 francs.
i asked him, these apples, s,oph where he found them.
"but what was rather surprising to me was that he didn't succeed in america either.
"but what was rather surprising to me was that he didn't succeed in america either.
"the truth is, it's an agony that doesn't end.
"but what's certain is that everyone as much as you are, it's nothing but a small furnace you have between your legs and still a wet property!"" that was sent!
"i was very disappointed by everything that had happened that sunday and very tired.
"but what was rather surprising to me was that he didn't succeed in america either.
"the truth is, it's an agony that doesn't end.
"the truth is, it's an agony that doesn't end.
"wel

In [41]:
# Obtain and display the test BLEU score
print("Test Bleu Score: ", test_results.metrics["test_bleu"])

Test Bleu Score:  2.1028
