Code responsible for downloading and unpacking the data

In [1]:
import urllib.request

url = "https://object.pouta.csc.fi/OPUS-MultiParaCrawl/v7.1/moses/hr-pl.txt.zip"
output_path = "opus_pl_hr.zip"

urllib.request.urlretrieve(url, output_path)
print("Download complete")


Download complete


In [2]:
!unzip -o opus_pl_hr.zip

Archive:  opus_pl_hr.zip
  inflating: README                  
  inflating: LICENSE                 
  inflating: MultiParaCrawl.hr-pl.hr  
  inflating: MultiParaCrawl.hr-pl.pl  
  inflating: MultiParaCrawl.hr-pl.xml  


In [23]:
import pandas as pd
from nltk.translate.bleu_score import sentence_bleu

with open("MultiParaCrawl.hr-pl.hr", "r", encoding="utf-8") as f: hr = f.read().splitlines()
with open("MultiParaCrawl.hr-pl.pl", "r", encoding="utf-8") as f: pl = f.read().splitlines()

df = pd.DataFrame({"hr": hr, "pl": pl})
n = 5

In [24]:
df = df[5:]
pl = pl[5:]
hr = hr[5:]

Multilingual model and tokenizer declaration. Creating the translate function for the Multilingual model.

In [25]:
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")

def translate_multi(sentences, src_lang, tgt_lang):
    output = []
    tokenizer.src_lang = src_lang
    for sentence in sentences:
      encoded = tokenizer(sentence, return_tensors="pt")

      generated_tokens = model.generate(
          **encoded,
          forced_bos_token_id=tokenizer.lang_code_to_id[tgt_lang]
      )
      output.append(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0])

    return output

Loading weights:   0%|          | 0/516 [00:00<?, ?it/s]

Back-and-forth translations and evaluations for the Multilingual model

In [26]:
hr_to_pl = translate_multi(hr[:n], "hr_HR", "pl_PL")

In [27]:
total_score = 0
for i in range(n):
  bleu = sentence_bleu([pl[:n][i].split(' ')], hr_to_pl[i].split(' '))
  total_score += bleu
print("BLEU score: ", total_score/n)

BLEU score:  0.1221009811312018


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


In [28]:
pl_back = translate_multi(hr_to_pl, "pl_PL", "hr_HR")
total_score = 0
for i in range(n):
  bleu = sentence_bleu([hr[:n][i].split(' ')], pl_back[i].split(' '))
  total_score += bleu
print("BLEU score: ", total_score/n)

BLEU score:  8.190192597872934e-79


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


In [29]:
pl_to_hr = translate_multi(pl[:n], "pl_PL", "hr_HR")
total_score = 0
for i in range(n):
  bleu = sentence_bleu([pl[:n][i].split(' ')], pl_to_hr[i].split(' '))
  total_score += bleu
print("BLEU score: ", total_score/n)

BLEU score:  5.842906353411464e-232


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


In [30]:
hr_back = translate_multi(pl_to_hr, "hr_HR", "pl_PL")
total_score = 0
for i in range(n):
  bleu = sentence_bleu([hr[:n][i].split(' ')], hr_back[i].split(' '))
  total_score += bleu
print("BLEU score: ", total_score/n)

BLEU score:  1.9189006110305264e-232


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Translations through English

In [31]:
ht_to_en = translate_multi(hr[:n], "hr_HR", "en_XX")
en_to_pl = translate_multi(ht_to_en, "en_XX", "pl_PL")
total_score = 0
for i in range(n):
  bleu = sentence_bleu([pl[:n][i].split(' ')], en_to_pl[i].split(' '))
  total_score += bleu
print("BLEU score: ", total_score/n)

BLEU score:  0.08927291841594476


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Slavic model and tokenizer declaration. Creating the translate function for the Slavic model.

In [32]:
from transformers import MarianMTModel, MarianTokenizer

model_name = 'Helsinki-NLP/opus-mt-sla-sla'
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
def translate(text, trg_lang):
    text = ">>" + trg_lang + "<< " + text
    translated = model.generate(**tokenizer(text, return_tensors='pt', padding=True))

    translated_text = tokenizer.decode(translated[0], skip_special_tokens=True)

    return translated_text



Loading weights:   0%|          | 0/258 [00:00<?, ?it/s]



Back-and-forth translations and evaluations for the Slavic model

In [33]:
hr_to_pl2 = []
for i in range(n):
  hr_to_pl2.append(translate(hr[i], "pol"))

total_score = 0
for i in range(n):
  bleu = sentence_bleu([pl[:n][i].split(' ')], hr_to_pl2[i].split(' '))
  total_score += bleu
print("BLEU score: ", total_score/n)

BLEU score:  0.25232520155621585


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


In [34]:
pl_to_hr2 = []
for i in range(n):
  pl_to_hr2.append(translate(pl[i], "hrv"))

total_score = 0
for i in range(n):
  bleu = sentence_bleu([hr[:n][i].split(' ')], pl_to_hr2[i].split(' '))
  total_score += bleu
print("BLEU score: ", total_score/n)

BLEU score:  0.22217223648086906


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


In [35]:
hr_back2 = []
for i in range(n):
  hr_back2.append(translate(pl_to_hr2[i], "pol"))

total_score = 0
for i in range(n):
  bleu = sentence_bleu([pl[:n][i].split(' ')], hr_back2[i].split(' '))
  total_score += bleu
print("BLEU score: ", total_score/n)

BLEU score:  0.1481594145679185


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


In [36]:
pl_back2 = []
for i in range(n):
  pl_back2.append(translate(hr_to_pl2[i], "hrv"))

total_score = 0
for i in range(n):
  bleu = sentence_bleu([hr[:n][i].split(' ')], pl_back2[i].split(' '))
  total_score += bleu
print("BLEU score: ", total_score/n)

BLEU score:  0.07084840876066213


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Text results

In [38]:
samples = pd.DataFrame({
    'Original Polish': pl[:5],
    'Reference Croatian': hr[:5],
    'mBART-50 Translation Polish': hr_to_pl[:5],
    'mBART-50 Translation Croatian': pl_to_hr[:5],
    'MarianMT Translation Polish': hr_to_pl2[:5],
    'MarianMT Translation Croatian': pl_to_hr2[:5]
})

display(samples.style.set_properties(**{'text-align': 'left'}))

Unnamed: 0,Original Polish,Reference Croatian,mBART-50 Translation Polish,mBART-50 Translation Croatian,MarianMT Translation Polish,MarianMT Translation Croatian
0,"Błędna jest opinia, że rozwój mowy rozpoczyna się od momentu pierwszych słów.",Pogrešno je mišljenje da razvoj govora počinje od trenutka prvih riječi.,"Błędnie uważamy, że rozwój języka zaczyna się od momentu pierwszych słów.",Błędna je ideja da se razvoj govora počinje od trenutka prve riječi.,"Niewłaściwe jest opinia, że rozwój mowy zaczyna się od momentu pierwszych słów.",Pogrešno je zaključiti da je razvoj govora počeo od početka prvih riječi.
1,"Jednak jeśli jesteś jednym z bezwłosy lub utrata włosów wokół osoby, czynnik potencjalnie nie jest faktycznie największym zmartwieniem swoją dzisiaj.","Ipak, ako ste jedan od dlaka ili gubitak ljudi kose oko, razlog potencijalno zapravo nije vaša najveća briga danas.","Jednak, if you're one of the hair-losers, the reason, potentially, isn't your biggest concern today.","No, if you are one of baldness or hair loss around a person, the factor potentially isn’t actually your biggest concern today.","Mimo to, jeśli jesteś jednym z włosów lub straty ludzi, to powód potencjalnie nie jest twoim największym problemem dzisiaj.","Međutim, ako ste jedan od bezkosa ili gubitak kose oko osobe, potencijalno nije zapravo najveća zabrinutost danas."
2,Wojna dopiero się zaczyna...,Rat je tek počeo...,Wojna dopiero się rozpoczęła...,Wojna tek počinje...,Wojna dopiero się zaczęła.,Rat je tek počeo...
3,"Upadek Turtle prowadzi spadkobierców w przeglądzie woli Westinga, aby pozbyć się prawdziwego znaczenia.",Pada djelovanje Kornjača vodi nasljednike u pregledu Westingove volje kako bi razjasnio svoje pravo značenje.,The dropping of the Koran leads to a review of Westing's will to clarify its true meaning.,Turtle's fall leads heirs to a review of Westing's will to get rid of his true significance.,"Działanie żółwia prowadzi spadkobierców do przeglądu woli Westinga, by wyjaśnić swoje prawdziwe znaczenie.",Opadanje kornjače vodi nasljednike u pregledu volje Westing da se riješi pravog značenja.
4,"Średnia ocena naszych byłych uczniów, na pytanie o ich cały pobyt w szkole w Frankfurt","Prosječna ocjena od naših bivših učenika, kada smo ih pitali za njihov ukupan boravak u školi u Frankfurt","Średnia ocena naszych byłych uczniów, kiedy pytaliśmy o ich uczęszczanie do szkoły w Frankfurt.","Średnja ocena naših bivših studenata, na pitanje o njihovom cijelom stayu u školi u Frankfurtu","Średnia ocena naszych byłych uczniów, kiedy zapytaliśmy ich o ich wspólne pobyt w szkole w Frankfurt","Prosječna procjena naših bivših učenika, na pitanje o njihovom cijelom boravku u školi u Frankfurtu"
