<a href="https://colab.research.google.com/github/francescodisalvo05/polito-deep-nlp/blob/main/Labs/Lab_06__Machine_Translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Deep Natural Language Processing @ PoliTO**


---


**Teaching Assistant:** Moreno La Quatra

**Practice 6:** Machine Translation

## **Machine Translation**

It is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another.

![](https://www.deepl.com/img/press/desktop_ENIT_2020-01.png)

In this practice you will use data collections provided by [tatoeba](https://tatoeba.org/).


In [1]:
%%capture
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P6/train_it_en.tsv
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P6/test_it_en.tsv

### **Question 1: data reading**

Read the data collection and store it into your preferred data structure. It will be used in the subsequent steps. 

Store train and test set into separate data objects.

In [None]:
# Your code here

In [27]:
# read train data
train_it, train_en = [], [] 
with open('train_it_en.tsv','r') as train_file:
  data = train_file.readlines()
  # skip header
  for line in data[1:]:

    # there are two rows with 7 unpacked elements
    # they probably contain a tab on the text
    # - skip them

    if len(line.split("\t")) == 5:
      _, _, curr_it, _, curr_en = line.split("\t")
      train_it.append(curr_it)
      train_en.append(curr_en.strip())

In [34]:
# read train data
test_it, test_en = [], [] 
with open('test_it_en.tsv','r') as train_file:
  data = train_file.readlines()
  # skip header
  for line in data[1:]:

    # there are two rows with 7 unpacked elements
    # they probably contain a tab on the text
    # - skip them

    if len(line.split("\t")) == 5:
      _, _, curr_it, _, curr_en = line.split("\t")
      test_it.append(curr_it)
      test_en.append(curr_en.strip()) # remove final \n

In [29]:
train_it[0], train_en[0]

('Li aiuteremo domani.', "We'll help them tomorrow.")

In [33]:
test_it[0], test_en[0]

('Non è mai capitato.', 'It never happened.')

### **Question 2: pretrained MT models**

[EasyNMT](https://github.com/UKPLab/EasyNMT) provides a simple wrapper over HuggingFace transformers library for machine translation. Translate all test sentences from english to italian and viceversa. Store translation in both directions.

Note: the choice for the MT model is up to you.

In [None]:
# Your code here

In [None]:
!pip install -U easynmt

In [81]:
from easynmt import EasyNMT

model = EasyNMT('mbart50_m2m')

100%|██████████| 24.9k/24.9k [00:00<00:00, 203kB/s]


Downloading:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/529 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/649 [00:00<?, ?B/s]

In [82]:
# random trial

sentences = ['Ciao come stai?', 'Ciao! Io sto bene, tu?']

#Translate a single sentence to German
print(model.translate(sentences, target_lang='en'))

['Hi. How are you?', "Hi! I'm okay, you?"]


In [84]:
%%time
pred_test_en = model.translate(test_it, target_lang='en') 

CPU times: user 11min 18s, sys: 2.1 s, total: 11min 20s
Wall time: 11min 17s


In [93]:
%%time
pred_test_it = model.translate(test_en, target_lang='it')

CPU times: user 10min 46s, sys: 2.14 s, total: 10min 48s
Wall time: 10min 46s


### **Question 3: BLEU scores**

Evaluate the selected MT model using [BLEU evaluation metric](https://github.com/mjpost/sacrebleu). Report scores for both translation directions (`EN->IT`, `IT->EN`)

In [87]:
!pip install sacrebleu

Collecting sacrebleu
  Downloading sacrebleu-2.0.0-py3-none-any.whl (90 kB)
[?25l[K     |███▋                            | 10 kB 28.3 MB/s eta 0:00:01[K     |███████▏                        | 20 kB 23.1 MB/s eta 0:00:01[K     |██████████▉                     | 30 kB 16.7 MB/s eta 0:00:01[K     |██████████████▍                 | 40 kB 14.5 MB/s eta 0:00:01[K     |██████████████████              | 51 kB 5.5 MB/s eta 0:00:01[K     |█████████████████████▋          | 61 kB 5.9 MB/s eta 0:00:01[K     |█████████████████████████▎      | 71 kB 5.5 MB/s eta 0:00:01[K     |████████████████████████████▉   | 81 kB 6.0 MB/s eta 0:00:01[K     |████████████████████████████████| 90 kB 4.0 MB/s 
Collecting colorama
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Collecting portalocker
  Downloading portalocker-2.3.2-py2.py3-none-any.whl (15 kB)
Installing collected packages: portalocker, colorama, sacrebleu
Successfully installed colorama-0.4.4 portalocker-2.3.2 sacrebleu-2.

In [None]:
# Your code here

In [90]:
from sacrebleu.metrics import BLEU

bleu = BLEU()

In [94]:
# bleau.corpus_score(hyps, ref)
# - hyps : list of hypothesis
# - ref : list of list of "references"
#         here we have just one ground truth
result_it = bleu.corpus_score(pred_test_it, [test_it])
print(result_it)

BLEU = 46.45 73.4/53.7/41.8/33.3 (BP = 0.960 ratio = 0.961 hyp_len = 34738 ref_len = 36166)


In [92]:
result_en = bleu.corpus_score(pred_test_en, [test_en])
print(result_en)

BLEU = 60.22 80.4/64.9/54.5/46.3 (BP = 1.000 ratio = 1.000 hyp_len = 37227 ref_len = 37241)


### **Question 4: finetuning Seq2Seq model (IT->EN)**

Exploit the [Trainer API](https://huggingface.co/transformers/training.html#fine-tuning-in-pytorch-with-the-trainer-api) to finetune and evaluate a [MarianMT](https://arxiv.org/pdf/1804.00344.pdf) sequence to sequence model for machine translation. The documentation for MarianMT is available [here](https://huggingface.co/transformers/model_doc/marian.html).

**Note 1:** select the pre-trained model according to the input-output pair (it-en)

**Note 2:** for the lab practice, please use a sub-set of the training data.

In [None]:
# Your code here

### **Question 5: Model evaluation**

Evaluate the fine-tuned model on the test set provided with the practice. Compute and report the bleu score for the translation model.

In [None]:
# Your code here)

### **Question 6: Seq2Seq model implementation (IT->EN) [BONUS]**

Implement a lightweight model for machine translation. It must be trainable on the train set of tatoeba available for the practice.

**NOTICE:** the goal is to create **your own network**, not to finetune an existing one. You can also leverage LSTM layers instead of transformers.

**Note 1:** The choice of the framework (e.g., Keras, Tensorflow, PyTorch) is up to you.

**Note 2:** You must write the architecture and training/evaluation procedures (please do not use out-of-the-box HF models).

In [None]:
# Your code here