<a href="https://colab.research.google.com/github/francescodisalvo05/polito-deep-nlp/blob/main/Labs/Lab_06__Machine_Translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Deep Natural Language Processing @ PoliTO**


---


**Teaching Assistant:** Moreno La Quatra

**Practice 6:** Machine Translation

## **Machine Translation**

It is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another.

![](https://www.deepl.com/img/press/desktop_ENIT_2020-01.png)

In this practice you will use data collections provided by [tatoeba](https://tatoeba.org/).


In [1]:
%%capture
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P6/train_it_en.tsv
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P6/test_it_en.tsv

### **Question 1: data reading**

Read the data collection and store it into your preferred data structure. It will be used in the subsequent steps. 

Store train and test set into separate data objects.

In [None]:
# Your code here

In [27]:
# read train data
train_it, train_en = [], [] 
with open('train_it_en.tsv','r') as train_file:
  data = train_file.readlines()
  # skip header
  for line in data[1:]:

    # there are two rows with 7 unpacked elements
    # they probably contain a tab on the text
    # - skip them

    if len(line.split("\t")) == 5:
      _, _, curr_it, _, curr_en = line.split("\t")
      train_it.append(curr_it)
      train_en.append(curr_en.strip())

In [34]:
# read train data
test_it, test_en = [], [] 
with open('test_it_en.tsv','r') as train_file:
  data = train_file.readlines()
  # skip header
  for line in data[1:]:

    # there are two rows with 7 unpacked elements
    # they probably contain a tab on the text
    # - skip them

    if len(line.split("\t")) == 5:
      _, _, curr_it, _, curr_en = line.split("\t")
      test_it.append(curr_it)
      test_en.append(curr_en.strip()) # remove final \n

In [29]:
train_it[0], train_en[0]

('Li aiuteremo domani.', "We'll help them tomorrow.")

In [33]:
test_it[0], test_en[0]

('Non è mai capitato.', 'It never happened.')

### **Question 2: pretrained MT models**

[EasyNMT](https://github.com/UKPLab/EasyNMT) provides a simple wrapper over HuggingFace transformers library for machine translation. Translate all test sentences from english to italian and viceversa. Store translation in both directions.

Note: the choice for the MT model is up to you.

In [None]:
# Your code here

In [10]:
!pip install -U easynmt

Collecting easynmt
  Downloading EasyNMT-2.0.1.tar.gz (14 kB)
Collecting transformers<5,>=4.4
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 5.5 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 46.3 MB/s 
[?25hCollecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[K     |████████████████████████████████| 68 kB 5.7 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 37.7 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 33.5 MB/s 
Collecting sacremos

### **Question 3: BLEU scores**

Evaluate the selected MT model using [BLEU evaluation metric](https://github.com/mjpost/sacrebleu). Report scores for both translation directions (`EN->IT`, `IT->EN`)

In [None]:
!pip install sacrebleu

In [None]:
# Your code here

### **Question 4: finetuning Seq2Seq model (IT->EN)**

Exploit the [Trainer API](https://huggingface.co/transformers/training.html#fine-tuning-in-pytorch-with-the-trainer-api) to finetune and evaluate a [MarianMT](https://arxiv.org/pdf/1804.00344.pdf) sequence to sequence model for machine translation. The documentation for MarianMT is available [here](https://huggingface.co/transformers/model_doc/marian.html).

**Note 1:** select the pre-trained model according to the input-output pair (it-en)

**Note 2:** for the lab practice, please use a sub-set of the training data.

In [None]:
# Your code here

### **Question 5: Model evaluation**

Evaluate the fine-tuned model on the test set provided with the practice. Compute and report the bleu score for the translation model.

In [None]:
# Your code here)

### **Question 6: Seq2Seq model implementation (IT->EN) [BONUS]**

Implement a lightweight model for machine translation. It must be trainable on the train set of tatoeba available for the practice.

**NOTICE:** the goal is to create **your own network**, not to finetune an existing one. You can also leverage LSTM layers instead of transformers.

**Note 1:** The choice of the framework (e.g., Keras, Tensorflow, PyTorch) is up to you.

**Note 2:** You must write the architecture and training/evaluation procedures (please do not use out-of-the-box HF models).

In [None]:
# Your code here