# Machine Translation with mBART-50

**Run this notebook on the [GPU Hub](https://gpuhub.labservices.ch/) or [Google Colab](https://colab.research.google.com/) to make use of a GPU for faster inference.**

In this exercise, we use the finetuned mBART-50 model to translate sentences from the transcripts of European Parliament discussions. 

In [None]:
%pip install -q tqdm
%pip install -q torch
%pip install -q sentencepiece
%pip install -q transformers
%pip install -q evaluate

In [None]:
import tqdm
import torch
import evaluate

## Data

The European Parliament has a nice corpus of parallel sentences from its proceedings that is [open-sourced](https://statmt.org/europarl/). We could download the German-English transcript pair and extract the files as shown below. Instead, we will use the shortened files on Ilias.

In [None]:
# !wget -N https://statmt.org/europarl/v7/de-en.tgz
# !tar xzf de-en.tgz

The files have one sentence per line. Each line in one file corresponds to the same line in the other file, they are *parallel*. If a line in one file is empty, that means there is no corresponding translation in the other file (for example, see line 22).

The corpus website recommends to remove the pairs where one line is empty. It also suggests to remove lines with XML-Tags (starting with "<").

In [None]:
def read_parallel_sentences(path1, path2, lines_to_read=200):
    """Reads the first `lines_to_read` lines of text from two files.
    Removes lines where one of both files has an empty line.
    Removes lines starting with XML tags."""
    # TODO

sents_en, sents_de = read_parallel_sentences('europarl-v7.de-en.en.txt', 'europarl-v7.de-en.de.txt')

## Model

We load the `facebook/mbart-large-50-many-to-many-mmt` model from the [Hugging Face model hub](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt).

In [None]:
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
model.eval()  # put the model into evaluation mode
if torch.cuda.is_available():
    model.to('cuda')  # move the model to GPU

## Translate

We now translate the sentences in both directions. Adapt the example from the documentation on the model hub. Call the generate function to output translations.

In [None]:
def translate(sentence, from_code, to_code):
    """Translates `sentence` into target language."""
    # TODO

translated_to_de = [translate(sent_en, 'en_XX', 'de_DE') for sent_en in tqdm.tqdm(sents_en)]
translated_to_en = [translate(sent_de, 'de_DE', 'en_XX') for sent_de in tqdm.tqdm(sents_de)]

In [None]:
print(translated_to_en[:5])

## Evaluation

We evaluate the translations against the references with the BLEU score, which is standard in machine translation. We use the [BLEU metric](https://huggingface.co/spaces/evaluate-metric/bleu) from Hugging Face's [evaluate library](https://github.com/huggingface/evaluate).

Interestingly, the model is better at translating to English than to German (this holds for other languages as well). A big factor in this is that there is more English data available for the mBART-50 model to learn the structure of the English language. Additionally, since BLEU is based on word-level overlap, German is a harder target than English for exact matches due to noun compounds and richer morphology (e.g. conjugation of verbs).