If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it. We also use the `sacrebleu` and `sentencepiece` libraries - you may need to install these even if you already have 🤗 Transformers!

In [None]:
#! pip install transformers[sentencepiece] datasets
#! pip install sacrebleu sentencepiece
#! pip install huggingface_hub

In [None]:
pip install tensorflow==2.9

In [1]:
pip freeze > requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [154]:
pip show evaluate

Name: evaluate
Version: 0.3.0
Summary: HuggingFace community-driven open-source library of evaluation
Home-page: https://github.com/huggingface/evaluate
Author: HuggingFace Inc.
Author-email: leandro@huggingface.co
License: Apache 2.0
Location: /Users/egoliakova/opt/anaconda3/lib/python3.8/site-packages
Requires: pandas, requests, xxhash, multiprocess, fsspec, dill, numpy, tqdm, huggingface-hub, datasets, packaging, responses
Required-by: 
Note: you may need to restart the kernel to use updated packages.


If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then uncomment the following cell and input your token:

In [1]:
from huggingface_hub import notebook_login

notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (osxkeychain).
Your token has been saved to /Users/egoliakova/.huggingface/token
Login successful


Then you need to install Git-LFS and setup Git if you haven't already. Uncomment the following instructions and adapt with your name and email:

In [None]:
# !apt install git-lfs
# !git config --global user.email "you@example.com"
# !git config --global user.name "Your Name"

Make sure your version of Transformers is at least 4.16.0 since some of the functionality we use was introduced in that version:

In [1]:
import transformers

print(transformers.__version__)

4.21.0


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/seq2seq).

# Convert CSV-file to a dataset-ready format

The code below works with a specifically formatted csv. Run the cell below to format your CSV accordingly.
Your CSV should have at least 2 columns `en` and `xx` where xx is the code of the target language.

If the CSV file has PoS tags for source and target language, the expected column names for them are:
`pos_en` and `pos_xx`. 

If the CSV file has WA tags, the expected column name is `wa`.

In [5]:
import pandas as pd
from datasets import Dataset


# source_lang accepted value = 'en'
# target_lang accepted values = 'fr'|'zh'
# Choose pos_tags=True if the file has PoS tags for the both languages
# Choose wa_tags=True if the file has WA tags.
# Choose store=True if you want to create a json dump of the file that can be used later

def csv_to_dataset(filename, source_lang, target_lang, pos_tags=False, wa_tags=False, store=False):
    data = pd.read_csv(filename)
    new_df = pd.DataFrame()
    new_df['translation'] = [{source_lang: x, target_lang: y} for x, y in zip(data[source_lang], data[target_lang])]
    if pos_tags:
        new_df['pos'] = [{source_lang: x, target_lang: y} for x, y in zip(data[f'pos_{source_lang}'], data[f'pos_{target_lang}'])]
    if wa_tags:
        new_df['wa'] = data['wa']
    return Dataset.from_pandas(new_df).train_test_split(test_size=0.2)

In [11]:
from datasets import load_from_disk

loaded_dataset = load_from_disk('dataset_split.hf')

In [15]:
loaded_dataset.remove_columns(['pos', 'wa'])

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 1508
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 378
    })
})

# Fine-tuning a model on a translation task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model for a translation task. We will use the [WMT dataset](http://www.statmt.org/wmt16/), a machine translation dataset composed from a collection of various sources, including news commentaries and parliament proceedings.

![Widget inference on a translation task](images/translation.png)

We will see how to easily load the dataset for this task using 🤗 Datasets and how to fine-tune a model on it using Keras.

In [18]:
model_checkpoint = "Helsinki-NLP/opus-mt-en-fr"

This notebook is built to run  with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a sequence-to-sequence version in the Transformers library. Here we picked the [`Helsinki-NLP/opus-mt-en-romance`](https://huggingface.co/Helsinki-NLP/opus-mt-en-ROMANCE) checkpoint. 

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the `datasets` function `load_dataset` and the `evaluate` function `load`. We use the English/Romanian part of the WMT dataset here.

In [19]:
from datasets import load_dataset
from evaluate import load

metric = load("sacrebleu")

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [20]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML


def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(
        dataset
    ), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [23]:
show_random_elements(loaded_dataset["train"])

Unnamed: 0,translation,pos,wa
0,"{'en': 'Christian Matras died on 16 October 1988.', 'fr': 'Christian Matras mourut le 16 octobre 1988.'}","{'en': 'Christian PROPN Matras PROPN died VERB on ADP 16 NUM October PROPN 1988 NUM . PUNCT ', 'fr': 'Christian PROPN Matras PROPN mourut PROPN le DET 16 NUM octobre NOUN 1988 NUM . PUNCT '}",0-0 1-1 2-2 3-3 4-4 5-5 6-6 7-7
1,"{'en': 'Choekyi Gyaltsen 10th Panchen Lama himself declared as cited by an official Chinese review that ""according to Tibetan tradition the confirmation of either the Dalai or Panchen must be mutually recognized.""', 'fr': 'Cependant le 10e panchen-lama lui-même avait fait une déclaration qui fut citée dans une publication officielle chinoise « Selon l'histoire tibétaine la confirmation du dalaï-lama ou du panchen-lama doit être mutuellement reconnue ».'}","{'en': 'Choekyi PROPN Gyaltsen PROPN 10th ADJ Panchen PROPN Lama PROPN himself PRON declared VERB as SCONJ cited VERB by ADP an DET official ADJ Chinese ADJ review NOUN that SCONJ "" PUNCT according VERB to ADP Tibetan ADJ tradition NOUN the DET confirmation NOUN of ADP either CCONJ the DET Dalai PROPN or CCONJ Panchen PROPN must AUX be AUX mutually ADV recognized VERB . PUNCT "" PUNCT ', 'fr': 'Cependant ADV le DET 10e PROPN panchen PROPN - PROPN lama PROPN lui-même PRON avait AUX fait VERB une DET déclaration NOUN qui PRON fut AUX citée VERB dans ADP une DET publication NOUN officielle ADJ chinoise ADJ « ADJ Selon ADP l' DET histoire NOUN tibétaine ADJ la DET confirmation NOUN du ADP dalaï NOUN - NOUN lama NOUN ou CCONJ du ADP panchen PROPN - PROPN lama PROPN doit VERB être AUX mutuellement ADV reconnue VERB » PUNCT . PUNCT '}",0-1 0-2 1-4 2-2 3-3 4-5 5-6 6-8 6-10 7-12 8-13 9-14 10-15 11-17 12-18 13-16 15-19 16-20 17-21 18-23 19-22 20-24 21-25 22-31 24-26 25-27 25-29 26-30 27-32 27-33 28-35 29-36 30-37 31-38 32-40 33-39
2,"{'en': 'The Roland Hayes Committee was formed in 1990 to advocate the induction of Roland Hayes into the Georgia Music Hall of Fame.', 'fr': 'Le Roland Hayes Committee a été formé en 1990 pour défendre l’implication de Roland Hayes dans le Georgia Music Hall of Fame.'}","{'en': 'The DET Roland PROPN Hayes PROPN Committee PROPN was AUX formed VERB in ADP 1990 NUM to PART advocate VERB the DET induction NOUN of ADP Roland PROPN Hayes PROPN into ADP the DET Georgia PROPN Music PROPN Hall PROPN of ADP Fame PROPN . PUNCT ', 'fr': 'Le DET Roland PROPN Hayes PROPN Committee NOUN a AUX été AUX formé VERB en ADP 1990 NUM pour ADP défendre VERB l’ SPACE implication NOUN de ADP Roland PROPN Hayes VERB dans ADP le DET Georgia PROPN Music PROPN Hall PROPN of ADP Fame PROPN . PUNCT '}",0-0 1-1 2-2 3-3 4-4 4-5 5-6 6-7 7-8 8-9 9-10 10-11 11-12 12-13 13-14 14-15 15-16 16-17 17-18 18-19 19-20 20-21 21-22 22-23
3,"{'en': 'This meant that equipment could be maintained modified and regularly updated by the company.', 'fr': 'Cela permet à l'entreprise de maintenir de mettre à jour et de modifier l'équipement à son gré.'}","{'en': 'This PRON meant VERB that DET equipment NOUN could AUX be AUX maintained VERB modified ADJ and CCONJ regularly ADV updated VERB by ADP the DET company NOUN . PUNCT ', 'fr': 'Cela PRON permet VERB à ADP l' DET entreprise NOUN de ADP maintenir VERB de ADP mettre VERB à ADP jour NOUN et CCONJ de ADP modifier VERB l' DET équipement VERB à ADP son DET gré NOUN . PUNCT '}",0-0 1-1 2-2 3-15 4-5 6-6 7-13 8-11 10-13 11-16 12-3 13-4 14-19
4,"{'en': 'The abdomen has a scattering of long pinkish-grey scales and a triple dorsal line.', 'fr': 'L'abdomen a une dispersion de longues écailles gris rosé et une triple ligne dorsale.'}","{'en': 'The DET abdomen NOUN has VERB a DET scattering NOUN of ADP long ADJ pinkish NOUN - PUNCT grey NOUN scales NOUN and CCONJ a DET triple ADJ dorsal NOUN line NOUN . PUNCT ', 'fr': 'L' DET abdomen NOUN a VERB une DET dispersion NOUN de ADP longues ADJ écailles ADJ gris VERB rosé VERB et CCONJ une DET triple ADJ ligne NOUN dorsale ADJ . PUNCT '}",0-0 1-1 2-2 3-3 4-4 5-5 6-6 7-9 9-8 10-7 11-10 12-11 13-12 14-14 15-13 15-14 16-15


## Preprocessing the data

In [24]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

For the mBART tokenizer (like we have here), we need to set the source and target languages (so the texts are preprocessed properly). You can check the language codes [here](https://huggingface.co/facebook/mbart-large-cc25) if you are using this notebook on a different pairs of languages.

In [25]:
if "mbart" in model_checkpoint:
    tokenizer.src_lang = "en-XX"
    tokenizer.tgt_lang = "fr-FR"

In [26]:
tokenizer(["Hello, this is a sentence!", "This is another sentence."])

{'input_ids': [[10537, 2, 67, 32, 15, 5776, 145, 0], [160, 32, 1036, 5776, 3, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

Later on, for Word Alignment encoding we will be using these token values instead of real words to express relatedness of words in 2 sentences.

In [27]:
import spacy

en_pos_sp = spacy.load("en_core_web_sm")
fr_pos_sp = spacy.load('fr_core_news_sm')

If you are using one of the five T5 checkpoints that require a special prefix to put before the inputs, you should adapt the following cell.

In [28]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "translate English to French: "
else:
    prefix = ""

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

For PoS tags we will use a separate function that will parse the sentences and extract the PoS information from there.

In [29]:
"""
get_pos_tags receives a tokenized input from the model. The tokenization is a bit different from spacy model,
so to keep the same dimensions of vectors as in the sentence embeddings for each token in a sentence we will:
- decode the token received from the model
- get a Part of Speech id for it from Spacy and return it
"""

def token_to_pos(token, lang):
    if lang == 'en':
        decoded = list(en_pos_sp(tokenizer.decode(token)))
    elif lang == 'fr':
        decoded = list(fr_pos_sp(tokenizer.decode(token)))
    return decoded[-1].pos if decoded else -1

def get_pos_tags(tokenized_sent, lang):
    return list(map(lambda x: token_to_pos(x, lang), tokenized_sent))

Word Alignment information can be encoded in different ways:
- Name: `trg-ids`. Create a vector of the same length as the tokenized input sentence. For each position i in the new vector, find a corresponding word in the original input sentence. Find a connected word from the target sentence and put its tokenized value in the new vector.
- Name: `sums`. Create a copy of the input vector. For i-th word that has a connected word in the target sentence, add its value to the tokenized value to the i-th position of the new vector.
- Name: `mult`. Same as sums but replaces sums with multiplication of values.

In [30]:
from copy import deepcopy

def encode_wa(tokenized_input, tokenized_target, wa, wa_type):
    wa_dict = {int(src): int(trg) for src, trg in map(lambda x: x.split('-'), wa.split())}
    n = len(tokenized_input)
    m = len(tokenized_target)
    if wa_type == 'trg-ids':
        wa_emb = [0]*n
        for k, v in wa_dict.items():
            if k >= n or v >= m:
                break
            wa_emb[k] = tokenized_target[v]
        return wa_emb
    elif wa_type == 'sums':
        wa_emb = deepcopy(tokenized_input)
        for k, v in wa_dict.items():
            if k >= n or v >= m:
                break
            wa_emb[k] += tokenized_target[v]
        return wa_emb
    
    elif wa_type == 'mult':
        wa_emb = deepcopy(tokenized_input)
        for k, v in wa_dict.items():
            if k >= n or v >= m:
                break
            wa_emb[k] *= tokenized_target[v]
        return wa_emb

In [31]:
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "fr"
pos_tags=False
wa_type=None

def preprocess_function(dataset):
    global source_lang, target_lang, pos_tags, wa_type
    inputs = [prefix + d[source_lang] for d in dataset["translation"]]
    targets = [d[target_lang] for d in dataset["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    
    if pos_tags:
        model_inputs['pos'] = [get_pos_tags(x, 'en') for x in model_inputs['input_ids']]
        model_inputs['target_pos'] = [get_pos_tags(y, 'fr') for y in model_inputs['labels']]
        
    if wa_type:
        model_inputs['wa'] = [encode_wa(src, trg, wa, wa_type) for src, trg, wa \
                              in zip(model_inputs['input_ids'],  model_inputs['labels'], dataset["wa"])]
    return model_inputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

To apply this function on all the pairs of sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [32]:
from transformers.keras_callbacks import KerasMetricCallback
import numpy as np


def metric_fn(eval_predictions):
    preds, labels = eval_predictions
    prediction_lens = [
        np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds
    ]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # We use -100 to mask labels - replace it with the tokenizer pad token when decoding
    # so that no output is emitted for these
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}
    result["gen_len"] = np.mean(prediction_lens)
    return result

In [33]:
from transformers.keras_callbacks import PushToHubCallback
from tensorflow.keras.callbacks import TensorBoard
from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
from transformers import AdamWeightDecay
import tensorflow as tf

## Fine-tuning the model with no extra tags

In [34]:
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "fr"
pos_tags=False
wa_tags = False
wa_type = None

split_dataset = loaded_dataset.remove_columns(['pos', 'wa'])


no_anno_dataset = split_dataset.map(preprocess_function, batched=True)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [35]:
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Some layers from the model checkpoint at Helsinki-NLP/opus-mt-en-fr were not used when initializing TFMarianMTModel: ['final_logits_bias']
- This IS expected if you are initializing TFMarianMTModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFMarianMTModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-fr.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


Note that  we don't get a warning like in our classification example. This means we used all the weights of the pretrained model and there is no randomly initialized head in this case.

Next we set some parameters like the learning rate and the `batch_size`and customize the weight decay. 

The last two arguments are to setup everything so we can push the model to the [Hub](https://huggingface.co/models) at the end of training. Remove the two of them if you didn't follow the installation steps at the top of the notebook, otherwise you can change the value of push_to_hub_model_id to something you would prefer.

In [36]:
batch_size = 16
learning_rate = 2e-5
weight_decay = 0.01
num_train_epochs = 1

#model_name = model_checkpoint.split("/")[-1]
#push_to_hub_model_id = f"{model_name}-finetuned-{source_lang}-to-{target_lang}"

Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels. Note that our data collators are multi-framework, so make sure you set `return_tensors='tf'` so you get `tf.Tensor` objects back and not something else!

In [37]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

generation_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf", pad_to_multiple_of=128)

Next, we convert our datasets to `tf.data.Dataset`, which Keras understands natively. There are two ways to do this - we can use the slightly more low-level [`Dataset.to_tf_dataset()`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.to_tf_dataset) method, or we can use [`Model.prepare_tf_dataset()`](https://huggingface.co/docs/transformers/main_classes/model#transformers.TFPreTrainedModel.prepare_tf_dataset). The main difference between these two is that the `Model` method can inspect the model to determine which column names it can use as input, which means you don't need to specify them yourself. Make sure to specify the collator we just created as our `collate_fn`!

We also want to compute `BLEU` metrics, which will require us to generate text from our model. To speed things up, we can compile our generation loop with XLA. This results in a *huge* speedup - up to 100X! The downside of XLA generation, though, is that it doesn't like variable input shapes, because it needs to run a new compilation for each new input shape! To compensate for that, let's use `pad_to_multiple_of` for the dataset we use for text generation. This will reduce the number of unique input shapes a lot, meaning we can get the benefits of XLA generation with only a few compilations.

In [38]:
train_dataset = model.prepare_tf_dataset(
    no_anno_dataset["train"],
    batch_size=batch_size,
    shuffle=True,
    collate_fn=data_collator,
)

validation_dataset = model.prepare_tf_dataset(
    no_anno_dataset["test"],
    batch_size=batch_size,
    shuffle=False,
    collate_fn=data_collator,
)

generation_dataset = model.prepare_tf_dataset(
    no_anno_dataset["test"],
    batch_size=8,
    shuffle=False,
    collate_fn=generation_data_collator,
)

Now we initialize our loss and optimizer and compile the model. Note that most Transformers models compute loss internally, so we can just leave the loss argument blank to use the internal loss instead. For the optimizer, we can use the `AdamWeightDecay` optimizer in the Transformer library.

In [39]:
optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


Now we can train our model. We can also add a few optional callbacks here, which you can remove if they aren't useful to you. In no particular order, these are:
- PushToHubCallback will sync up our model with the Hub - this allows us to resume training from other machines, share the model after training is finished, and even test the model's inference quality midway through training!
- TensorBoard is a built-in Keras callback that logs TensorBoard metrics.
- KerasMetricCallback is a callback for computing advanced metrics. There are a number of common metrics in NLP like ROUGE which are hard to fit into your compiled training loop because they depend on decoding predictions and labels back to strings with the tokenizer, and calling arbitrary Python functions to compute the metric. The KerasMetricCallback will wrap a metric function, outputting metrics as training progresses.

If this is the first time you've seen `KerasMetricCallback`, it's worth explaining what exactly is going on here. The callback takes two main arguments - a `metric_fn` and an `eval_dataset`. It then iterates over the `eval_dataset` and collects the model's outputs for each sample, before passing the `list` of predictions and the associated `list` of labels to the user-defined `metric_fn`. If the `predict_with_generate` argument is `True`, then it will call `model.generate()` for each input sample instead of `model.predict()` - this is useful for metrics that expect generated text from the model, like `ROUGE` and `BLEU`.

This callback allows complex metrics to be computed each epoch that would not function as a standard Keras Metric. Metric values are printed each epoch, and can be used by other callbacks like `TensorBoard` or `EarlyStopping`.

In [40]:
metric_callback = KerasMetricCallback(
    metric_fn=metric_fn, eval_dataset=generation_dataset, predict_with_generate=True, use_xla_generation=True, 
    generate_kwargs={"max_length": 128}
)

With the metric callback ready, now we can specify the other callbacks and fit our model:

In [41]:
tensorboard_callback = TensorBoard(log_dir="./translation_model_save/logs")

"""push_to_hub_callback = PushToHubCallback(
    output_dir="./translation_model_save",
    tokenizer=tokenizer,
    hub_model_id=push_to_hub_model_id,
)
"""


#callbacks = [metric_callback, tensorboard_callback, push_to_hub_callback]
callbacks = [metric_callback, tensorboard_callback]

model.fit(
    train_dataset, validation_data=validation_dataset, epochs=1, callbacks=callbacks
)



<keras.callbacks.History at 0x7fa09cc22b20>

**BLEU** metric after the run with no extra features is **18.7710**. This is our baseline.

## Running the model with PoS features

This cell can run pretty slow and can take 5-10 minutes.

In [86]:
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "fr"
# Pos_tags need to be set to True in the cell
pos_tags = True
wa_tags = False
wa_type = None

split_dataset = loaded_dataset.remove_columns(['wa'])

pos_anno_dataset = split_dataset.map(preprocess_function, batched=True)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [87]:
model_with_pos = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Some layers from the model checkpoint at Helsinki-NLP/opus-mt-en-fr were not used when initializing TFMarianMTModel: ['final_logits_bias']
- This IS expected if you are initializing TFMarianMTModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFMarianMTModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-fr.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


In [88]:
batch_size = 16
learning_rate = 2e-5
weight_decay = 0.01
num_train_epochs = 1

In [89]:
data_collator_pos = DataCollatorForSeq2Seq(tokenizer, model=model_with_pos, return_tensors="tf")

generation_data_collator_pos = DataCollatorForSeq2Seq(tokenizer, model=model_with_pos, return_tensors="tf", pad_to_multiple_of=128)

In [90]:
train_dataset = model_with_pos.prepare_tf_dataset(
    pos_anno_dataset["train"],
    batch_size=batch_size,
    shuffle=True,
    collate_fn=data_collator_pos,
)

validation_dataset = model_with_pos.prepare_tf_dataset(
    pos_anno_dataset["test"],
    batch_size=batch_size,
    shuffle=False,
    collate_fn=data_collator_pos,
)

generation_dataset = model_with_pos.prepare_tf_dataset(
    pos_anno_dataset["test"],
    batch_size=8,
    shuffle=False,
    collate_fn=data_collator_pos,
)

In [91]:
optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model_with_pos.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [92]:
metric_callback = KerasMetricCallback(
    metric_fn=metric_fn, eval_dataset=generation_dataset, predict_with_generate=True, use_xla_generation=True, 
    generate_kwargs={"max_length": 128}
)

In [93]:
tensorboard_callback = TensorBoard(log_dir="./translation_model_save/logs")

callbacks = [metric_callback, tensorboard_callback]

model_with_pos.fit(
    train_dataset, validation_data=validation_dataset, epochs=1, callbacks=callbacks
)



<keras.callbacks.History at 0x7f9d10a13a00>

**BLEU** metric after the run with POS features only is **22.47899**. 

## Running model with WA (vector with target ids)

In [65]:
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "fr"
pos_tags = False
wa_tags = True
wa_type="trg-ids"


split_dataset = loaded_dataset.remove_columns(['pos'])
wa_trg_id_dataset = split_dataset.map(preprocess_function, batched=True)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [66]:
model_with_wa_trg_ids = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Some layers from the model checkpoint at Helsinki-NLP/opus-mt-en-fr were not used when initializing TFMarianMTModel: ['final_logits_bias']
- This IS expected if you are initializing TFMarianMTModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFMarianMTModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-fr.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


In [67]:
data_collator_wa_trg_ids = DataCollatorForSeq2Seq(tokenizer, model=model_with_wa_trg_ids, return_tensors="tf")

generation_data_collator_wa_trg_ids = DataCollatorForSeq2Seq(tokenizer, model=model_with_wa_trg_ids, return_tensors="tf", pad_to_multiple_of=128)

In [68]:
train_dataset = model_with_wa_trg_ids.prepare_tf_dataset(
    wa_trg_id_dataset["train"],
    batch_size=batch_size,
    shuffle=True,
    collate_fn=generation_data_collator_wa_trg_ids,
)

validation_dataset = model_with_wa_trg_ids.prepare_tf_dataset(
    wa_trg_id_dataset["test"],
    batch_size=batch_size,
    shuffle=False,
    collate_fn=generation_data_collator_wa_trg_ids,
)

generation_dataset = model_with_wa_trg_ids.prepare_tf_dataset(
    wa_trg_id_dataset["test"],
    batch_size=8,
    shuffle=False,
    collate_fn=generation_data_collator_wa_trg_ids,
)

In [69]:
optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model_with_wa_trg_ids.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [70]:
metric_callback = KerasMetricCallback(
    metric_fn=metric_fn, eval_dataset=generation_dataset, predict_with_generate=True, use_xla_generation=True, 
    generate_kwargs={"max_length": 128}
)

In [71]:
tensorboard_callback = TensorBoard(log_dir="./wa_trg_ids/logs")

callbacks = [metric_callback, tensorboard_callback]

model_with_wa_trg_ids.fit(
    train_dataset, validation_data=validation_dataset, epochs=1, callbacks=callbacks
)



<keras.callbacks.History at 0x7f9e37066880>

**BLEU** metric after the run with WA alignment using target indeces is **31.5155**. 

## Running model with WA (using sum of token ids)

In [72]:
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "fr"
pos_tags = False
wa_tags = True
wa_type ="sums"

split_dataset = loaded_dataset.remove_columns(['pos'])
wa_sums_dataset = split_dataset.map(preprocess_function, batched=True)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [73]:
model_with_wa_sums = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Some layers from the model checkpoint at Helsinki-NLP/opus-mt-en-fr were not used when initializing TFMarianMTModel: ['final_logits_bias']
- This IS expected if you are initializing TFMarianMTModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFMarianMTModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-fr.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


In [74]:
data_collator_wa_sums = DataCollatorForSeq2Seq(tokenizer, model=model_with_wa_sums, return_tensors="tf")

generation_data_collator_wa_sums = DataCollatorForSeq2Seq(tokenizer, model=model_with_wa_sums, return_tensors="tf", pad_to_multiple_of=128)

In [75]:
train_dataset = model_with_wa_sums.prepare_tf_dataset(
    wa_sums_dataset["train"],
    batch_size=batch_size,
    shuffle=True,
    collate_fn=generation_data_collator_wa_sums,
)

validation_dataset = model_with_wa_sums.prepare_tf_dataset(
    wa_sums_dataset["test"],
    batch_size=batch_size,
    shuffle=False,
    collate_fn=generation_data_collator_wa_sums,
)

generation_dataset = model_with_wa_sums.prepare_tf_dataset(
    wa_sums_dataset["test"],
    batch_size=8,
    shuffle=False,
    collate_fn=generation_data_collator_wa_sums,
)

In [76]:
optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model_with_wa_sums.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [77]:
metric_callback = KerasMetricCallback(
    metric_fn=metric_fn, eval_dataset=generation_dataset, predict_with_generate=True, use_xla_generation=True, 
    generate_kwargs={"max_length": 128}
)

In [78]:
tensorboard_callback = TensorBoard(log_dir="./wa_trg_ids/logs")

callbacks = [metric_callback, tensorboard_callback]

model_with_wa_sums.fit(
    train_dataset, validation_data=validation_dataset, epochs=1, callbacks=callbacks
)



<keras.callbacks.History at 0x7f9d8493f070>

**BLEU** metric after the run with WA alignment using sums of related words' indeces is **23.2457**. 

## Running model with WA (using multiplication of token ids)

In [79]:
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "fr"
pos_tags = False
wa_tags = True
wa_type="mult"

split_dataset = loaded_dataset.remove_columns(['pos'])
wa_mult_dataset = split_dataset.map(preprocess_function, batched=True)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [80]:
model_with_wa_mult = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Some layers from the model checkpoint at Helsinki-NLP/opus-mt-en-fr were not used when initializing TFMarianMTModel: ['final_logits_bias']
- This IS expected if you are initializing TFMarianMTModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFMarianMTModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-fr.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


In [81]:
data_collator_wa_mult = DataCollatorForSeq2Seq(tokenizer, model=model_with_wa_mult, return_tensors="tf")

generation_data_collator_wa_mult = DataCollatorForSeq2Seq(tokenizer, model=model_with_wa_mult, return_tensors="tf", pad_to_multiple_of=128)

In [82]:
train_dataset = model_with_wa_mult.prepare_tf_dataset(
    wa_mult_dataset["train"],
    batch_size=batch_size,
    shuffle=True,
    collate_fn=generation_data_collator_wa_mult,
)

validation_dataset = model_with_wa_mult.prepare_tf_dataset(
    wa_mult_dataset["test"],
    batch_size=batch_size,
    shuffle=False,
    collate_fn=generation_data_collator_wa_mult,
)

generation_dataset = model_with_wa_mult.prepare_tf_dataset(
    wa_mult_dataset["test"],
    batch_size=8,
    shuffle=False,
    collate_fn=generation_data_collator_wa_mult,
)

In [83]:
optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model_with_wa_mult.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [84]:
metric_callback = KerasMetricCallback(
    metric_fn=metric_fn, eval_dataset=generation_dataset, predict_with_generate=True, use_xla_generation=True, 
    generate_kwargs={"max_length": 128}
)

In [85]:
tensorboard_callback = TensorBoard(log_dir="./wa_mult/logs")

callbacks = [metric_callback, tensorboard_callback]

model_with_wa_mult.fit(
    train_dataset, validation_data=validation_dataset, epochs=1, callbacks=callbacks
)



<keras.callbacks.History at 0x7f9d89518be0>

**BLEU** metric after the run with WA alignment using sums of related words' indeces is **26.4307**. 

# Results

In [94]:
results = pd.DataFrame(columns=['Annotation','BLEU'])
results_list = [
    ('No annotation', 18.7710),
    ('Part of Speech', 21.8055),
    ('Word alignment: target token ids', 22.7393),
    ('Word alignment: sum of token ids', 20.2693),
    ('Word alignment: multiplication of token ids', 18.0217)
]
results.append([{'Annotation': x[0], 'BLEU': x[1]} for x in results_list])

Unnamed: 0,Annotation,BLEU
0,No annotation,18.771
1,Part of Speech,21.8055
2,Word alignment: target token ids,22.7393
3,Word alignment: sum of token ids,20.2693
4,Word alignment: multiplication of token ids,18.0217


## Translation with the models

Now we've trained our model, let's see how we could load it and use it to translate text in future! First, let's load it from the hub. This means we can resume the code from here without needing to rerun everything above every time.

Now let's try tokenizing some text and passing it to the model to generate a translation. Don't forget to add the "translate: " string at the start if you're using a `T5` model.

In [116]:
input_text  = "I'm not actually a very competent Romanian speaker, but let's try our best."

tokenized = tokenizer([input_text], return_tensors='np')
# In the line below use the variable name of the model you want to test
out = model.generate(**tokenized, max_length=128)
print(out)

tf.Tensor(
[[59513   134   148   157    76   558    53    34 27645 34846   366 13369
    175 12143     5   371  1049     3     0 59513 59513 59513 59513 59513
  59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513
  59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513
  59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513
  59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513
  59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513
  59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513
  59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513
  59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513
  59513 59513 59513 59513 59513 59513 59513 59513]], shape=(1, 128), dtype=int32)


Well, that's some tokens and a lot of padding! Let's decode those to see what it says, using the `skip_special_tokens` argument to skip those padding tokens:

In [44]:
with tokenizer.as_target_tokenizer():
    print(tokenizer.decode(out[0], skip_special_tokens=True))

En fait je ne suis pas un orateur roumain très compétent mais faisons de notre mieux.
