If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it. We also use the `sacrebleu` and `sentencepiece` libraries - you may need to install these even if you already have 🤗 Transformers!

In [None]:
#! pip install transformers[sentencepiece] datasets
#! pip install sacrebleu sentencepiece
#! pip install huggingface_hub

In [None]:
pip install tensorflow==2.9

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then uncomment the following cell and input your token:

In [1]:
from huggingface_hub import notebook_login

notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (osxkeychain).
Your token has been saved to /Users/egoliakova/.huggingface/token
Login successful


Then you need to install Git-LFS and setup Git if you haven't already. Uncomment the following instructions and adapt with your name and email:

In [None]:
# !apt install git-lfs
# !git config --global user.email "you@example.com"
# !git config --global user.name "Your Name"

Make sure your version of Transformers is at least 4.16.0 since some of the functionality we use was introduced in that version:

In [1]:
import transformers

print(transformers.__version__)

4.21.0


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/seq2seq).

# Convert CSV-file to a dataset-ready format

The code below works with a specifically formatted csv. Run the cell below to format your CSV accordingly.
Your CSV should have at least 2 columns `en` and `xx` where xx is the code of the target language.

If the CSV file has PoS tags for source and target language, the expected column names for them are:
`pos_en` and `pos_xx`. 

If the CSV file has WA tags, the expected column name is `wa`.

In [2]:
import pandas as pd
from datasets import Dataset


# source_lang accepted value = 'en'
# target_lang accepted values = 'fr'|'zh'
# Choose pos_tags=True if the file has PoS tags for the both languages
# Choose wa_tags=True if the file has WA tags.
# Choose store=True if you want to create a json dump of the file that can be used later

def csv_to_dataset(filename, source_lang, target_lang, pos_tags=False, wa_tags=False, store=False):
    data = pd.read_csv(filename)
    new_df = pd.DataFrame()
    new_df['translation'] = [{source_lang: x, target_lang: y} for x, y in zip(data[source_lang], data[target_lang])]
    if pos_tags:
        new_df['pos'] = [{source_lang: x, target_lang: y} for x, y in zip(data[f'pos_{source_lang}'], data[f'pos_{target_lang}'])]
    if wa_tags:
        new_df['wa'] = data['wa']
    return new_df

In [3]:
test = csv_to_dataset('en_fr_pos_wa_anno.csv', 'en', 'fr', pos_tags=True, wa_tags=True)

In [4]:
raw_datasets = Dataset.from_pandas(test).train_test_split(test_size=0.2)

# Fine-tuning a model on a translation task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model for a translation task. We will use the [WMT dataset](http://www.statmt.org/wmt16/), a machine translation dataset composed from a collection of various sources, including news commentaries and parliament proceedings.

![Widget inference on a translation task](images/translation.png)

We will see how to easily load the dataset for this task using 🤗 Datasets and how to fine-tune a model on it using Keras.

In [5]:
model_checkpoint = "Helsinki-NLP/opus-mt-en-fr"

This notebook is built to run  with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a sequence-to-sequence version in the Transformers library. Here we picked the [`Helsinki-NLP/opus-mt-en-romance`](https://huggingface.co/Helsinki-NLP/opus-mt-en-ROMANCE) checkpoint. 

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the `datasets` function `load_dataset` and the `evaluate` function `load`. We use the English/Romanian part of the WMT dataset here.

In [6]:
from datasets import load_dataset
from evaluate import load

metric = load("sacrebleu")

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [7]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML


def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(
        dataset
    ), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [8]:
show_random_elements(raw_datasets["train"])

Unnamed: 0,translation,pos,wa
0,"{'en': 'Sydney rescued 47 out of 225 men from the Italian destroyer and thirty six more escaped on rafts but only six of them were later found alive by Italian submarine Topazio almost 20 days later.', 'fr': 'Le Sydney sauva 47 hommes du destroyer italien et six autres furent retrouvés vivants par un sous-marin italien près de 20 jours plus tard.'}","{'en': 'Sydney PROPN rescued VERB 47 NUM out ADP of ADP 225 NUM men NOUN from ADP the DET Italian ADJ destroyer NOUN and CCONJ thirty NUM six NUM more ADV escaped ADJ on ADP rafts NOUN but CCONJ only ADV six NUM of ADP them PRON were AUX later ADV found VERB alive ADJ by ADP Italian ADJ submarine NOUN Topazio PROPN almost ADV 20 NUM days NOUN later ADV . PUNCT ', 'fr': 'Le DET Sydney PROPN sauva ADJ 47 NUM hommes NOUN du ADP destroyer NOUN italien ADJ et CCONJ six NUM autres ADJ furent AUX retrouvés VERB vivants NOUN par ADP un DET sous-marin NOUN italien ADJ près ADV de ADP 20 NUM jours NOUN plus ADV tard ADV . PUNCT '}",0-1 1-2 2-3 4-19 5-3 6-4 7-5 8-0 9-7 10-6 11-8 13-9 14-10 16-15 18-8 20-9 22-12 23-11 25-12 26-13 27-14 28-17 29-16 30-16 31-18 32-20 33-21 34-22 34-23 35-24
1,"{'en': 'The National Ignition Facility (NIF) is a large laser-based inertial confinement fusion (ICF) research device located at the Lawrence Livermore National Laboratory in Livermore California.', 'fr': 'Le National Ignition Facility ou NIF est un laser de recherche extrêmement énergétique construit au sein du Lawrence Livermore National Laboratory à Livermore (Californie États-Unis).'}","{'en': 'The DET National PROPN Ignition PROPN Facility PROPN ( PUNCT NIF PROPN ) PUNCT is AUX a DET large ADJ laser NOUN - PUNCT based VERB inertial ADJ confinement NOUN fusion NOUN ( PUNCT ICF PROPN ) PUNCT research NOUN device NOUN located VERB at ADP the DET Lawrence PROPN Livermore PROPN National PROPN Laboratory PROPN in ADP Livermore PROPN California PROPN . PUNCT ', 'fr': 'Le DET National NOUN Ignition PROPN Facility PROPN ou CCONJ NIF PROPN est AUX un DET laser NOUN de ADP recherche NOUN extrêmement ADV énergétique ADJ construit ADJ au ADP sein NOUN du ADP Lawrence PROPN Livermore PROPN National PROPN Laboratory PROPN à ADP Livermore PROPN ( PUNCT Californie NOUN États-Unis PROPN ) PUNCT . PUNCT '}",0-0 1-1 2-2 3-3 4-4 5-5 6-26 7-6 8-7 9-11 10-8 11-9 13-12 17-5 19-10 21-13 22-15 23-16 24-17 25-18 26-19 27-20 28-21 29-22 30-24 30-25 31-27
2,"{'en': 'The French king comes to battle them at Cayeux (Cayeux-en-Santerre or Cayeux-sur-Mer).', 'fr': 'Le roi de France se porte à leur rencontre à Cayeux (Cayeux-sur-Mer).'}","{'en': 'The DET French ADJ king NOUN comes VERB to PART battle VERB them PRON at ADP Cayeux PROPN ( PUNCT Cayeux PROPN - PUNCT en PROPN - PUNCT Santerre PROPN or CCONJ Cayeux PROPN - PUNCT sur ADJ - PUNCT Mer PROPN ) PUNCT . PUNCT ', 'fr': 'Le DET roi NOUN de ADP France PROPN se PRON porte VERB à ADP leur DET rencontre NOUN à ADP Cayeux PROPN ( PUNCT Cayeux-sur-Mer NOUN ) PUNCT . PUNCT '}",0-0 1-3 2-1 2-2 3-4 3-5 4-6 5-8 6-7 7-9 8-10 9-11 10-12 11-12 12-12 13-12 14-12 15-11 16-12 17-12 18-12 19-12 20-12 21-13 22-14
3,"{'en': 'They were partially inspired by the actions of the Greenham Common Women's Peace Camp.', 'fr': 'Elles s'inspirent en partie des actions du camp de la femme de Greenham..'}","{'en': 'They PRON were AUX partially ADV inspired VERB by ADP the DET actions NOUN of ADP the DET Greenham PROPN Common PROPN Women PROPN 's PART Peace PROPN Camp PROPN . PUNCT ', 'fr': 'Elles PRON s' PRON inspirent ADV en ADP partie NOUN des DET actions NOUN du ADP camp NOUN de ADP la DET femme NOUN de ADP Greenham PROPN .. PUNCT '}",0-0 1-2 2-1 3-2 5-5 6-6 7-7 8-10 9-13 11-11 14-8 15-14
4,"{'en': 'He married Margaret de Multon Baroness Multon of Gilsland.', 'fr': 'Il épouse avant 1319 Margaret Multon baronne Multon de Gilsland.'}","{'en': 'He PRON married VERB Margaret PROPN de PROPN Multon PROPN Baroness PROPN Multon PROPN of ADP Gilsland PROPN . PUNCT ', 'fr': 'Il PRON épouse VERB avant ADP 1319 NUM Margaret PROPN Multon PROPN baronne ADJ Multon PROPN de ADP Gilsland PROPN . PUNCT '}",0-0 1-1 2-4 3-8 4-5 5-6 6-7 7-8 8-9 9-10


## Preprocessing the data

In [11]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

For the mBART tokenizer (like we have here), we need to set the source and target languages (so the texts are preprocessed properly). You can check the language codes [here](https://huggingface.co/facebook/mbart-large-cc25) if you are using this notebook on a different pairs of languages.

In [12]:
if "mbart" in model_checkpoint:
    tokenizer.src_lang = "en-XX"
    tokenizer.tgt_lang = "fr-FR"

In [13]:
tokenizer(["Hello, this is a sentence!", "This is another sentence."])

{'input_ids': [[10537, 2, 67, 32, 15, 5776, 145, 0], [160, 32, 1036, 5776, 3, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

Later on, for Word Alignment encoding we will be using these token values instead of real words to express relatedness of words in 2 sentences.

In [14]:
import spacy

en_pos_sp = spacy.load("en_core_web_sm")
fr_pos_sp = spacy.load('fr_core_news_sm')

If you are using one of the five T5 checkpoints that require a special prefix to put before the inputs, you should adapt the following cell.

In [15]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "translate English to French: "
else:
    prefix = ""

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

For PoS tags we will use a separate function that will parse the sentences and extract the PoS information from there.

In [16]:
"""
get_pos_tags receives a tokenized input from the model. The tokenization is a bit different from spacy model,
so to keep the same dimensions of vectors as in the sentence embeddings for each token in a sentence we will:
- decode the token received from the model
- get a Part of Speech id for it from Spacy and return it
"""

def token_to_pos(token, lang):
    if lang == 'en':
        decoded = list(en_pos_sp(tokenizer.decode(token)))
    elif lang == 'fr':
        decoded = list(fr_pos_sp(tokenizer.decode(token)))
    return decoded[-1].pos if decoded else -1

def get_pos_tags(tokenized_sent, lang):
    return list(map(lambda x: token_to_pos(x, lang), tokenized_sent))

Word Alignment information can be encoded in different ways:
- Name: `trg-ids`. Create a vector of the same length as the tokenized input sentence. For each position i in the new vector, find a corresponding word in the original input sentence. Find a connected word from the target sentence and put its tokenized value in the new vector.
- Name: `sums`. Create a copy of the input vector. For i-th word that has a connected word in the target sentence, add its value to the tokenized value to the i-th position of the new vector.
- Name: `mult`. Same as sums but replaces sums with multiplication of values.

In [72]:
from copy import deepcopy

def encode_wa(tokenized_input, tokenized_target, wa, wa_type):
    wa_dict = {int(src): int(trg) for src, trg in map(lambda x: x.split('-'), wa.split())}
    n = len(tokenized_input)
    m = len(tokenized_target)
    if wa_type == 'trg-ids':
        wa_emb = [0]*n
        for k, v in wa_dict.items():
            if k >= n or v >= m:
                break
            wa_emb[k] = tokenized_target[v]
        return wa_emb
    elif wa_type == 'sums':
        wa_emb = deepcopy(tokenized_input)
        for k, v in wa_dict.items():
            if k >= n or v >= m:
                break
            wa_emb[k] += tokenized_target[v]
        return wa_emb
    
    elif wa_type == 'mult':
        wa_emb = deepcopy(tokenized_input)
        for k, v in wa_dict.items():
            if k >= n or v >= m:
                break
            wa_emb[k] *= tokenized_target[v]
        return wa_emb

In [125]:
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "fr"
pos_tags=False
wa_type=None

def preprocess_function(dataset):
    global source_lang, target_lang, pos_tags, wa_type
    inputs = [prefix + d[source_lang] for d in dataset["translation"]]
    targets = [d[target_lang] for d in dataset["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    
    if pos_tags:
        model_inputs['pos'] = [get_pos_tags(x, 'en') for x in model_inputs['input_ids']]
        model_inputs['target_pos'] = [get_pos_tags(y, 'fr') for y in model_inputs['labels']]
        
    if wa_type:
        model_inputs['wa'] = [encode_wa(src, trg, wa, wa_type) for src, trg, wa \
                              in zip(model_inputs['input_ids'],  model_inputs['labels'], dataset["wa"])]
    return model_inputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

To apply this function on all the pairs of sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [20]:
from transformers.keras_callbacks import KerasMetricCallback
import numpy as np


def metric_fn(eval_predictions):
    preds, labels = eval_predictions
    prediction_lens = [
        np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds
    ]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # We use -100 to mask labels - replace it with the tokenizer pad token when decoding
    # so that no output is emitted for these
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}
    result["gen_len"] = np.mean(prediction_lens)
    return result

In [22]:
from transformers.keras_callbacks import PushToHubCallback
from tensorflow.keras.callbacks import TensorBoard
from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
from transformers import AdamWeightDecay
import tensorflow as tf

## Fine-tuning the model with no extra tags

In [126]:
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "fr"
pos_tags=False
wa_tags = False
wa_type = None

test = csv_to_dataset('en_fr_pos_wa_anno.csv', 'en', 'fr', pos_tags=pos_tags, wa_tags=wa_tags)
raw_datasets = Dataset.from_pandas(test).train_test_split(test_size=0.2)


no_anno_dataset = raw_datasets.map(preprocess_function, batched=True)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [90]:
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Some layers from the model checkpoint at Helsinki-NLP/opus-mt-en-fr were not used when initializing TFMarianMTModel: ['final_logits_bias']
- This IS expected if you are initializing TFMarianMTModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFMarianMTModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-fr.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


Note that  we don't get a warning like in our classification example. This means we used all the weights of the pretrained model and there is no randomly initialized head in this case.

Next we set some parameters like the learning rate and the `batch_size`and customize the weight decay. 

The last two arguments are to setup everything so we can push the model to the [Hub](https://huggingface.co/models) at the end of training. Remove the two of them if you didn't follow the installation steps at the top of the notebook, otherwise you can change the value of push_to_hub_model_id to something you would prefer.

In [91]:
batch_size = 16
learning_rate = 2e-5
weight_decay = 0.01
num_train_epochs = 1

#model_name = model_checkpoint.split("/")[-1]
#push_to_hub_model_id = f"{model_name}-finetuned-{source_lang}-to-{target_lang}"

Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels. Note that our data collators are multi-framework, so make sure you set `return_tensors='tf'` so you get `tf.Tensor` objects back and not something else!

In [92]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

generation_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf", pad_to_multiple_of=128)

Next, we convert our datasets to `tf.data.Dataset`, which Keras understands natively. There are two ways to do this - we can use the slightly more low-level [`Dataset.to_tf_dataset()`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.to_tf_dataset) method, or we can use [`Model.prepare_tf_dataset()`](https://huggingface.co/docs/transformers/main_classes/model#transformers.TFPreTrainedModel.prepare_tf_dataset). The main difference between these two is that the `Model` method can inspect the model to determine which column names it can use as input, which means you don't need to specify them yourself. Make sure to specify the collator we just created as our `collate_fn`!

We also want to compute `BLEU` metrics, which will require us to generate text from our model. To speed things up, we can compile our generation loop with XLA. This results in a *huge* speedup - up to 100X! The downside of XLA generation, though, is that it doesn't like variable input shapes, because it needs to run a new compilation for each new input shape! To compensate for that, let's use `pad_to_multiple_of` for the dataset we use for text generation. This will reduce the number of unique input shapes a lot, meaning we can get the benefits of XLA generation with only a few compilations.

In [93]:
train_dataset = model.prepare_tf_dataset(
    no_anno_dataset["train"],
    batch_size=batch_size,
    shuffle=True,
    collate_fn=data_collator,
)

validation_dataset = model.prepare_tf_dataset(
    no_anno_dataset["test"],
    batch_size=batch_size,
    shuffle=False,
    collate_fn=data_collator,
)

generation_dataset = model.prepare_tf_dataset(
    no_anno_dataset["test"],
    batch_size=8,
    shuffle=False,
    collate_fn=generation_data_collator,
)

Now we initialize our loss and optimizer and compile the model. Note that most Transformers models compute loss internally, so we can just leave the loss argument blank to use the internal loss instead. For the optimizer, we can use the `AdamWeightDecay` optimizer in the Transformer library.

In [94]:
optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


Now we can train our model. We can also add a few optional callbacks here, which you can remove if they aren't useful to you. In no particular order, these are:
- PushToHubCallback will sync up our model with the Hub - this allows us to resume training from other machines, share the model after training is finished, and even test the model's inference quality midway through training!
- TensorBoard is a built-in Keras callback that logs TensorBoard metrics.
- KerasMetricCallback is a callback for computing advanced metrics. There are a number of common metrics in NLP like ROUGE which are hard to fit into your compiled training loop because they depend on decoding predictions and labels back to strings with the tokenizer, and calling arbitrary Python functions to compute the metric. The KerasMetricCallback will wrap a metric function, outputting metrics as training progresses.

If this is the first time you've seen `KerasMetricCallback`, it's worth explaining what exactly is going on here. The callback takes two main arguments - a `metric_fn` and an `eval_dataset`. It then iterates over the `eval_dataset` and collects the model's outputs for each sample, before passing the `list` of predictions and the associated `list` of labels to the user-defined `metric_fn`. If the `predict_with_generate` argument is `True`, then it will call `model.generate()` for each input sample instead of `model.predict()` - this is useful for metrics that expect generated text from the model, like `ROUGE` and `BLEU`.

This callback allows complex metrics to be computed each epoch that would not function as a standard Keras Metric. Metric values are printed each epoch, and can be used by other callbacks like `TensorBoard` or `EarlyStopping`.

In [95]:
metric_callback = KerasMetricCallback(
    metric_fn=metric_fn, eval_dataset=generation_dataset, predict_with_generate=True, use_xla_generation=True, 
    generate_kwargs={"max_length": 128}
)

With the metric callback ready, now we can specify the other callbacks and fit our model:

In [96]:
tensorboard_callback = TensorBoard(log_dir="./translation_model_save/logs")

"""push_to_hub_callback = PushToHubCallback(
    output_dir="./translation_model_save",
    tokenizer=tokenizer,
    hub_model_id=push_to_hub_model_id,
)
"""


#callbacks = [metric_callback, tensorboard_callback, push_to_hub_callback]
callbacks = [metric_callback, tensorboard_callback]

model.fit(
    train_dataset, validation_data=validation_dataset, epochs=1, callbacks=callbacks
)



<keras.callbacks.History at 0x7fc07929e910>

**BLEU** metric after the run with no extra features is **22.7236**. This is our baseline.

## Running the model with PoS features

This cell can run pretty slow and can take 5-10 minutes.

In [127]:
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "fr"
# Pos_tags need to be set to True in the cell
pos_tags = True
wa_tags = False
wa_type = None

test = csv_to_dataset('en_fr_pos_wa_anno.csv', 'en', 'fr', pos_tags=pos_tags, wa_tags=wa_tags)
raw_datasets = Dataset.from_pandas(test).train_test_split(test_size=0.2)

pos_anno_dataset = raw_datasets.map(preprocess_function, batched=True)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [129]:
model_with_pos = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Some layers from the model checkpoint at Helsinki-NLP/opus-mt-en-fr were not used when initializing TFMarianMTModel: ['final_logits_bias']
- This IS expected if you are initializing TFMarianMTModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFMarianMTModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-fr.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


In [130]:
batch_size = 16
learning_rate = 2e-5
weight_decay = 0.01
num_train_epochs = 1

In [131]:
data_collator_pos = DataCollatorForSeq2Seq(tokenizer, model=model_with_pos, return_tensors="tf")

generation_data_collator_pos = DataCollatorForSeq2Seq(tokenizer, model=model_with_pos, return_tensors="tf", pad_to_multiple_of=128)

In [132]:
train_dataset = model_with_pos.prepare_tf_dataset(
    pos_anno_dataset["train"],
    batch_size=batch_size,
    shuffle=True,
    collate_fn=data_collator_pos,
)

validation_dataset = model_with_pos.prepare_tf_dataset(
    pos_anno_dataset["test"],
    batch_size=batch_size,
    shuffle=False,
    collate_fn=data_collator_pos,
)

generation_dataset = model_with_pos.prepare_tf_dataset(
    pos_anno_dataset["test"],
    batch_size=8,
    shuffle=False,
    collate_fn=data_collator_pos,
)

In [133]:
optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model_with_pos.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [134]:
metric_callback = KerasMetricCallback(
    metric_fn=metric_fn, eval_dataset=generation_dataset, predict_with_generate=True, use_xla_generation=True, 
    generate_kwargs={"max_length": 128}
)

In [135]:
tensorboard_callback = TensorBoard(log_dir="./translation_model_save/logs")

callbacks = [metric_callback, tensorboard_callback]

model_with_pos.fit(
    train_dataset, validation_data=validation_dataset, epochs=1, callbacks=callbacks
)



<keras.callbacks.History at 0x7fc07829bf10>

**BLEU** metric after the run with POS features only is **22.47899**. 

## Running model with WA (vector with target ids)

In [136]:
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "fr"
pos_tags = False
wa_tags = True
wa_type="trg-ids"

test = csv_to_dataset('en_fr_pos_wa_anno.csv', 'en', 'fr', pos_tags=pos_tags, wa_tags=wa_tags)
raw_datasets = Dataset.from_pandas(test).train_test_split(test_size=0.2)

wa_trg_id_dataset = raw_datasets.map(preprocess_function, batched=True)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [137]:
model_with_wa_trg_ids = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Some layers from the model checkpoint at Helsinki-NLP/opus-mt-en-fr were not used when initializing TFMarianMTModel: ['final_logits_bias']
- This IS expected if you are initializing TFMarianMTModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFMarianMTModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-fr.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


In [138]:
data_collator_wa_trg_ids = DataCollatorForSeq2Seq(tokenizer, model=model_with_wa_trg_ids, return_tensors="tf")

generation_data_collator_wa_trg_ids = DataCollatorForSeq2Seq(tokenizer, model=model_with_wa_trg_ids, return_tensors="tf", pad_to_multiple_of=128)

In [139]:
train_dataset = model_with_wa_trg_ids.prepare_tf_dataset(
    wa_trg_id_dataset["train"],
    batch_size=batch_size,
    shuffle=True,
    collate_fn=generation_data_collator_wa_trg_ids,
)

validation_dataset = model_with_wa_trg_ids.prepare_tf_dataset(
    wa_trg_id_dataset["test"],
    batch_size=batch_size,
    shuffle=False,
    collate_fn=generation_data_collator_wa_trg_ids,
)

generation_dataset = model_with_wa_trg_ids.prepare_tf_dataset(
    wa_trg_id_dataset["test"],
    batch_size=8,
    shuffle=False,
    collate_fn=generation_data_collator_wa_trg_ids,
)

In [140]:
optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model_with_wa_trg_ids.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [141]:
metric_callback = KerasMetricCallback(
    metric_fn=metric_fn, eval_dataset=generation_dataset, predict_with_generate=True, use_xla_generation=True, 
    generate_kwargs={"max_length": 128}
)

In [144]:
tensorboard_callback = TensorBoard(log_dir="./wa_trg_ids/logs")

callbacks = [metric_callback, tensorboard_callback]

model_with_wa_trg_ids.fit(
    train_dataset, validation_data=validation_dataset, epochs=1, callbacks=callbacks
)



<keras.callbacks.History at 0x7fc07904d520>

**BLEU** metric after the run with WA alignment using target indeces is **31.5155**. 

## Running model with WA (using sum of token ids)

In [145]:
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "fr"
pos_tags = False
wa_tags = True
wa_type ="sums"

test = csv_to_dataset('en_fr_pos_wa_anno.csv', 'en', 'fr', pos_tags=pos_tags, wa_tags=wa_tags)
raw_datasets = Dataset.from_pandas(test).train_test_split(test_size=0.2)

wa_sums_dataset = raw_datasets.map(preprocess_function, batched=True)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [146]:
model_with_wa_sums = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Some layers from the model checkpoint at Helsinki-NLP/opus-mt-en-fr were not used when initializing TFMarianMTModel: ['final_logits_bias']
- This IS expected if you are initializing TFMarianMTModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFMarianMTModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-fr.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


In [147]:
data_collator_wa_sums = DataCollatorForSeq2Seq(tokenizer, model=model_with_wa_sums, return_tensors="tf")

generation_data_collator_wa_sums = DataCollatorForSeq2Seq(tokenizer, model=model_with_wa_sums, return_tensors="tf", pad_to_multiple_of=128)

In [148]:
train_dataset = model_with_wa_sums.prepare_tf_dataset(
    wa_sums_dataset["train"],
    batch_size=batch_size,
    shuffle=True,
    collate_fn=generation_data_collator_wa_sums,
)

validation_dataset = model_with_wa_sums.prepare_tf_dataset(
    wa_sums_dataset["test"],
    batch_size=batch_size,
    shuffle=False,
    collate_fn=generation_data_collator_wa_sums,
)

generation_dataset = model_with_wa_sums.prepare_tf_dataset(
    wa_sums_dataset["test"],
    batch_size=8,
    shuffle=False,
    collate_fn=generation_data_collator_wa_sums,
)

In [149]:
optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model_with_wa_sums.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [150]:
metric_callback = KerasMetricCallback(
    metric_fn=metric_fn, eval_dataset=generation_dataset, predict_with_generate=True, use_xla_generation=True, 
    generate_kwargs={"max_length": 128}
)

In [152]:
tensorboard_callback = TensorBoard(log_dir="./wa_trg_ids/logs")

callbacks = [metric_callback, tensorboard_callback]

model_with_wa_sums.fit(
    train_dataset, validation_data=validation_dataset, epochs=1, callbacks=callbacks
)



<keras.callbacks.History at 0x7fc079287dc0>

**BLEU** metric after the run with WA alignment using sums of related words' indeces is **23.2457**. 

## Running model with WA (using multiplication of token ids)

In [98]:
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "fr"
pos_tags = False
wa_tags = True
wa_type="mult"

test = csv_to_dataset('en_fr_pos_wa_anno.csv', 'en', 'fr', pos_tags=pos_tags, wa_tags=wa_tags)
raw_datasets = Dataset.from_pandas(test).train_test_split(test_size=0.2)

wa_mult_dataset = raw_datasets.map(preprocess_function, batched=True)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [100]:
model_with_wa_mult = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Some layers from the model checkpoint at Helsinki-NLP/opus-mt-en-fr were not used when initializing TFMarianMTModel: ['final_logits_bias']
- This IS expected if you are initializing TFMarianMTModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFMarianMTModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-fr.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


In [103]:
data_collator_wa_mult = DataCollatorForSeq2Seq(tokenizer, model=model_with_wa_mult, return_tensors="tf")

generation_data_collator_wa_mult = DataCollatorForSeq2Seq(tokenizer, model=model_with_wa_mult, return_tensors="tf", pad_to_multiple_of=128)

In [104]:
train_dataset = model_with_wa_mult.prepare_tf_dataset(
    wa_mult_dataset["train"],
    batch_size=batch_size,
    shuffle=True,
    collate_fn=generation_data_collator_wa_mult,
)

validation_dataset = model_with_wa_mult.prepare_tf_dataset(
    wa_mult_dataset["test"],
    batch_size=batch_size,
    shuffle=False,
    collate_fn=generation_data_collator_wa_mult,
)

generation_dataset = model_with_wa_mult.prepare_tf_dataset(
    wa_mult_dataset["test"],
    batch_size=8,
    shuffle=False,
    collate_fn=generation_data_collator_wa_mult,
)

In [106]:
optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model_with_wa_mult.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [107]:
metric_callback = KerasMetricCallback(
    metric_fn=metric_fn, eval_dataset=generation_dataset, predict_with_generate=True, use_xla_generation=True, 
    generate_kwargs={"max_length": 128}
)

In [108]:
tensorboard_callback = TensorBoard(log_dir="./wa_mult/logs")

callbacks = [metric_callback, tensorboard_callback]

model_with_wa_mult.fit(
    train_dataset, validation_data=validation_dataset, epochs=1, callbacks=callbacks
)



<keras.callbacks.History at 0x7fc078404c10>

**BLEU** metric after the run with WA alignment using sums of related words' indeces is **26.4307**. 

# Results

In [153]:
results = pd.DataFrame(columns=['Annotation','BLEU'])
results_list = [
    ('No annotation', 22.7236),
    ('Part of Speech', 22.4789),
    ('Word alignment: target token ids', 31.5155),
    ('Word alignment: sum of token ids', 23.2457),
    ('Word alignment: multiplication of token ids', 26.4307)
]
results.append([{'Annotation': x[0], 'BLEU': x[1]} for x in results_list])

Unnamed: 0,Annotation,BLEU
0,No annotation,22.7236
1,Part of Speech,22.4789
2,Word alignment: target token ids,31.5155
3,Word alignment: sum of token ids,23.2457
4,Word alignment: multiplication of token ids,26.4307


## Translation with the models

Now we've trained our model, let's see how we could load it and use it to translate text in future! First, let's load it from the hub. This means we can resume the code from here without needing to rerun everything above every time.

Now let's try tokenizing some text and passing it to the model to generate a translation. Don't forget to add the "translate: " string at the start if you're using a `T5` model.

In [116]:
input_text  = "I'm not actually a very competent Romanian speaker, but let's try our best."

tokenized = tokenizer([input_text], return_tensors='np')
# In the line below use the variable name of the model you want to test
out = model.generate(**tokenized, max_length=128)
print(out)

tf.Tensor(
[[59513   134   148   157    76   558    53    34 27645 34846   366 13369
    175 12143     5   371  1049     3     0 59513 59513 59513 59513 59513
  59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513
  59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513
  59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513
  59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513
  59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513
  59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513
  59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513
  59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513 59513
  59513 59513 59513 59513 59513 59513 59513 59513]], shape=(1, 128), dtype=int32)


Well, that's some tokens and a lot of padding! Let's decode those to see what it says, using the `skip_special_tokens` argument to skip those padding tokens:

In [44]:
with tokenizer.as_target_tokenizer():
    print(tokenizer.decode(out[0], skip_special_tokens=True))

En fait je ne suis pas un orateur roumain très compétent mais faisons de notre mieux.
