If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it. We also use the `sacrebleu` and `sentencepiece` libraries - you may need to install these even if you already have 🤗 Transformers!

In [None]:
! pip install transformers[sentencepiece] datasets
! pip install sacrebleu sentencepiece
! pip install huggingface_hub

In [None]:
! pip install tensorflow==2.9

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then uncomment the following cell and input your token:

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Then you need to install Git-LFS and setup Git if you haven't already. Uncomment the following instructions and adapt with your name and email:

In [None]:
# !apt install git-lfs
# !git config --global user.email "you@example.com"
# !git config --global user.name "Your Name"

Make sure your version of Transformers is at least 4.16.0 since some of the functionality we use was introduced in that version:

In [1]:
import transformers

print(transformers.__version__)

4.21.0


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/seq2seq).

# Convert CSV-file to a dataset-ready format

The code below works with a specifically formatted csv. Run the cell below to format your CSV accordingly.
Your CSV should have at least 2 columns `en` and `xx` where xx is the code of the target language.

If the CSV file has PoS tags for source and target language, the expected column names for them are:
`pos_en` and `pos_xx`. 

If the CSV file has WA tags, the expected column name is `wa`.

In [2]:
import pandas as pd
from datasets import Dataset


# source_lang accepted value = 'en'
# target_lang accepted values = 'fr'|'zh'
# Choose pos_tags=True if the file has PoS tags for the both languages
# Choose wa_tags=True if the file has WA tags.
# Choose store=True if you want to create a json dump of the file that can be used later

def csv_to_dataset(filename, source_lang, target_lang, pos_tags=False, wa_tags=False, store=False):
    data = pd.read_csv(filename)
    new_df = pd.DataFrame()
    new_df['translation'] = [{source_lang: x, target_lang: y} for x, y in zip(data[source_lang], data[target_lang])]
    if pos_tags:
        new_df['pos'] = [{source_lang: x, target_lang: y} for x, y in zip(data[f'pos_{source_lang}'], data[f'pos_{target_lang}'])]
    if wa_tags:
        new_df['wa'] = data['wa']
    return Dataset.from_pandas(new_df).train_test_split(test_size=0.2)

In [3]:
from datasets import load_from_disk

loaded_dataset = load_from_disk('zh_split_dataset.hf')

# Fine-tuning a model on a translation task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model for a translation task. We will use the [WMT dataset](http://www.statmt.org/wmt16/), a machine translation dataset composed from a collection of various sources, including news commentaries and parliament proceedings.

![Widget inference on a translation task](images/translation.png)

We will see how to easily load the dataset for this task using 🤗 Datasets and how to fine-tune a model on it using Keras.

In [4]:
tokenizer_checkpoint = "xlm-roberta-base"
model_checkpoint = "Helsinki-NLP/opus-mt-en-zh"
facebook_model = 'facebook/mbart-large-cc25'

This notebook is built to run  with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a sequence-to-sequence version in the Transformers library. Here we picked the [`Helsinki-NLP/opus-mt-en-romance`](https://huggingface.co/Helsinki-NLP/opus-mt-en-ROMANCE) checkpoint. 

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the `datasets` function `load_dataset` and the `evaluate` function `load`. We use the English/Romanian part of the WMT dataset here.

In [5]:
from datasets import load_dataset
from evaluate import load

metric = load("sacrebleu")

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [6]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML


def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(
        dataset
    ), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    return df

In [7]:
show_random_elements(loaded_dataset["train"])

Unnamed: 0,translation,pos,wa
0,{'en': 'two days later on 29 march more fighti...,{'en': 'two NUM days NOUN later ADV on ADP 29 ...,0-0 1-0 2-1 3-3 4-4 4-5 5-2 5-6 6-11 6-12 7-8 ...
1,{'en': 'He also stated that Turkey would not g...,{'en': 'He PRON also ADV stated VERB that SCON...,0-0 1-1 2-2 3-2 4-3 5-4 6-4 7-5 8-6 10-8 12-9 ...
2,{'en': 'he also managed the los nettos network...,{'en': 'he PRON also ADV managed VERB the DET ...,0-0 1-1 2-2 2-3 3-3 4-4 5-5 5-6 6-6
3,{'en': 'to win a war you need to know your ene...,{'en': 'to PART win VERB a DET war NOUN you PR...,0-0 1-1 2-1 3-2 4-3 5-4 6-4 7-5 8-6 8-7 9-8
4,{'en': 'sometimes that result is an equilibriu...,{'en': 'sometimes ADV that DET result NOUN is ...,0-0 1-1 2-2 3-3 4-4 5-5


## Preprocessing the data

In [None]:
!pip install jieba

In [88]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [36]:
import jieba

In [238]:
loaded_dataset['train']['pos'][90]['zh']

'他们 r\n砍掉 v\n高大 a\n的 uj\n树 v\n'

In [35]:
tokenizer.encode('a good side a bad side a past a future')

[10, 4127, 5609, 10, 6494, 5609, 10, 11015, 10, 22690, 2, 250004]

For both English and Chinese the encodings are ended with 2s which shouldn't affect the WA and PoS tags.

In comparison to the English tokenized sentence, the Chinse sentence also includes space encodings. We will filter it out later when working with WA.

In [266]:
list(filter(lambda x: x!= 8, tokenizer.encode('Safari 不 支持 MNGJNG')))

[65097, 156, 316, 65033, 0]

We add all the chinese tokens from our files to the tokenizer so the tokenizer won't split them:

In [91]:
tokens = set()

for x in loaded_dataset['train']['translation']:
    zh_sent = x['zh'].rstrip().split()
    tokens.update(zh_sent)

In [102]:
new_tokens = tokens - set(tokenizer.get_vocab().keys())

In [104]:
tokenizer.add_tokens(list(new_tokens))

2010

Later on, for Word Alignment encoding we will be using these token values instead of real words to express relatedness of words in 2 sentences.

In [None]:
#! python -m spacy download zh_core_web_sm

In [38]:
import spacy

en_pos_sp = spacy.load("en_core_web_sm")
zh_pos_sp = spacy.load('zh_core_web_sm')

If you are using one of the five T5 checkpoints that require a special prefix to put before the inputs, you should adapt the following cell.

In [39]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "translate English to Chinese: "
else:
    prefix = ""

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

For PoS tags we will use a separate function that will parse the sentences and extract the PoS information from there.

In [329]:
"""
get_pos_tags receives a tokenized input from the model. The tokenization is a bit different from spacy model,
so to keep the same dimensions of vectors as in the sentence embeddings for each token in a sentence we will:
- decode the token received from the model
- get a Part of Speech id for it from Spacy and return it
"""
spacy_encoding = {
    'ADJ': 0.3,
 'ADP': 0.5,
 'ADV': 0.4,
 'AUX': 0.10,
 'CCONJ': 0.11,
 'DET': 0.6,
 'INTJ': 0.12,
 'NOUN': 0.1,
 'NUM': 0.13,
 'PART': 0.7,
 'PRON': 0.8,
 'PROPN': 0.9,
 'PUNCT': 0.14,
 'SCONJ': 0.15,
 'SPACE': 0.16,
 'SYM': 0.17,
 'VERB': 0.2,
 'X': 0.18,
}

jieba_spacy = {
    'a': 'ADJ',
    'ad': 'ADJ',
    'an': 'ADJ',
    'b': 'ADJ',
    'c': 'CCONJ',
    'd': 'ADV',
    'df': 'ADV',
    'dg': 'ADV',
    'e': 'INTJ',
    'f': 'ADP',
    'g': 'NOUN',
    'h': 'ADV',
    'j': 'PROPN',
    'k': 'PART',
    'm': 'NUM',
    'mg': 'NUM',
    'mq': 'NOUN',
    'n': 'NOUN',
    'ng': 'NOUN',
    'nr': 'PROPN',
    'nrfg': 'PROPN',
    'nrt': 'PROPN',
    'ns': 'PROPN',
    'nt': 'PROPN',
    'nz': 'PROPN',
    'o': 'NOUN',
    'p': 'ADP',
    'q': 'NOUN',
    'r': 'PRON',
    'rg': 'PRON',
    'rr': 'PRON',
    'rz': 'PRON',
    's': 'NOUN',
    't': 'NOUN',
    'u': 'PART',
    'ud': 'PART',
    'ug': 'PART',
    'uj': 'PART',
    'ul': 'PART',
    'uv': 'PART',
    'uz': 'PART',
    'v': 'VERB',
    'vd': 'VERB',
    'vg': 'VERB',
    'vi': 'VERB',
    'vn': 'VERB',
    'vq': 'VERB',
    'w': 'PUNCT',
    'x': 'NOUN',
    'y': 'INTJ',
    'z': 'ADV',
    'zg': 'PROPN'
}

"""
Stop getting the POS on the fly
def token_to_pos(token, lang):
    if lang == 'en':
        decoded = list(en_pos_sp(tokenizer.decode(token)))
    elif lang == 'zh':
        decoded = list(zh_pos_sp(tokenizer.decode(token)))
    return decoded[-1].pos if decoded else -1
"""

def get_pos_tags(tokenized_sent, pos_tags_sent, lang):
    n = len(tokenized_sent)
    pos_embedding = [0] * n
    
    words_pos = pos_tags_sent.rstrip().split('\n')
    pos_tags = list(map(lambda x: x.split()[-1], words_pos))
    m = len(pos_tags)
    
    
    i = 0
    j = 0
    while (i < n) and (j < m):
        if lang == 'en':
            if i < m: 
                break
            pos_embedding[i] = spacy_encoding.get(pos_tags[i], 100)
            
        elif lang == 'zh':
            if tokenized_sent[i] == 8:
                spacy_tag = 'SPACE'
            else:
                spacy_tag = jieba_spacy.get(pos_tags[j], None)
                if not spacy_tag:
                    decoded_word = tokenizer.decode(tokenized_sent[i])
                    spacy_tag = zh_pos_sp(decoded_word)[-1].pos_ if decoded_word else 'SPACE'
                j += 1
            pos_embedding[i] = spacy_encoding.get(spacy_tag, 100)
        i += 1
        
    return pos_embedding

Word Alignment information can be encoded in different ways:
- Name: `trg-ids`. Create a vector of the same length as the tokenized input sentence. For each position i in the new vector, find a corresponding word in the original input sentence. Find a connected word from the target sentence and put its tokenized value in the new vector.
- Name: `sums`. Create a copy of the input vector. For i-th word that has a connected word in the target sentence, add its value to the tokenized value to the i-th position of the new vector.
- Name: `mult`. Same as sums but replaces sums with multiplication of values.

In [270]:
if list(zh_pos_sp('')): print('hello')

In [264]:
tokenizer.decode(6)

'.'

In [267]:
def encode_wa(tokenized_input, tokenized_target, wa, wa_type):
    # Parse the WA string into a dictionary
    wa_dict = {int(src): int(trg) for src, trg in map(lambda x: x.split('-'), wa.split())}
    n = len(tokenized_input)
    # As we have seen above, the Chinese tokenizer adds 8 for spaces which will remove for WA to work properly
    filtered_target = list(filter(lambda x: x!=8, tokenized_target))
    m = len(filtered_target)
    if wa_type == 'trg-ids':
        wa_emb = [0]*n
        for k, v in wa_dict.items():
            if k >= n or v >= m:
                break
            wa_emb[k] = filtered_target[v]
        return wa_emb
    elif wa_type == 'sums':
        wa_emb = deepcopy(tokenized_input)
        for k, v in wa_dict.items():
            if k >= n or v >= m:
                break
            wa_emb[k] += filtered_target[v]
        return wa_emb
    
    elif wa_type == 'mult':
        wa_emb = deepcopy(tokenized_input)
        for k, v in wa_dict.items():
            if k >= n or v >= m:
                break
            wa_emb[k] *= filtered_target[v]
        return wa_emb

In [245]:
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "zh"
pos_tags=False
wa_type=None

def preprocess_function(dataset):
    global source_lang, target_lang, pos_tags, wa_type
    inputs = [prefix + d[source_lang] for d in dataset["translation"]]
    targets = [d[target_lang] for d in dataset["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    
    if pos_tags:
        model_inputs['pos'] = [get_pos_tags(x[0], x[1]['en'], 'en') for x in zip(model_inputs['input_ids'], dataset['pos'])]
        model_inputs['target_pos'] = [get_pos_tags(y[0], y[1]['zh'], 'zh') for y in zip(model_inputs['labels'], dataset['pos'])]
        
    if wa_type:
        model_inputs['wa'] = [encode_wa(src, trg, wa, wa_type) for src, trg, wa \
                              in zip(model_inputs['input_ids'],  model_inputs['labels'], dataset["wa"])]
    return model_inputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

To apply this function on all the pairs of sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [43]:
from transformers.keras_callbacks import KerasMetricCallback
import numpy as np


def metric_fn(eval_predictions):
    preds, labels = eval_predictions
    prediction_lens = [
        np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds
    ]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # We use -100 to mask labels - replace it with the tokenizer pad token when decoding
    # so that no output is emitted for these
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}
    result["gen_len"] = np.mean(prediction_lens)
    return result

In [48]:
from transformers.keras_callbacks import PushToHubCallback
from tensorflow.keras.callbacks import TensorBoard
from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, AutoModelForMaskedLM, AutoModelForSeq2SeqLM
from transformers import AdamWeightDecay
import tensorflow as tf

## Fine-tuning the model with no extra tags

In [144]:
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "zh"
pos_tags=False
wa_tags = False
wa_type = None

split_dataset = loaded_dataset.remove_columns(['pos', 'wa'])
no_anno_dataset = split_dataset.map(preprocess_function, batched=True)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [145]:
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
model.resize_token_embeddings(len(tokenizer))

Some layers from the model checkpoint at Helsinki-NLP/opus-mt-en-zh were not used when initializing TFMarianMTModel: ['final_logits_bias']
- This IS expected if you are initializing TFMarianMTModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFMarianMTModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-zh.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


<transformers.modeling_tf_utils.TFSharedEmbeddings at 0x7f98d427a100>

Note that  we don't get a warning like in our classification example. This means we used all the weights of the pretrained model and there is no randomly initialized head in this case.

Next we set some parameters like the learning rate and the `batch_size`and customize the weight decay. 

The last two arguments are to setup everything so we can push the model to the [Hub](https://huggingface.co/models) at the end of training. Remove the two of them if you didn't follow the installation steps at the top of the notebook, otherwise you can change the value of push_to_hub_model_id to something you would prefer.

In [146]:
batch_size = 16
learning_rate = 2e-5
weight_decay = 0.01
num_train_epochs = 1

#model_name = model_checkpoint.split("/")[-1]
#push_to_hub_model_id = f"{model_name}-finetuned-{source_lang}-to-{target_lang}"

Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels. Note that our data collators are multi-framework, so make sure you set `return_tensors='tf'` so you get `tf.Tensor` objects back and not something else!

In [147]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

generation_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf", pad_to_multiple_of=128)

Next, we convert our datasets to `tf.data.Dataset`, which Keras understands natively. There are two ways to do this - we can use the slightly more low-level [`Dataset.to_tf_dataset()`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.to_tf_dataset) method, or we can use [`Model.prepare_tf_dataset()`](https://huggingface.co/docs/transformers/main_classes/model#transformers.TFPreTrainedModel.prepare_tf_dataset). The main difference between these two is that the `Model` method can inspect the model to determine which column names it can use as input, which means you don't need to specify them yourself. Make sure to specify the collator we just created as our `collate_fn`!

We also want to compute `BLEU` metrics, which will require us to generate text from our model. To speed things up, we can compile our generation loop with XLA. This results in a *huge* speedup - up to 100X! The downside of XLA generation, though, is that it doesn't like variable input shapes, because it needs to run a new compilation for each new input shape! To compensate for that, let's use `pad_to_multiple_of` for the dataset we use for text generation. This will reduce the number of unique input shapes a lot, meaning we can get the benefits of XLA generation with only a few compilations.

In [148]:
train_dataset = model.prepare_tf_dataset(
    no_anno_dataset["train"],
    batch_size=batch_size,
    shuffle=True,
    collate_fn=data_collator,
)

validation_dataset = model.prepare_tf_dataset(
    no_anno_dataset["test"],
    batch_size=batch_size,
    shuffle=False,
    collate_fn=data_collator,
)

generation_dataset = model.prepare_tf_dataset(
    no_anno_dataset["test"],
    batch_size=8,
    shuffle=False,
    collate_fn=generation_data_collator,
)

Now we initialize our loss and optimizer and compile the model. Note that most Transformers models compute loss internally, so we can just leave the loss argument blank to use the internal loss instead. For the optimizer, we can use the `AdamWeightDecay` optimizer in the Transformer library.

In [149]:
optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


Now we can train our model. We can also add a few optional callbacks here, which you can remove if they aren't useful to you. In no particular order, these are:
- PushToHubCallback will sync up our model with the Hub - this allows us to resume training from other machines, share the model after training is finished, and even test the model's inference quality midway through training!
- TensorBoard is a built-in Keras callback that logs TensorBoard metrics.
- KerasMetricCallback is a callback for computing advanced metrics. There are a number of common metrics in NLP like ROUGE which are hard to fit into your compiled training loop because they depend on decoding predictions and labels back to strings with the tokenizer, and calling arbitrary Python functions to compute the metric. The KerasMetricCallback will wrap a metric function, outputting metrics as training progresses.

If this is the first time you've seen `KerasMetricCallback`, it's worth explaining what exactly is going on here. The callback takes two main arguments - a `metric_fn` and an `eval_dataset`. It then iterates over the `eval_dataset` and collects the model's outputs for each sample, before passing the `list` of predictions and the associated `list` of labels to the user-defined `metric_fn`. If the `predict_with_generate` argument is `True`, then it will call `model.generate()` for each input sample instead of `model.predict()` - this is useful for metrics that expect generated text from the model, like `ROUGE` and `BLEU`.

This callback allows complex metrics to be computed each epoch that would not function as a standard Keras Metric. Metric values are printed each epoch, and can be used by other callbacks like `TensorBoard` or `EarlyStopping`.

In [150]:
metric_callback = KerasMetricCallback(
    metric_fn=metric_fn, eval_dataset=generation_dataset, predict_with_generate=True, use_xla_generation=True, 
    generate_kwargs={"max_length": 128}
)

With the metric callback ready, now we can specify the other callbacks and fit our model:

In [151]:
tensorboard_callback = TensorBoard(log_dir="./translation_model_save/logs")

"""push_to_hub_callback = PushToHubCallback(
    output_dir="./translation_model_save",
    tokenizer=tokenizer,
    hub_model_id=push_to_hub_model_id,
)
"""


#callbacks = [metric_callback, tensorboard_callback, push_to_hub_callback]
callbacks = [metric_callback]

model.fit(
    train_dataset, validation_data=validation_dataset, epochs=1, callbacks=callbacks
)



<keras.callbacks.History at 0x7f98846ab490>

## Running the model with PoS features

This cell can run pretty slow and can take 5-10 minutes.

In [338]:
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "zh"
# Pos_tags need to be set to True in the cell
pos_tags = True
wa_tags = False
wa_type = None

split_dataset = loaded_dataset.remove_columns(['wa'])
pos_anno_dataset = split_dataset.map(preprocess_function, batched=True)



In [339]:
model_with_pos = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
model_with_pos.resize_token_embeddings(len(tokenizer))

Some layers from the model checkpoint at Helsinki-NLP/opus-mt-en-zh were not used when initializing TFMarianMTModel: ['final_logits_bias']
- This IS expected if you are initializing TFMarianMTModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFMarianMTModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-zh.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


<transformers.modeling_tf_utils.TFSharedEmbeddings at 0x7f9548686700>

In [340]:
batch_size = 16
learning_rate = 2e-5
weight_decay = 0.01
num_train_epochs = 1

In [341]:
data_collator_pos = DataCollatorForSeq2Seq(tokenizer, model=model_with_pos, return_tensors="tf")

generation_data_collator_pos = DataCollatorForSeq2Seq(tokenizer, model=model_with_pos, return_tensors="tf", pad_to_multiple_of=128)

In [342]:
train_dataset = model_with_pos.prepare_tf_dataset(
    pos_anno_dataset["train"],
    batch_size=batch_size,
    shuffle=True,
    collate_fn=data_collator_pos,
)

validation_dataset = model_with_pos.prepare_tf_dataset(
    pos_anno_dataset["test"],
    batch_size=batch_size,
    shuffle=False,
    collate_fn=data_collator_pos,
)

generation_dataset = model_with_pos.prepare_tf_dataset(
    pos_anno_dataset["test"],
    batch_size=8,
    shuffle=False,
    collate_fn=generation_data_collator_pos,
)

In [343]:
optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model_with_pos.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [344]:
metric_callback = KerasMetricCallback(
    metric_fn=metric_fn, eval_dataset=generation_dataset, predict_with_generate=True, use_xla_generation=True, 
    generate_kwargs={"max_length": 128}
)

In [345]:
tensorboard_callback = TensorBoard(log_dir="./translation_model_save/logs")

callbacks = [metric_callback]

model_with_pos.fit(
    train_dataset, validation_data=validation_dataset, epochs=1, callbacks=callbacks
)



<keras.callbacks.History at 0x7f9739edfac0>

## Running model with WA (vector with target ids)

In [314]:
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "zh"
pos_tags = False
wa_tags = True
wa_type="trg-ids"

split_dataset = loaded_dataset.remove_columns(['pos'])
wa_trg_id_dataset = split_dataset.map(preprocess_function, batched=True)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [315]:
model_with_wa_trg_ids = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
model_with_wa_trg_ids.resize_token_embeddings(len(tokenizer))

Some layers from the model checkpoint at Helsinki-NLP/opus-mt-en-zh were not used when initializing TFMarianMTModel: ['final_logits_bias']
- This IS expected if you are initializing TFMarianMTModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFMarianMTModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-zh.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


<transformers.modeling_tf_utils.TFSharedEmbeddings at 0x7f963579ccd0>

In [316]:
data_collator_wa_trg_ids = DataCollatorForSeq2Seq(tokenizer, model=model_with_wa_trg_ids, return_tensors="tf")

generation_data_collator_wa_trg_ids = DataCollatorForSeq2Seq(tokenizer, model=model_with_wa_trg_ids, return_tensors="tf", pad_to_multiple_of=128)

In [317]:
train_dataset = model_with_wa_trg_ids.prepare_tf_dataset(
    wa_trg_id_dataset["train"],
    batch_size=batch_size,
    shuffle=True,
    collate_fn=data_collator_wa_trg_ids,
)

validation_dataset = model_with_wa_trg_ids.prepare_tf_dataset(
    wa_trg_id_dataset["test"],
    batch_size=batch_size,
    shuffle=False,
    collate_fn=generation_data_collator_wa_trg_ids,
)

generation_dataset = model_with_wa_trg_ids.prepare_tf_dataset(
    wa_trg_id_dataset["test"],
    batch_size=8,
    shuffle=False,
    collate_fn=generation_data_collator_wa_trg_ids,
)

In [318]:
optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model_with_wa_trg_ids.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [319]:
metric_callback = KerasMetricCallback(
    metric_fn=metric_fn, eval_dataset=generation_dataset, predict_with_generate=True, use_xla_generation=True, 
    generate_kwargs={"max_length": 128}
)

In [320]:
tensorboard_callback = TensorBoard(log_dir="./wa_trg_ids/logs")

callbacks = [metric_callback]

model_with_wa_trg_ids.fit(
    train_dataset, validation_data=validation_dataset, epochs=1, callbacks=callbacks
)



<keras.callbacks.History at 0x7f9736943af0>

# Results

In [None]:
results = pd.DataFrame(columns=['Annotation','BLEU'])
results_list = [
    ('No annotation', 21.9916),
    ('Part of Speech', 22.4789),
    ('Word alignment: target token ids', 31.5155)
]
results.append([{'Annotation': x[0], 'BLEU': x[1]} for x in results_list])

## Translation with the models

Now we've trained our model, let's see how we could load it and use it to translate text in future! First, let's load it from the hub. This means we can resume the code from here without needing to rerun everything above every time.

Now let's try tokenizing some text and passing it to the model to generate a translation. Don't forget to add the "translate: " string at the start if you're using a `T5` model.

Well, that's some tokens and a lot of padding! Let's decode those to see what it says, using the `skip_special_tokens` argument to skip those padding tokens:

In [117]:
with tokenizer.as_target_tokenizer():
    print(tokenizer.decode(out[0], skip_special_tokens=True))

我 其实 不是 一个 非常 有 能力 的 罗马尼亚 演讲 但 让我们 尽 我们 的 最大 努力


In [322]:
vital = pd.read_csv("~/wikipedia/wiki/scrapping/ver2/articles_clean_ver2/en_only_fr.csv", sep=";", encoding="iso-8859-1")


In [324]:
vital['sentences']

0     In Chinese painting, abstraction can be traced...
1     While none of his paintings remain, this style...
2     The Chan buddhist painter Liang Kai (??, c. 11...
3     A late Song painter named Yu Jian, adept to Ti...
4     When Turing was 39 years old in 1951, he turne...
5     He was interested in morphogenesis, the develo...
6     He suggested that a system of chemicals reacti...
7     He used systems of partial differential equati...
8     For example, if a catalyst A is required for a...
9     Turing discovered that patterns could be creat...
10    If A and B then diffused through the container...
11    To calculate the extent of this, Turing would ...
12    These calculations gave the right qualitative ...
13    The Russian biochemist Boris Belousov had perf...
14    Belousov was not aware of Turing's paper in th...
15    Although published before the structure and ro...
16    One of the early applications of Turing's pape...
17    Further research in the area suggests that

### Model with no annotaion

In [None]:
outputs = []

for input_sentence in vital["sentences"]:
    tokenized_sentence = tokenizer([input_sentence], return_tensors='np')
    out = model.generate(**tokenized_sentence, max_length=128)
    with tokenizer.as_target_tokenizer():
        output_sentence = tokenizer.decode(out[0], skip_special_tokens=True)
        print(output_sentence)
        outputs.append(output_sentence)

In [None]:
vital["no_anno"] = outputs

### Model with POS tags (jieba)

In [346]:
outputs_pos = []

for input_sentence in vital["sentences"]:
    tokenized_sentence = tokenizer([input_sentence], return_tensors='np')
    out = model_with_pos.generate(**tokenized_sentence, max_length=128)
    with tokenizer.as_target_tokenizer():
        output_sentence = tokenizer.decode(out[0], skip_special_tokens=True)
        print(output_sentence)
        outputs_pos.append(output_sentence)

在 中国 的 画 中, 抽象 可以 追溯 到 汤 时代 的 画 家 王 模 的 风格
虽然 他 的 绘 画 都 没有 留下 了 但 这种 风格 在 一些 宋 时代 的 绘画 中 明显 可见
Chan Buddhist 画家 梁凯 (?, c 1140 * 1210 ) 运用 了 这种 风格 来 绘制 他 的 “ 在 喷 墨 中 的 永恒 ” 其中 牺牲 了 准确 的 表述 来 增强 与 开 明 人 非 理性 的 思想 相联系 的 自 性
一位 已故 的 画 画 人  看向 并 得 于 天 台 佛教 创建 了 一系列 被 喷 的 墨 景 最终 激发 许多 日本 锌 画 家
当 1951 年 Turing 39 岁 时 他 转向 数学 生物学 最后 在 1 月 发表 他 的 杰作 《 产生 的 化学 基础 》
他 对 生物 生物 的 形态 和 形状 的 发展 感兴趣
他 提议 一个 化学 系统 互相 反应 并 在 整个 空间 传播 称为 一个 反应 扩散 系统 可能 代表 了 “ 产生 的 主要 现象 ”
他 使用 部分 差异 等 系统 来 模拟 催 化 化学 反应
例如 如果 某种 化学 反应 需要 一个 催化剂 A 进行 并且 如果 该 反应 产生 了 更多 的 催化剂 A 那么 我们 就 说 该 反应 是 自动 催化 的, 有 积极 的 反馈 可以 以 非 线 差异 公式 来 模拟
Turing 发现 如果 化学 反应 不仅 产生 催化剂 A 并且 还 产生 抑制 剂 B 从而 减缓 A 的 生产
如果 A 和 B 然后 以 不同 的 速度 在 集装箱 中 扩散 那么 你 可以 有 一些 A 统治 的 地区 和 一些 B 统治 的 地区
为 计算 其 程度, 图 灵 需要 一个 强大 的 计算机 但 这些 计算机 在 1951 年 并 没有 如此 自由 的 可用 因此 他 不得不 使用 线 近 点 来 亲 手 解决 这些 方程式
这些 计算 给出 了 正确 的 质量 结果 例如 产生 了 一个 统一 的 混合物 奇怪 地 经常 间隔 固定 的 红色 点
俄罗斯 生物 化学 学家 Boris Belousov 曾 进行 类似 的 实验 但 无法 发表 他 的 论文 因为 当代 的 偏见 任何 这种 东西 都 违反 了 热 动力 的 第二项 法律
Belousov 并

In [347]:
vital["pos_anno"] = outputs_pos

In [348]:
vital

Unnamed: 0,article,sentences,no_anno,pos_anno
0,Abstract art,"In Chinese painting, abstraction can be traced...",在 中国 的 画 中 抽象 可以 追溯 到 唐 时代 的 画 人 王 模 (? ) 他 被 ...,"在 中国 的 画 中, 抽象 可以 追溯 到 汤 时代 的 画 家 王 模 的 风格"
1,Abstract art,"While none of his paintings remain, this style...",虽然 他 的 绘 画 都 没有 留下 但 这种 风格 在 一些 宋 时代 的 绘画 中 明显 可见,虽然 他 的 绘 画 都 没有 留下 了 但 这种 风格 在 一些 宋 时代 的 绘画 中 ...
2,Abstract art,"The Chan buddhist painter Liang Kai (??, c. 11...","Chan buddist 画家 Liang Kai (?, c 1140 * 1210 ) ...","Chan Buddhist 画家 梁凯 (?, c 1140 * 1210 ) 运用 了 这..."
3,Abstract art,"A late Song painter named Yu Jian, adept to Ti...",一位 已故 的 画 画 名为 于 建 的 专家 于 天 台 孟加拉 创建 了 一系列 被 喷...,一位 已故 的 画 画 人 看向 并 得 于 天 台 佛教 创建 了 一系列 被 喷 的 ...
4,Alan Turing,"When Turing was 39 years old in 1951, he turne...",当 1951 年 Turing 年 39 岁 时 他 转向 数学 生物学 最后 在 年 1 ...,当 1951 年 Turing 39 岁 时 他 转向 数学 生物学 最后 在 1 月 发表...
5,Alan Turing,"He was interested in morphogenesis, the develo...",他 对 生物 生物 生物 的 形态 和 形状 的 发展 感兴趣,他 对 生物 生物 的 形态 和 形状 的 发展 感兴趣
6,Alan Turing,He suggested that a system of chemicals reacti...,他 提出 一个 化学 系统 互相 反应 并 扩散 整个 空间 被 称为 一个 反应 扩散 系...,他 提议 一个 化学 系统 互相 反应 并 在 整个 空间 传播 称为 一个 反应 扩散 系...
7,Alan Turing,He used systems of partial differential equati...,他 使用 部分 差异 等 系统 来 模拟 催 化 化学 反应,他 使用 部分 差异 等 系统 来 模拟 催 化 化学 反应
8,Alan Turing,"For example, if a catalyst A is required for a...",例如 如果 某种 化学 反应 需要 一个 催化剂 A 而 如果 该 反应 产生 了 更多 的...,例如 如果 某种 化学 反应 需要 一个 催化剂 A 进行 并且 如果 该 反应 产生 了 ...
9,Alan Turing,Turing discovered that patterns could be creat...,Turing 发现 如果 化学 反应 不仅 产生 了 催化剂 A 并且 也 产生 了 抑制 ...,Turing 发现 如果 化学 反应 不仅 产生 催化剂 A 并且 还 产生 抑制 剂 B ...


### Model with WA

In [351]:
outputs_wa = []

for input_sentence in vital["sentences"]:
    tokenized_sentence = tokenizer([input_sentence], return_tensors='np')
    out = model_with_wa_trg_ids.generate(**tokenized_sentence, max_length=128)
    with tokenizer.as_target_tokenizer():
        output_sentence = tokenizer.decode(out[0], skip_special_tokens=True)
        print(output_sentence)
        outputs_wa.append(output_sentence)

在 中国 的 画 中 的 抽象 可以 追溯 到 唐 的 画 人 Wang Mo (? ) 的 发明 了 被 认为 是 发明 了 被 喷 的 墨 的 风格
虽然 他 的 作品 没有 留下 但 这种 风格 在 一些 宋 时代 的 绘画 中 明显 可见
Chan buddist 画家 Liang Kai (?, c 1140 * 1210 ) 运用 了 这种 风格 来 绘制 他 的 “ 在 喷 墨 中 的 永恒 ” 其中 牺牲 了 准确 的 表述 来 增强 与 开 人 非 理性 的 思想 相联系 的 自 性
一个 已故 的 画 人  beating 于 车站 并 得 于 天 台 佛教 并 创建 了 一系列 的 喷 墨 景观 最终 激励 许多 日本 犹太人 的 画
当 图 灵 在 1951 年 39 岁 时 他 转向 数学 生物学 最后 在 1 年 1 月 12 月 发表 他 的 杰作 《 产生 的 化学 基础 》
他 对 生物 生物 的 形态 和 形状 的 发展 感兴趣
他 建议 一种 化学 系统 相互 反应 并 传播 于 整个 空间 的 系统 被 称为 反应 传播 系统 可以 解释 出 于 摩 的 主要 现象
他 使用 部分 差异 等 系统 来 模拟 催 化 化学 反应
例如 如果 需要 一个 催化剂 A 来 进行 某种 的 化学 反应 如果 该 反应 产生 了 更多 的 催化剂 A 那么 我们 就 说 该 反应 是 自动 催化 的, 有 积极 的 反馈 可以 以 非 线 差异 等 模式 来 模拟
图 灵 发现 如果 化学 反应 不仅 产生 催化剂 A 并且 还 产生 一种 抑制 剂 B 减缓 A 的 生产
如果 A 和 B 然后 以 不同 的 速度 在 容器 中 扩散 那么 你 可以 有 一些 A 统治 的 地区 和 一些 B 统治 的 地区
为 计算 这个 程度 的 程度, 图 灵 需要 一个 强大 的 计算机 但 这些 计算机 在 1951 年 没有 如此 免费 的 使用 因此 他 必须 使用 线 近 来 解决 人工 的 等 式
这些 计算 给出 了 正确 的 质量 结果 并 产生 例如 一种 统一 的 混合物 奇怪 地 经常 间 固定 的 红色 点
俄罗斯 的 生物 化学 学家 Boris Belousov 已经 进行 了 类似 的 实验 但 无法 得到 他 的

In [352]:
vital["wa_anno"] = outputs_wa
vital

Unnamed: 0,article,sentences,no_anno,pos_anno,wa_anno
0,Abstract art,"In Chinese painting, abstraction can be traced...",在 中国 的 画 中 抽象 可以 追溯 到 唐 时代 的 画 人 王 模 (? ) 他 被 ...,"在 中国 的 画 中, 抽象 可以 追溯 到 汤 时代 的 画 家 王 模 的 风格",在 中国 的 画 中 的 抽象 可以 追溯 到 唐 的 画 人 Wang Mo (? ) 的...
1,Abstract art,"While none of his paintings remain, this style...",虽然 他 的 绘 画 都 没有 留下 但 这种 风格 在 一些 宋 时代 的 绘画 中 明显 可见,虽然 他 的 绘 画 都 没有 留下 了 但 这种 风格 在 一些 宋 时代 的 绘画 中 ...,虽然 他 的 作品 没有 留下 但 这种 风格 在 一些 宋 时代 的 绘画 中 明显 可见
2,Abstract art,"The Chan buddhist painter Liang Kai (??, c. 11...","Chan buddist 画家 Liang Kai (?, c 1140 * 1210 ) ...","Chan Buddhist 画家 梁凯 (?, c 1140 * 1210 ) 运用 了 这...","Chan buddist 画家 Liang Kai (?, c 1140 * 1210 ) ..."
3,Abstract art,"A late Song painter named Yu Jian, adept to Ti...",一位 已故 的 画 画 名为 于 建 的 专家 于 天 台 孟加拉 创建 了 一系列 被 喷...,一位 已故 的 画 画 人 看向 并 得 于 天 台 佛教 创建 了 一系列 被 喷 的 ...,一个 已故 的 画 人 beating 于 车站 并 得 于 天 台 佛教 并 创建 了 ...
4,Alan Turing,"When Turing was 39 years old in 1951, he turne...",当 1951 年 Turing 年 39 岁 时 他 转向 数学 生物学 最后 在 年 1 ...,当 1951 年 Turing 39 岁 时 他 转向 数学 生物学 最后 在 1 月 发表...,当 图 灵 在 1951 年 39 岁 时 他 转向 数学 生物学 最后 在 1 年 1 月...
5,Alan Turing,"He was interested in morphogenesis, the develo...",他 对 生物 生物 生物 的 形态 和 形状 的 发展 感兴趣,他 对 生物 生物 的 形态 和 形状 的 发展 感兴趣,他 对 生物 生物 的 形态 和 形状 的 发展 感兴趣
6,Alan Turing,He suggested that a system of chemicals reacti...,他 提出 一个 化学 系统 互相 反应 并 扩散 整个 空间 被 称为 一个 反应 扩散 系...,他 提议 一个 化学 系统 互相 反应 并 在 整个 空间 传播 称为 一个 反应 扩散 系...,他 建议 一种 化学 系统 相互 反应 并 传播 于 整个 空间 的 系统 被 称为 反应 ...
7,Alan Turing,He used systems of partial differential equati...,他 使用 部分 差异 等 系统 来 模拟 催 化 化学 反应,他 使用 部分 差异 等 系统 来 模拟 催 化 化学 反应,他 使用 部分 差异 等 系统 来 模拟 催 化 化学 反应
8,Alan Turing,"For example, if a catalyst A is required for a...",例如 如果 某种 化学 反应 需要 一个 催化剂 A 而 如果 该 反应 产生 了 更多 的...,例如 如果 某种 化学 反应 需要 一个 催化剂 A 进行 并且 如果 该 反应 产生 了 ...,例如 如果 需要 一个 催化剂 A 来 进行 某种 的 化学 反应 如果 该 反应 产生 了...
9,Alan Turing,Turing discovered that patterns could be creat...,Turing 发现 如果 化学 反应 不仅 产生 了 催化剂 A 并且 也 产生 了 抑制 ...,Turing 发现 如果 化学 反应 不仅 产生 催化剂 A 并且 还 产生 抑制 剂 B ...,图 灵 发现 如果 化学 反应 不仅 产生 催化剂 A 并且 还 产生 一种 抑制 剂 B ...


### Not fine-tuned model

In [353]:
outputs_original = []

model_original = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
model_original.resize_token_embeddings(len(tokenizer))

for input_sentence in vital["sentences"]:
    tokenized_sentence = tokenizer([input_sentence], return_tensors='np')
    out = model_original.generate(**tokenized_sentence, max_length=128)
    with tokenizer.as_target_tokenizer():
        output_sentence = tokenizer.decode(out[0], skip_special_tokens=True)
        print(output_sentence)
        outputs_original.append(output_sentence)

Some layers from the model checkpoint at Helsinki-NLP/opus-mt-en-zh were not used when initializing TFMarianMTModel: ['final_logits_bias']
- This IS expected if you are initializing TFMarianMTModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFMarianMTModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-zh.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


在中国的绘画中,可以追溯到 唐朝画家王模( - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
{\fn黑体\fs22\bord1\shad0\3aHBE\4aH00\fscx67\fscy66\2cHFFFFFF\3cH808080}虽然他的绘画 {\fn黑体\fs22\bord1\shad0\3aHBE\4aH00\fscx67\fscy66\2cHFFFFFF\3cH808080} {\fn黑体\fs22\bord1\shad0\3aHBE\4aH00\fscx67\fscy66\2cHFFFFFF\3cH808080} {\fn黑体\fs22\bord1\shad0\3aHBE\4aH00\fscx67\fscy66\2cHFFFFFF\3cH808080}但是 {\fn黑体\fs22\bord1\shad0\3aHBE\4aH00\fscx67\fscy66\2
{\fn黑体\fs22\bord1\shad0\3aHBE\4aH00\fscx67\fscy66\2cHFFFFFF\3cH808080} {\fn黑体\fs22\bord1\shad0\3aHBE\4aH00\fscx67\fscy66\2cHFFFFFF\3cH808080} {\fn黑体\fs22\bord1\shad0\3aHBE\4aH00\fscx67\fscy66\2cHFFFFFF\3cH808080} {\fn黑体\fs22\bord1\shad0\3aHBE\4aH00\fscx67\fscy66\2cHFFFFFF\3cH808080} {\fn黑体\fs22\bord1\shad0\3aHBE\4aH00\fscx67\fscy66\2cHFFFFFF\3cH808080} {\
{\fn黑体\fs22\bord1\shad0\3aHBE\4aH00\fscx67\fscy66\2cHFFFFFF\

KeyboardInterrupt: 

In [354]:
vital.to_csv("translations_zh_models.csv", index=False)