### Environmental Setup

In [239]:
import sys
python = sys.executable
python

'/Library/Developer/CommandLineTools/usr/bin/python3'

In [241]:
%%capture
! {python} -m pip install -U fsspec tqdm transformers accelerate datasets sacrebleu evaluate sentencepiece sacremoses 

In [242]:
! {python} -m pip freeze | grep -E "transformers|accelerate|datasets|sacrebleu|evaluate|sentencepiece|sacremoses"

accelerate==0.21.0
datasets==2.14.1
evaluate==0.4.0
sacrebleu==2.3.1
sacremoses==0.0.53
sentence-transformers==2.2.2
sentencepiece==0.1.99
transformers==4.31.0


# The Model: Lets pick a multilingual model

In [36]:
import csv

import evaluate
from datasets import load_dataset

from transformers import MarianConfig, MarianMTModel, AutoTokenizer, Seq2SeqTrainingArguments, Seq2SeqTrainer

In [38]:
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-de")
model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-de")

# set special tokens, not sure if it's needed but adding them for sanity...
model.config.eos_token_id = tokenizer.eos_token_id
model.config.pad_token_id = tokenizer.pad_token_id

# The Data: Lets get some "translation" data

From https://github.com/alvations/kopitiam, we have some data that translates from Singlish coffee names to their standard English counterparts.
<img src="https://blog.seedly.sg/_next/image/?url=https%3A%2F%2Fcdn-blog.seedly.sg%2Fwp-content%2Fuploads%2F2022%2F04%2F13174522%2F141222-How-to-Order-Coffee-Kopi-in-Singapore-Like-Locals-Differences-in-Prices.png&w=3840&q=75" width="700" align="left"/>


Image Source: https://blog.seedly.sg/singapore-coffee-kopi-tea-teh-guide-difference-in-price-how-to-order/

In [9]:
! wget -O kopitiam.tsv https://raw.githubusercontent.com/alvations/kopitiam/master/kopitiam.tsv

--2023-07-28 04:24:28--  https://raw.githubusercontent.com/alvations/kopitiam/master/kopitiam.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8001::154, 2606:50c0:8000::154, 2606:50c0:8002::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8001::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20427 (20K) [text/plain]
Saving to: ‘kopitiam.tsv’


2023-07-28 04:24:28 (7.41 MB/s) - ‘kopitiam.tsv’ saved [20427/20427]



In [10]:
! cut -f1,2 kopitiam.tsv | head -n10

Local Terms	Meaning
Kopi O	Black Coffee with Sugar
Kopi	Black Coffee with Condensed Milk
Kopi C	Black Coffee with Evaporated Milk
Kopi Kosong	Black Coffee without sugar or milk
Kopi Gah Dai	Black Coffee with extra condensed milk
Kopi Siew Dai	Coffee with Condensed Milk but less sugar
Kopi O Siew Dai	Black Coffee with less sugar
Kopi Po	Coffee with Condensed Milk but weaker (they add more water)
Kopi O Po	Black Coffee with Sugar but weaker (they add more water)


In [None]:
from datasets import load_dataset

kopi_dataset = load_dataset(
    "csv", 
    data_files="./kopitiam.tsv", 
    delimiter="\t", encoding="utf8", 
    header=None, names=['SRC', 'TRG'], skiprows=1, index_col=False,
    quoting=csv.QUOTE_NONE, quotechar="",  escapechar="\0",
    split="train"
)

In [30]:
kopi_dataset

Dataset({
    features: ['SRC', 'TRG'],
    num_rows: 168
})

In [32]:
kopi_dataset['SRC'][:10], kopi_dataset['TRG'][:10], 

(['Kopi O',
  'Kopi',
  'Kopi C',
  'Kopi Kosong',
  'Kopi Gah Dai',
  'Kopi Siew Dai',
  'Kopi O Siew Dai',
  'Kopi Po',
  'Kopi O Po',
  'Kopi Gau'],
 ['Black Coffee with Sugar',
  'Black Coffee with Condensed Milk',
  'Black Coffee with Evaporated Milk',
  'Black Coffee without sugar or milk',
  'Black Coffee with extra condensed milk',
  'Coffee with Condensed Milk but less sugar',
  'Black Coffee with less sugar',
  'Coffee with Condensed Milk but weaker (they add more water)',
  'Black Coffee with Sugar but weaker (they add more water)',
  'Strong Coffee with Condensed Milk'])

## What is all these weird arguments in `load_dataset`? 

They are actually the `csv` parsing arguments. They follow closely to the options available from the `pandas` library https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html 

> **pandas.read_csv**(filepath_or_buffer, *, sep=_NoDefault.no_default, delimiter=None, header='infer', names=_NoDefault.no_default, index_col=None, usecols=None, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=None, infer_datetime_format=_NoDefault.no_default, keep_date_col=False, date_parser=_NoDefault.no_default, date_format=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal='.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, encoding_errors='strict', dialect=None, on_bad_lines='error', delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None, storage_options=None, dtype_backend=_NoDefault.no_default)


Lets go through some of it:

 - `delimiter="\t"`: We are reading a tab-separated file, so we set the delimiter as tab `\t`
 - `encodeing="utf8"`: It's a unicode file, so we specify it as `utf8`
 - `header=None, names=['SRC', 'TRG'], skiprows=1, index_col=False`
     - When we specify `header=None, skiprows=1`, it's not that the file doesn't contain a header row, but what we're doing is to ignore the headerline, 
     - then we set the name column names to 'SRC' and 'TRG with `names=['SRC', 'TRG']`, implicitly throwing away any columns after the 2nd.
     - `index_col=False` is used because by default, it'll assume the first column as the index column of the csv file, and in this case, it isn't so we disable it.
 - `quoting=csv.QUOTE_NONE, quotechar="",  escapechar="\0"`, this is an idiom to read tab separated file without trying to escape the quotes, see https://stackoverflow.com/a/51941891/610569 

# Then, we convert the string inputs to vocabulary IDs

In [39]:
tokenizer # We initialized this when we pick our model.

MarianTokenizer(name_or_path='Helsinki-NLP/opus-mt-en-de', vocab_size=58101, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'}, clean_up_tokenization_spaces=True)

In [44]:
def preprocess_function(batch):
    inputs = tokenizer(batch['SRC'], max_length=10, truncation=True, padding="max_length")
    outputs = tokenizer(batch['TRG'], max_length=10, truncation=True, padding="max_length")

    return {"input_ids": inputs["input_ids"], 
            "labels": outputs.input_ids.copy()}

kopi_dataset = kopi_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/168 [00:00<?, ? examples/s]

In [45]:
kopi_dataset

Dataset({
    features: ['SRC', 'TRG', 'input_ids', 'labels'],
    num_rows: 168
})

In [46]:
print(kopi_dataset[0])

{'SRC': 'Kopi O', 'TRG': 'Black Coffee with Sugar', 'input_ids': [1739, 3175, 470, 0, 58100, 58100, 58100, 58100, 58100, 58100], 'labels': [2410, 16222, 33, 33502, 0, 58100, 58100, 58100, 58100, 58100]}


## What are all these 58100?

They are the pad tokens. And the `0` is the end of sentence (EOS) token. 

In newer model these days, sometimes they'll use the EOS token as the pad, see https://stackoverflow.com/a/76453052/610569 

In [49]:
tokenizer.convert_ids_to_tokens(58100)

'<pad>'

## What if my dataset is really huge and I don't want to pre-tokenize the dataset?

Then you can use the `streaming=True` option, see https://huggingface.co/docs/datasets/stream

If you set `streaming=True`, the Huggingface `Trainer` object that will eventually read in the dataset will function the same way as not using streaming **but you can't easily inspect the data `load_dataset` nor can you check the vocab IDs after doing the map functions.**


In [159]:
from datasets import load_dataset

kopi_dataset = load_dataset(
    "csv", 
    data_files="./kopitiam.tsv", 
    delimiter="\t", encoding="utf8", 
    header=None, names=['SRC', 'TRG'], skiprows=1, index_col=False,
    quoting=csv.QUOTE_NONE, quotechar="",  escapechar="\0",
    streaming=True,
    split="train"
)

kopi_dataset = kopi_dataset.map(preprocess_function, batched=True)

In [160]:
kopi_dataset

<datasets.iterable_dataset.IterableDataset at 0x13741d820>

In [161]:
kopi_dataset[0]  # This will throw an error.

NotImplementedError: 

In [163]:
# This is one way you can inspect the dataset.
for row in kopi_dataset:
    print(row)
    break

{'SRC': 'Kopi O', 'TRG': 'Black Coffee with Sugar', 'input_ids': [1739, 3175, 470, 0, 58100, 58100, 58100, 58100, 58100, 58100], 'labels': [2410, 16222, 33, 33502, 0, 58100, 58100, 58100, 58100, 58100]}


  for batch_idx, df in enumerate(csv_file_reader):


### What is this `ParserWarning`?

This is because we dropped out the last column in the original `.tsv` file, you can do the following if you want to avoid the error, you can make sure the no. of names you passed into the csv parsing module matches the data, ` names=['SRC', 'TRG', 'URL']`, e.g.

```python
from datasets import load_dataset

kopi_dataset = load_dataset(
    "csv", 
    data_files="./kopitiam.tsv", 
    delimiter="\t", encoding="utf8", 
    header=None, names=['SRC', 'TRG', 'provenance (datas )'], skiprows=1, index_col=False,
    quoting=csv.QUOTE_NONE, quotechar="",  escapechar="\0",
    streaming=True,
    split="train"
)

kopi_dataset = kopi_dataset.map(preprocess_function, batched=True)
```
    
    

### Lets section out some data as validation set?

In [190]:
from datasets import load_dataset

kopi_dataset = load_dataset(
    "csv", 
    data_files="./kopitiam.tsv", 
    delimiter="\t", encoding="utf8", 
    header=None, names=['SRC', 'TRG'], skiprows=1, index_col=False,
    quoting=csv.QUOTE_NONE, quotechar="",  escapechar="\0",
    split="train"
)

kopi_dataset = kopi_dataset.map(preprocess_function, batched=True)

train_dataset = kopi_dataset.select(range(150))
validation_dataset = kopi_dataset.select(range(150, len(kopi_dataset)))

Map:   0%|          | 0/168 [00:00<?, ? examples/s]

In [191]:
train_dataset

Dataset({
    features: ['SRC', 'TRG', 'input_ids', 'labels'],
    num_rows: 150
})

In [192]:
validation_dataset

Dataset({
    features: ['SRC', 'TRG', 'input_ids', 'labels'],
    num_rows: 18
})

### That's so "un-scientific", couldn't you do some shuffling before or something?

Actually, there's a better way! Using https://discuss.huggingface.co/t/how-to-split-main-dataset-into-train-dev-test-as-datasetdict/1090?u=alvations

In [194]:
from datasets import load_dataset, DatasetDict

kopi_dataset = load_dataset(
    "csv", 
    data_files="./kopitiam.tsv", 
    delimiter="\t", encoding="utf8", 
    header=None, names=['SRC', 'TRG'], skiprows=1, index_col=False,
    quoting=csv.QUOTE_NONE, quotechar="",  escapechar="\0",
    split="train"
)

kopi_dataset = kopi_dataset.map(preprocess_function, batched=True)

# Split 80-20
train_devtest = kopi_dataset.train_test_split(shuffle=True, seed =42, test_size=0.2)

# From the 20, split 50-50.
dev_test = train_test['test'].train_test_split(shuffle=True, seed =42, test_size=0.5)

kopi_dataset = DatasetDict({
    'train': train_devtest['train'],
    'dev': dev_test['train'],
    'test': dev_test['test']
})

In [195]:
kopi_dataset['train']

Dataset({
    features: ['SRC', 'TRG', 'input_ids', 'labels'],
    num_rows: 134
})

In [196]:
kopi_dataset['dev']

Dataset({
    features: ['SRC', 'TRG', 'input_ids', 'labels'],
    num_rows: 17
})

# The Metric: Lets go with the classic BLEU and ChrF

In [50]:
mt_metrics = evaluate.combine(
    ["bleu", "chrf"], force_prefix=True
)

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions
    
    predictions = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)

    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    references = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    outputs = mt_metrics.compute(predictions=predictions,
                             references=references)

    return outputs

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/9.01k [00:00<?, ?B/s]

## ちょっと待って! (Wait a minute!) Why is the `evaluate.combine` downloading something?

The `evaluate` package is a pretty neat package. 

The general idea is that the metric is

 - any self-contained python package created by metric author
 - the package follows the standardize API from [`evaluate.Metric` module](https://github.com/huggingface/evaluate/blob/main/src/evaluate/module.py) 
 - users are abstracted away from the nitty-gritty of different metrics and have a user-friendly way to compute metric scores when using Huggingface's `transformers`

More details on https://aclanthology.org/2022.emnlp-demos.13

## Is there a way to just do not download the script and do the `evaluate.Metric` locally? 

Yes, you can and here's an example. We'll write a metric `.py` script that encapsulates whatever library we want to use to compute the score, and then load it with `evaluate.load` function. 

The `evaluate.load` function loads a single metric and the `evaluate.combine` function is just a quick way to load multiple metrics at the same time and it also helps to combine the dictionaries output that each metric produces into a single dictionary.

In [111]:
%%writefile mybleu.py

import datasets
import evaluate

from sacrebleu.metrics import BLEU

class MyBleu(evaluate.Metric):
    def _info(self):
        """This is required somehow, which is good, see https://aclanthology.org/2022.humeval-1.6/"""
        return evaluate.MetricInfo(
            description='my own bleu',
            citation="heh, maybe https://aclanthology.org/W18-6319/",
            features=[
                datasets.Features(
                    {
                        "predictions": datasets.Value("string", id="sequence"),
                        "references":  datasets.Value("string", id="sequence")
                    }
                )
            ],
        )
    
    def _compute(self, predictions, references):
        """Wrap over the actual BLEU computation"""
        oh_my_bleu = BLEU()
        # Sacrebleu expects multiple references.
        score = oh_my_bleu.corpus_score(predictions, [references])
        return {"bleu": float(score.score), 'bleu_str': score}

Overwriting mybleu.py


In [113]:
hyp = ['The dog bit the man.', "It wasn't surprising.", 'The man had just bitten him.']
ref = ['The dog bit the man.', 'It was not unexpected.', 'The man bit him first.']

mybleu = evaluate.combine([f'./mybleu.py'])
mybleu.compute(predictions=hyp, references=ref)

{'bleu': 45.06750632106114,
 'bleu_str': BLEU = 45.07 70.6/42.9/36.4/37.5 (BP = 1.000 ratio = 1.000 hyp_len = 17 ref_len = 17)}

### Going back to the metrics we want to use

You'll notice that there's a `force_prefix=True` argument. That is to avoid any confusion when the different metrics in report the scores with the same name in the key of the dictionary. The key in the dictionary is important since this is what you'll tell the `Trainer` object to track when evaluating the model.

In [145]:
chrf = evaluate.load('chrf')

chrf.compute(predictions=hyp, references=ref)

{'score': 50.043063606582294, 'char_order': 6, 'word_order': 0, 'beta': 2}

In [146]:
chrf = evaluate.load('chrf', force_prefix=True)

chrf.compute(predictions=hyp, references=ref)

{'score': 50.043063606582294, 'char_order': 6, 'word_order': 0, 'beta': 2}

In [147]:
mt_metrics = evaluate.combine(
    ["bleu", "chrf"], force_prefix=True
)

mt_metrics.compute(predictions=hyp, references=ref)

{'bleu_bleu': 0.45067506321061157,
 'bleu_precisions': [0.7058823529411765,
  0.42857142857142855,
  0.36363636363636365,
  0.375],
 'bleu_brevity_penalty': 1.0,
 'bleu_length_ratio': 1.0,
 'bleu_translation_length': 17,
 'bleu_reference_length': 17,
 'chr_f_score': 50.043063606582294,
 'chr_f_char_order': 6,
 'chr_f_word_order': 0,
 'chr_f_beta': 2}

# Lets go back to training an MT model!

We'll need to set some hyperparameters first.

In [149]:
training_args = Seq2SeqTrainingArguments(
    output_dir='./',
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    logging_steps=1,
    save_steps=5,
    eval_steps=5,
    max_steps=10,
    evaluation_strategy="steps",
    predict_with_generate=True,
    report_to=None,
    metric_for_best_model="chr_f_score",
    load_best_model_at_end=True,
    save_total_limit=5
)


## What are all these arguments? 

They are the training arguments that's used to configure the training routine. The only compulsory argument is `output_dir`.

There aparently a whole lot of "knobs" to turn; remember the English copy of Marian's Bomba https://www.computerhistory.org/timeline/1941/#169ebbe2ad45559efbc6eb357207bb27 =)

Alternatively, you can avoid these defaults and roll out your own training routine, for reference, see https://github.com/huggingface/transformers/blob/main/examples/pytorch/translation/run_translation_no_trainer.py


Meanwhile, here's how you can view the no. of possible `TrainingArguments`:

In [150]:
print(Seq2SeqTrainingArguments.__doc__)


    TrainingArguments is the subset of the arguments we use in our example scripts **which relate to the training loop
    itself**.

    Using [`HfArgumentParser`] we can turn this class into
    [argparse](https://docs.python.org/3/library/argparse#module-argparse) arguments that can be specified on the
    command line.

    Parameters:
        output_dir (`str`):
            The output directory where the model predictions and checkpoints will be written.
        overwrite_output_dir (`bool`, *optional*, defaults to `False`):
            If `True`, overwrite the content of the output directory. Use this to continue training if `output_dir`
            points to a checkpoint directory.
        do_train (`bool`, *optional*, defaults to `False`):
            Whether to run training or not. This argument is not directly used by [`Trainer`], it's intended to be used
            by your training/evaluation scripts instead. See the [example
            scripts](https://github.com/hugging

## That list is just too long, what are the `TrainingArguments` you've used here?

Sure, here's a quick explanation:

 - `per_device_train_batch_size=4, per_device_eval_batch_size=4` are the batch size options
   - alternatively, you can use `auto_find_batch_size=True` and that'll *"find a batch size that will fit into memory automatically through exponential decay, avoiding CUDA Out-of-Memory errors."*
   
   
 - `logging_steps=1, save_steps=5, eval_steps=5, max_steps=10, evaluation_strategy="steps",`
   - these are the model updates, saving, validation, logging and stopping strategies

   - `logging_steps=1` means we print out the training loss every 1 step
      - This can be reduced to be the same as the `eval_steps` 
      - esp. if you are only concerned about the BLEU scores and not training loss.
      
   - `evaluation_strategy="steps"` means that we use no. of steps before we check the score of the model on the validation dataset, the possible strategies are:
      - `"no"`: No evaluation is done during training.
      - `"steps"`: Evaluation is done (and logged) every `eval_steps`.
      - `"epoch"`: Evaluation is done at the end of each epoch.   
     
   - `max_steps=10` means we only take 10 steps when training the model
   - `eval_steps=5` means we check the validation metric scores every 5 steps
   - `save_steps=5` means we save the model every 5 steps, 
      - It makes sense to save the model everytime you check the validation scores 
      - so making `save_steps` and `eval_steps` the same would be good.


 - `predict_with_generate=True`: this is necessary to compute MT metrics scores which requires the input to the metrics to be string, for some other tasks evaluation metrics can use the final layer outputs directly without generating the "labels" / "text". 
 
 
 - `report_to=None`: This is to allow model tracking dashboard integrations to report the results and logs to. 
   - Supported platforms are `"azure_ml"`, `"comet_ml"`, `"mlflow"`, `"neptune"`, `"tensorboard"`,`"clearml"` and `"wandb"`. Use `"all"` to report to all integrations installed, `"none"` for no integrations.
   
   
 - `metric_for_best_model="chr_f_score"`: Remember that when we use the `evaluate.combine(...).compute()`, it returns the dictionary? We need to pick one of the key there to use as the metric to select the best model.
 
 
 - `load_best_model_at_end=True`: Whether or not to load the best model found during training at the end of training. When this option is enabled, the best checkpoint will always be saved.
 
 
 - `save_total_limit=5`: This will limit the no. of checkpoints saved during training. It will only save a max of up to 5 latest checkpoints, and if `load_best_model_at_end=True`, it'll save 4 latest checkpoint + 1 best checkpoint.
   - Tip: Always set this to a fix number so that the disk don't max out easily.

In [201]:
training_args = Seq2SeqTrainingArguments(
    output_dir='./',
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    logging_steps=1,
    save_steps=10,
    eval_steps=10,
    max_steps=200,
    evaluation_strategy="steps",
    predict_with_generate=True,
    report_to=None,
    metric_for_best_model="chr_f_score",
    load_best_model_at_end=True,
    save_total_limit=5
)


trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=kopi_dataset['train'].with_format("torch"),
    eval_dataset=kopi_dataset['dev'].with_format("torch"),
    compute_metrics=compute_metrics,
)


In [202]:
trainer.train()

Step,Training Loss,Validation Loss,Bleu Bleu,Bleu Precisions,Bleu Brevity Penalty,Bleu Length Ratio,Bleu Translation Length,Bleu Reference Length,Chr F Score,Chr F Char Order,Chr F Word Order,Chr F Beta
10,3.4275,3.028779,0.133659,"[0.375, 0.19148936170212766, 0.06666666666666667, 0.06666666666666667]",1.0,1.04918,64,61,45.230381,6,0,2
20,2.3358,2.716913,0.0,"[0.27419354838709675, 0.1111111111111111, 0.03571428571428571, 0.0]",1.0,1.016393,62,61,46.951882,6,0,2
30,2.4035,2.486472,0.0,"[0.40384615384615385, 0.2857142857142857, 0.2222222222222222, 0.0]",0.841073,0.852459,52,61,50.145481,6,0,2
40,1.4573,2.441367,0.246252,"[0.39705882352941174, 0.2549019607843137, 0.20588235294117646, 0.17647058823529413]",1.0,1.114754,68,61,54.412557,6,0,2
50,1.1039,2.382429,0.191926,"[0.3582089552238806, 0.2, 0.15151515151515152, 0.125]",1.0,1.098361,67,61,51.997505,6,0,2
60,2.0663,2.265625,0.258303,"[0.4090909090909091, 0.2653061224489796, 0.21875, 0.1875]",1.0,1.081967,66,61,53.881481,6,0,2
70,0.9263,2.24452,0.128396,"[0.3787878787878788, 0.1836734693877551, 0.0625, 0.0625]",1.0,1.081967,66,61,49.949855,6,0,2
80,0.8331,2.305717,0.212624,"[0.43137254901960786, 0.2647058823529412, 0.11764705882352941, 0.3333333333333333]",0.821948,0.836066,51,61,46.064509,6,0,2
90,0.4696,2.201389,0.134017,"[0.4, 0.1875, 0.06451612903225806, 0.06666666666666667]",1.0,1.065574,65,61,47.951283,6,0,2
100,0.6907,2.128309,0.319131,"[0.40625, 0.3191489361702128, 0.3, 0.26666666666666666]",1.0,1.04918,64,61,59.351783,6,0,2


Trainer is attempting to log a value of "[0.375, 0.19148936170212766, 0.06666666666666667, 0.06666666666666667]" of type <class 'list'> for key "eval/bleu_precisions" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "[0.27419354838709675, 0.1111111111111111, 0.03571428571428571, 0.0]" of type <class 'list'> for key "eval/bleu_precisions" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "[0.40384615384615385, 0.2857142857142857, 0.2222222222222222, 0.0]" of type <class 'list'> for key "eval/bleu_precisions" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "[0.39705882352941174, 0.2549019607843137, 0.20588235294117646, 0.17647058823529413]" of type <class 'list'> for key "eval/bleu_precisions"

TrainOutput(global_step=200, training_loss=1.2455991057306528, metrics={'train_runtime': 370.1132, 'train_samples_per_second': 2.162, 'train_steps_per_second': 0.54, 'total_flos': 2092164710400.0, 'train_loss': 1.2455991057306528, 'epoch': 5.88})

# Now, lets try using the model!

We'll use the `transformers.pipeline` class that has some pre-coded tasks that wraps around the models and tokenizers created by the `transformers` library.

In [205]:
from transformers import pipeline

translate = pipeline('translation', model=model, tokenizer=tokenizer)

In [212]:
translate("Kopi siew dai")

[{'translation_text': '▁coffee with▁less▁sugar'}]

In [214]:
translate("Kopi kar dai")

[{'translation_text': '▁coffee with more▁sugar'}]

### Somehow the model still hasn't learn the difference between "▁" and " " whitspaces...

The easy solution is just to keep training until the BLEU/ChrF scores are higher. But sometimes a quick hack to the post-process the translation output would work well enough.

# Hmmm, can't we just specify some post-editing functions?

Yes, you can!

First lets check what type of pipeline object we get when we initialize `pipeline('translation', model=model, tokenizer=tokenizer)`

In [217]:
from transformers import pipeline

translate = pipeline('translation', model=model, tokenizer=tokenizer)

type(translate)

transformers.pipelines.text2text_generation.TranslationPipeline

### We trace how TranslationPipeline postprocess the models output

From [here](https://github.com/huggingface/transformers/blob/v4.21.3/src/transformers/pipelines/text2text_generation.py#L252), we see that the `TranslationPipeline` has no specified `postprocess()` and inherits from the parent class [`Text2TextGenerationPipeline`](https://github.com/huggingface/transformers/blob/v4.21.3/src/transformers/pipelines/text2text_generation.py#L167), which does:


```python
class Text2TextGenerationPipeline(Pipeline):
    ...
    def postprocess(self, model_outputs, return_type=ReturnType.TEXT, clean_up_tokenization_spaces=False):
        records = []
        for output_ids in model_outputs["output_ids"][0]:
            if return_type == ReturnType.TENSORS:
                record = {f"{self.return_name}_token_ids": output_ids}
            elif return_type == ReturnType.TEXT:
                record = {
                    f"{self.return_name}_text": self.tokenizer.decode(
                        output_ids,
                        skip_special_tokens=True,
                        clean_up_tokenization_spaces=clean_up_tokenization_spaces,
                    )
                }
            records.append(record)
        return records
```



### Lets do something simple and replace "▁" with space and then clean up any spurious whitespaces.

In [219]:
# Here's an example.
s = '▁coffee with more▁sugar'

# Replace '▁' with space
s = s.replace('▁', ' ')
# Remove spurious spaces.
s = ' '.join(s.strip().split(' '))

print(s)

coffee with more sugar


In [234]:
from transformers import TranslationPipeline

class TranslateAndPostedit(TranslationPipeline):
    def postprocess(self, model_outputs):
        postedited_outputs = []
        for output_ids in model_outputs["output_ids"][0]:
            # Converting the vocab IDs to strings.
            output = {'translation_text': self.tokenizer.decode(output_ids, skip_special_tokens=True)}
            # Now we do the post-editing operations.
            translation = output['translation_text']
            postedited = translation.replace('▁', ' ')
            postedited = ' '.join(s.strip().split(' '))
            postedited_outputs = {'translation_text': postedited}
        return postedited_outputs

    
translate = TranslateAndPostedit(model=model, tokenizer=tokenizer)

In [235]:
translate("Kopi siew dai")

{'translation_text': 'coffee with more sugar'}

# おめでとう! (Congratulations!) 

# Now you know the basics of how to fine-tune an OPUS MT model with Huggingface `transformers`!


Here's a summary of the learning points:

- [x] How to load TSV data to Huggingface `datasets`?
- [x] How to stream large data to `load_dataset`?
- [x] How to split a `datasets.Dataset` into train/dev/test?
- [x] How to use `evaluate.load` metrics?
- [x] How to use multiple metrics with `evaluate.combine`?
- [x] How to load custom/local metrics with `evaluate`?
- [x] How to find all the available options for `Seq2SeqTrainingArguments`?
- [x] How to postprocess the model outputs using `Pipeline.postproces`?


### Q: ちょっと待って! (Wait a minute!) What do you mean by "basics"? There's more?!

A: Yes, there will always be more when it comes to training NLP/MT/ML models =) 

# つづく...