# NLP Course

Please see the [Hugging Face NLP Course page](https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt).

## Translation

### Try it out: extra credit work at end of this section

Please see [Translation](https://huggingface.co/learn/nlp-course/chapter7/4?fw=pt#translation), 7. Main NLP Tasks, in the 🤗 NLP Course.

### Preparing the data

> To fine-tune or train a translation model from scratch, we will need a dataset suitable for the task. As mentioned previously, we’ll use the [KDE4](https://huggingface.co/datasets/Helsinki-NLP/kde4) dataset in this section, but you can adapt the code to use your own data quite easily, as long as you have pairs of sentences in the two languages you want to translate from and into. 

In [1]:
from datasets import load_dataset

raw_datasets = load_dataset("kde4", lang1="en", lang2="ja")
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 131429
    })
})

In [2]:
split_datasets = raw_datasets["train"].train_test_split(train_size=0.9, seed=20)
split_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 118286
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 13143
    })
})

> We can rename the "test" key to "validation" like this:

In [3]:
split_datasets["validation"] = split_datasets.pop("test")
split_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 118286
    })
    validation: Dataset({
        features: ['id', 'translation'],
        num_rows: 13143
    })
})

In [4]:
split_datasets["train"][99]["translation"]

{'en': 'Specifies whether to use a logarithmic instead of a linear gradient for the Kalzium Mass Gradient feature',
 'ja': 'KalziumMassGradientType クラスに線形グラディエントでなはく対数グラディエントを使うかどうかを指定します。'}

In [5]:
from transformers import pipeline

model_checkpoint = "Helsinki-NLP/opus-mt-en-jap"
#                                           ^^^

translator = pipeline("translation", model=model_checkpoint)
translator("Default to expanded threads")

Device set to use cuda:0


[{'translation_text': '持 っ て い る 類 , 細か な 糸 を 生 じ させ る なら ば ,'}]

In [6]:
translator("Please press the Enter key")

[{'translation_text': 'どう か , かぎ を 開 い て ほし い .'}]

In [7]:
split_datasets["train"][172]["translation"]

{'en': "Neighbors' Loved Radio", 'ja': 'ご近所さんのお気に入りラジオ'}

In [8]:
translator(
    "Unable to import %1 using the OFX importer plugin. This file is not the correct format."
)

[{'translation_text': '" わたし たち の 王 クレセテ , すなわち , 今 に 至 る まで 出帆 し た . この 人 が あ る の は , 異な る こと で は な い " と 言 っ て い た .'}]

#### Processing the data

In [9]:
from transformers import AutoTokenizer

#model_checkpoint = "Helsinki-NLP/opus-mt-en-fr"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, return_tensors="pt")

> As we can see, the output contains the input IDs associated with the English sentence, while the IDs associated with the Japanese one are stored in the labels field.

In [10]:
idx = 13

print(split_datasets["train"][idx])

{'id': '42455', 'translation': {'en': 'Unable to start backend.', 'ja': 'バックエンドを開始できません。'}}


In [11]:
en_sentence = split_datasets["train"][idx]["translation"]["en"]
ja_sentence = split_datasets["train"][idx]["translation"]["ja"]

inputs = tokenizer(en_sentence, text_target=ja_sentence)
inputs

{'input_ids': [11641, 2766, 14, 36214, 749, 1, 4, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1], 'labels': [7155, 2948, 1160, 15702, 5587, 98, 1138, 15796, 222, 336, 2842, 1, 0]}

> If you forget to indicate that you are tokenizing labels, they will be tokenized by the input tokenizer, which in the case of a Marian model is not going to go well at all

##### NOTE `as_target_tokenizer` is deprecated and will be removed in v5 of Transformers. You can tokenize your labels by using the argument `text_target` 

In [12]:
print(tokenizer.convert_ids_to_tokens(inputs["input_ids"]))
print(tokenizer.convert_ids_to_tokens(inputs["labels"]))

['▁Un', 'able', '▁to', '▁start', '▁back', '<unk>', '.', '</s>']
['▁バ', 'ッ', 'ク', 'エン', 'ド', 'を', '開', '始', 'で', 'き', 'ません', '<unk>', '</s>']


In [13]:
max_length = 128


def preprocess_function(examples):
    inputs = [ex["en"] for ex in examples["translation"]]
    targets = [ex["ja"] for ex in examples["translation"]]
    model_inputs = tokenizer(
        inputs, text_target=targets, max_length=max_length, truncation=True
    )
    return model_inputs

In [14]:
tokenized_datasets = split_datasets.map(
    preprocess_function,
    batched=True,
    remove_columns=split_datasets["train"].column_names,
)
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 118286
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 13143
    })
})

### Fine-tuning the model with the Trainer API

> The actual code using the Trainer will be the same as before, with just one little change: we use a [`Seq2SeqTrainer`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainer) here, which is a subclass of `Trainer` that will allow us to properly deal with the evaluation, using the `generate()` method to predict outputs from the inputs.

In [15]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

#### Data collation

> This is all done by a [`DataCollatorForSeq2Seq`](https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorForSeq2Seq). Like the `DataCollatorWithPadding`, it takes the tokenizer used to preprocess the inputs, but it also takes the model. This is because this data collator will also be responsible for preparing the decoder input IDs, which are shifted versions of the labels with a special token at the beginning. Since this shift is done slightly differently for different architectures, the `DataCollatorForSeq2Seq` needs to know the model object

In [16]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [17]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(1, 3)])
batch.keys()

dict_keys(['input_ids', 'attention_mask', 'labels', 'decoder_input_ids'])

In [18]:
batch

{'input_ids': tensor([[16465,  2156, 22699, 22784,  3407,  7323,     1, 32478,    31,  2315,
         45185, 19121,    29, 20801,     0],
        [45630,    76, 17084,     4,     4,     4,     0, 46275, 46275, 46275,
         46275, 46275, 46275, 46275, 46275]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]), 'labels': tensor([[    6,     1,     6,     1, 15508,  3983, 27591,  1696,  1852, 41294,
           587,  3443,     1,     0],
        [ 7155, 27591,  1781, 13128,  1060,     4,     4,     4,     0,  -100,
          -100,  -100,  -100,  -100]]), 'decoder_input_ids': tensor([[46275,     6,     1,     6,     1, 15508,  3983, 27591,  1696,  1852,
         41294,   587,  3443,     1],
        [46275,  7155, 27591,  1781, 13128,  1060,     4,     4,     4,     0,
         46275, 46275, 46275, 46275]])}

In [19]:
batch["labels"]

tensor([[    6,     1,     6,     1, 15508,  3983, 27591,  1696,  1852, 41294,
           587,  3443,     1,     0],
        [ 7155, 27591,  1781, 13128,  1060,     4,     4,     4,     0,  -100,
          -100,  -100,  -100,  -100]])

In [20]:
batch["decoder_input_ids"]

tensor([[46275,     6,     1,     6,     1, 15508,  3983, 27591,  1696,  1852,
         41294,   587,  3443,     1],
        [46275,  7155, 27591,  1781, 13128,  1060,     4,     4,     4,     0,
         46275, 46275, 46275, 46275]])

In [21]:
for i in range(1, 3):
    print(tokenized_datasets["train"][i]["labels"])

[6, 1, 6, 1, 15508, 3983, 27591, 1696, 1852, 41294, 587, 3443, 1, 0]
[7155, 27591, 1781, 13128, 1060, 4, 4, 4, 0]


#### Metrics

> One weakness with BLEU is that it expects the text to already be tokenized, which makes it difficult to compare scores between models that use different tokenizers. So instead, the most commonly used metric for benchmarking translation models today is SacreBLEU, which addresses this weakness (and others) by standardizing the tokenization step. 

In [22]:
import evaluate

metric = evaluate.load("sacrebleu")

> The score can go from 0 to 100, and higher is better.

In [23]:
predictions = [
    "This plugin lets you translate web pages between several languages automatically."
]

references = [
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]

metric.compute(predictions=predictions, references=references)

{'score': 46.750469682990165,
 'counts': [11, 6, 4, 3],
 'totals': [12, 11, 10, 9],
 'precisions': [91.66666666666667,
  54.54545454545455,
  40.0,
  33.333333333333336],
 'bp': 0.9200444146293233,
 'sys_len': 12,
 'ref_len': 13}

In [24]:
predictions = ["This This This This"]

references = [
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]

metric.compute(predictions=predictions, references=references)

{'score': 1.683602693167689,
 'counts': [1, 0, 0, 0],
 'totals': [4, 3, 2, 1],
 'precisions': [25.0, 16.666666666666668, 12.5, 12.5],
 'bp': 0.10539922456186433,
 'sys_len': 4,
 'ref_len': 13}

> To get from the model outputs to texts the metric can use, we will use the [`tokenizer.batch_decode()`](https://huggingface.co/docs/transformers/v4.49.0/en/main_classes/tokenizer#transformers.PreTrainedTokenizer.batch_decode) method. We just have to clean up all the `-100`s in the labels (the tokenizer will automatically do the same for the padding token)

In [25]:
import numpy as np


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # In case the model returns more than the prediction logits
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100s in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": result["score"]}

### Fine-tuning the model

In [26]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [27]:
#from huggingface_hub import Repository, get_full_repo_name
from huggingface_hub import get_full_repo_name

model_name = "marian-finetuned-kde4-en-to-ja-accelerate"

repo_name = get_full_repo_name(model_name)
repo_name

'buruzaemon/marian-finetuned-kde4-en-to-ja-accelerate'

In [28]:
from transformers import Seq2SeqTrainingArguments

args = Seq2SeqTrainingArguments(
    model_name,
    eval_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

In [29]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    processing_class=tokenizer,
    compute_metrics=compute_metrics,
)

In [30]:
%%time

trainer.evaluate(max_length=max_length)

That's 100 lines that end in a tokenized period ('.')
It looks like you forgot to detokenize your test data, which may hurt your score.
If you insist your data is detokenized, or don't care, you can suppress this message with the `force` parameter.


CPU times: user 18min 37s, sys: 3.92 s, total: 18min 41s
Wall time: 17min 1s


{'eval_loss': 10.40902042388916,
 'eval_model_preparation_time': 0.0038,
 'eval_bleu': 0.00817271844352432,
 'eval_runtime': 1021.0567,
 'eval_samples_per_second': 12.872,
 'eval_steps_per_second': 0.202}

In [31]:
%%time

trainer.train()

Step,Training Loss
500,4.4484
1000,3.2492
1500,2.9174
2000,2.696
2500,2.5553
3000,2.4453
3500,2.3562
4000,2.2786
4500,2.1758
5000,2.1145




CPU times: user 24min 58s, sys: 2min 32s, total: 27min 31s
Wall time: 27min 35s


TrainOutput(global_step=11091, training_loss=2.2918752639848927, metrics={'train_runtime': 1652.9533, 'train_samples_per_second': 214.681, 'train_steps_per_second': 6.71, 'total_flos': 6336352287719424.0, 'train_loss': 2.2918752639848927, 'epoch': 3.0})

----

### A custom training loop

>  First we’ll build the [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)s from our datasets, after setting the datasets to the "torch" format so we get PyTorch tensors..

Please also see the API documentation page for [`torch.utils.data`](https://pytorch.org/docs/stable/data.html)

In [32]:
from torch.utils.data import DataLoader

tokenized_datasets.set_format("torch")

train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)

eval_dataloader = DataLoader(
    tokenized_datasets["validation"], 
    shuffle=True,    
    collate_fn=data_collator, 
    batch_size=8
)

> Next we reinstantiate our model, to make sure we're not continuing the fine-tuning from before but starting from the pretrained model again

In [33]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

> Then we will need an optimizer...

##### Question: why the [`transformers.AdamW`](https://huggingface.co/docs/transformers/v4.49.0/en/main_classes/optimizer_schedules#transformers.AdamW) instead of [`torch.optim.AdamW`](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html#torch.optim.AdamW) implementation??

When attempting to use the `transformers.AdamW` optimizer, we see the following warning:

    FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning

... so, yeah, the recommended optimizer is from `torch.optim`.

In [34]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

> Once we have all those objects, we can send them to the `accelerator.prepare()` method...

##### TODO: check out that [Accelerate documentation](https://huggingface.co/docs/accelerate/index) and go through [that short tutorial](https://huggingface.co/docs/accelerate/basic_tutorials/overview)...

In [35]:
from accelerate import Accelerator

accelerator = Accelerator()

model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

> Now that we have sent our `train_dataloader` to `accelerator.prepare()`, we can use its length to compute the number of training steps. <span style="background-color:#33ffff">Remember we should always do this after preparing the dataloader, as that method will change the length of the `DataLoader`</span>. We use a classic linear schedule from the learning rate to 0

In [36]:
from transformers import get_scheduler

num_train_epochs = 3

num_update_steps_per_epoch = len(train_dataloader)

num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [37]:
def postprocess(predictions, labels):
    predictions = predictions.cpu().numpy()
    labels = labels.cpu().numpy()

    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]
    return decoded_preds, decoded_labels

In [38]:
from huggingface_hub import HfApi

output_dir = model_name

api = HfApi()

In [39]:
%%time

from tqdm.auto import tqdm
import torch

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    for batch in tqdm(eval_dataloader):
        with torch.no_grad():
            generated_tokens = accelerator.unwrap_model(model).generate(
                batch["input_ids"],
                attention_mask=batch["attention_mask"],
                max_length=128,
            )
        #print(f"??? type(batch['labels']) is???: {type(batch['labels'])}")
        labels = batch["labels"]

        # Necessary to pad predictions and labels for being gathered
        generated_tokens = accelerator.pad_across_processes(
            generated_tokens, dim=1, pad_index=tokenizer.pad_token_id
        )
        labels = accelerator.pad_across_processes(labels, dim=1, pad_index=-100)

        predictions_gathered = accelerator.gather(generated_tokens)
        labels_gathered = accelerator.gather(labels)

        decoded_preds, decoded_labels = postprocess(predictions_gathered, labels_gathered)
        metric.add_batch(predictions=decoded_preds, references=decoded_labels)

    results = metric.compute()
    print(f"epoch {epoch}, BLEU score: {results['score']:.2f}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        #repo.push_to_hub(
        #    commit_message=f"Training in progress epoch {epoch}", blocking=False
        #)
        future = api.upload_folder( # Upload in the background (non-blocking action)
            repo_id=repo_name,
            folder_path=output_dir,
            run_as_future=True,
            commit_message=f"Training in progress epoch {epoch}"
        )

  0%|          | 0/44358 [00:00<?, ?it/s]

  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)


  0%|          | 0/1643 [00:00<?, ?it/s]

epoch 0, BLEU score: 14.03




model.safetensors:   0%|          | 0.00/272M [00:00<?, ?B/s]

  0%|          | 0/1643 [00:00<?, ?it/s]

epoch 1, BLEU score: 17.28


model.safetensors:   0%|          | 0.00/272M [00:00<?, ?B/s]

  0%|          | 0/1643 [00:00<?, ?it/s]

epoch 2, BLEU score: 18.62
CPU times: user 2h 50min 58s, sys: 1min 21s, total: 2h 52min 19s
Wall time: 2h 11min 42s


model.safetensors:   0%|          | 0.00/272M [00:00<?, ?B/s]

#### Using the fine-tuned model

Let's try using our newly fine-tuned translation model.

In [40]:
sample = "Next, press the Enter key."

In [41]:
# Replace this with your own checkpoint
model_checkpoint = repo_name

our_finetuned_translator = pipeline("translation", model=model_checkpoint)
our_finetuned_translator(sample)

model.safetensors:   0%|          | 0.00/272M [00:00<?, ?B/s]

Device set to use cuda:0


[{'translation_text': '次のキーを押してキーを押してください'}]

##### Now let's compare with the original model!

Recall:
* source languages: en
* target languages: jap (sic)

In [42]:
the_original_translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-jap")
the_original_translator(sample)

Device set to use cuda:0


[{'translation_text': 'その 次 に , かぎ の 開き を する つもり で あ っ た .'}]