<a href="https://colab.research.google.com/github/gaussalgo/L2L_MLPrague23/blob/main/notebooks/hands_on_improving_ICL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training generative models

This notebook will show you how to fine-tune existing few-shot in-context learners to perform better on the application of your interest.

We will demonstrate how you can fine-tune a model to perform in-context few-shot learning **on a new (target) language** using QA dataset(s) of the target language.


In [1]:
# notebook's requirements
!pip install transformers datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Existing models

* T5 models (https://huggingface.co/t5-large) - pre-trained on mixture of tasks in text2text format
* FLAN - T5 models (https://huggingface.co/google/flan-t5-large) - fine-tuned T5 model on 1000 additional tasks
* mT5 models (https://huggingface.co/google/mt5-large) - pre-trained on 101 languages, no supervised tasks - needs to be fine-tuned
* Tk models (https://huggingface.co/allenai/tk-instruct-large-def-pos) - fine-tuned T5 on tasks with prompts written as in-context instructions
* mTk models (https://huggingface.co/allenai/mtk-instruct-3b-def-pos) - multilingual version of the Tk model
* MPT (https://huggingface.co/mosaicml/mpt-7b)
* Alpaca (https://github.com/tatsu-lab/stanford_alpaca)

## Generation - refresher

Recall the **Causal Langauge Modeling (CLM)** from earlier, where the model predicts the **following token** from previous context.

![image.png](https://gcdnb.pbrd.co/images/Bx4h6Lordx0y.png?o=1)  
![image.png](https://gcdnb.pbrd.co/images/rb7bmZS11gtl.png?o=1)
![image.png](https://gcdnb.pbrd.co/images/gXYffjzLIk7n.png?o=1)

[[images source]](https://www.rohanawhad.com/improvements-of-spanbert-over-bert/)

Note that the task of language **generation** is very similar to CLM: We predict the next token conditionally to the input. Only in the case of generation, the **input also contains the previous outputs** of the model.

## Construction of Training pipeline

Many libraries makes it easy to train your NN model, with different levels of user complexity - do not get confused by that.

Ordered by implementation complexity incrementally, for PyTorch models, you can find at least these: **Pure PyTorch, PyTorch Lightning, Transformers Trainer, Adaptor (ours)**. We will take a look at how it looks at the most low-level (Pure PyTorch), and the most high-level (Adaptor), but in your time, you can also look at [Sequence Classification tutorial](https://huggingface.co/docs/transformers/tasks/sequence_classification) from HuggingFace.


## ⛵ Low-level Training pipeline - Example

Here, we are going to take a look at the low level of updating a network.
The process can be summarized in the following steps:

1. **We pick our base model** to fine-tune. While all Transformer LMs can perform token classification, not all of them are equally good at it. We'll talk about it a bit more offline.

2. **We construct training dataset** from our data. Here, we transform texts into valid model inputs/samples (as we've seen in previous session) and assign true labels for each sample.

3. **We iterate over the samples** in so-called *epochs*. In this step, we get the model predictions for a *batch* of samples: The raw predictions take form of probabilities (usually log-probabilities, to make prediction faster).

4. **We update the model**. Here, we first compare the predicted probabilities with "true probabilities", where true category gets a probability==1, and other categories get 0. The comparison is done by so-called *loss* function, which is a special version of distance measure. Then, we update the weights of the model so that they improve the loss metric.

5. **We continue as long as the model improves**, which we measure on a held-out dataset, to avoid that the model just learns to remember our data (but then would perform badly in the real world).

### 1: Pick our base model

In [2]:
from transformers import GPT2LMHeadModel, AutoTokenizer

model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.add_special_tokens({'pad_token': 'pad'})


0

### 2: Constructing training dataset

In [3]:
# 2: We construct training dataset

from datasets import load_dataset

dataset = load_dataset("imdb")



  0%|          | 0/3 [00:00<?, ?it/s]

In [4]:
sample_encoding = tokenizer(dataset["unsupervised"]['text'][10], 
                            padding="longest",  # this assures that all input samples have the same length and can be processed in a batch
                            truncation=True,  # this removes parts of inputs that can not fit into the model input
                            return_tensors="pt"  # this returns the samples as PyTorch tensors, that we do not have to convert ourselves
                            )

#### Constructing training labels

In [5]:
# GPT's input ids, decoded:
"|".join(tokenizer.batch_decode(sample_encoding["input_ids"][0]))

"This| isn|'t| the| worst| comedy| of| all|-|time|,| but| that| is| about| the| best| thing| that| I| can| say| about| this| pathetic| film|.| I| didn|'t| laugh| once|,| or| even| smile| once| during| this| bomb|.| There| was| usually| something| going| on| on|-|screen|,| so| I| didn|'t| get| TO|O| bored|,| but| most| of| the| jokes| here| were| simply| awful|.| The| final| sequence| is| nothing| more| than| a| long| series| of| people| falling| through| doors| and| stumbling| all| over| the| place|.| Needless| to| say|,| it| was| a| fitting| way| to| end| a| movie| that| was| impossible| for| me| to| like|."

In [6]:
# GPT's training labels: when training to predict the following token, we simply shift the inputs one position to the right
labels = sample_encoding["input_ids"][..., 1:]

"|".join(tokenizer.batch_decode(labels[0]))

" isn|'t| the| worst| comedy| of| all|-|time|,| but| that| is| about| the| best| thing| that| I| can| say| about| this| pathetic| film|.| I| didn|'t| laugh| once|,| or| even| smile| once| during| this| bomb|.| There| was| usually| something| going| on| on|-|screen|,| so| I| didn|'t| get| TO|O| bored|,| but| most| of| the| jokes| here| were| simply| awful|.| The| final| sequence| is| nothing| more| than| a| long| series| of| people| falling| through| doors| and| stumbling| all| over| the| place|.| Needless| to| say|,| it| was| a| fitting| way| to| end| a| movie| that| was| impossible| for| me| to| like|."

### Creating next-token prediction inputs from each sample

For the next-token prediction, we actually create multiple samples from each text: There are **many** tokens that we can use as **targets**!

To make it easier for us, we'll repeatedly use the **same input ids**, and only **attend to the previous tokens**, to be used in prediction. We just need to be careful not to un-mask the actually-predicted token.

We can implement this quite easily by constructing a **triangular attention mask** for each input from the batch.

In [7]:
attended_input_length = sample_encoding["attention_mask"].sum(axis=1)
attended_input_length

tensor([113])

**Input ids**: duplicate inputs by the number of predicted tokens

In [8]:
# duplicate inputs by the number of predicted tokens
input_ids = sample_encoding["input_ids"].expand(attended_input_length, -1)
input_ids

tensor([[1212, 2125,  470,  ...,  284,  588,   13],
        [1212, 2125,  470,  ...,  284,  588,   13],
        [1212, 2125,  470,  ...,  284,  588,   13],
        ...,
        [1212, 2125,  470,  ...,  284,  588,   13],
        [1212, 2125,  470,  ...,  284,  588,   13],
        [1212, 2125,  470,  ...,  284,  588,   13]])

In [9]:
input_ids.shape

torch.Size([113, 113])

**Attention mask**: we create triangles that will mask all future tokens from prediction

In [11]:
import torch 

# this is how we construct tensor triangles
torch.tril(torch.ones(4, 6), diagonal=0)

tensor([[1., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0.]])

In [12]:
attention_triangle = torch.tril(torch.ones(attended_input_length, input_ids.shape[1]), diagonal=0)
attention_triangle

tensor([[1., 0., 0.,  ..., 0., 0., 0.],
        [1., 1., 0.,  ..., 0., 0., 0.],
        [1., 1., 1.,  ..., 0., 0., 0.],
        ...,
        [1., 1., 1.,  ..., 1., 0., 0.],
        [1., 1., 1.,  ..., 1., 1., 0.],
        [1., 1., 1.,  ..., 1., 1., 1.]])

In [13]:
attention_triangle.shape

torch.Size([113, 113])

**Labels**: Finally, we spread the pre-computed labels to assign exactly one label id to each new sample

In [14]:
labels = sample_encoding["input_ids"][..., 1:][:attended_input_length]
labels
"|".join(tokenizer.batch_decode(labels[0]))

" isn|'t| the| worst| comedy| of| all|-|time|,| but| that| is| about| the| best| thing| that| I| can| say| about| this| pathetic| film|.| I| didn|'t| laugh| once|,| or| even| smile| once| during| this| bomb|.| There| was| usually| something| going| on| on|-|screen|,| so| I| didn|'t| get| TO|O| bored|,| but| most| of| the| jokes| here| were| simply| awful|.| The| final| sequence| is| nothing| more| than| a| long| series| of| people| falling| through| doors| and| stumbling| all| over| the| place|.| Needless| to| say|,| it| was| a| fitting| way| to| end| a| movie| that| was| impossible| for| me| to| like|."

In [15]:
# After the model is done, we want it to generate a special <EoS> token. This way, we know that the model is done with generation.
model.config.eos_token_id

50256

In [16]:
# Hence, we add the token as the last label
labels = torch.hstack([labels, torch.tensor([[model.config.eos_token_id]])])

In [17]:
labels.shape

torch.Size([1, 113])

In [18]:
sample_encoding["input_ids"].shape

torch.Size([1, 113])

#### Now, we wrap the whole encoding into a method

In [122]:
from typing import Dict
import torch
import itertools

def construct_causalLM_sample(sample: Dict[str, torch.tensor]) -> Dict[str, torch.tensor]:
    extended_batch = {}

    attended_input_length = sample["input_ids"].shape[-1]

    extended_batch["input_ids"] = sample["input_ids"].expand(attended_input_length, -1)

    extended_batch["attention_mask"] = torch.tril(torch.ones(attended_input_length, attended_input_length), diagonal=0)

    extended_batch["labels"] = sample["input_ids"][..., 1:][:attended_input_length]
    extended_batch["labels"] = torch.hstack([extended_batch["labels"][0], torch.tensor([model.config.eos_token_id])])

    extended_batch["labels_position"] = torch.arange(1, sample["input_ids"].shape[-1])

    return extended_batch

### 3: Iterate over the samples and 4: Update the model

In [25]:
{k: v.shape for k, v in dataset_batch.items()}

{'input_ids': torch.Size([2, 201]),
 'attention_mask': torch.Size([2, 201]),
 'labels': torch.Size([2])}

In [31]:
model_outputs.logits.shape

torch.Size([2, 201, 50257])

In [137]:
from transformers import AdamW
from torch.nn import CrossEntropyLoss

batch_size = 2
max_num_epochs = 5
learning_rate = 2e-6

optimizer = AdamW(model.parameters(),  # optimizer will actually update the model weights,
                  no_deprecation_warning=True,  # so that they get better at prediction after every step
                  lr=learning_rate)

loss_fn = CrossEntropyLoss()  # distance function comparing predictions to expected labels

for epoch in range(max_num_epochs):
    running_loss = 0
    last_running_loss = 10e28
    for text in dataset["unsupervised"]['text']:
        sample_encoding = tokenizer(text, 
                                    padding="longest",  # this assures that all input samples have the same length and can be processed in a batch
                                    truncation=True,  # this removes parts of inputs that can not fit into the model input
                                    return_tensors="pt"  # this returns the samples as PyTorch tensors, that we do not have to convert ourselves
                                    )
        clm_samples = construct_causalLM_sample(sample_encoding)
        for batch_offset in range(0, len(clm_samples["input_ids"]), batch_size):

            # the Construction of the training batch that we've seen above
            dataset_batch = {k: clm_samples[k][batch_offset: batch_offset+batch_size] 
                             for k in clm_samples.keys()}

            # Model prediction, (also called forward pass)
            model_logprobs = model(input_ids=dataset_batch["input_ids"],  # this can also be done with model(**dataset_batch)
                                   attention_mask=dataset_batch["attention_mask"]).logits          
            # HuggingFace implementation gives us predictions for all tokens, 
            # but we'll update the model only based on the predictions with labels 
            logprobs_with_labels = model_logprobs[torch.arange(batch_size), dataset_batch["labels_position"]]
                        
            # we first compare the predicted probabilities with "true probabilities"
            loss_value = loss_fn(logprobs_with_labels, dataset_batch["labels"])

            # we note the errors (gradients) to each model parameters (also called backward pass)
            loss_value.backward()
            
            running_loss += loss_value.item()
            
            # 4. We update the model
            optimizer.step()
            optimizer.zero_grad()
            
            # 5: Evaluation: Check and stop the training if the model no longer improves
            if batch_offset!= 0 and (batch_offset/batch_size) % 1000 == 0:
                # print our loss after every 1000-th step
                print("Current training loss: %s" % running_loss)
                # stop if the loss increased
                if last_running_loss < running_loss:
                    break

                running_loss = 0  # restart the log

KeyboardInterrupt: ignored

In [128]:
model_lprobs[torch.arange(batch_size), dataset_batch["labels_position"]].shape

torch.Size([2, 50257])

In [135]:
dataset_batch["input_ids"].shape, dataset_batch["labels"].shape

(torch.Size([2, 201]), torch.Size([2]))

In [117]:
dataset_batch["input_ids"].size(1)

201

In [121]:
torch.vstack([model_lprobs[:, i] for i in range(dataset_batch["input_ids"].size(1))]).shape

torch.Size([402, 50257])

In [72]:
torch.ones_like(torch.arange(model_lprobs.shape[1])).shape

torch.Size([201])

In [75]:
torch.ones_like(model_lprobs).dtype

torch.float32

In [79]:
model_lprobs.gather(1, torch.ones_like(model_lprobs, dtype=torch.int64))

tensor([[[ -36.2221,  -36.1904,  -41.0536,  ...,  -41.9787,  -40.6478,
           -38.1296],
         [ -36.2221,  -36.1904,  -41.0536,  ...,  -41.9787,  -40.6478,
           -38.1296],
         [ -36.2221,  -36.1904,  -41.0536,  ...,  -41.9787,  -40.6478,
           -38.1296],
         ...,
         [ -36.2221,  -36.1904,  -41.0536,  ...,  -41.9787,  -40.6478,
           -38.1296],
         [ -36.2221,  -36.1904,  -41.0536,  ...,  -41.9787,  -40.6478,
           -38.1296],
         [ -36.2221,  -36.1904,  -41.0536,  ...,  -41.9787,  -40.6478,
           -38.1296]],

        [[-107.7291, -108.0176, -113.2968,  ..., -116.4645, -115.7444,
          -110.8654],
         [-107.7291, -108.0176, -113.2968,  ..., -116.4645, -115.7444,
          -110.8654],
         [-107.7291, -108.0176, -113.2968,  ..., -116.4645, -115.7444,
          -110.8654],
         ...,
         [-107.7291, -108.0176, -113.2968,  ..., -116.4645, -115.7444,
          -110.8654],
         [-107.7291, -108.0176, -113.296

In [134]:
dataset_batch["labels"]

tensor([318, 655])

## 🛥 High-level Training pipeline

Many libraries makes it easy to train your NN model, with different levels of user complexity - do not get confused by that.

Ordered by implementation complexity incrementally, for PyTorch models, you can find at least these: **Pure PyTorch, PyTorch Lightning, Transformers Trainer, Adaptor (ours)**. We will take a look at how it looks at the most low-level (Pure PyTorch), and the most high-level (Adaptor), but in your time, you can also look at [Generation training tutorial]() and [Example script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-generation/run_generation.py) from HuggingFace.

## Training Generative model using Adaptor

When we train the model for generation, we update it maximise the chance of correctly predicting the following token.

During the inference (i.e. the actual usage of the model), we look only at model's prediction at this single token. This is also what `transformers.AnyModelForSequenceClassification` does internally.


In [None]:
!pip install sentencepiece protobuf==3.20.0 adaptor==0.2.1  # required for generation

In [None]:
from adaptor.lang_module import LangModule

language_module = LangModule("google/mt5-base")

Second, we choose the objective that we want to fine-tune the model for. The objective will take care of configuring the model correctly. We just give it our desired inputs and outputs.

In [None]:
from adaptor.objectives.seq2seq import Sequence2Sequence

promt = "How many stars will this review rate? Options: 1, 2, 3, 4, 5."

training_objective = Sequence2Sequence(lang_module=language_module,
                                       texts_or_path=[promt + review for review in dataset["train"]['comment']],
                                       labels_or_path=[str(x) for x in dataset["train"]["rating_int"]],
                                       val_texts_or_path=[promt + review for review in dataset["validation"]['comment']],
                                       val_labels_or_path=[str(x) for x in dataset["validation"]["rating_int"]],
                                       batch_size=1)

The training process is configured through a possibly large set of arugments. You can read through each of them in [TrainingArgs documentation](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments).

In [None]:
from adaptor.utils import AdaptationArguments, StoppingStrategy

# TODO: maybe set output dir
output_dir = "/content/drive/MyDrive/training_output_dir"

args = AdaptationArguments(output_dir=output_dir,
                           learning_rate=2e-5,
                           warmup_steps=1000,
                           stopping_strategy=StoppingStrategy.FIRST_OBJECTIVE_CONVERGED,
                           do_train=True,
                           do_eval=True,
                           log_level="critical",
                           logging_steps=100,
                           eval_steps=200,
                           evaluation_strategy="steps",
                           save_steps=200,
                           save_total_limit=6,
                           stopping_patience=5,
                           num_train_epochs=20,
                           gradient_accumulation_steps=8)

Adaptor allows training for multiple objectives at once, that are applied in a chosen Schedule. Multi-objective training is useful for more complex scenarios, but for our case, we suffice with a single objective, so a selection of Schedule does not really matter.

In [None]:
from adaptor.schedules import SequentialSchedule
from adaptor.adapter import Adapter

parallel_schedule = SequentialSchedule(objectives=[training_objective], args=args)
# 4. train using Adapter
adapter = Adapter(lang_module=language_module,
                  schedule=parallel_schedule,
                  args=args)

Last thing before the training: we need to **persist the weights** of the trained model somewhere. In our case, we create checkpoints that can be directly loaded as any HuggingFace model.

In Google Colab, you can mount your Google Drive to persist the model checkpoints using the following commands. If you run this script elsewhere, you may skip the following steps.

In [None]:
# This will mount your google drive to persist the training model later on. 
# If you do not want to do it, you can skip this command.

from google.colab import drive
drive.mount('/content/drive')

In [None]:
# TODO: before starting the training, check that the folder where the model will be persisted actually exist

!ls $output_dir

In [None]:
# if it does not, create it manually in the menu on the right

# !mkdir $output_dir

In [None]:
# Check that the folder for checkpoints existyou can continue, if this command passes without errors

!ls $output_dir

After all the configuration, we are ready to run the training and wait for the trained model.

Given the `stopping_strategy=StoppingStrategy.FIRST_OBJECTIVE_CONVERGED` and `stopping_patience=1`, the training will terminate after first evaluation, where `model_quality_evaluator` (i.e. `SequenceAccuracy`) does not improve over one evaluation.

In [None]:
adapter.train()