# Joint multitask

The point of this notebook is not to do anything useful, but to show what is possible with relatively low effort using the *transformer_heads* library. In this example, we will train four heads on a transformer model while using qlora to finetune the transformer block weights. The first head will be hooked at layer 9 (-4) and predict the sentiment of imdb reviews (text classification). The second head will be hooked at the last layer (-1 or 12) and does causal language modelling on imdb reviews. The third head will be hooked at layer 6 (-7) and will learn to count the number of occurences of each letter of the alphabet occuring in imdb reviews (Text-level regression). The final head will be hooked at layer 4 (-9) and will predict how many tokens will follow before the review ends for each token in imdb reviews (Token-level regression). The final head will also be a small mlp instead of a linear head.

All heads and the qlora parameters will be trained jointly (multi-task learning).

In [1]:
from transformer_heads import create_headed_qlora, load_lora_with_heads
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    MistralForCausalLM,
    Trainer,
    BitsAndBytesConfig,
    TrainingArguments,
    GPT2Model,
    GPT2LMHeadModel,
)
from transformer_heads.util.helpers import DataCollatorWithPadding, get_model_params
from peft import LoraConfig
from transformer_heads.config import HeadConfig
from transformer_heads.util.model import print_trainable_parameters
from transformer_heads.util.evaluate import (
    evaluate_head_wise,
)
import torch
import pandas as pd

In [None]:
model_path = "gpt2"
train_epochs = 1
eval_epochs = 1
logging_steps = 100
train_batch_size = 4
eval_batch_size = 4

In [2]:
model_params = get_model_params(model_path)
model_class = model_params["model_class"]
hidden_size = model_params["hidden_size"]
vocab_size = model_params["vocab_size"]
print(model_params)

Define the various different heads. Given the differences in loss functions and magnitudes of label data in the dataset, it is important to weigh the losses of each head so that training is given similar importance for all of them.

In [3]:
head_configs = [
    HeadConfig(
        name=f"sentiment_head",
        layer_hook=-4,
        in_size=hidden_size,
        output_activation="linear",
        pred_for_sequence=True,
        loss_fct="cross_entropy",
        num_outputs=2,
        loss_weight=2.0,
    ),
    HeadConfig(
        name=f"causal_lm",
        layer_hook=-1,
        in_size=hidden_size,
        output_activation="linear",
        is_causal_lm=True,
        loss_fct="cross_entropy",
        num_outputs=vocab_size,
        is_regression=False,
        output_bias=False,
        loss_weight=1.0,
    ),
    HeadConfig(
        name=f"alphabet_regression",
        layer_hook=-7,
        in_size=hidden_size,
        output_activation="linear",
        is_causal_lm=False,
        pred_for_sequence=True,
        loss_fct="mse",
        num_outputs=26,  # 26 letters in the alphabet
        is_regression=True,
        loss_weight=0.002,
    ),
    HeadConfig(
        name=f"num_tokens_regression",
        layer_hook=-7,
        hidden_size=128,  # MLP hidden size
        num_layers=3,  # 2 hidden layers in MLP
        in_size=hidden_size,
        output_activation="linear",
        is_causal_lm=False,
        pred_for_sequence=False,
        loss_fct="mse",
        num_outputs=1,
        is_regression=True,
        loss_weight=0.0002,
    ),
    HeadConfig(
        name=f"lm_head",  # Let's also keep the original lm head for comparison
        layer_hook=-1,
        in_size=hidden_size,
        output_activation="linear",
        is_causal_lm=True,
        pred_for_sequence=False,
        loss_fct="cross_entropy",
        num_outputs=vocab_size,
        is_regression=False,
        trainable=False,  # Keep it in it's pretrained state
    ),
]

In [4]:
dd = load_dataset("imdb")

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_path)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token = tokenizer.eos_token


def processing_function(examples):
    out = tokenizer(examples["text"], padding=False, truncation=True)
    out["sentiment_head"] = examples["label"]
    out["causal_lm"] = out["lm_head"] = out["input_ids"].copy()
    out["num_tokens_regression"] = [
        list(map(float, range(len(ids) - 1, -1, -1))) for ids in out["input_ids"]
    ]
    out["alphabet_regression"] = [
        [
            float(text.count(x) + text.count(x.upper()))
            for x in "abcdefghijklmnopqrstuvwxyz"
        ]
        for text in examples["text"]
    ]
    return out


for split in dd.keys():
    dd[split] = dd[split].filter(function=lambda example: len(example["text"]) > 10)
    dd[split] = dd[split].shuffle()
    dd[split] = dd[split].map(processing_function, batched=True)

dd.set_format(
    type="torch",
    columns=["input_ids", "attention_mask"] + [x.name for x in head_configs],
)
for split in dd.keys():
    dd[split] = dd[split].remove_columns(["text", "label"])

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [6]:
dd["test"]

Dataset({
    features: ['input_ids', 'attention_mask', 'sentiment_head', 'causal_lm', 'lm_head', 'num_tokens_regression', 'alphabet_regression'],
    num_rows: 25000
})

Setting *target_modules=None* in the qlora config will make *create_headed_qlora* create LoRA modules for all linear layers in the transformer.

In [7]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    load_in_8bit=False,
    llm_int8_threshold=6.0,
    llm_int8_has_fp16_weight=False,
    bnb_4bit_compute_dtype=torch.float32,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)
lora_config = LoraConfig(
    r=32,
    lora_alpha=16,
    target_modules=None,
    lora_dropout=0.0,
    bias="none",
    task_type="CAUSAL_LM",
)
model = create_headed_qlora(
    base_model_class=model_class,
    model_name=model_path,
    quantization_config=quantization_config,
    lora_config=lora_config,
    head_configs=head_configs,
    fully_trained_heads=True,
    device_map={"": torch.cuda.current_device()},
    gradient_checkpointing=True,
)

Some weights of TransformerWithHeads were not initialized from the model checkpoint at gpt2 and are newly initialized: ['heads.alphabet_regression.lins.0.weight', 'heads.causal_lm.lins.0.weight', 'heads.num_tokens_regression.lins.0.bias', 'heads.num_tokens_regression.lins.0.weight', 'heads.num_tokens_regression.lins.1.bias', 'heads.num_tokens_regression.lins.1.weight', 'heads.num_tokens_regression.lins.2.weight', 'heads.sentiment_head.lins.0.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
print_trainable_parameters(model)

all params: 130143616 || trainable params: 48171136 || trainable%: 37.01382939905404
params by dtype: defaultdict(<class 'int'>, {torch.float32: 87676288, torch.uint8: 42467328})
trainable params by dtype: defaultdict(<class 'int'>, {torch.float32: 48171136})


In [9]:
collator = DataCollatorWithPadding(
    feature_name_to_padding_value={
        "input_ids": tokenizer.pad_token_id,
        "attention_mask": 0,
        "causal_lm": -100,
        "lm_head": -100,
        "num_tokens_regression": 0,
    }
)

In [10]:
print(
    evaluate_head_wise(
        model, dd["test"], collator, epochs=eval_epochs, batch_size=eval_batch_size
    )
)

Evaluating: 313it [03:54,  1.33it/s]                           

(49.67398147825982, {'sentiment_head': 4.195852350467329, 'causal_lm': 21.761122667106093, 'alphabet_regression': 4400.525231428207, 'num_tokens_regression': 35131.852517292, 'lm_head': 3.6937329708390934})





In [11]:
args = TrainingArguments(
    output_dir="imdb_linear_probe",
    learning_rate=0.0002,
    num_train_epochs=train_epochs,  # To speed things up set to 0.1, set to 1 for better performance
    logging_steps=logging_steps,
    do_eval=False,
    remove_unused_columns=False,
    optim="paged_adamw_32bit",
    gradient_checkpointing=True,
    lr_scheduler_type="constant",
    ddp_find_unused_parameters=False,
    per_device_train_batch_size=train_batch_size,
    per_device_eval_batch_size=eval_batch_size,
)
trainer = Trainer(
    model,
    args=args,
    train_dataset=dd["train"],
    data_collator=collator,
)
trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mykeller[0m ([33mchm-hci[0m). Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
20,43.129
40,33.1802
60,30.1188
80,24.658
100,24.4318
120,25.3398
140,25.5653
160,23.6597
180,18.7199
200,23.6384


TrainOutput(global_step=313, training_loss=24.852344452001798, metrics={'train_runtime': 171.9685, 'train_samples_per_second': 14.538, 'train_steps_per_second': 1.82, 'total_flos': 1342270855919616.0, 'train_loss': 24.852344452001798, 'epoch': 0.1})

In [12]:
evals = evaluate_head_wise(
    model, dd["test"], collator, epochs=eval_epochs, batch_size=eval_batch_size
)
print(evals)

Evaluating: 313it [03:56,  1.32it/s]                           

(20.37207105660894, {'sentiment_head': 0.7118918163001917, 'causal_lm': 6.284298394136368, 'alphabet_regression': 2519.2657797746597, 'num_tokens_regression': 16490.400432610968, 'lm_head': 4.327377298834977})





# Saving and loading
Now let's how to save a complicated mulit-headed model to then load it again for inference. Saving is super easy. Just call save_pretrained and all trained parameters will be saved correctly. The saving will also work correctly for checkpoints created during training.

In [None]:
model.save_pretrained("qlora_multitask_imdb")
del model

While loading the model, we need to make sure to correctly attach and initialize all the heads, so that won't easily work with the huggingface api. Instead, *transformer_heads* provides the *load_lora_with_heads* function. Note that giving a quantization config is optional here. We could also give a different quantization config or none at all.

In [None]:
model = load_lora_with_heads(
    model_class,
    "qlora_multitask_imdb",
    quantization_config,
    device_map={"": torch.cuda.current_device()},
)

Let's now find out if the loaded model behaves the same as the saved model:

In [None]:
new_evals = evaluate_head_wise(
    model, dd["test"], collator, epochs=eval_epochs, batch_size=eval_batch_size
)
print(new_evals)