# Prompt Tuning
In this notebook, we explore prompt tuning, and see how we can easily switch between tasks with minimal fine-tuning.


Refrences:    

 https://huggingface.co/docs/peft/main/en/task_guides/clm-prompt-tuning   
 https://huggingface.co/datasets/ought/raft   
 https://huggingface.co/datasets/takala/financial_phrasebank
 

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, default_data_collator, get_linear_schedule_with_warmup
from peft import get_peft_config, get_peft_model, PromptTuningInit, PromptTuningConfig, TaskType, PeftType
import torch
from datasets import load_dataset
import os
from torch.utils.data import DataLoader
from tqdm import tqdm

### Preprocessing function

**Create a preprocess_function to:**

Tokenize the input text and labels.

* 1) For each example in a batch, pad the labels with the tokenizers pad_token_id.  
* 2) Concatenate the input text and labels into the model_inputs.   
* 3) Create a separate attention mask for labels and model_inputs.   
* 4) Loop through each example in the batch again to pad the input ids, labels, and attention mask to the max_length and convert them to PyTorch tensors.

In [15]:
def preprocess_function(examples):
    
    batch_size = len(examples[text_column])
    inputs = [f"{text_column} : {x} Label : " for x in examples[text_column]]
    targets = [str(x) for x in examples[label_column]]
    model_inputs = tokenizer(inputs)
    labels = tokenizer(targets)
    
    for i in range(batch_size):
        sample_input_ids = model_inputs["input_ids"][i]
        label_input_ids = labels["input_ids"][i] + [tokenizer.pad_token_id]
        #print(i, sample_input_ids, label_input_ids)
        
        model_inputs["input_ids"][i] = sample_input_ids + label_input_ids
        labels["input_ids"][i] = [-100] * len(sample_input_ids) + label_input_ids
        model_inputs["attention_mask"][i] = [1] * len(model_inputs["input_ids"][i])
    #print(model_inputs)
    
    
    for i in range(batch_size):
        sample_input_ids = model_inputs["input_ids"][i]
        label_input_ids = labels["input_ids"][i]
        model_inputs["input_ids"][i] = [tokenizer.pad_token_id] * (
            max_length - len(sample_input_ids)
        ) + sample_input_ids
        model_inputs["attention_mask"][i] = [0] * (max_length - len(sample_input_ids)) + model_inputs[
            "attention_mask"
        ][i]
        labels["input_ids"][i] = [-100] * (max_length - len(sample_input_ids)) + label_input_ids
        model_inputs["input_ids"][i] = torch.tensor(model_inputs["input_ids"][i][:max_length])
        model_inputs["attention_mask"][i] = torch.tensor(model_inputs["attention_mask"][i][:max_length])
        labels["input_ids"][i] = torch.tensor(labels["input_ids"][i][:max_length])
    model_inputs["labels"] = labels["input_ids"]
    
    return model_inputs

## Step1: initialize model, tokenizer, dataset, and the prompt tokens.
The PromptTuningConfig contains information about the task type, the text to initialize the prompt embedding, the number of virtual tokens, and the tokenizer to use. PromptTuningConfig is a type of PEFT config, and in fact the training process considers the prompt weights as weights that are trained on PEFT. The rest of the training and inference is just like a regular classification with LLMs.
PromptTuning config: https://huggingface.co/docs/peft/main/en/package_reference/prompt_tuning#peft.PromptTuningConfig


In [3]:
device = "cuda"

model_name_or_path = "bigscience/bloomz-560m"
tokenizer_name_or_path = "bigscience/bloomz-560m"

# ***Important***
peft_config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    prompt_tuning_init=PromptTuningInit.TEXT,
    num_virtual_tokens=8,
    prompt_tuning_init_text="Classify if the tweet is a complaint or not:",
    tokenizer_name_or_path=model_name_or_path,
)


max_length = 64
lr = 3e-2
num_epochs = 20
batch_size = 8

## Data loading, and preprocessing

Load the twitter_complaints subset of the RAFT dataset. This subset contains tweets that are labeled either complaint or no complaint:
https://huggingface.co/datasets/ought/raft

In [10]:
dataset_name = "twitter_complaints"
text_column = "Tweet text"
label_column = "text_label"


dataset = load_dataset("ought/raft", dataset_name)
print( f"Number of training samples: {len(dataset["train"])}, Number of test samples {len(dataset["test"])}")
dataset["train"][0]

Number of training samples: 50, Number of test samples 3399


{'Tweet text': '@HMRCcustomers No this is my first job', 'ID': 0, 'Label': 2}

To make the Label column more readable, replace the Label value with the corresponding label text and store them in a text_label column. You can use the map function to apply this change over the entire dataset in one step:

In [6]:
# Label 0 corrosponds to class Unlabeled, Label 1 corrosponds to complaint and Label 2 is no complaint'.
dataset["train"].features["Label"].names

['Unlabeled', 'complaint', 'no complaint']

In [11]:
# To make the Label column more readable, replace the Label value with the corresponding label text
classes = [k.replace("_", " ") for k in dataset["train"].features["Label"].names]
dataset = dataset.map(
    lambda x: {"text_label": [classes[label] for label in x["Label"]]},
    batched=True,
    num_proc=1,
)
dataset["train"][0]

{'Tweet text': '@HMRCcustomers No this is my first job',
 'ID': 0,
 'Label': 2,
 'text_label': 'no complaint'}

In [12]:
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id


In [13]:
# See the number of tokens used for each class label after being tokenized.
# The max number of tokens is 3, so the model's answers should have at most 3 tokens.
[(tokenizer(class_label)["input_ids"],class_label) for class_label in classes]

[([3074, 4762, 60943], 'Unlabeled'),
 ([16449, 5952], 'complaint'),
 ([1936, 106863], 'no complaint')]

In [16]:
processed_datasets = dataset.map(
    preprocess_function,
    batched=True,
    num_proc=1,
    remove_columns=dataset["train"].column_names,
    load_from_cache_file=False,
    desc="Running tokenizer on dataset",
)

Running tokenizer on dataset:   0%|          | 0/50 [00:00<?, ? examples/s]

Running tokenizer on dataset:   0%|          | 0/3399 [00:00<?, ? examples/s]

In [17]:
train_dataset = processed_datasets["train"]
eval_dataset = processed_datasets["test"]


train_dataloader = DataLoader(
    train_dataset, shuffle=True, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True
)
eval_dataloader = DataLoader(eval_dataset, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True)

## Load the model
Initialize a base model from AutoModelForCausalLM.

In [19]:
model = AutoModelForCausalLM.from_pretrained(model_name_or_path)

### Prompt the model before training with one example: one-shot

In [24]:
# test the model before training
inputs = tokenizer(
    f'{text_column} : {" Categorize the following sentence in 'complaint'or 'no complaint' @nationalgridus I have no water and the bill is current and paid. Can you do something about this?"} Label : ',
    return_tensors="pt",
)

model.to(device)

with torch.no_grad():
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model.generate(
        input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=2, eos_token_id=3
    )
    print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))



["Tweet text :  Categorize the following sentence in 'complaint'or 'no complaint' @nationalgridus I have no water and the bill is current and paid. Can you do something about this? Label :  No complaintThe present invention relates to a method"]


The correct answer is complaint.

## Train

Pass the model and peft_config to the get_peft_model() function to create a PeftModel. You can print the new PeftModel’s trainable parameters to see how much more efficient it is than training the full parameters of the original model!

In [25]:
# Configuring the model training with peft_config.
model = get_peft_model(model, peft_config)
# Trainable params are the parameters for our prompt embedding layer.
print(model.print_trainable_parameters())

trainable params: 8,192 || all params: 559,222,784 || trainable%: 0.0015
None


### Visualize model structure and the prompt embedding layer before training

In [28]:
model

PeftModelForCausalLM(
  (base_model): BloomForCausalLM(
    (transformer): BloomModel(
      (word_embeddings): Embedding(250880, 1024)
      (word_embeddings_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (h): ModuleList(
        (0-23): 24 x BloomBlock(
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): BloomAttention(
            (query_key_value): Linear(in_features=1024, out_features=3072, bias=True)
            (dense): Linear(in_features=1024, out_features=1024, bias=True)
            (attention_dropout): Dropout(p=0.0, inplace=False)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): BloomMLP(
            (dense_h_to_4h): Linear(in_features=1024, out_features=4096, bias=True)
            (gelu_impl): BloomGelu()
            (dense_4h_to_h): Linear(in_features=4096, out_features=1024, bias=True)
          )
        )
      

In [45]:
model.prompt_encoder

ModuleDict(
  (default): PromptEmbedding(
    (embedding): Embedding(8, 1024)
  )
)

In [29]:
prompt_embedding_layer_before_ft = model.prompt_encoder["default"].embedding.weight.data
print("Embedding tensor values:", prompt_embedding_layer_before_ft)

Embedding tensor values: tensor([[-0.0146, -0.0090, -0.0184,  ..., -0.0424, -0.0203,  0.0109],
        [ 0.0189,  0.0071,  0.0161,  ..., -0.0432, -0.0030,  0.0039],
        [-0.0062, -0.0035,  0.0085,  ..., -0.0430,  0.0081, -0.0003],
        ...,
        [-0.0026,  0.0100, -0.0056,  ..., -0.0434, -0.0068,  0.0047],
        [ 0.0153,  0.0015, -0.0072,  ..., -0.0433, -0.0094, -0.0019],
        [-0.0084,  0.0141, -0.0068,  ..., -0.0432, -0.0139,  0.0233]],
       device='cuda:0')


### Set the optimizer and lr_scheduler

In [30]:
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
lr_scheduler = get_linear_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=(len(train_dataloader) * num_epochs),
)

### Training loop

In [32]:
model = model.to(device)

for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for step, batch in enumerate(tqdm(train_dataloader)):
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        total_loss += loss.detach().float()
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
    train_epoch_loss = total_loss / len(train_dataloader)
    train_ppl = torch.exp(train_epoch_loss)
    print(f"{epoch=}: {train_ppl=}  {train_epoch_loss=}")

    model.eval()
    
    #  We can comment the below lines to make the training loop faster.
    
#     eval_loss = 0
#     eval_preds = []
#     for step, batch in enumerate(tqdm(eval_dataloader)):
#         batch = {k: v.to(device) for k, v in batch.items()}
#         with torch.no_grad():
#             outputs = model(**batch)
#         loss = outputs.loss
#         eval_loss += loss.detach().float()
#         eval_preds.extend(
#             tokenizer.batch_decode(torch.argmax(outputs.logits, -1).detach().cpu().numpy(), skip_special_tokens=True)
#         )

#     eval_epoch_loss = eval_loss / len(eval_dataloader)
#     eval_ppl = torch.exp(eval_epoch_loss)
#     print(f"{epoch=}: {eval_ppl=} {eval_epoch_loss=}")

100%|██████████| 7/7 [00:02<00:00,  2.60it/s]
100%|██████████| 425/425 [01:35<00:00,  4.43it/s]


epoch=0: train_ppl=tensor(239.9144, device='cuda:0') train_epoch_loss=tensor(5.4803, device='cuda:0') eval_ppl=tensor(9217.8994, device='cuda:0') eval_epoch_loss=tensor(9.1289, device='cuda:0')


100%|██████████| 7/7 [00:02<00:00,  2.51it/s]
100%|██████████| 425/425 [01:39<00:00,  4.26it/s]


epoch=1: train_ppl=tensor(176.4419, device='cuda:0') train_epoch_loss=tensor(5.1730, device='cuda:0') eval_ppl=tensor(11803.8564, device='cuda:0') eval_epoch_loss=tensor(9.3762, device='cuda:0')


100%|██████████| 7/7 [00:02<00:00,  2.50it/s]
100%|██████████| 425/425 [01:39<00:00,  4.25it/s]


epoch=2: train_ppl=tensor(133.8714, device='cuda:0') train_epoch_loss=tensor(4.8969, device='cuda:0') eval_ppl=tensor(12261.6914, device='cuda:0') eval_epoch_loss=tensor(9.4142, device='cuda:0')


100%|██████████| 7/7 [00:02<00:00,  2.50it/s]
100%|██████████| 425/425 [01:39<00:00,  4.27it/s]


epoch=3: train_ppl=tensor(101.8403, device='cuda:0') train_epoch_loss=tensor(4.6234, device='cuda:0') eval_ppl=tensor(13147.6250, device='cuda:0') eval_epoch_loss=tensor(9.4840, device='cuda:0')


100%|██████████| 7/7 [00:02<00:00,  2.51it/s]
100%|██████████| 425/425 [01:38<00:00,  4.30it/s]


epoch=4: train_ppl=tensor(78.4906, device='cuda:0') train_epoch_loss=tensor(4.3630, device='cuda:0') eval_ppl=tensor(22318.8613, device='cuda:0') eval_epoch_loss=tensor(10.0132, device='cuda:0')


100%|██████████| 7/7 [00:02<00:00,  2.51it/s]
100%|██████████| 425/425 [01:38<00:00,  4.31it/s]


epoch=5: train_ppl=tensor(62.6062, device='cuda:0') train_epoch_loss=tensor(4.1369, device='cuda:0') eval_ppl=tensor(22512.9238, device='cuda:0') eval_epoch_loss=tensor(10.0218, device='cuda:0')


100%|██████████| 7/7 [00:02<00:00,  2.52it/s]
100%|██████████| 425/425 [01:38<00:00,  4.31it/s]


epoch=6: train_ppl=tensor(47.3318, device='cuda:0') train_epoch_loss=tensor(3.8572, device='cuda:0') eval_ppl=tensor(25957.7539, device='cuda:0') eval_epoch_loss=tensor(10.1642, device='cuda:0')


100%|██████████| 7/7 [00:02<00:00,  2.52it/s]
 79%|███████▉  | 336/425 [01:18<00:20,  4.30it/s]


KeyboardInterrupt: 

### Let's test the trained prompt with the same example

In [56]:
inputs = tokenizer(
    f'{text_column} : {"@nationalgridus I have no water and the bill is current and paid. Can you do something about this?"} Label : ',
    return_tensors="pt",
)

In [57]:
model.to(device)

with torch.no_grad():
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model.generate(
        input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=2, eos_token_id=3
    )
    print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))

['Tweet text : @nationalgridus I have no water and the bill is current and paid. Can you do something about this? Label : complaint']


The correct class is "complaint".

## We can keep the prompt embedding weights to pompt the model for this task

Let's save the prompt embedding weights to be able to use it for the twitter_complaints task ('complaint' vs 'no complaint' prediction).

In [None]:
# Save the fine-tuned prompt weights

prompt_embedding_weights_twitter_complaints = model.prompt_encoder["default"].embedding.weight.data
print("Embedding tensor values:", prompt_embedding_weights_twitter_complaints)


## Task 1: replace the prompt weights with zero, and see the effect on the example.

In [63]:
import torch
# Now replace the prompy embedding layer with zero and see the effect.
zero_tensor = torch.zeros((8, 1024))
zero_tensor

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

In [64]:
model.prompt_encoder["default"].embedding.weight.data = zero_tensor

# Do inference again.
inputs = tokenizer(
    f'{text_column} : {"@nationalgridus I have no water and the bill is current and paid. Can you do something about this?"} Label : ',
    return_tensors="pt",
)

model.to(device)

with torch.no_grad():
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model.generate(
        input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=2, eos_token_id=3
    )
    print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))

Now you can see that this small portion of the model actually matters. Without fine-tuning the soft prompt layer we would need to train the whole model which would take a long time to train.

## Task 2:  Try the same model for a new task

You don't need to change the whole model. This is the beauty of prompt optimization.

Steps to follow:
* Load the new dataset, and follow the same steps to preprocess the data.   
* Train a new prompt embedding matrix for the new task just like before.   
* Save the new prompt weights. 
* Switch between the tasks by using their respective prompt weights.



Don't forget to try the model with one example, before and after the training. You can use the following example.

In [39]:
# test the model before and after the training
test_inputs = tokenizer(
    f'{text_column} : {"The Lithuanian beer market made up 14.41 million liters in January , a rise of 0.8 percent from the year-earlier figure , the Lithuanian Brewers. Association reporting citing the results from its members ."} Label : ',
    return_tensors="pt",
)
# The correct answer for this example is "Positive"

## New dataset and task

For this task, train on the sentences_allagree subset of the financial_phrasebank dataset. This dataset contains financial news categorized by sentiment.
https://huggingface.co/datasets/takala/financial_phrasebank

In [33]:
# 1) load the dataset
dataset = load_dataset("financial_phrasebank", "sentences_allagree")
dataset = dataset["train"].train_test_split(test_size=0.1)


classes = dataset["train"].features["label"].names
dataset = dataset.map(
    lambda x: {"text_label": [classes[label] for label in x["label"]]},
    batched=True,
    num_proc=1,
)


text_column = "sentence"
label_column = "text_label"
print( f"Number of training samples: {len(dataset["train"])}, Number of test samples {len(dataset["test"])}")
dataset["train"][0]

Map:   0%|          | 0/2037 [00:00<?, ? examples/s]

Map:   0%|          | 0/227 [00:00<?, ? examples/s]

{'sentence': 'Igor and Oleg Yankov , who currently manage Moron and Vitim , will hold onto the 25 % stake for now .',
 'label': 1,
 'text_label': 'neutral'}

In [35]:
print("Labels:", dataset["train"].features["label"].names)

Labels: ['negative', 'neutral', 'positive']


In [None]:
# Initialize the tokenizer, preprocess the data, and create data loaders.

In [None]:
# Test the model with the example before prompt optimization

In [None]:
# Initialize the optimizer 

In [None]:
# Train the model

In [2]:
# Test the model with the example after prompt optimization


In [3]:
# Save the prompt weights


## Task 3: now try switching between tasks with the same model

In [1]:
# switch to the first task


In [None]:
# Is it effective?