**What are we going to do in the notebook?**
We are going to train two different models using two datasets, each with just one pre-trained model from the Bloom family. One model will be trained with a dataset of prompts, while the other will use a dataset of inspirational sentences. We will compare the results for the same question from both models before and after training.

Additionally, we'll explore how to load both models with only one copy of the foundational model in memory.

**Loading the PEFT Library**
This library contains the Hugging Face implementation of various Fine-Tuning techniques, including Prompt Tuning
From the transformers library, we import the necessary classes to instantiate the model and the tokenizer.

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModelForSeq2SeqLM

**Loading the model and the tokenizers.**
Bloom is one of the smallest and smartest models available for training with the PEFT Library using Prompt Tuning. You can choose any model from the Bloom Family, and I encourage you to try at least two of them to observe the differences.

I'm opting for the smallest one to minimize training time and avoid memory issues in Colab.

In [2]:
model_name = "t5-small"
dataset = "xsum"
#model_name = "bigscience/bloomz-560m"
#model_name="bigscience/bloom-1b1"
NUM_VIRTUAL_TOKENS = 4
NUM_EPOCHS = 6
#TEXT = "I want you to act as a motivational coach. "
TEXT = """Barack Obama has endorsed Vice-President Kamala Harris to be the Democratic presidential nominee, ending days of speculation over whether he would support her.
Former President Obama and ex-First Lady Michelle Obama said in a joint statement that they believe Ms Harris has the "vision, the character, and the strength that this critical moment demands".
Mr Obama was reportedly among more than 100 prominent Democrats Ms Harris spoke to after President Joe Biden announced last Sunday he was dropping out of the race.
In a statement at the time, Mr Obama praised Mr Biden's exit, but stopped short of endorsing Ms Harris.
The US vice-president has already secured the support of a majority of Democratic delegates, setting her on course to become the official nominee at the party convention in August.
The Obamas said in Friday's statement that they could not be "more thrilled to endorse" Ms Harris. They vowed to do "everything we can" to elect her.
"We agree with President Biden," said the couple's statement, "choosing Kamala was one of the best decisions he’s made. She has the resume to prove it."
They cited her record as California’s attorney general, a US senator and then vice-president.
"But Kamala has more than a resume," the statement continued. "She has the vision, the character, and the strength that this critical moment demands.
"There is no doubt in our mind that Kamala Harris has exactly what it takes to win this election and deliver for the American people.
"At a time when the stakes have never been higher, she gives us all reason to hope."
The statement was accompanied by a video of Ms Harris taking a phone call from the Obamas in which they pledge their support.
"Oh my goodness," says the vice-president in the clip. "Michelle, Barack, this means so much to me."""

In [3]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
#foundational_model = .from_pretrained(
foundational_model = AutoModelForSeq2SeqLM.from_pretrained(    
    model_name,
    trust_remote_code=True
)

**Inference with the pre trained bloom model**
If you want to achieve more varied and original generations, uncomment the parameters: temperature, top_p, and do_sample, in model.generate below

With the default configuration, the model's responses remain consistent across calls.

In [4]:
#this function returns the outputs from the model received, and inputs.
def get_outputs(model, inputs, max_new_tokens=2048):
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_new_tokens=max_new_tokens,
        #temperature=0.2,
        #top_p=0.95,
        #do_sample=True,
        repetition_penalty=1.5, #Avoid repetition.
        early_stopping=True, #The model can stop before reach the max_length
        eos_token_id=tokenizer.eos_token_id
    )
    return outputs

As we want to have two different trained models, I will create two distinct prompts.

The first model will be trained with a dataset containing prompts, and the second one with a dataset of motivational sentences.

The first model will receive the prompt "I want you to act as a motivational coach." and the second model will receive "There are two nice things that should matter to you:"

But first, I'm going to collect some results from the model without Fine-Tuning.

In [5]:
input_prompt = tokenizer(TEXT, return_tensors="pt")
foundational_outputs_prompt = get_outputs(foundational_model, input_prompt)

print(tokenizer.batch_decode(foundational_outputs_prompt, skip_special_tokens=True))



['has endorsed Kamala Harris to be the Democratic presidential nominee. former president Obama and ex-First Lady Michelle Obama said in joint statement that they believe she has the "vision, character"']


Both answers are more or less correct. Any of the Bloom models is pre-trained and can generate sentences accurately and sensibly. Let's see if, after training, the responses are either equal or more accurately generated.

**Preparing the Datasets**
The Datasets useds are:

https://huggingface.co/datasets/fka/awesome-chatgpt-prompts
https://huggingface.co/datasets/Abirate/english_quotes

In [6]:
import os
#os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [7]:
from datasets import load_dataset

#dataset_prompt = "fka/awesome-chatgpt-prompts"
dataset_prompt = "xsum"

#Create the Dataset to create prompts.
data_prompt = load_dataset(dataset_prompt)
#data_prompt = data_prompt.map(lambda samples: tokenizer(samples["prompt"]), batched=True)
data_prompt = data_prompt.map(lambda samples: tokenizer(samples["summary"]), batched=True)
train_dataset = data_prompt["train"].select(range(5000))
validate_dataset = data_prompt["validation"].select(range(1000))

In [8]:
display(train_dataset)

Dataset({
    features: ['document', 'summary', 'id', 'input_ids', 'attention_mask'],
    num_rows: 5000
})

In [9]:
print(train_dataset[:1])



**Fine-Tuning.**
PEFT configurations
API docs: https://huggingface.co/docs/peft/main/en/package_reference/tuners#peft.PromptTuningConfig

We can use the same configuration for both models to be trained.

In [10]:
from peft import  get_peft_model, PromptTuningConfig, TaskType, PromptTuningInit

generation_config = PromptTuningConfig(
    #task_type=TaskType.CAUSAL_LM, #This type indicates the model will generate text.
    task_type=TaskType.SEQ_2_SEQ_LM, #This type indicates the model will generate text.
    prompt_tuning_init=PromptTuningInit.RANDOM,  #The added virtual tokens are initializad with random numbers
    num_virtual_tokens=NUM_VIRTUAL_TOKENS, #Number of virtual tokens to be added and trained.
    tokenizer_name_or_path=model_name #The pre-trained model.
)

**Creating two Prompt Tuning Models.**
We will create two identical prompt tuning models using the same pre-trained model and the same config.

In [11]:
peft_model_prompt = get_peft_model(foundational_model, generation_config)
print(peft_model_prompt.print_trainable_parameters())

trainable params: 4,096 || all params: 60,510,720 || trainable%: 0.0068
None


**That's amazing: did you see the reduction in trainable parameters? We are going to train a 0.001% of the paramaters available.**

Now we are going to create the training arguments, and we will use the same configuration in both trainings.

In [12]:
from transformers import TrainingArguments, Seq2SeqTrainingArguments
def create_training_arguments(path, learning_rate=0.0035, epochs=6):
    #training_args = TrainingArguments(
    #    output_dir=path, # Where the model predictions and checkpoints will be written
    #    use_cpu=True, # This is necessary for CPU clusters.
    #    auto_find_batch_size=True, # Find a suitable batch size that will fit into memory automatically
    #    learning_rate= learning_rate, # Higher learning rate than full Fine-Tuning
    #    num_train_epochs=epochs
    #)
    training_args = Seq2SeqTrainingArguments(
        output_dir=path, # Where the model predictions and checkpoints will be written
        use_cpu=True, # This is necessary for CPU clusters.
        auto_find_batch_size=True, # Find a suitable batch size that will fit into memory automatically
        learning_rate= learning_rate, # Higher learning rate than full Fine-Tuning
        num_train_epochs=epochs
    )
    return training_args

In [13]:
import os

working_dir = "./prompt_tuning"

#Is best to store the models in separate folders.
#Create the name of the directories where to store the models.
output_directory_prompt =  os.path.join(working_dir, "peft_t5_outputs_prompt")


#Just creating the directoris if not exist.
if not os.path.exists(working_dir):
    os.mkdir(working_dir)
if not os.path.exists(output_directory_prompt):
    os.mkdir(output_directory_prompt)


We need to indicate the directory containing the model when creating the TrainingArguments.

In [14]:
training_args_prompt = create_training_arguments(output_directory_prompt, 0.003, NUM_EPOCHS)


In [15]:
from evaluate import load
import numpy as np
import nltk
metric = load("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True, use_aggregator=True)
    # Extract a few results
    result = {key: value * 100 for key, value in result.items()}

    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}


In [16]:
from transformers import DataCollatorForSeq2Seq

# Create a data collator that will pad your inputs and labels
data_collator = DataCollatorForSeq2Seq(tokenizer, model=peft_model_prompt)

# Ensure your dataset includes decoder_input_ids
max_input_length = 512
max_target_length = 128
prefix = "summarize: "

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    labels = tokenizer(text_target=examples["summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

train_dataset = train_dataset.map(preprocess_function, batched=True)
validate_dataset = validate_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

**Train**
We will create the trainer Object, one for each model to train.

In [17]:
from transformers import Trainer, DataCollatorForLanguageModeling, DataCollatorForSeq2Seq, Seq2SeqTrainer
def create_trainer(model, training_args, train_dataset):
    #trainer = Trainer(
    #    model=model, # We pass in the PEFT version of the foundation model, bloomz-560M
    #    args=training_args, #The args for the training.
    #    train_dataset=train_dataset, #The dataset used to tyrain the model.
    #    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False) # mlm=False indicates not to use masked language modeling
    #)
    data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

    trainer = Seq2SeqTrainer(
        model=model, # We pass in the PEFT version of the foundation model, bloomz-560M
        args=training_args, #The args for the training.
        train_dataset=train_dataset, #The dataset used to tyrain the model.
        eval_dataset=validate_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator, # mlm=False indicates not to use masked language modeling
        compute_metrics=compute_metrics,
    )
    return trainer

In [18]:
#Training first model.
trainer_prompt = create_trainer(peft_model_prompt, training_args_prompt, train_dataset)
trainer_prompt.train()

  0%|          | 0/3750 [00:00<?, ?it/s]

**Save models**
We are going to save the models. These models are ready to be used, as long as we have the pre-trained model from which they were created in memory.

In [None]:
trainer_prompt.model.save_pretrained(output_directory_prompt)
#trainer_sentences.model.save_pretrained(output_directory_sentences)

**Inference**
You can load the model from the path that you have saved to before, and ask the model to generate text based on our input before!

In [None]:
from peft import PeftModel

loaded_model_prompt = PeftModel.from_pretrained(foundational_model,
                                         output_directory_prompt,
                                         #device_map='auto',
                                         is_trainable=False)

In [None]:
loaded_model_prompt_outputs = get_outputs(loaded_model_prompt, input_prompt)
print(tokenizer.batch_decode(loaded_model_prompt_outputs, skip_special_tokens=True))