<a href="https://colab.research.google.com/github/arquansa/PSTB-exercises/blob/main/Week08/Day2/DC2/W8D2DC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Daily Challenge : How to Finetune LLMs with LoRA#

Parameter-Efficient Fine-Tuning (PEFT) methods, like LoRA, address the challenges of fine-tuning large language models (LLMs) by only updating a small subset of the model’s parameters. This approach significantly reduces computational and storage costs, making LLM fine-tuning more accessible. PEFT techniques allow developers to adapt pre-trained models to specific tasks without retraining the entire model, leading to faster development cycles and reduced resource consumption.
You will implement it for this challenge.

👩‍🏫 👩🏿‍🏫 What You’ll learn
How to apply Low-Rank Adaptation (LoRA) to a pre-trained language model.
How to fine-tune a LoRA-adapted model using the Hugging Face PEFT library.
How to save and load a fine-tuned LoRA model.
How to perform inference using a fine-tuned LoRA model.

🛠️ What you will create
A fine-tuned language model that generates text based on a specific dataset of quotes, using LoRA.

Dataset
The “Abirate/english_quotes” dataset, specifically a 10% sample of the training split.

Task
- Install necessary libraries (PEFT, datasets).
- Load a pre-trained language model (bigscience/bloomz-560m) and its tokenizer.
- Load the dataset and preprocess it for the model.
- Configure LoRA using LoraConfig.
- Apply LoRA to the pre-trained model using get_peft_model.
- Set up training arguments using TrainingArguments.
- Initialize and train the model using Trainer.
- Save the fine-tuned LoRA model.
- Load the saved LoRA model for inference using PeftModel.from_pretrained.
- Generate text using the fine-tuned model and the tokenizer.


Hint :

%pip install peft==0.4.0

mkdir cache

!pip install datasets

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name =
tokenizer =
foundation_model =

data =  # Sample 10%
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)
train_sample = data.select(range(5))
display(train_sample)

import peft
from peft import LoraConfig, get_peft_model

**Fill in `r=1` and `target_modules`.**
lora_config = LoraConfig(
    r=,
    lora_alpha=, # a scaling factor that adjusts the magnitude of the weight matrix. Usually set to 1
    target_modules=,
    lora_dropout=,
    bias="none", # this specifies if the bias parameter should be trained.
    task_type="CAUSAL_LM"
)

**Add the adapter layers to the foundation model to be trained**
peft_model = get_peft_model(foundation_model, lora_config)
print(peft_model.print_trainable_parameters())

**Fill out the `Trainer` class.**
import transformers
from transformers import TrainingArguments, Trainer
import os

output_directory = os.path.join("../cache/working", "peft_lab_outputs")
training_args = TrainingArguments(
    report_to="none",
    output_dir=output_directory,
    auto_find_batch_size=,
    learning_rate= 3e-2, # Higher learning rate than full fine-tuning.
    num_train_epochs=,
    use_cpu=True
)

trainer = Trainer(
    model=,
    args=,
    train_dataset=e,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
trainer.train()

**Load the PEFT model using pre-defined LoRA configs and foundation model.** We set `is_trainable=False` to avoid further training.

import time

time_now =
peft_model_path = os.path.join(output_directory, f"peft_model_{time_now}")
trainer.model.save_pretrained(peft_model_path)

**Generate output tokens**

inputs = tokenizer("Two things are infinite: ", return_tensors="pt")
outputs = peft_model.generate(
    ...
    )

print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

## Install necessary libraries (PEFT, datasets)

In [14]:
!pip install peft==0.4.0 datasets transformers accelerate
!mkdir cache

Collecting peft==0.4.0
  Downloading peft-0.4.0-py3-none-any.whl.metadata (21 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.13.0->peft==0.4.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.13.0->peft==0.4.0)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.13.0->peft==0.4.0)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.13.0->peft==0.4.0)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.13.0->peft==0.4.0)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=

All necessary dependencies have been installed.

## Load a pre-trained language model (bigscience/bloomz-560m) and its tokenizer

In [None]:
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "bigscience/bloomz-560m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
foundation_model = AutoModelForCausalLM.from_pretrained(
    model_name
)

The pre-trained model and its tokenizer are now ready for their task: fine-tuning using the LoRA (Low Rank Adaptation) PEFT method.
The objective: just by setting up a few parameters, the aim is to fine-tune the model without re-training, thus reducing considerably computational and storage costs, as well as resources consumption, while accelerating development cycles.

##Load the dataset and preprocess it for the model

In [6]:
data = load_dataset("Abirate/english_quotes", split="train[:10%]")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)
train_sample = data.select(range(5))
print(train_sample)

Dataset({
    features: ['quote', 'author', 'tags', 'input_ids', 'attention_mask'],
    num_rows: 5
})


After being loaded, the pre-trained language model (bigscience/bloomz-560m) can be applied to LoRA with its tokenizer.


# Configure LoRA using LoraConfig and apply LoRA to the pre-trained model using get_peft_model

***Due to use of CPU instead of GPU, code had to be modified, hence the green comments (#)***.

In [7]:
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
    r=1,
    lora_alpha=1, # a scaling factor that adjusts the magnitude of the weight matrix. Usually set to 1
    target_modules=["query_key_value"],
    lora_dropout=0.1,
    bias="none", # this specifies if the bias parameter should be trained.
    task_type="CAUSAL_LM"
  )

peft_model = get_peft_model(foundation_model, lora_config) # Add the adapter layers to the foundation model to be trained
print(peft_model.print_trainable_parameters())

trainable params: 98,304 || all params: 559,312,896 || trainable%: 0.0176
None


LoRA can now take place, using TrainingArguments.

##Set up training arguments using TrainingArguments

***Due to use of CPU instead of GPU, code had to be modified, hence the green comments (#)***.

In [8]:
import transformers
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
import os

output_directory = os.path.join("../cache/working", "peft_lab_outputs")
training_args = TrainingArguments(report_to="none",
                                  output_dir=output_directory,
                                  per_device_train_batch_size=1,  # little batch for CPU
                                  #auto_find_batch_size=True,  Set to True to automatically find batch size
                                  #learning_rate= 3e-2, # Higher learning rate than full fine-tuning.
                                  num_train_epochs=1, # Set number of training epochs
                                  logging_steps=100,
                                  save_strategy="epoch",
                                  #evaluation_strategy="no",  # No evaluation to accelerate
                                  dataloader_num_workers=2,  # Moderate parallelism for CPU
                                  fp16=False,  # fp16 not supported on CPU
                                  #use_cpu=True
                                  )

The model is initialized, using Trainer

##Initialize and train the model using Trainer

In [9]:
import time
from peft import PeftModel
from transformers import TrainingArguments, Trainer
import os

time_now = int(time.time())
output_directory = os.path.join("../cache/working", "peft_lab_outputs")
peft_model_path = os.path.join(output_directory, f"peft_model_{time_now}")

data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=data,
    data_collator=data_collator,
    #label_names=["labels"]
)
trainer.train()

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Step,Training Loss
100,3.6331
200,3.4448


TrainOutput(global_step=251, training_loss=3.521116705054781, metrics={'train_runtime': 1163.7712, 'train_samples_per_second': 0.216, 'train_steps_per_second': 0.216, 'total_flos': 15486507048960.0, 'train_loss': 3.521116705054781, 'epoch': 1.0})

##Save the fine-tuned LoRA model

In [10]:
# Saved PEFT model
trainer.model.save_pretrained(peft_model_path)

The finetuned LoRA model is saved.

##Load the saved LoRA model for inference using PeftModel.from_pretrained


The finetuned LoRA model is loaded for inference (a waning message appears because the model was already loaded)

In [11]:
#Load the saved PEFT model
peft_model = PeftModel.from_pretrained(foundation_model, peft_model_path)
#To be added: a control 'is_trainable=False' to avoid further training.



##Generate text using the fine-tuned model and the tokenizer

**The model is now fine-tuned and can generate refined text, using the tokenizer.**

In [12]:
# Generate output tokens

inputs = tokenizer("Two things are infinite:", return_tensors="pt")
outputs = peft_model.generate(
    inputs=inputs["input_ids"],
    max_new_tokens=50,
    num_beams=2,
    early_stopping=True
)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

['Two things are infinite: time and space']
