# How to Finetune LLMs with LoRA


Parameter-Efficient Fine-Tuning (PEFT) methods, like LoRA, address the challenges of fine-tuning large language models (LLMs) by only updating a small subset of the model’s parameters. This approach significantly reduces computational and storage costs, making LLM fine-tuning more accessible.

PEFT techniques allow developers to adapt pre-trained models to specific tasks without retraining the entire model, leading to faster development cycles and reduced resource consumption.




In [1]:
!pip install peft==0.4.0 datasets accelerate -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/72.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/487.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m487.4/487.4 kB[0m [31m30.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [14]:
import os  # File and directory management (e.g., saving/loading models)
import torch  # Deep learning framework used for model training and inference
import time  # Used for measuring execution time (benchmarking)

# Import Hugging Face Transformers for handling pre-trained language models
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

# Import dataset loading utility from Hugging Face
from datasets import load_dataset

# Import PEFT (Parameter-Efficient Fine-Tuning) modules for LoRA-based training
from peft import LoraConfig  # Defines LoRA configurations (rank, dropout, etc.)
from peft import get_peft_model  # Applies LoRA to a pre-trained model
from peft import PeftModel  # Loads a fine-tuned LoRA model for inference

#### Creating a Cache Directory

In [15]:
# Ensure the "cache" directory exists to store temporary files or model checkpoints
if not os.path.exists("cache"):
    os.makedirs("cache")  # Create the directory if it doesn't exist
    print(" 'cache' directory created!")
else:
    print(" 'cache' directory already exists.")


 'cache' directory already exists.


#### Loading the Pre-trained Model and Tokenizer
This cell initializes the tokenizer and pre-trained language model, which will later be fine-tuned using LoRA.


In [16]:
# Define model name (BigScience's BLOOMZ-560M)
model_name = "bigscience/bloomz-560m"

# Load pre-trained tokenizer for BLOOMZ
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load pre-trained causal language model (optimized for hardware)
foundation_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,  # Use efficient float16 on GPU if available
    device_map="auto"  # Automatically assign model to GPU or CPU
)

# Confirm successful loading
print(" Model and tokenizer loaded successfully!")

 Model and tokenizer loaded successfully!


#### Loading and Preprocessing the Dataset
This cell loads a dataset of English quotes, preprocesses it, and prepares it for fine-tuning.


In [17]:
from datasets import load_dataset

# Load only 10% of the dataset for efficient training
dataset = load_dataset("Abirate/english_quotes", split="train").shuffle(seed=42).select(range(int(0.1 * len(load_dataset("Abirate/english_quotes", split="train")))))  # Select 10% of data

# Tokenize the dataset (convert quotes to token IDs)
data = dataset.map(lambda samples: tokenizer(samples["quote"],
                                             padding="max_length",  # Pad all sequences to the same length
                                             truncation=True,  # Truncate if longer than max_length
                                             max_length=128  # Define max token length
                                             ),
                   batched=True)  # Apply tokenization in batches for efficiency

# Select a small sample (10 samples) for inspection
train_sample = data.select(range(10))

# Print confirmation message
print(f"Dataset loaded and tokenized! Number of samples: {len(data)}")


Dataset loaded and tokenized! Number of samples: 250


In [18]:
print(data)

Dataset({
    features: ['quote', 'author', 'tags', 'input_ids', 'attention_mask'],
    num_rows: 250
})


#### Configuring LoRA for Efficient Fine-Tuning
Now, we define the LoRA (Low-Rank Adaptation) configuration, which enables efficient fine-tuning by modifying only a subset of the model's parameters instead of the entire network. This significantly reduces computational costs while maintaining strong performance.

In [20]:
# Define LoRA configuration for parameter-efficient fine-tuning
lora_config = LoraConfig(
    r=8,  # LoRA rank: Controls parameter reduction (smaller rank = more efficiency)
    lora_alpha=32,  # Scaling factor: Adjusts the impact of LoRA weight updates
    target_modules=["query_key_value"],  # Apply LoRA only to attention layers (query-key-value projection)
    lora_dropout=0.1,  # Regularization: Dropout to prevent overfitting
    bias="none",  # No additional bias training
    task_type="CAUSAL_LM"  # Fine-tuning for autoregressive text generation
)

print(" LoRA configuration set successfully!")

 LoRA configuration set successfully!


#### Applying LoRA to the Pre-trained Model

Next, we integrate the LoRA (Low-Rank Adaptation) layers into the pre-trained model to enable efficient fine-tuning. Instead of updating all model parameters, we introduce trainable LoRA layers while keeping the majority of the original model frozen.

In [21]:
# Apply LoRA to the pre-trained model
lora_model = get_peft_model(foundation_model, lora_config)

# Compute the number of trainable and frozen parameters
trainable_params = sum(p.numel() for p in lora_model.parameters() if p.requires_grad)
frozen_params = sum(p.numel() for p in lora_model.parameters() if not p.requires_grad)

# Display the parameter details
print(f"  LoRA model initialized successfully!")
print(f" - Trainable parameters: {trainable_params:,}")
print(f" - Frozen parameters: {frozen_params:,}")
print(f" - Percentage of trainable parameters: {100 * trainable_params / (trainable_params + frozen_params):.4f}%")


  LoRA model initialized successfully!
 - Trainable parameters: 786,432
 - Frozen parameters: 559,214,592
 - Percentage of trainable parameters: 0.1404%


#### Configuring Training Arguments
we define training parameters using the TrainingArguments class from Hugging Face's transformers library. These arguments control how the fine-tuning process is executed.



In [22]:
# Define outputs directory
output_directory = os.path.join("./cache", "peft_lab_outputs")

# Define training arguments
training_args = TrainingArguments(
    report_to="none",                # Disable reporting to external services like WandB
    output_dir=output_directory,     # Directory to save model checkpoints
    auto_find_batch_size=True,       # Automatically find batch size
    # evaluation_strategy="epoch",     # Evaluate model at the end of each epoch
    save_strategy="epoch",           # Save model at the end of each epoch
    learning_rate=3e-4,              # LoRA often requires a higher learning rate
    # per_device_train_batch_size=8,   # Number of samples per batch (adjust based on GPU memory)
    # per_device_eval_batch_size=8,    # Batch size for evaluation
    # save_total_limit=2,              # Keep only the last 2 checkpoints
    #  weight_decay=0.01,               # Regularization to prevent overfitting
    # logging_dir="./logs",            # Where logs are stored
    # logging_steps=10,                # Log metrics every 10 steps
    num_train_epochs=5              # Number of epochs (adjust based on dataset size)

)

print(" Training arguments set up successfully!")


 Training arguments set up successfully!


#### Initializing and Running the Trainer
We initialize the Trainer using Hugging Face's Trainer class. This step trains the LoRA-enhanced model using the preprocessed dataset.

In [23]:
# Initialize the Trainer
trainer = Trainer(
    model=lora_model,  # LoRA-enhanced model to fine-tune
    args=training_args,  # Training configurations (learning rate, epochs, etc.)
    train_dataset=data,  # Tokenized training dataset
    # eval_dataset=data,   # Evaluation dataset (optional, same as train dataset)
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)  # 🛠️ Handles padding
)

# Start training
trainer.train()


Step,Training Loss


TrainOutput(global_step=160, training_loss=3.273973846435547, metrics={'train_runtime': 13.9262, 'train_samples_per_second': 89.759, 'train_steps_per_second': 11.489, 'total_flos': 290975907840000.0, 'train_loss': 3.273973846435547, 'epoch': 5.0})

#### Saving the Fine-Tuned LoRA Model
Now that training is complete, we save the fine-tuned LoRA model so that it can be reloaded for inference later.

In [24]:
# Define the directory to save the fine-tuned model
time_now = time.strftime("%Y-%m-%d_%H-%M-%S")  # Generate timestamp
peft_model_path = os.path.join(output_directory, f"peft_model_{time_now}")  # Create unique model path

# Save the fine-tuned LoRA model
trainer.model.save_pretrained(peft_model_path)

# Confirm successful save
print(f" LoRA fine-tuned model saved successfully at: {peft_model_path}")


 LoRA fine-tuned model saved successfully at: ./cache/peft_lab_outputs/peft_model_2025-03-19_07-05-00


#### Loading the Fine-Tuned LoRA Model for Inference
After that we have saved the fine-tuned LoRA model, we need to load it back for inference (text generation).



In [25]:
# Load the base model (bloomz-560m)
base_model = AutoModelForCausalLM.from_pretrained(model_name)  # Load pre-trained model
tokenizer = AutoTokenizer.from_pretrained(model_name)  # Load tokenizer

# Load the fine-tuned LoRA model
peft_model = PeftModel.from_pretrained(base_model, peft_model_path)

# Confirmation message
print(" LoRA fine-tuned model loaded successfully for inference!")


 LoRA fine-tuned model loaded successfully for inference!


#### Generating Text with the Fine-Tuned LoRA Model

Finally, we will use loaded fine_tuned LoRA model to generate text based on a given prompt.

In [27]:
# Display 5 random quotes from the dataset
import random

# Select 5 random samples
sample_quotes = data.select(random.sample(range(len(data)), 5))

# Print the quotes
for i, sample in enumerate(sample_quotes["quote"]):
    print(f"{i+1}. {sample}")


1. “Don't spend time beating on a wall, hoping to transform it into a door. ”
2. “The homemaker has the ultimate career. All other careers exist for one purpose only - and that is to support the ultimate career. ”
3. “One, remember to look up at the stars and not down at your feet. Two, never give up work. Work gives you meaning and purpose and life is empty without it. Three, if you are lucky enough to find love, remember it is there and don't throw it away.”
4. “If you want to keep a secret, you must also hide it from yourself.”
5. “I will not let anyone walk through my mind with their dirty feet.”


In [29]:
# Define an input prompt
input_text = "Don't spend time beating on a wall,"

# Tokenize input text (convert to numerical format)
inputs = tokenizer(input_text, return_tensors="pt")

# Generate text using the fine-tuned LoRA model
outputs = peft_model.generate(
    input_ids=inputs["input_ids"],  # Tokenized input prompt
    attention_mask=inputs["attention_mask"],  # Mask to focus on real input
    max_length=100,  # Limit output length
    num_return_sequences=1,  # Generate 1 sequence
    do_sample=True,  # Enable randomness for diverse responses
    top_k=50,  # Consider top 50 tokens for sampling
    top_p=0.95  # Nucleus sampling: keep top 95% probability mass
)

# Decode generated tokens back into readable text
print("\n Generated Text:")
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))  # Remove special tokens like <eos>



 Generated Text:
["Don't spend time beating on a wall, talking to a man, or even listening to music on the cellar. Don’t waste time going through the woods by walking at a slow pace until you fall over."]


# Conclusion
In this exercise, we successfully applied LoRA fine-tuning to a pre-trained language model (bloomz-560m) using a dataset of English quotes. Our goal was to see how well the fine-tuned model could generate meaningful text completions.

Key Observations:
* LoRA efficiently fine-tuned the model with minimal computational cost.
* The model learned general sentence patterns from the dataset.
* However, it struggled to accurately complete quotes, often generating text that was coherent but not contextually relevant.
* The limited training data (10% sample) likely reduced the model’s ability to specialize in quote completions.

Takeaways:

LoRA is a lightweight yet powerful fine-tuning method.
* The choice of training data size and quality significantly impacts model performance.
* Further improvements could be made by refining the dataset, adjusting hyperparameters, and using a more robust base model.
* Overall, this challenge provided a hands-on understanding of how LoRA fine-tuning works in practice.








