# Fine-tuning Llama 3 with QLoRA on Alpaca Dataset

This notebook guides through the process of fine-tuning a Llama 3 model using QLoRA (Quantized Low-Rank Adaptation) on a alpaca dataset

## Setup and Environment Preparation

### 1. Check GPU Availability

First, let's make sure we have access to a GPU:

In [None]:
!nvidia-smi

Sat May 17 20:47:23 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla P100-PCIE-16GB           Off |   00000000:00:04.0 Off |                    0 |
| N/A   36C    P0             26W /  250W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

### 2. Install Required Libraries

In [None]:
!pip install -q transformers peft bitsandbytes accelerate trl datasets scipy tensorboard

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m348.0/348.0 kB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m

In [None]:
!pip install -q huggingface_hub

### 3. Configure Hugging Face Access

In [None]:
from huggingface_hub import login
import os
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Data Preparation and Exploration

### 1. Load and Explore the Dataset

In [None]:
import random
import numpy as np
import torch
import os

def seed_everything(seed):
    """Set seed for all random number generators."""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)

SEED = 123
seed_everything(SEED)


In [None]:
# Load the dataset
from datasets import load_dataset

dataset = load_dataset("yahma/alpaca-cleaned", split="train")
print(f"Dataset size: {len(dataset)}")
print("Sample entry:", dataset[6])

Dataset size: 51760
Sample entry: {'output': 'The fraction 4/16 is equivalent to 1/4 because both fractions represent the same value. A fraction can be simplified by dividing both the numerator and the denominator by a common factor. In this case, 4 is a common factor of both the numerator and the denominator of 4/16. When we divide both by 4, we get 4/4 = 1 and 16/4 = 4, so the simplified fraction is 1/4. Alternatively, we can think of this in terms of multiplication. For example, if we multiply the numerator and denominator of the fraction 1/4 by 4, we get (1x4)/(4x4), or 4/16. Since both fractions can be derived from the other through multiplication or division by the same number, they represent the same value and are equivalent.', 'input': '4/16', 'instruction': 'Explain why the following fraction is equivalent to 1/4'}


### 2. Create a Smaller Subset for Testing

In [None]:
subset_ds = dataset.shuffle(seed=SEED).select(range(int(len(dataset) * 0.01)))
print(f"Subset size: {len(subset_ds)}")

Subset size: 517


### 3. Format the Data for Instruction Fine-tuning

In [None]:
# Format the data for instruction fine-tuning
def format_instruction(entry):
    """Format a single entry into an instruction format."""
    instruction = entry.get('instruction', '')
    input_text = entry.get('input', '')

    prompt = 'Below is an instruction that describes a task'
    if input_text:
        prompt += ', paired with an input that provides further context'

    prompt += ".\n\n"
    prompt += "Write a response that appropriately completes the request.\n\n"
    prompt += f"### Instruction:\n{instruction}\n\n"

    if input_text:
        prompt += f"### Input:\n{input_text}\n\n"

    prompt += "### Response:\n"
    return prompt

def format_entry(example):
    """Format an example with prompt and completion."""
    return {
        "prompt": format_instruction(example),
        "completion": example["output"]
    }

In [None]:
# Format the dataset
formatted_dataset = subset_ds.map(lambda x: {"formatted": format_entry(x)})

### 4. Split into Train and Validation Sets

In [None]:
# Split into train and validation sets
train_val_dataset = formatted_dataset.train_test_split(test_size=0.01, seed=SEED)
train_dataset = train_val_dataset["train"]
val_dataset = train_val_dataset["test"]

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")

Training samples: 511
Validation samples: 6


In [None]:
# Display a sample after formatting
sample_formatted = train_dataset[0]["formatted"]
print("\nSample formatted input:")
print(sample_formatted["prompt"])
print("\nSample completion:")
print(sample_formatted["completion"])


Sample formatted input:
Below is an instruction that describes a task.

Write a response that appropriately completes the request.

### Instruction:
Edit the following sentence to make it sound more formal: “we spoke on the phone”

### Response:


Sample completion:
"We had a conversation via telephone."


## Model Preparation

### 1. Load Tokenizer and Define Tokenization Function

In [None]:
# Load tokenizer
from transformers import AutoTokenizer

# Choose your model
model_id = "meta-llama/Meta-Llama-3-8B"  # You can change to "meta-llama/Llama-2-7b" if preferred

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token  # LLaMA uses <eos> as <pad>

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

In [None]:
# Tokenize function
def tokenize_function(example):
    """Tokenize the text combining prompt and completion."""
    formatted = example["formatted"]
    full_text = formatted["prompt"] + formatted["completion"]

    tokenized = tokenizer(
        full_text,
        truncation=True,
        max_length=2048,
        padding="max_length",
        return_tensors="pt"
    )

    # Create labels (same as input_ids but with -100 for prompt tokens)
    input_ids = tokenized["input_ids"][0]

    # Clone for labels
    labels = input_ids.clone()

    # Find the position where the completion starts
    prompt_ids = tokenizer(formatted["prompt"], return_tensors="pt")["input_ids"][0]
    prompt_length = len(prompt_ids)

    # Set labels to -100 for prompt tokens (we don't want to compute loss on these)
    labels[:prompt_length] = -100

    return {
        "input_ids": input_ids,
        "attention_mask": tokenized["attention_mask"][0],
        "labels": labels
    }

### 2. Tokenize Datasets

In [None]:
# Tokenize datasets
tokenized_train_dataset = train_dataset.map(
    tokenize_function,
    remove_columns=train_dataset.column_names
)

tokenized_val_dataset = val_dataset.map(
    tokenize_function,
    remove_columns=val_dataset.column_names
)

print(f"Tokenized train dataset: {len(tokenized_train_dataset)}")
print(f"Tokenized validation dataset: {len(tokenized_val_dataset)}")

Map:   0%|          | 0/511 [00:00<?, ? examples/s]

Map:   0%|          | 0/6 [00:00<?, ? examples/s]

Tokenized train dataset: 511
Tokenized validation dataset: 6


## Model and QLoRA Configuration

### 1. Define QLoRA Parameters and BitsAndBytes Configuration

In [None]:
# Define QLoRA parameters
from transformers import BitsAndBytesConfig
from peft import LoraConfig

# QLoRA parameters
lora_r = 8
lora_alpha = 16
lora_dropout = 0.1

# BitsAndBytes parameters
use_4bit = True
bnb_4bit_quant_type = "nf4"

# BitsAndBytes configuration
quantization_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_use_double_quant=True,
)

2025-05-17 20:50:03.361907: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747515003.562335      35 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747515003.614226      35 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


### 2. Load Base Model with Quantization

In [None]:
# Load base model with quantization
from transformers import AutoModelForCausalLM

device_map = {"": 0}

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map=device_map,
    token=True  # Set to True if the model requires authentication
)

# Prepare model for training
model.config.use_cache = False  # Disable KV cache for training
model.config.pretraining_tp = 1

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

### 3. Configure LoRA

In [None]:
# LoRA configuration
from peft import LoraConfig, get_peft_model

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj",
        # "k_proj",
        "v_proj",
        # "o_proj",
        # "gate_proj",
        # "up_proj",
        # "down_proj"
    ]
)

## Training Setup and Execution

### 1. Configure Training Arguments

In [None]:
# Configure training arguments
from transformers import TrainingArguments

output_dir = "./llama3-8b-finetuned"

training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=1,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=4,  # effective batch size = 4
    gradient_checkpointing=True,
    optim="paged_adamw_32bit",
    learning_rate=2e-4,
    weight_decay=0.001,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=25,
    save_steps=500,
    save_total_limit=2,
    fp16=True,
    bf16=False,  # Set to True if using A100 or H100
    report_to="tensorboard",
    logging_dir="./logs",
    seed=SEED,
    data_seed=SEED,
    push_to_hub=False,  # Set to True if you want to upload to HF Hub
)

### 2. Initialize SFT Trainer and Start Training

In [None]:
# Initialize the SFT Trainer
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    peft_config=peft_config,
    args=training_arguments,
)

# Start training
print("Starting training...")
trainer.train()

Truncating train dataset:   0%|          | 0/511 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/6 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Starting training...


Step,Training Loss
25,4.8868
50,0.2157
75,0.1903
100,0.1864
125,0.1858


TrainOutput(global_step=127, training_loss=1.1180327210839338, metrics={'train_runtime': 3978.9847, 'train_samples_per_second': 0.128, 'train_steps_per_second': 0.032, 'total_flos': 2.343464713637069e+16, 'train_loss': 1.1180327210839338})

## Model Evaluation and Saving

### 1. Evaluate the Model

In [None]:
# Evaluate the model on the validation set
print("Evaluating model...")
eval_results = trainer.evaluate()
print(f"Evaluation results: {eval_results}")

Evaluating model...


Evaluation results: {'eval_loss': 0.2131618708372116, 'eval_runtime': 16.0364, 'eval_samples_per_second': 0.374, 'eval_steps_per_second': 0.374}


### 2. Save the Fine-tuned Model

In [None]:
# Save the fine-tuned model
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"Model saved to {output_dir}")

Model saved to ./llama3-8b-finetuned


## Push to Hugging Face Hub

In [None]:
# Set up your model card information
model_name = "llama3-8b-finetuned-alpaca"
repo_name = f"aashu-0/{model_name}"

In [None]:
# Push to hub
from huggingface_hub import HfApi

# Configure model card with training information
model_card = f"""
# {model_name}

This model is a fine-tuned version of [{model_id}](https://huggingface.co/{model_id}) on the Alpaca dataset.

## Training Parameters
- **Training Dataset**: Alpaca
- **Dataset Size**: {len(train_dataset)} samples
- **QLoRA Parameters**: r={lora_r}, alpha={lora_alpha}, dropout={lora_dropout}
- **Learning Rate**: {training_arguments.learning_rate}
- **Batch Size**: {training_arguments.per_device_train_batch_size * training_arguments.gradient_accumulation_steps}
- **Epochs**: {training_arguments.num_train_epochs}

## Evaluation Results
{eval_results}
"""

In [None]:
# Save model card
with open(f"{output_dir}/README.md", "w") as f:
    f.write(model_card)

trainer.push_to_hub()

print(f"Model pushed to Hugging Face Hub: {repo_name}")

Repo card metadata block was not found. Setting CardData to empty.


adapter_model.safetensors:   0%|          | 0.00/13.6M [00:00<?, ?B/s]

Model pushed to Hugging Face Hub: aashu-0/llama3-8b-finetuned-alpaca


## Inference Example

In [None]:
# Load the fine-tuned model for inference
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load base model
model_base = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map={"": 0},
    token=True
)

# Load LoRA configuration and model
peft_model_id = output_dir
config = PeftConfig.from_pretrained(peft_model_id)
model = PeftModel.from_pretrained(model_base, peft_model_id)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)
tokenizer.pad_token = tokenizer.eos_token

# Create a text generation pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Device set to use cuda:0
The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['AriaTextForCausalLM', 'BambaForCausalLM', 'BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'Cohere2ForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'DeepseekV3ForCausalLM', 'DiffLlamaForCausalLM', 'ElectraForCausalLM', 'Emu3ForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FalconMambaForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'Gemma3ForConditionalGeneration', 'Gemma3ForCausalLM', 'GitForCausalLM', 'GlmForCausalLM', 'Glm4ForCausalLM', 'GotOcr2ForConditionalGeneration', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoFo

In [None]:
# Test with a few prompts
test_prompts = [
    "Write about artificial intelligence.",
    "Explain quantum computing to a 10-year-old."
]

for prompt in test_prompts:
    formatted_prompt = format_instruction({"instruction": prompt})
    print(f"\nPrompt: {prompt}")
    result = pipe(formatted_prompt)[0]['generated_text']
    response = result.split("### Response:")[1].strip()
    print(f"Response:\n{response}")

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.



Prompt: Write about artificial intelligence.
Response:
Artificial Intelligence (AI) refers to the development of computer systems capable of performing tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and learning. AI has become increasingly prevalent in recent years, with applications ranging from self-driving cars to virtual assistants like Siri and Alexa. While there are concerns surrounding the potential negative impact of AI on society, it also offers many benefits, including improved efficiency, enhanced productivity, and increased personalization. Ultimately, Artificial Intelligence holds great promise for the future, but its proper implementation and regulation must be prioritized to ensure responsible use and ethical outcomes.

Prompt: Explain quantum computing to a 10-year-old.
Response:
Quantum Computing is like using magical powers from science fiction movies. It's super cool and amazing! Quantum Computers use