# QLoRA Fine-tuning with LLaMA 3.2

This notebook shows how to fine-tune the LLaMA 3.2 3B-Instruct model using QLoRA for efficient training.

## What this does
- Loads LLaMA 3.2 3B-Instruct model with 4-bit quantization
- Sets up QLoRA adapters for parameter-efficient fine-tuning
- Uses about 2.2GB GPU memory
- Only trains 1.33% of the model parameters (24M out of 1.8B)

## What you need
- GPU with CUDA support
- Kaggle account to download the model
- HuggingFace account for model access

In [19]:
! pip install --upgrade pip
! pip install -q jupyterlab notebook ipywidgets
! pip install -q huggingface_hub transformers datasets accelerate peft trl safetensors
! pip install -q bitsandbytes || pip install -q bitsandbytes-windows




## Setup and Dependencies

Install the required packages for QLoRA fine-tuning.

In [None]:
# Import token from config file
from config import HUGGINGFACE_TOKEN
from huggingface_hub import login

login(token=HUGGINGFACE_TOKEN)


## Authentication

Login to HuggingFace and set up Kaggle credentials.

In [None]:
# KAGGLE AUTHENTICATION & LLAMA 3.2 DOWNLOAD
import os
import json
import kagglehub
from config import KAGGLE_USERNAME, KAGGLE_API_KEY

print("Setting up Kaggle authentication...")

# Import credentials from config
kaggle_credentials = {
    "username": KAGGLE_USERNAME,
    "key": KAGGLE_API_KEY
}

# Create .kaggle directory if it doesn't exist
kaggle_dir = os.path.expanduser("~/.kaggle")
os.makedirs(kaggle_dir, exist_ok=True)

# Write credentials to kaggle.json
kaggle_json_path = os.path.join(kaggle_dir, "kaggle.json")
with open(kaggle_json_path, 'w') as f:
    json.dump(kaggle_credentials, f)

# Set proper permissions (important for security)
if os.name != 'nt':  # Not Windows
    os.chmod(kaggle_json_path, 0o600)

print(f"Kaggle credentials saved to: {kaggle_json_path}")

# Set environment variables for this session
os.environ['KAGGLE_USERNAME'] = kaggle_credentials['username']
os.environ['KAGGLE_KEY'] = kaggle_credentials['key']

print("Downloading LLaMA 3.2 3B-Instruct model from Kaggle...")
print("This will take several minutes (model is ~6GB)...")

try:
    # Download the LLaMA 3.2 model
    model_path = kagglehub.model_download("metaresearch/llama-3.2/pyTorch/3b-instruct")
    
    print(f"LLaMA 3.2 model downloaded successfully!")
    print(f"Model path: {model_path}")
    
    # List contents of the model directory
    import os
    if os.path.exists(model_path):
        files = os.listdir(model_path)
        print(f"Model files: {files}")
    
    # Store the path for later use
    llama_model_path = model_path
    
except Exception as e:
    print(f"Error downloading LLaMA model: {e}")
    print("Falling back to GPT-2 model")
    llama_model_path = None

print("\nReady to load LLaMA 3.2 model!")

🔐 Setting up Kaggle authentication...
✅ Kaggle credentials saved to: C:\Users\N I T R O/.kaggle\kaggle.json
🔽 Downloading LLaMA 3.2 3B-Instruct model from Kaggle...
⏳ This will take several minutes (model is ~6GB)...


Downloading 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading from https://www.kaggle.com/api/v1/models/metaresearch/llama-3.2/pyTorch/3b-instruct/1/download/orig_params.json...


100%|██████████| 220/220 [00:00<00:00, 55.6kB/s]


Downloading from https://www.kaggle.com/api/v1/models/metaresearch/llama-3.2/pyTorch/3b-instruct/1/download/tokenizer.model...




Downloading from https://www.kaggle.com/api/v1/models/metaresearch/llama-3.2/pyTorch/3b-instruct/1/download/consolidated.00.pth...



[A

Downloading from https://www.kaggle.com/api/v1/models/metaresearch/llama-3.2/pyTorch/3b-instruct/1/download/params.json...




100%|██████████| 220/220 [00:00<00:00, 55.0kB/s]
100%|██████████| 2.08M/2.08M [00:01<00:00, 1.12MB/s]

[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A

✅ LLaMA 3.2 model downloaded successfully!
📁 Model path: C:\Users\N I T R O\.cache\kagglehub\models\metaresearch\llama-3.2\pyTorch\3b-instruct\1
📂 Model files: ['consolidated.00.pth', 'orig_params.json', 'params.json', 'tokenizer.model']

🎯 Ready to load LLaMA 3.2 model!


## Model Download

Download the LLaMA 3.2 model from Kaggle.

In [15]:
# LOAD LLAMA 3.2 MODEL WITH TRANSFORMERS
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import os

# Use the downloaded LLaMA model path
if 'llama_model_path' in globals() and llama_model_path:
    print(f"Loading LLaMA 3.2 from: {llama_model_path}")
    
    # However, we need to use the HuggingFace model ID for transformers
    # The downloaded files are in Meta's format, not HuggingFace format
    print("Note: Downloaded model is in Meta format, using HuggingFace model ID instead")
    model_id = "meta-llama/Llama-3.2-3B-Instruct"
    
    # You'll need to accept the license at: https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
    print("Make sure you have access to the HuggingFace model")
    
else:
    print("LLaMA model not downloaded, using GPT-2 as fallback")
    model_id = "gpt2"

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")

# Configure 4-bit quantization for efficiency
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
) if torch.cuda.is_available() else None

try:
    print("Loading tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    
    # Add pad token if needed
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    
    print("Tokenizer loaded!")
    
    print("Loading model with 4-bit quantization...")
    print("This may take a few minutes...")
    
    if torch.cuda.is_available() and bnb_config:
        model = AutoModelForCausalLM.from_pretrained(
            model_id,
            device_map="auto",
            quantization_config=bnb_config,
            torch_dtype=torch.float16,
            trust_remote_code=True
        )
    else:
        model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
            trust_remote_code=True
        )
        if torch.cuda.is_available():
            model = model.to("cuda")
    
    print("Model loaded successfully!")
    print(f"Model: {model_id}")
    print(f"Device: {device}")
    
    if torch.cuda.is_available():
        print(f"GPU Memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
    
    # Test generation
    print("\nTesting LLaMA 3.2 generation...")
    test_prompt = "What is the future of artificial intelligence?"
    
    # Format for Llama instruct model
    if "Instruct" in model_id:
        formatted_prompt = f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{test_prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
    else:
        formatted_prompt = test_prompt
    
    inputs = tokenizer.encode(formatted_prompt, return_tensors="pt")
    if torch.cuda.is_available():
        inputs = inputs.to("cuda")
    
    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_length=inputs.shape[1] + 100,
            num_return_sequences=1,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"Generated response:")
    print(f"{response}")
    print(f"\nLLaMA 3.2 is ready for QLoRA fine-tuning!")
    
except Exception as e:
    print(f"Error loading model: {e}")
    print("You may need to request access to the model first")
    if "gated" in str(e).lower() or "403" in str(e):
        print("Request access at: https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct")

🎯 Loading LLaMA 3.2 from: C:\Users\N I T R O\.cache\kagglehub\models\metaresearch\llama-3.2\pyTorch\3b-instruct\1
💡 Note: Downloaded model is in Meta format, using HuggingFace model ID instead
🔐 Make sure you have access to the HuggingFace model
🖥️ Device: cuda
📚 Loading tokenizer...


tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

✅ Tokenizer loaded!
🤖 Loading model with 4-bit quantization...
⏳ This may take a few minutes...


config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

✅ Model loaded successfully!
📊 Model: meta-llama/Llama-3.2-3B-Instruct
🖥️ Device: cuda
💾 GPU Memory: 2.14 GB

🧪 Testing LLaMA 3.2 generation...
🎯 Generated response:
user

What is the future of artificial intelligence?assistant

The future of artificial intelligence (AI) is a topic of ongoing debate and speculation. While it's difficult to predict exactly what the future will hold, here are some potential trends and developments that may shape the future of AI:

**Short-term (2025-2035)**

1. **Increased adoption in industries**: AI will become more ubiquitous in various industries, such as healthcare, finance, education, and transportation.
2. **Advancements in natural language processing (NLP)**: NLP

🎉 LLaMA 3.2 is ready for QLoRA fine-tuning!


## Model Loading

Load the LLaMA 3.2 model with 4-bit quantization to save memory.

In [16]:
# QLORA SETUP FOR FINE-TUNING
from peft import LoraConfig, get_peft_model, TaskType
import torch

print("Setting up QLoRA for fine-tuning...")

# QLoRA configuration
lora_config = LoraConfig(
    r=16,                               # Rank of the low-rank matrices
    lora_alpha=32,                      # Scaling parameter
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention layers
        "gate_proj", "up_proj", "down_proj"       # MLP layers
    ],
    lora_dropout=0.1,                   # Dropout for LoRA layers
    bias="none",                        # No bias terms
    task_type=TaskType.CAUSAL_LM        # Causal language modeling
)

print("LoRA Configuration:")
print(f"   - Rank (r): {lora_config.r}")
print(f"   - Alpha: {lora_config.lora_alpha}")
print(f"   - Target modules: {lora_config.target_modules}")
print(f"   - Dropout: {lora_config.lora_dropout}")

# Apply LoRA to the model
try:
    print("\nApplying LoRA adapters to the model...")
    
    # Enable gradient checkpointing for memory efficiency
    model.gradient_checkpointing_enable()
    
    # Apply LoRA
    model = get_peft_model(model, lora_config)
    
    # Print trainable parameters
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in model.parameters())
    
    print("LoRA adapters applied successfully!")
    print(f"Trainable parameters: {trainable_params:,} ({100 * trainable_params / total_params:.2f}%)")
    print(f"Total parameters: {total_params:,}")
    
    # Memory usage after LoRA
    if torch.cuda.is_available():
        print(f"GPU Memory after LoRA: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
    
    print("\nModel is ready for fine-tuning!")
    print("Next steps:")
    print("   1. Prepare your training dataset")
    print("   2. Set up training arguments")
    print("   3. Start fine-tuning with Trainer")
    
except Exception as e:
    print(f"Error applying LoRA: {e}")
    print("Make sure PEFT is properly installed")

🔧 Setting up QLoRA for fine-tuning...
📝 LoRA Configuration:
   - Rank (r): 16
   - Alpha: 32
   - Target modules: {'gate_proj', 'v_proj', 'o_proj', 'down_proj', 'up_proj', 'q_proj', 'k_proj'}
   - Dropout: 0.1

🔗 Applying LoRA adapters to the model...
✅ LoRA adapters applied successfully!
📊 Trainable parameters: 24,313,856 (1.33%)
📊 Total parameters: 1,827,777,536
💾 GPU Memory after LoRA: 2.23 GB

🎯 Model is ready for fine-tuning!
💡 Next steps:
   1. Prepare your training dataset
   2. Set up training arguments
   3. Start fine-tuning with Trainer


## QLoRA Setup

Configure Low-Rank Adaptation for efficient fine-tuning.