[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dz-web3/DS-Tech-2026spring/blob/main/Module8_LLM_Finetuning/LLM_Finetuning.ipynb)

**Click the badge above to open this notebook in Google Colab!**

# Module 8: Fine-Tuning Large Language Models (LLMs)

**Data Science for Business (Technical) — Spring 2026**

---

## Learning Objectives

By the end of this module, you will be able to:

1. **Explain** the difference between pre-training, fine-tuning, and prompting
2. **Understand** when fine-tuning is the right choice vs. other approaches
3. **Perform** hands-on fine-tuning of a Llama model using modern, efficient techniques
4. **Evaluate** the business value and trade-offs of fine-tuning LLMs

---

## Why This Matters for Business

Large Language Models like GPT-4, Claude, and Llama are transforming how businesses operate. But **off-the-shelf models don't always fit your specific needs**. Fine-tuning allows you to:

- 🎯 **Customize** model behavior for your domain (legal, medical, customer service)
- 💰 **Reduce costs** by using smaller, specialized models instead of expensive large ones
- 🔒 **Maintain control** over your data and model behavior
- ⚡ **Improve performance** on specific tasks your business cares about

## 1. Setting Up Google Colab Pro (Free for NYU Students)

### 🎓 NYU Students: Get Free Colab Pro!

Google offers **free Colab Pro subscriptions** for students at U.S. higher education institutions.

**To claim your free subscription:**

1. Go to [colab.research.google.com](https://colab.research.google.com)
2. Click on the gear icon (⚙️) → "Colab Pro"
3. Select "Colab Pro for Education"
4. Verify your student status using your NYU email
5. You'll receive a **1-year free subscription** with more compute resources

### 🖥️ Enabling GPU for This Notebook

Fine-tuning requires a GPU. Here's how to enable it:

1. Go to **Runtime** → **Change runtime type**
2. Set **Hardware accelerator** to **T4 GPU** (or A100 if available with Colab Pro)
3. Click **Save**

Run the cell below to verify GPU is enabled:

In [None]:
# Check if GPU is available
import torch
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    print(f"✅ GPU is enabled: {gpu_name}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("❌ GPU is NOT enabled. Please go to Runtime → Change runtime type → Select GPU")

## 2. The LLM Customization Spectrum

Before diving into fine-tuning, let's understand where it fits in the broader landscape of LLM customization:

| Approach | Effort | Data Needed | Use Case |
|----------|--------|-------------|----------|
| **Prompting** | Low | None | Quick tasks, general use |
| **Few-shot Learning** | Low | 5-20 examples | Demonstrate desired format |
| **RAG** (Retrieval) | Medium | Documents | Add knowledge, keep model current |
| **Fine-tuning** | High | 100-10,000+ examples | Change model behavior/style |
| **Pre-training** | Very High | Billions of tokens | Build from scratch (rarely needed) |

### When Should You Fine-Tune?

✅ **Fine-tune when you want to:**
- Change the model's communication style consistently
- Make the model follow specific formats/templates
- Teach domain-specific terminology or behavior
- Improve reliability on repetitive tasks

❌ **Don't fine-tune when you can:**
- Solve the problem with better prompts
- Use RAG to add relevant knowledge
- Use few-shot examples in the prompt

## 3. Understanding Fine-Tuning: The Concept

### Pre-training vs Fine-tuning

**Pre-training** is like giving someone a general education:
- The model learns from massive amounts of text (books, websites, code)
- It learns language patterns, facts, and reasoning
- This is expensive: millions of dollars, weeks of compute time
- Done by companies like Meta (Llama), OpenAI (GPT), Google (Gemini)

**Fine-tuning** is like specialized job training:
- Start with a pre-trained model that already "knows" language
- Train it on your specific examples to learn your style/domain
- Much cheaper: can be done in minutes to hours on a single GPU
- This is what **you** can do!

### LoRA: Efficient Fine-Tuning

Traditional fine-tuning updates **all** model parameters — expensive and slow.

**LoRA (Low-Rank Adaptation)** is a clever technique that:
- Freezes the original model weights
- Adds small "adapter" layers that learn your specific task
- Only trains ~1% of the parameters
- Result: **Same quality, 10x less memory, 10x faster!**

Think of it like this: instead of rewriting an entire textbook, you're adding sticky notes with your customizations.

## 4. Hands-On: Fine-Tuning Llama 3.2

Now let's actually fine-tune a model! We'll use:

- **Model**: Llama 3.2 1B (a capable but small model from Meta)
- **Tool**: Unsloth (makes fine-tuning 2x faster and uses 70% less memory)
- **Technique**: LoRA with 4-bit quantization
- **Task**: Create a customer service chatbot

### Step 1: Install Dependencies

This will take 2-3 minutes. Go grab a coffee! ☕

In [None]:
%%capture
# Install Unsloth (optimized for Colab)
!pip install unsloth
# Install other dependencies
!pip install --no-deps trl peft accelerate bitsandbytes

In [None]:
# Verify installation
import unsloth
print(f"✅ Unsloth version: {unsloth.__version__}")

### Step 2: Load the Base Model

We'll load Llama 3.2 1B in 4-bit quantization (uses ~1GB instead of ~4GB).

In [None]:
from unsloth import FastLanguageModel
import torch

# Model configuration
max_seq_length = 2048  # Maximum context length
dtype = None  # Auto-detect (float16 for T4)
load_in_4bit = True  # Use 4-bit quantization to save memory

# Load the model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-1B-Instruct",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

print(f"✅ Model loaded successfully!")
print(f"   Model: Llama 3.2 1B Instruct")
print(f"   Parameters: ~1 billion")

### Step 3: Add LoRA Adapters

Now we add the LoRA adapters — the small trainable layers that will learn our task.

In [None]:
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank - higher = more capacity but more memory
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,  # 0 is optimized
    bias="none",
    use_gradient_checkpointing="unsloth",  # Memory optimization
    random_state=42,
)

# Count trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"✅ LoRA adapters added!")
print(f"   Trainable parameters: {trainable_params:,} ({100*trainable_params/total_params:.2f}%)")
print(f"   Total parameters: {total_params:,}")

### Step 4: Prepare the Training Data

We'll use a simple customer service dataset. In real scenarios, you'd use your own business data.

In [None]:
# Sample customer service training data
training_data = [
    {"instruction": "What is your return policy?", "response": "Our return policy allows returns within 30 days of purchase with a valid receipt. Items must be in original condition with tags attached."},
    {"instruction": "How do I track my order?", "response": "You can track your order by logging into your account and clicking 'Order History', or use the tracking number from your shipping confirmation email."},
    {"instruction": "Do you offer international shipping?", "response": "Yes, we ship to over 50 countries worldwide. International shipping rates and delivery times vary by destination."},
    {"instruction": "How can I cancel my order?", "response": "To cancel an order, please contact us within 2 hours of placing it. Once an order has been processed for shipping, it cannot be cancelled."},
    {"instruction": "What payment methods do you accept?", "response": "We accept all major credit cards (Visa, Mastercard, Amex), PayPal, Apple Pay, and Google Pay."},
    {"instruction": "How do I reset my password?", "response": "Click 'Forgot Password' on the login page, enter your email, and we'll send you a reset link valid for 24 hours."},
    {"instruction": "Is my personal information secure?", "response": "Yes, we use industry-standard SSL encryption and never share your data with third parties. Your security is our priority."},
    {"instruction": "How do I apply a discount code?", "response": "Enter your discount code in the 'Promo Code' field at checkout and click 'Apply'. The discount will be reflected in your order total."},
    {"instruction": "What are your store hours?", "response": "Our online store is available 24/7. For physical locations, hours vary by store - please check our store locator for specific hours."},
    {"instruction": "How do I contact customer support?", "response": "You can reach us via live chat on our website, email at support@example.com, or call 1-800-EXAMPLE Monday-Friday 9am-6pm EST."},
    {"instruction": "Do you price match?", "response": "Yes, we offer price matching within 14 days of purchase if you find the same item at a lower price from an authorized retailer."},
    {"instruction": "How long does shipping take?", "response": "Standard shipping takes 5-7 business days. Express shipping (2-3 days) and overnight options are also available at checkout."},
    {"instruction": "Can I change my shipping address?", "response": "You can update your shipping address before the order ships by contacting customer support. Once shipped, address changes are not possible."},
    {"instruction": "Do you have a loyalty program?", "response": "Yes! Join our rewards program for free to earn points on every purchase, receive exclusive discounts, and get early access to sales."},
    {"instruction": "What if my item arrives damaged?", "response": "We're sorry to hear that! Please contact us within 48 hours with photos of the damage, and we'll send a replacement or issue a full refund."},
]

print(f"✅ Training data prepared: {len(training_data)} examples")
print(f"\nExample:")
print(f"  Q: {training_data[0]['instruction']}")
print(f"  A: {training_data[0]['response']}")

In [None]:
from datasets import Dataset

# Format data for training
def format_prompt(example):
    """Format the data into the chat template the model expects"""
    text = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful customer service assistant. Be friendly, professional, and concise.<|eot_id|><|start_header_id|>user<|end_header_id|>

{example['instruction']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{example['response']}<|eot_id|>"""
    return {"text": text}

# Create dataset
dataset = Dataset.from_list(training_data)
dataset = dataset.map(format_prompt)

print(f"✅ Dataset formatted!")
print(f"\nSample formatted prompt:")
print(dataset[0]['text'][:500] + "...")

### Step 5: Train the Model! 🚀

This is where the magic happens. Training will take approximately **10-15 minutes** on a T4 GPU.

Watch the loss decrease — that means the model is learning!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

# Training configuration
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,  # Quick training for demo
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=42,
        output_dir="outputs",
    ),
)

print("🚀 Starting training...")
print("   This will take ~10-15 minutes on a T4 GPU")
print("   Watch the 'loss' value decrease - that means learning is happening!\n")

In [None]:
# Run training
trainer_stats = trainer.train()

print(f"\n✅ Training complete!")
print(f"   Total training time: {trainer_stats.metrics['train_runtime']:.1f} seconds")
print(f"   Final loss: {trainer_stats.metrics['train_loss']:.4f}")

### Step 6: Test the Fine-Tuned Model

Let's see if our model learned to be a good customer service assistant!

In [None]:
# Switch to inference mode (faster generation)
FastLanguageModel.for_inference(model)

def ask_customer_service(question):
    """Ask our fine-tuned customer service model a question"""
    prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful customer service assistant. Be friendly, professional, and concise.<|eot_id|><|start_header_id|>user<|end_header_id|>

{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(
        **inputs,
        max_new_tokens=150,
        temperature=0.7,
        do_sample=True,
    )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract just the assistant's response
    response = response.split("assistant")[-1].strip()
    return response

print("🤖 Customer Service Bot Ready!")
print("="*50)

In [None]:
# Test with questions from training data
test_questions = [
    "What's your return policy?",
    "How can I track my package?",
    "Do you ship internationally?",
]

print("📝 Testing with training-similar questions:\n")
for q in test_questions:
    print(f"Customer: {q}")
    print(f"Bot: {ask_customer_service(q)}")
    print("-" * 40)

In [None]:
# Test with NEW questions (not in training data)
new_questions = [
    "Can I get a refund if I don't like the product?",
    "What happens if my package is lost?",
    "Do you have a size guide?",
]

print("🆕 Testing with NEW questions (not in training data):\n")
for q in new_questions:
    print(f"Customer: {q}")
    print(f"Bot: {ask_customer_service(q)}")
    print("-" * 40)

### Step 7: Save the Model (Optional)

If you want to use this model later, you can save it.

In [None]:
# Save the LoRA adapters (small, ~50MB)
model.save_pretrained("customer_service_lora")
tokenizer.save_pretrained("customer_service_lora")
print("✅ Model saved to 'customer_service_lora' folder")

# Check the size
!du -sh customer_service_lora/

## 5. Business Applications & Decision Framework

### Real-World Use Cases

| Company | Application | Why Fine-Tuning? |
|---------|-------------|------------------|
| **Legal Tech** | Contract analysis | Domain-specific terminology |
| **Healthcare** | Patient communication | Regulatory compliance, tone |
| **E-commerce** | Customer service bots | Brand voice, product knowledge |
| **Finance** | Report generation | Consistent formatting, compliance |
| **Education** | Tutoring assistants | Teaching style, curriculum alignment |

### Cost-Benefit Analysis

**Fine-tuning costs:**
- Compute: ~$1-10 for small models, ~$100-1000 for large models
- Data preparation: Often the largest cost (human time to create/curate examples)
- Iteration: Usually need 2-5 rounds to get it right

**Fine-tuning benefits:**
- 10-100x cheaper inference than prompting with examples
- More consistent behavior
- Faster response times (no need for long prompts)
- Can use smaller, cheaper models

### Decision Framework

```
Start with Prompting
        ↓
Works well enough? → YES → Stop here! 🎉
        ↓ NO
Need external knowledge? → YES → Try RAG first
        ↓ NO
Have 100+ good examples? → NO → Collect more data
        ↓ YES
Fine-tune! → Evaluate → Iterate
```

## 6. Summary & Key Takeaways

### What We Learned

1. **Fine-tuning** adapts a pre-trained model to your specific needs
2. **LoRA** makes fine-tuning efficient (train only ~1% of parameters)
3. **Modern tools** (Unsloth) make it accessible on free hardware
4. **Business value** comes from consistency, cost reduction, and customization

### What We Did

- ✅ Loaded Llama 3.2 1B (a 1 billion parameter model)
- ✅ Added LoRA adapters for efficient training
- ✅ Fine-tuned on customer service data
- ✅ Tested the model on new questions
- ✅ Saved the model for future use

### Next Steps for Your Career

1. **Experiment**: Try fine-tuning with your own data
2. **Explore**: Look into Hugging Face, OpenAI fine-tuning API
3. **Stay current**: This field moves fast — follow AI news!

---

*Questions? Reach out during office hours or on the course forum.*

## 🎯 Exercise (Optional)

Try modifying the training data to create a different kind of assistant:

1. **Tech Support Bot**: Answer questions about common tech issues
2. **Restaurant Bot**: Handle reservations and menu questions
3. **Fitness Coach**: Give workout advice and motivation

Modify the `training_data` list with your own examples and re-run the training!

In [None]:
# Your custom training data here!
# my_training_data = [
#     {"instruction": "...", "response": "..."},
#     ...
# ]