[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dz-web3/DS-Tech-2026spring/blob/main/Module8_LLM_Finetuning/LLM_Finetuning.ipynb)

**Click the badge above to open this notebook in Google Colab!**

# Module 8: Fine-Tuning Large Language Models (LLMs)

**Data Science for Business (Technical) ‚Äî Spring 2026**

---

## Learning Objectives

By the end of this module, you will be able to:

1. **Explain** the difference between pre-training, fine-tuning, and prompting
2. **Understand** when fine-tuning is the right choice vs. other approaches
3. **Perform** hands-on fine-tuning using Hugging Face and LoRA
4. **Visualize** training loss and detect overfitting
5. **Measure** improvement using before/after evaluation on held-out test data

---

## Why This Matters for Business

Large Language Models like GPT-4, Claude, and Qwen are transforming how businesses operate. But **off-the-shelf models don't always fit your specific needs**. Fine-tuning allows you to:

- üéØ **Customize** model behavior for your domain (legal, medical, customer service)
- üí∞ **Reduce costs** by using smaller, specialized models instead of expensive large ones
- üîê **Maintain control** over your data and model behavior
- ‚ö° **Improve performance** on specific tasks your business cares about

## 1. Setting Up Google Colab Pro (Free for NYU Students)

### üéì NYU Students: Get Free Colab Pro!

Google offers **free Colab Pro subscriptions** for students at U.S. higher education institutions.

**To claim your free subscription:**

1. Go to [colab.research.google.com](https://colab.research.google.com)
2. Click on the gear icon (‚öôÔ∏è) ‚Üí "Colab Pro"
3. Select "Colab Pro for Education"
4. Verify your student status using your NYU email
5. You'll receive a **1-year free subscription** with more compute resources

### üñ•Ô∏è Enabling GPU for This Notebook

Fine-tuning requires a GPU. Here's how to enable it:

1. Go to **Runtime** ‚Üí **Change runtime type**
2. Set **Hardware accelerator** to **T4 GPU** (or A100 if available with Colab Pro)
3. Click **Save**

Run the cell below to verify GPU is enabled:

In [None]:
# Check if GPU is available
import torch
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    print(f"‚úÖ GPU is enabled: {gpu_name}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("‚ùå GPU is NOT enabled. Please go to Runtime ‚Üí Change runtime type ‚Üí Select GPU")

## 2. The LLM Customization Spectrum

Before diving into fine-tuning, let's understand where it fits in the broader landscape of LLM customization:

| Approach | Effort | Data Needed | Use Case |
|----------|--------|-------------|----------|
| **Prompting** | Low | None | Quick tasks, general use |
| **Few-shot Learning** | Low | 5-20 examples | Demonstrate desired format |
| **RAG** (Retrieval) | Medium | Documents | Add knowledge, keep model current |
| **Fine-tuning** | High | 100-10,000+ examples | Change model behavior/style |
| **Pre-training** | Very High | Billions of tokens | Build from scratch (rarely needed) |

### When Should You Fine-Tune?

‚úÖ **Fine-tune when you want to:**
- Change the model's communication style consistently
- Make the model follow specific formats/templates
- Teach domain-specific terminology or behavior
- Improve reliability on repetitive tasks

‚ùå **Don't fine-tune when you can:**
- Solve the problem with better prompts
- Use RAG to add relevant knowledge
- Use few-shot examples in the prompt

## 3. Understanding Fine-Tuning: The Concept

### Pre-training vs Fine-tuning

**Pre-training** is like giving someone a general education:
- The model learns from massive amounts of text (books, websites, code)
- It learns language patterns, facts, and reasoning
- This is expensive: millions of dollars, weeks of compute time
- Done by companies like Alibaba (Qwen), Meta (Llama), OpenAI (GPT), Google (Gemini)

**Fine-tuning** is like specialized job training:
- Start with a pre-trained model that already "knows" language
- Train it on your specific examples to learn your style/domain
- Much cheaper: can be done in minutes to hours on a single GPU
- This is what **you** can do!

### LoRA: Efficient Fine-Tuning

Traditional fine-tuning updates **all** model parameters ‚Äî expensive and slow.

**LoRA (Low-Rank Adaptation)** is a clever technique that:
- Freezes the original model weights
- Adds small "adapter" layers that learn your specific task
- Only trains ~1% of the parameters
- Result: **Same quality, 10x less memory, 10x faster!**

Think of it like this: instead of rewriting an entire textbook, you're adding sticky notes with your customizations.

## 4. Hands-On: Fine-Tuning Qwen2.5

Now let's actually fine-tune a model! We'll use:

- **Model**: Qwen2.5-0.5B-Instruct (Alibaba's efficient small model)
- **Library**: Hugging Face Transformers + PEFT (Parameter-Efficient Fine-Tuning)
- **Technique**: LoRA adapters
- **Task**: Create a customer service chatbot

### Step 1: Install Dependencies

This will take about 1-2 minutes. ‚òï

In [None]:
%%capture
# Install Hugging Face libraries
!pip install transformers datasets peft accelerate matplotlib -q

In [None]:
# Verify installation
import transformers
import peft
import matplotlib.pyplot as plt
print(f"‚úÖ Transformers version: {transformers.__version__}")
print(f"‚úÖ PEFT version: {peft.__version__}")
print(f"‚úÖ Matplotlib ready for visualization")

### Step 2: Load the Base Model (Qwen2.5-0.5B)

We'll load Qwen2.5-0.5B-Instruct, an efficient model from Alibaba that's openly available.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Model configuration - Qwen2.5 is openly available, no approval needed!
model_name = "Qwen/Qwen2.5-0.5B-Instruct"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Load the model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
)

print(f"‚úÖ Model loaded: Qwen2.5-0.5B-Instruct")
print(f"   Parameters: ~0.5 billion")

### Step 3: Prepare Training and Test Data

We split our data into:
- **Training data**: Used to fine-tune the model
- **Test data**: Held out to evaluate improvement (model never sees this during training)

In [None]:
# TRAINING DATA - The model will learn from these examples
training_data = [
    {"instruction": "What is your return policy?", "response": "Our return policy allows returns within 30 days of purchase with a valid receipt. Items must be in original condition with tags attached."},
    {"instruction": "How do I track my order?", "response": "You can track your order by logging into your account and clicking 'Order History', or use the tracking number from your shipping confirmation email."},
    {"instruction": "Do you offer international shipping?", "response": "Yes, we ship to over 50 countries worldwide. International shipping rates and delivery times vary by destination."},
    {"instruction": "How can I cancel my order?", "response": "To cancel an order, please contact us within 2 hours of placing it. Once an order has been processed for shipping, it cannot be cancelled."},
    {"instruction": "What payment methods do you accept?", "response": "We accept all major credit cards (Visa, Mastercard, Amex), PayPal, Apple Pay, and Google Pay."},
    {"instruction": "How do I reset my password?", "response": "Click 'Forgot Password' on the login page, enter your email, and we'll send you a reset link valid for 24 hours."},
    {"instruction": "Is my personal information secure?", "response": "Yes, we use industry-standard SSL encryption and never share your data with third parties. Your security is our priority."},
    {"instruction": "How do I apply a discount code?", "response": "Enter your discount code in the 'Promo Code' field at checkout and click 'Apply'. The discount will be reflected in your order total."},
    {"instruction": "What are your store hours?", "response": "Our online store is available 24/7. For physical locations, hours vary by store - please check our store locator for specific hours."},
    {"instruction": "How do I contact customer support?", "response": "You can reach us via live chat on our website, email at support@example.com, or call 1-800-EXAMPLE Monday-Friday 9am-6pm EST."},
]

# TEST DATA - Held out for evaluation (model never sees these during training!)
test_data = [
    {"instruction": "Do you price match?", "response": "Yes, we offer price matching within 14 days of purchase if you find the same item at a lower price from an authorized retailer."},
    {"instruction": "How long does shipping take?", "response": "Standard shipping takes 5-7 business days. Express shipping (2-3 days) and overnight options are also available at checkout."},
    {"instruction": "Can I change my shipping address?", "response": "You can update your shipping address before the order ships by contacting customer support. Once shipped, address changes are not possible."},
    {"instruction": "Do you have a loyalty program?", "response": "Yes! Join our rewards program for free to earn points on every purchase, receive exclusive discounts, and get early access to sales."},
    {"instruction": "What if my item arrives damaged?", "response": "We're sorry to hear that! Please contact us within 48 hours with photos of the damage, and we'll send a replacement or issue a full refund."},
]

print(f"‚úÖ Data prepared!")
print(f"   Training examples: {len(training_data)} (used for fine-tuning)")
print(f"   Test examples: {len(test_data)} (held out for evaluation)")

## 5. üìä BEFORE Fine-Tuning: Baseline Evaluation

Before we fine-tune, let's see how the **original model** performs on our **test data**.

In [None]:
def generate_response(model, tokenizer, question, system_prompt="You are a helpful customer service assistant."):
    """Generate a response from the model"""
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": question}
    ]
    
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=150,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    if question in full_response:
        response = full_response.split(question)[-1].strip()
    else:
        response = full_response
    return response[:500]

def evaluate_on_test_data(model, tokenizer, test_data):
    """Evaluate model on test data"""
    results = []
    total_keyword_score = 0
    
    for item in test_data:
        question = item["instruction"]
        expected = item["response"]
        generated = generate_response(model, tokenizer, question)
        
        expected_keywords = set(word.lower() for word in expected.split() if len(word) >= 4)
        generated_lower = generated.lower()
        
        keywords_found = sum(1 for kw in expected_keywords if kw in generated_lower)
        keyword_score = keywords_found / len(expected_keywords) if expected_keywords else 0
        total_keyword_score += keyword_score
        
        results.append({"question": question, "expected": expected, "generated": generated, "keyword_score": keyword_score})
    
    return results, total_keyword_score / len(test_data)

print("üìä Evaluating BEFORE fine-tuning...\n")
before_results, before_score = evaluate_on_test_data(model, tokenizer, test_data)

print("=" * 60)
print(f"BASELINE SCORE: {before_score:.1%}")
print("=" * 60)
for r in before_results:
    print(f"\n‚ùì {r['question']}")
    print(f"   Score: {r['keyword_score']:.0%}")

## 6. Add LoRA Adapters and Train

Now let's add LoRA adapters and fine-tune the model.

In [None]:
from peft import LoraConfig, get_peft_model, TaskType

# LoRA configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)

trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())

print(f"‚úÖ LoRA adapters added!")
print(f"   Trainable: {trainable_params:,} ({100*trainable_params/total_params:.2f}%)")

In [None]:
from datasets import Dataset

def format_and_tokenize(example):
    messages = [
        {"role": "system", "content": "You are a helpful customer service assistant. Be friendly, professional, and concise."},
        {"role": "user", "content": example['instruction']},
        {"role": "assistant", "content": example['response']}
    ]
    text = tokenizer.apply_chat_template(messages, tokenize=False)
    tokens = tokenizer(text, truncation=True, max_length=512, padding="max_length", return_tensors=None)
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

dataset = Dataset.from_list(training_data)
tokenized_dataset = dataset.map(format_and_tokenize, remove_columns=dataset.column_names)

print(f"‚úÖ Dataset ready: {len(tokenized_dataset)} training examples")

In [None]:
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling, TrainerCallback

# Custom callback to track training metrics
class TrainingMetricsCallback(TrainerCallback):
    def __init__(self):
        self.training_loss = []
        self.steps = []
    
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs and "loss" in logs:
            self.training_loss.append(logs["loss"])
            self.steps.append(state.global_step)

metrics_callback = TrainingMetricsCallback()

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Training configuration - 15 epochs
training_args = TrainingArguments(
    output_dir="./customer_service_qwen",
    num_train_epochs=15,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=2,
    learning_rate=2e-4,
    logging_steps=5,  # Log every 5 steps
    save_strategy="no",
    report_to="none",
    fp16=True,
    warmup_steps=10,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
    callbacks=[metrics_callback],
)

print("üöÄ Starting training for 15 epochs...")
print(f"   Training examples: {len(tokenized_dataset)}")
print("   Watch the loss decrease!\n")

In [None]:
# Run training
train_result = trainer.train()

print(f"\n‚úÖ Training complete!")
print(f"   Final training loss: {train_result.training_loss:.4f}")

## 7. üìà Visualize Training Loss

Let's plot the training loss to see how the model learned and detect potential overfitting.

In [None]:
import matplotlib.pyplot as plt

# Plot training loss
plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.plot(metrics_callback.steps, metrics_callback.training_loss, 'b-', linewidth=2, label='Training Loss')
plt.xlabel('Training Steps', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('Training Loss Over Time', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)

# Add epoch markers
steps_per_epoch = len(tokenized_dataset) // (training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps)
if steps_per_epoch > 0:
    for epoch in range(1, int(training_args.num_train_epochs) + 1):
        epoch_step = epoch * steps_per_epoch
        if epoch_step <= max(metrics_callback.steps):
            plt.axvline(x=epoch_step, color='gray', linestyle='--', alpha=0.5)

# Plot smoothed loss (moving average)
plt.subplot(1, 2, 2)
window_size = min(5, len(metrics_callback.training_loss))
if window_size > 1:
    smoothed_loss = [sum(metrics_callback.training_loss[max(0,i-window_size):i+1])/min(i+1, window_size) 
                    for i in range(len(metrics_callback.training_loss))]
    plt.plot(metrics_callback.steps, smoothed_loss, 'r-', linewidth=2, label='Smoothed Loss')
else:
    plt.plot(metrics_callback.steps, metrics_callback.training_loss, 'r-', linewidth=2, label='Training Loss')
plt.xlabel('Training Steps', fontsize=12)
plt.ylabel('Loss (Smoothed)', fontsize=12)
plt.title('Smoothed Training Loss', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print analysis
print("\nüìä TRAINING ANALYSIS:")
print("="*50)
if len(metrics_callback.training_loss) >= 2:
    initial_loss = metrics_callback.training_loss[0]
    final_loss = metrics_callback.training_loss[-1]
    print(f"   Initial loss: {initial_loss:.4f}")
    print(f"   Final loss:   {final_loss:.4f}")
    print(f"   Reduction:    {(initial_loss - final_loss) / initial_loss * 100:.1f}%")
    
    # Check for overfitting signs
    mid_point = len(metrics_callback.training_loss) // 2
    first_half_avg = sum(metrics_callback.training_loss[:mid_point]) / mid_point if mid_point > 0 else 0
    second_half_avg = sum(metrics_callback.training_loss[mid_point:]) / (len(metrics_callback.training_loss) - mid_point)
    
    if second_half_avg < first_half_avg * 0.5:
        print("\n   ‚ö†Ô∏è Loss is very low - watch for overfitting!")
        print("   Consider: fewer epochs, more regularization, or more data.")
    else:
        print("\n   ‚úÖ Training looks healthy - loss is decreasing steadily.")

## 8. üìä AFTER Fine-Tuning: Measure Real Improvement

Now let's evaluate on **held-out test data** to see the real gain!

In [None]:
model.eval()

print("üìä Evaluating AFTER fine-tuning on TEST DATA...\n")
after_results, after_score = evaluate_on_test_data(model, tokenizer, test_data)

print("=" * 60)
print("AFTER FINE-TUNING - TEST DATA")
print("=" * 60)
for r in after_results:
    print(f"\n‚ùì {r['question']}")
    print(f"ü§ñ {r['generated'][:150]}..." if len(r['generated']) > 150 else f"ü§ñ {r['generated']}")
    print(f"   Score: {r['keyword_score']:.0%}")

print(f"\n{'='*60}")
print(f"üìä AFTER SCORE: {after_score:.1%}")
print(f"{'='*60}")

In [None]:
# Visualize before/after comparison
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Bar chart comparing scores
scores = [before_score * 100, after_score * 100]
labels = ['Before\nFine-Tuning', 'After\nFine-Tuning']
colors = ['#ff6b6b', '#51cf66']

axes[0].bar(labels, scores, color=colors, edgecolor='black', linewidth=1.5)
axes[0].set_ylabel('Score (%)', fontsize=12)
axes[0].set_title('Overall Test Score Comparison', fontsize=14)
axes[0].set_ylim(0, 100)
for i, v in enumerate(scores):
    axes[0].text(i, v + 2, f'{v:.1f}%', ha='center', fontsize=12, fontweight='bold')

# Per-question comparison
questions_short = [f"Q{i+1}" for i in range(len(test_data))]
before_scores = [r['keyword_score'] * 100 for r in before_results]
after_scores = [r['keyword_score'] * 100 for r in after_results]

x = range(len(questions_short))
width = 0.35
axes[1].bar([i - width/2 for i in x], before_scores, width, label='Before', color='#ff6b6b')
axes[1].bar([i + width/2 for i in x], after_scores, width, label='After', color='#51cf66')
axes[1].set_xticks(x)
axes[1].set_xticklabels(questions_short)
axes[1].set_ylabel('Score (%)', fontsize=12)
axes[1].set_title('Per-Question Score Comparison', fontsize=14)
axes[1].legend()
axes[1].set_ylim(0, 100)

plt.tight_layout()
plt.show()

# Print summary
print("\n" + "=" * 60)
print("üìà IMPROVEMENT SUMMARY")
print("=" * 60)
print(f"\n   BEFORE: {before_score:.1%}")
print(f"   AFTER:  {after_score:.1%}")
improvement = after_score - before_score
if improvement > 0:
    print(f"   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")
    print(f"   üìà GAIN: +{improvement:.1%}")
    print(f"\n   ‚úÖ Fine-tuning improved performance on unseen test data!")

## 9. Test with Your Own Questions

In [None]:
new_questions = [
    "Can I get a refund if I don't like the product?",
    "What happens if my package is lost?",
    "Do you have a size guide?",
]

print("üÜï Testing with completely NEW questions:\n")
for q in new_questions:
    response = generate_response(model, tokenizer, q)
    print(f"Customer: {q}")
    print(f"Bot: {response}")
    print("-" * 50)

## 10. Save the Model (Optional)

In [None]:
model.save_pretrained("customer_service_lora")
tokenizer.save_pretrained("customer_service_lora")
print("‚úÖ Model saved!")
!du -sh customer_service_lora/

## 11. Summary & Key Takeaways

### What We Learned

1. **Fine-tuning** adapts a pre-trained model to your specific needs
2. **LoRA** makes fine-tuning efficient (train only ~1% of parameters)
3. **Training curves** help you monitor learning and detect overfitting
4. **Train/Test split** is essential to measure real improvement

### What We Did

- ‚úÖ Loaded Qwen2.5-0.5B (0.5 billion parameters)
- ‚úÖ Split data: 10 training, 5 test examples
- ‚úÖ Added LoRA adapters
- ‚úÖ Trained for 15 epochs with loss visualization
- ‚úÖ Measured improvement on unseen test data

---

*Questions? Reach out during office hours or on the course forum.*

## üìù Required Tasks

Complete the following two task notebooks to practice what you've learned:

### Task 1: Sentiment Fine-Tuning
**Notebook**: `Task1_Sentiment_Finetuning.ipynb`

[![Open Task 1 in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dz-web3/DS-Tech-2026spring/blob/main/Module8_LLM_Finetuning/Task1_Sentiment_Finetuning.ipynb)

---

### Task 2: Prompting vs Fine-Tuning
**Notebook**: `Task2_Prompting_vs_Finetuning.ipynb`

[![Open Task 2 in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dz-web3/DS-Tech-2026spring/blob/main/Module8_LLM_Finetuning/Task2_Prompting_vs_Finetuning.ipynb)