# Week 6: LLM Fine-tuning & Optimization

## Learning Objectives
- Understand parameter-efficient fine-tuning methods (LoRA, QLoRA)
- Apply distributed training and model parallelism
- Explore quantization, pruning, and knowledge distillation
- Optimize LLMs for cost, speed, and memory

## Table of Contents
1. [Introduction to LLM Fine-tuning](#introduction)
2. [LoRA and QLoRA](#lora-qlora)
3. [Parameter-Efficient Fine-tuning](#peft)
4. [Distributed Training & Model Parallelism](#distributed-training)
5. [Quantization & Pruning](#quantization-pruning)
6. [Knowledge Distillation](#knowledge-distillation)
7. [Optimization for Production](#optimization)
8. [Hands-on Project](#hands-on-project)

## 1. Introduction to LLM Fine-tuning <a id='introduction'></a>
Fine-tuning adapts a pre-trained LLM to a specific task or domain. Modern techniques focus on efficiency, scalability, and cost.

- Why fine-tune? Customization, performance, domain adaptation
- Challenges: compute, data, overfitting, catastrophic forgetting

## 2. LoRA and QLoRA <a id='lora-qlora'></a>
LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) are parameter-efficient fine-tuning methods.

- LoRA: Injects trainable low-rank matrices into transformer layers
- QLoRA: Combines LoRA with quantization for memory efficiency
- Example: Fine-tuning with LoRA using Hugging Face PEFT

In [None]:
# Example: Fine-tuning with LoRA (Hugging Face PEFT)
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2")
lora_config = LoraConfig(task_type=TaskType.CAUSAL_LM, r=8, lora_alpha=32, lora_dropout=0.1)
model = get_peft_model(model, lora_config)

## 3. Parameter-Efficient Fine-tuning <a id='peft'></a>
PEFT methods reduce the number of trainable parameters, making fine-tuning feasible on smaller hardware.

- Adapters, prefix-tuning, prompt-tuning
- Trade-offs: flexibility vs. efficiency
- Example: Prefix-tuning with Hugging Face

In [None]:
# Example: Prefix-tuning (conceptual)
from peft import PrefixTuningConfig
prefix_config = PrefixTuningConfig(task_type=TaskType.CAUSAL_LM, num_virtual_tokens=10)
# model = get_peft_model(model, prefix_config)

## 4. Distributed Training & Model Parallelism <a id='distributed-training'></a>
Large models require distributed training across multiple GPUs or nodes.

- Data parallelism, model parallelism, pipeline parallelism
- Tools: DeepSpeed, Hugging Face Accelerate, PyTorch DDP
- Example: Launching distributed training with Accelerate

In [None]:
# Example: Launch distributed training (conceptual)
# accelerate launch train.py --config_file=accelerate_config.yaml

## 5. Quantization & Pruning <a id='quantization-pruning'></a>
Quantization reduces model size and inference cost by lowering precision. Pruning removes redundant weights.

- Post-training quantization, quantization-aware training
- Structured and unstructured pruning
- Example: Quantize a model with Hugging Face Optimum

In [None]:
# Example: Quantize a model (conceptual)
from optimum.intel.openvino import OVModelForCausalLM
model = OVModelForCausalLM.from_pretrained("gpt2", export=True)

## 6. Knowledge Distillation <a id='knowledge-distillation'></a>
Distillation transfers knowledge from a large teacher model to a smaller student model.

- Teacher-student paradigm
- Loss functions: soft targets, hard targets
- Example: Distillation with Hugging Face Trainer

In [None]:
# Example: Distillation (conceptual)
# from transformers import Trainer, TrainingArguments
# trainer = Trainer(model=student, args=TrainingArguments(...), teacher_model=teacher)

## 7. Optimization for Production <a id='optimization'></a>
Optimizing LLMs for production involves balancing speed, cost, and accuracy.

- Model compilation (ONNX, TorchScript)
- Batch inference, request batching
- Monitoring and profiling
- Example: Export to TorchScript

In [None]:
# Example: Export model to TorchScript
import torch
traced = torch.jit.trace(model, torch.zeros(1, 8, dtype=torch.long))
traced.save("model.pt")

## 8. Hands-on Project <a id='hands-on-project'></a>

### Hands-on Project: Fine-tuning and Optimizing a Small LLM

**Goal:** Fine-tune a small LLM (e.g., DistilGPT2, TinyLlama) using LoRA or QLoRA on a custom dataset, optimize it for efficient inference, deploy it, and benchmark its performance.

#### Step 1: Prepare Your Custom Dataset
- Choose a domain (e.g., customer support, finance, healthcare, etc.)
- Format your dataset for language modeling or instruction tuning (CSV, JSON, or Hugging Face Datasets format)
- Split into train/validation sets
- Example: [Hugging Face Datasets Quickstart](https://huggingface.co/docs/datasets/quickstart)

```python
from datasets import load_dataset
# Example: Load a public dataset (replace with your own for custom)
dataset = load_dataset("yelp_review_full")
train_data = dataset["train"]
val_data = dataset["test"]
```

#### Step 2: Fine-tune with LoRA or QLoRA
- Use Hugging Face PEFT or similar libraries
- Configure LoRA/QLoRA parameters for your hardware
- Train and save the adapted model
- Example: [PEFT LoRA Quickstart](https://huggingface.co/docs/peft/task_guides/peft_lora)

```python
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("distilgpt2")
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
lora_config = LoraConfig(task_type=TaskType.CAUSAL_LM, r=8, lora_alpha=32, lora_dropout=0.1)
model = get_peft_model(model, lora_config)

# Tokenize your dataset and set up Trainer as usual
# ...
# trainer = Trainer(model=model, ...)
# trainer.train()
```

#### Step 3: Apply Quantization or Distillation
- Quantize the model for faster, cheaper inference (e.g., 8-bit, 4-bit)
- Optionally, distill knowledge from a larger teacher model
- Save the optimized model
- Example: [Hugging Face Optimum Quantization](https://huggingface.co/docs/optimum/intel/usage_guides/quantization)

```python
from optimum.intel.openvino import OVModelForCausalLM
model = OVModelForCausalLM.from_pretrained("distilgpt2", export=True)
```

#### Step 4: Deploy and Benchmark
- Serve the model using FastAPI, Flask, or another framework
- Containerize with Docker for portability
- Benchmark latency, throughput, and memory usage
- Compare performance before and after optimization
- Example: [Hugging Face Inference Endpoints](https://huggingface.co/inference-endpoints)

```python
from fastapi import FastAPI
from transformers import pipeline
app = FastAPI()
pipe = pipeline("text-generation", model="distilgpt2")
@app.post("/generate")
def generate(prompt: str):
    return pipe(prompt)
```

#### Step 5: Document Your Workflow
- Summarize your process, challenges, and results
- Include code snippets, metrics, and lessons learned
- Suggest improvements for future projects

---

**References & Further Reading:**
- [Hugging Face PEFT Documentation](https://huggingface.co/docs/peft/index)
- [Hugging Face Transformers Fine-tuning Guide](https://huggingface.co/docs/transformers/training)
- [Hugging Face Optimum (Quantization)](https://huggingface.co/docs/optimum/intel/usage_guides/quantization)
- [Deploying with FastAPI](https://fastapi.tiangolo.com/tutorial/first-steps/)
- [Hugging Face Inference Endpoints](https://huggingface.co/inference-endpoints)