# LoRA, Quantization & Inference

In this notebook, you'll make LLM finetuning and inference practical on real hardware using LoRA and quantization.

**What you'll do:**
- Calculate training memory requirements and discover why optimizer states dominate the cost
- Implement absmax and zero-point quantization by hand, tracing every step with real numbers
- Build a LoRA layer from scratch, apply it to GPT-2, and verify the parameter savings
- LoRA-finetune a model using the HuggingFace PEFT library
- Load a quantized model and compare memory usage and output quality vs full precision

**For each exercise, PREDICT the output before running the cell.** Wrong predictions are more valuable than correct ones â€” they reveal gaps in your mental model.

In [None]:
# Setup â€” self-contained for Google Colab
!pip install -q transformers datasets accelerate peft bitsandbytes

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from transformers import AutoTokenizer, AutoModelForCausalLM

# Reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [10, 4]

# Device setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')
if device.type == 'cuda':
    print(f'GPU: {torch.cuda.get_device_name(0)}')
    print(f'Memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')

print('\nSetup complete.')

---

## Exercise 1: Memory Calculator (Guided)

You have been finetuning GPT-2 for the last three lessons without worrying about memory. GPT-2 has 124M parameters â€” it fits comfortably on any GPU. But the models people actually use â€” Llama 2 7B, Mistral 7B â€” are 50 to 500 times larger.

Training with AdamW requires storing four things per parameter:
1. **Weights** in bfloat16 â€” the model itself (2 bytes/param)
2. **Gradients** in bfloat16 â€” one per parameter for the backward pass (2 bytes/param)
3. **Adam momentum** in float32 â€” running average of gradients (4 bytes/param)
4. **Adam variance** in float32 â€” running average of squared gradients (4 bytes/param)

That is 12 bytes per parameter. Let's calculate the actual memory cost.

**Before running, predict:**
- For a 7B parameter model in float16, how many GB are the weights alone?
- For full finetuning with AdamW (bf16 weights + bf16 gradients + fp32 optimizer states = 12 bytes/param), roughly how many GB total?
- Which component dominates: weights, gradients, or optimizer states?

In [None]:
def bytes_per_param(dtype):
    """Return bytes per parameter for a given dtype string."""
    mapping = {
        'float32': 4,
        'float16': 2,
        'bfloat16': 2,
        'int8': 1,
        'int4': 0.5,
    }
    return mapping[dtype]


def memory_for_inference(num_params, dtype='float16'):
    """Memory for inference = weights only."""
    return num_params * bytes_per_param(dtype)


def memory_for_training(num_params):
    """Memory for full finetuning with AdamW (bf16 mixed precision).
    
    Stores 12 bytes per parameter:
    - Weights in bfloat16: 2 bytes/param
    - Gradients in bfloat16: 2 bytes/param
    - Adam momentum in float32: 4 bytes/param
    - Adam variance in float32: 4 bytes/param
    
    Total: 2 + 2 + 4 + 4 = 12 bytes/param
    """
    weights_bf16 = num_params * 2        # bfloat16 for forward/backward
    gradients_bf16 = num_params * 2      # bfloat16 gradients
    adam_momentum = num_params * 4       # float32
    adam_variance = num_params * 4       # float32
    return {
        'weights_bf16': weights_bf16,
        'gradients_bf16': gradients_bf16,
        'adam_momentum': adam_momentum,
        'adam_variance': adam_variance,
        'total': weights_bf16 + gradients_bf16 + adam_momentum + adam_variance,
    }


def to_gb(num_bytes):
    """Convert bytes to GB (decimal, 1 GB = 1e9 bytes)."""
    return num_bytes / 1e9


# ---------------------------------------------------------------
# Compare GPT-2 (124M) vs Llama 2 7B
# ---------------------------------------------------------------
models = {
    'GPT-2': 124_000_000,
    'Llama 2 7B': 7_000_000_000,
    'Llama 2 13B': 13_000_000_000,
    'Llama 2 70B': 70_000_000_000,
}

print("=" * 70)
print("INFERENCE MEMORY (weights only)")
print("=" * 70)
for name, n_params in models.items():
    mem_fp32 = to_gb(memory_for_inference(n_params, 'float32'))
    mem_fp16 = to_gb(memory_for_inference(n_params, 'float16'))
    mem_int8 = to_gb(memory_for_inference(n_params, 'int8'))
    mem_int4 = to_gb(memory_for_inference(n_params, 'int4'))
    print(f"\n{name} ({n_params/1e9:.1f}B params):")
    print(f"  float32: {mem_fp32:6.1f} GB")
    print(f"  float16: {mem_fp16:6.1f} GB")
    print(f"  int8:    {mem_int8:6.1f} GB")
    print(f"  int4:    {mem_int4:6.1f} GB")

print("\n" + "=" * 70)
print("TRAINING MEMORY (full finetuning with AdamW, bf16 mixed precision)")
print("=" * 70)
print("Model: bf16 weights + bf16 gradients + fp32 Adam states = 12 bytes/param")
for name, n_params in models.items():
    mem = memory_for_training(n_params)
    print(f"\n{name} ({n_params/1e9:.1f}B params):")
    print(f"  Weights (bf16):        {to_gb(mem['weights_bf16']):6.1f} GB")
    print(f"  Gradients (bf16):      {to_gb(mem['gradients_bf16']):6.1f} GB")
    print(f"  Adam momentum (fp32):  {to_gb(mem['adam_momentum']):6.1f} GB")
    print(f"  Adam variance (fp32):  {to_gb(mem['adam_variance']):6.1f} GB")
    print(f"  ----------------------------------------")
    print(f"  TOTAL:                 {to_gb(mem['total']):6.1f} GB")
    
    # What fraction is optimizer states?
    optimizer_frac = (mem['adam_momentum'] + mem['adam_variance']) / mem['total'] * 100
    print(f"  Optimizer states:      {optimizer_frac:.0f}% of total")

In [None]:
# Visualize: where does training memory go?
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for ax, (name, n_params) in zip(axes, [('GPT-2', 124_000_000), ('Llama 2 7B', 7_000_000_000)]):
    mem = memory_for_training(n_params)
    
    components = ['Weights\n(bf16)', 'Gradients\n(bf16)', 'Adam\nmomentum\n(fp32)', 'Adam\nvariance\n(fp32)']
    values = [
        to_gb(mem['weights_bf16']),
        to_gb(mem['gradients_bf16']),
        to_gb(mem['adam_momentum']),
        to_gb(mem['adam_variance']),
    ]
    colors = ['#60a5fa', '#f87171', '#a78bfa', '#fbbf24']
    
    bars = ax.bar(components, values, color=colors, alpha=0.8, edgecolor='white', linewidth=0.5)
    ax.set_title(f'{name} ({n_params/1e9:.1f}B) \u2014 Training Memory', fontsize=12)
    ax.set_ylabel('Memory (GB)')
    
    # Add value labels on bars
    for bar, val in zip(bars, values):
        ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.02 * max(values),
                f'{val:.1f}', ha='center', va='bottom', fontsize=9, color='white')
    
    ax.set_ylim(0, max(values) * 1.15)
    ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("The optimizer states (momentum + variance) are the largest cost.")
print("For Llama 2 7B, they alone require 56 GB in float32 \u2014 two-thirds of the 84 GB total.")
print("A consumer RTX 4090 has 24 GB. Full finetuning does NOT fit.")

**What you just computed:** The memory wall that motivates LoRA and quantization. Training a 7B model requires **~84 GB** minimum â€” not because of the weights (only 14 GB in bfloat16), but because of the gradients and especially the optimizer states. Adam stores two float32 tensors the same size as the model, accounting for two-thirds of the total.

This creates two separate problems:
- **Finetuning is too expensive:** Too many trainable parameters means too many gradients and optimizer states. **Solution: LoRA.**
- **Inference is too expensive:** The model itself is too large. **Solution: Quantization.**

---

## Exercise 2: Quantization by Hand (Guided)

In Scaling & Efficiency, you traded float32 for bfloat16 and lost almost nothing. Quantization pushes this further: map floating-point weights to integers (int8 or int4) for even greater memory savings.

We'll implement two quantization methods step by step and trace the math with real numbers.

### Part A: Absmax Quantization

The simplest method. For a vector of weights:
1. Find the max absolute value
2. Compute scale = max_abs / 127 (for int8, range [-127, 127])
3. Quantize: q = round(w / scale)
4. Dequantize: w_approx = q * scale

**Before running, predict:** Given weights [-0.8, 0.3, 1.2, -0.5], the max absolute value is 1.2 and the scale is 1.2/127 â‰ˆ 0.0094. What will the quantized int8 values be? What will the reconstruction error look like?

In [None]:
def absmax_quantize(weights):
    """Absmax quantization to int8.
    
    Steps:
    1. Find scale = max(|w|) / 127
    2. Quantize: q = round(w / scale), clamp to [-127, 127]
    3. Store q as int8 + scale as float32
    
    Returns: quantized (int8), scale (float32)
    """
    max_abs = weights.abs().max()
    scale = max_abs / 127.0
    quantized = torch.clamp(torch.round(weights / scale), -127, 127).to(torch.int8)
    return quantized, scale


def absmax_dequantize(quantized, scale):
    """Reconstruct float weights from quantized int8 values."""
    return quantized.float() * scale


# --- Trace with the exact example from the lesson ---
weights = torch.tensor([-0.8, 0.3, 1.2, -0.5])

print("ABSMAX QUANTIZATION WALKTHROUGH")
print("=" * 50)
print(f"Original weights:  {weights.tolist()}")
print(f"Max absolute value: {weights.abs().max().item()}")

q, scale = absmax_quantize(weights)
print(f"Scale factor:      {scale.item():.6f}")
print(f"Quantized (int8):  {q.tolist()}")

w_approx = absmax_dequantize(q, scale)
print(f"Reconstructed:     {[f'{v:.4f}' for v in w_approx.tolist()]}")

error = (weights - w_approx).abs()
print(f"Absolute error:    {[f'{v:.6f}' for v in error.tolist()]}")
print(f"Max error:         {error.max().item():.6f}")

print(f"\nMemory: 4 bytes (int8) + 4 bytes (scale) = 8 bytes")
print(f"vs original: 4 * 4 bytes (float32) = 16 bytes")
print(f"Savings: 2x for this 4-element example.")
print(f"(In practice, the scale factor is shared across a group of 32-128 values,")
print(f" so per-weight overhead shrinks and real savings approach ~4x for int8.)")

### Part B: The Outlier Problem

Absmax works well when weights are distributed evenly. But what happens when there are outliers? One extreme value stretches the scale and wastes precision for the majority of weights.

**Before running, predict:** Given weights [-0.1, 0.05, 0.02, -0.03, 8.5], the max absolute value is 8.5. The scale will be 8.5/127 â‰ˆ 0.067. What happens to the small values near zero when we quantize?

In [None]:
# --- The outlier problem ---
weights_normal = torch.tensor([-0.1, 0.05, 0.02, -0.03, 0.08])
weights_outlier = torch.tensor([-0.1, 0.05, 0.02, -0.03, 8.5])

print("NORMAL DISTRIBUTION (no outliers)")
print("=" * 50)
q_normal, scale_normal = absmax_quantize(weights_normal)
w_approx_normal = absmax_dequantize(q_normal, scale_normal)
error_normal = (weights_normal - w_approx_normal).abs()
print(f"Original:    {weights_normal.tolist()}")
print(f"Scale:       {scale_normal.item():.6f}")
print(f"Quantized:   {q_normal.tolist()}")
print(f"Reconstructed: {[f'{v:.4f}' for v in w_approx_normal.tolist()]}")
print(f"Errors:      {[f'{v:.6f}' for v in error_normal.tolist()]}")

print(f"\nOUTLIER DISTRIBUTION (one extreme value)")
print("=" * 50)
q_outlier, scale_outlier = absmax_quantize(weights_outlier)
w_approx_outlier = absmax_dequantize(q_outlier, scale_outlier)
error_outlier = (weights_outlier - w_approx_outlier).abs()
print(f"Original:    {weights_outlier.tolist()}")
print(f"Scale:       {scale_outlier.item():.6f}")
print(f"Quantized:   {q_outlier.tolist()}")
print(f"Reconstructed: {[f'{v:.4f}' for v in w_approx_outlier.tolist()]}")
print(f"Errors:      {[f'{v:.6f}' for v in error_outlier.tolist()]}")

print(f"\nThe outlier (8.5) hijacks the scale.")
print(f"Small values near zero lose almost all their information.")
print(f"Most of the int8 range [-127, 127] is wasted on values that don't exist.")

In [None]:
# Visualize: quantization error for normal vs outlier distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 4))

# Large-scale test: random normal weights vs weights with outliers
torch.manual_seed(42)
w_gaussian = torch.randn(1000) * 0.1  # Typical weight distribution
w_with_outliers = w_gaussian.clone()
w_with_outliers[::50] = torch.randn(20) * 5.0  # Add outliers every 50 values

for ax, (name, w) in zip(axes, [('Gaussian (no outliers)', w_gaussian), ('With outliers', w_with_outliers)]):
    q, s = absmax_quantize(w)
    w_approx = absmax_dequantize(q, s)
    errors = (w - w_approx).abs()
    
    ax.hist(errors.numpy(), bins=50, color='#f87171', alpha=0.7, edgecolor='white', linewidth=0.5)
    ax.set_title(f'{name}\nMean error: {errors.mean():.6f}, Max error: {errors.max():.6f}', fontsize=10)
    ax.set_xlabel('Absolute reconstruction error')
    ax.set_ylabel('Count')
    ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("With outliers, the reconstruction error is much larger for the majority of weights.")
print("This motivates zero-point quantization and more sophisticated methods like GPTQ/AWQ.")

### Part C: Zero-Point Quantization

Zero-point quantization shifts the range so the minimum maps to -128 and the maximum maps to 127. This better utilizes the integer range for asymmetric distributions.

**Before running, predict:** If all weights are positive (e.g., [0.1, 0.3, 0.5, 0.8]), absmax wastes the entire negative range of int8 [-127, 0). Zero-point quantization should use the full [-128, 127] range. Will the reconstruction error be smaller?

In [None]:
def zero_point_quantize(weights):
    """Zero-point quantization to int8.
    
    Steps:
    1. Find min and max of weights
    2. Compute scale = (max - min) / 255 (full uint8 range)
    3. Compute zero_point = round(-min / scale) - 128
    4. Quantize: q = round(w / scale) + zero_point, clamp to [-128, 127]
    
    Returns: quantized (int8), scale (float32), zero_point (int)
    """
    w_min = weights.min()
    w_max = weights.max()
    
    # Scale maps the float range to 256 int8 values
    scale = (w_max - w_min) / 255.0
    
    # Zero point: which int8 value corresponds to float 0.0
    zero_point = torch.round(-w_min / scale).to(torch.int32) - 128
    
    # Quantize
    quantized = torch.clamp(torch.round(weights / scale) + zero_point, -128, 127).to(torch.int8)
    
    return quantized, scale, zero_point


def zero_point_dequantize(quantized, scale, zero_point):
    """Reconstruct float weights from zero-point quantized values."""
    return (quantized.float() - zero_point.float()) * scale


# Compare absmax vs zero-point on an asymmetric distribution
weights_asym = torch.tensor([0.1, 0.3, 0.5, 0.8, 0.2, 0.6, 0.9, 0.15])

print("ASYMMETRIC DISTRIBUTION (all positive values)")
print("=" * 55)
print(f"Original: {weights_asym.tolist()}")

# Absmax
q_abs, s_abs = absmax_quantize(weights_asym)
w_abs = absmax_dequantize(q_abs, s_abs)
err_abs = (weights_asym - w_abs).abs().mean()

print(f"\nAbsmax:")
print(f"  Quantized: {q_abs.tolist()}")
print(f"  Range used: [{q_abs.min().item()}, {q_abs.max().item()}] out of [-127, 127]")
print(f"  Mean error: {err_abs:.6f}")

# Zero-point
q_zp, s_zp, zp = zero_point_quantize(weights_asym)
w_zp = zero_point_dequantize(q_zp, s_zp, zp)
err_zp = (weights_asym - w_zp).abs().mean()

print(f"\nZero-point:")
print(f"  Quantized: {q_zp.tolist()}")
print(f"  Range used: [{q_zp.min().item()}, {q_zp.max().item()}] out of [-128, 127]")
print(f"  Zero point: {zp.item()} (this int8 value = float 0.0)")
print(f"  Mean error: {err_zp:.6f}")

print(f"\nZero-point uses the full int8 range, giving {err_abs/err_zp:.1f}x lower error.")
print(f"Cost: one extra integer (zero_point) stored per group.")

In [None]:
# Apply quantization to a REAL model's weights and measure the error
print("QUANTIZING REAL GPT-2 WEIGHTS")
print("=" * 55)

# Load GPT-2
gpt2 = AutoModelForCausalLM.from_pretrained('gpt2')

# Pick a real weight matrix: the first attention query projection
# GPT-2 uses Conv1D, so the weight shape is (in_features, out_features)
real_weights = gpt2.transformer.h[0].attn.c_attn.weight.data[:, :768].clone()
print(f"Weight matrix shape: {real_weights.shape}")
print(f"Total values: {real_weights.numel():,}")
print(f"Min: {real_weights.min():.4f}, Max: {real_weights.max():.4f}")
print(f"Mean: {real_weights.mean():.4f}, Std: {real_weights.std():.4f}")

# Flatten for quantization
w_flat = real_weights.flatten()

# Absmax quantization
q_abs, s_abs = absmax_quantize(w_flat)
w_abs = absmax_dequantize(q_abs, s_abs)
err_abs = (w_flat - w_abs).abs()

# Zero-point quantization  
q_zp, s_zp, zp = zero_point_quantize(w_flat)
w_zp = zero_point_dequantize(q_zp, s_zp, zp)
err_zp = (w_flat - w_zp).abs()

print(f"\nAbsmax:     mean error = {err_abs.mean():.6f}, max error = {err_abs.max():.6f}")
print(f"Zero-point: mean error = {err_zp.mean():.6f}, max error = {err_zp.max():.6f}")
print(f"\nRelative error (absmax):     {(err_abs.mean() / w_flat.abs().mean() * 100):.2f}%")
print(f"Relative error (zero-point): {(err_zp.mean() / w_flat.abs().mean() * 100):.2f}%")

# Visualize weight distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 4))

axes[0].hist(w_flat.numpy(), bins=100, color='#60a5fa', alpha=0.7, edgecolor='white', linewidth=0.3)
axes[0].set_title('GPT-2 Weight Distribution (attention Q projection)', fontsize=10)
axes[0].set_xlabel('Weight value')
axes[0].set_ylabel('Count')
axes[0].grid(alpha=0.3)

axes[1].hist(err_abs.numpy(), bins=100, color='#f87171', alpha=0.6, label='Absmax', edgecolor='white', linewidth=0.3)
axes[1].hist(err_zp.numpy(), bins=100, color='#34d399', alpha=0.6, label='Zero-point', edgecolor='white', linewidth=0.3)
axes[1].set_title('Quantization Reconstruction Error', fontsize=10)
axes[1].set_xlabel('Absolute error')
axes[1].set_ylabel('Count')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\nThe weight distribution is approximately Gaussian (most values near zero).")
print("Quantization error is tiny because the dense region maps well to the int8 grid.")
print("This is why INT8 (and even INT4) quantization works: neural network weights are compressible.")

**What you just implemented:** Both absmax and zero-point quantization from scratch, traced every step with real numbers, and applied them to actual GPT-2 weights.

Key takeaways:
- Absmax is the simplest method: divide by max absolute value, round to int8. Works well for symmetric, Gaussian-like distributions.
- Outliers break absmax by stretching the scale and wasting precision. Zero-point quantization handles asymmetric distributions better.
- Real neural network weights are approximately Gaussian, which is ideal for quantization. The reconstruction error is tiny.
- This is why GPTQ and AWQ can achieve INT4 with less than 1% perplexity degradation: the weight information is highly compressible.

---

## Exercise 3: LoRA from Scratch (Supported)

Now the other half of the solution. LoRA freezes all base weights and adds tiny trainable low-rank matrices alongside them.

The original forward pass: `h = W @ x` (W is frozen).
The LoRA forward pass: `h = W @ x + (B @ A) @ x * (alpha / r)`.

- **A** is (r x d_in), initialized from a random normal distribution
- **B** is (d_out x r), initialized to **zeros** (so LoRA starts as identity)
- **alpha / r** is a scaling factor

You'll implement the `LoRALinear` class, apply it to GPT-2, and count parameters.

Fill in the TODOs below. Each TODO is 1-3 lines.

<details>
<summary>ðŸ’¡ Solution</summary>

The key insight is that LoRA adds a low-rank bypass alongside the frozen original weight matrix. Because B is initialized to zeros, the LoRA output starts at zero â€” the model begins identical to the pretrained model. Training gradually learns the task-specific adaptation through A and B.

```python
class LoRALinear(nn.Module):
    def __init__(self, base: nn.Linear, r: int = 8, alpha: int = 16):
        super().__init__()
        self.base = base
        self.base.weight.requires_grad = False
        if self.base.bias is not None:
            self.base.bias.requires_grad = False
        d_in = base.in_features
        d_out = base.out_features
        self.A = nn.Parameter(torch.randn(r, d_in) * 0.01)
        self.B = nn.Parameter(torch.zeros(d_out, r))
        self.scale = alpha / r

    def forward(self, x):
        base_out = self.base(x)
        lora_out = (x @ self.A.T @ self.B.T) * self.scale
        return base_out + lora_out
```

For the parameter count: LoRA rank 8 on a 768x768 matrix adds 768*8 + 8*768 = 12,288 parameters. The full matrix has 589,824. That is about 2% â€” a 48x reduction.

Common mistake: forgetting to freeze the base weight with `requires_grad = False`. Without this, the base weights would also receive gradients, defeating the purpose of LoRA.

</details>

In [None]:
class LoRALinear(nn.Module):
    """A linear layer with a LoRA low-rank bypass.
    
    Original: h = W @ x
    LoRA:     h = W @ x + (B @ A) @ x * (alpha / r)
    
    W is frozen. Only A and B are trainable.
    B initialized to zeros -> LoRA starts as identity.
    """
    def __init__(self, base: nn.Linear, r: int = 8, alpha: int = 16):
        super().__init__()
        self.base = base
        
        # TODO: Freeze the base layer's weights (and bias if present).
        # Set requires_grad = False on base.weight and base.bias.
        # YOUR CODE HERE (2 lines)
        
        d_in = base.in_features
        d_out = base.out_features
        
        # TODO: Create the LoRA parameters A and B.
        # A is (r x d_in), initialized with small random normal values (* 0.01)
        # B is (d_out x r), initialized to zeros
        # Both should be nn.Parameter so they are trainable.
        # YOUR CODE HERE (2 lines)
        
        self.scale = alpha / r
    
    def forward(self, x):
        # The frozen highway: original linear transformation
        base_out = self.base(x)
        
        # TODO: Compute the LoRA bypass.
        # lora_out = (x @ A^T @ B^T) * scale
        # This is the "trainable detour" that starts at zero.
        # YOUR CODE HERE (1 line)
        
        return base_out + lora_out


# --- Test the LoRA layer ---
torch.manual_seed(42)

# Create a base linear layer (simulating a frozen model weight)
base_linear = nn.Linear(768, 768, bias=False)

# Wrap it with LoRA
lora_layer = LoRALinear(base_linear, r=8, alpha=16)

# Test input
x = torch.randn(1, 10, 768)  # batch=1, seq_len=10, d_model=768

# Forward pass
output = lora_layer(x)
print(f"Input shape:  {x.shape}")
print(f"Output shape: {output.shape}")

# Verify the base output is the same (LoRA starts at zero because B=0)
with torch.no_grad():
    base_only = base_linear(x)
    diff = (output - base_only).abs().max().item()
    print(f"\nMax difference from base output: {diff:.10f}")
    print(f"(Should be ~0 because B is initialized to zeros)")

In [None]:
# --- Parameter counting: LoRA vs full finetuning ---

# Check which parameters are trainable
print("PARAMETER ANALYSIS")
print("=" * 50)
for name, param in lora_layer.named_parameters():
    print(f"  {name:15s} | shape: {str(param.shape):15s} | trainable: {param.requires_grad}")

total_params = sum(p.numel() for p in lora_layer.parameters())
trainable_params = sum(p.numel() for p in lora_layer.parameters() if p.requires_grad)
frozen_params = total_params - trainable_params

print(f"\nTotal parameters:     {total_params:>10,}")
print(f"Trainable (LoRA):     {trainable_params:>10,}")
print(f"Frozen (base):        {frozen_params:>10,}")
print(f"Trainable fraction:   {trainable_params/total_params*100:.2f}%")
print(f"Reduction factor:     {frozen_params/trainable_params:.0f}x fewer trainable params")

In [None]:
# --- Apply LoRA to actual GPT-2 and count parameters ---

def apply_lora_to_attention(model, r=8, alpha=16):
    """Apply LoRA to all attention Q and V projections in a GPT-2 model.
    
    GPT-2 uses Conv1D instead of nn.Linear, so we need to handle this.
    The combined c_attn projects to Q, K, V concatenated.
    For simplicity, we apply LoRA to the full c_attn projection.
    """
    lora_layers = []
    for i, block in enumerate(model.transformer.h):
        # GPT-2's c_attn is a Conv1D(768, 2304) -> projects to Q, K, V
        # We'll wrap it in a LoRA-style bypass
        old_attn = block.attn.c_attn
        d_in = old_attn.weight.shape[0]   # 768
        d_out = old_attn.weight.shape[1]  # 2304
        
        # Freeze the original weights
        old_attn.weight.requires_grad = False
        old_attn.bias.requires_grad = False
        
        # Create LoRA A and B as separate parameters on the block
        A = nn.Parameter(torch.randn(r, d_in) * 0.01)
        B = nn.Parameter(torch.zeros(d_out, r))
        
        # Store them
        block.attn.lora_A = A
        block.attn.lora_B = B
        block.attn.lora_scale = alpha / r
        lora_layers.append((A, B))
    
    return lora_layers


# Load a fresh GPT-2
gpt2_lora = AutoModelForCausalLM.from_pretrained('gpt2')

# Freeze ALL parameters first
for param in gpt2_lora.parameters():
    param.requires_grad = False

# Apply LoRA
lora_pairs = apply_lora_to_attention(gpt2_lora, r=8, alpha=16)

# Count parameters
total = sum(p.numel() for p in gpt2_lora.parameters())
trainable = sum(p.numel() for p in gpt2_lora.parameters() if p.requires_grad)
frozen = total - trainable

print("GPT-2 WITH LoRA (rank=8, attention projections only)")
print("=" * 55)
print(f"Total parameters:     {total:>12,}")
print(f"Frozen (base model):  {frozen:>12,}")
print(f"Trainable (LoRA):     {trainable:>12,}")
print(f"Trainable fraction:   {trainable/total*100:.3f}%")
print(f"\nLoRA parameters per layer: {lora_pairs[0][0].numel() + lora_pairs[0][1].numel():,}")
print(f"Number of layers:          {len(lora_pairs)}")
print(f"Total LoRA parameters:     {trainable:,}")

# Memory comparison: LoRA vs full finetuning
# LoRA: frozen base in bf16 + LoRA params with full optimizer overhead
lora_training_mem = (
    frozen * 2 +          # frozen weights in bf16 (no gradients needed)
    trainable * 2 +       # LoRA weights in bf16
    trainable * 2 +       # LoRA gradients in bf16
    trainable * 4 +       # Adam momentum for LoRA (fp32)
    trainable * 4          # Adam variance for LoRA (fp32)
)
full_training_mem = memory_for_training(total)['total']

print(f"\nMEMORY COMPARISON")
print(f"Full finetuning:  {to_gb(full_training_mem):.1f} GB")
print(f"LoRA finetuning:  {to_gb(lora_training_mem):.2f} GB")
print(f"Savings:          {full_training_mem/lora_training_mem:.1f}x")

**What you just built:** A LoRA layer from scratch, then applied it to GPT-2. The key numbers:

- A single LoRA bypass on a 768x768 matrix with rank 8: **12,288 trainable parameters** vs 589,824 frozen. That is about **2%** of the full matrix.
- Applied across all attention projections in GPT-2: the trainable fraction is a tiny percentage of the total model.
- Because only the LoRA parameters need gradients and optimizer states, the memory cost for training drops dramatically.

The conceptual picture: LoRA is the surgical version of "freeze the backbone." Instead of freezing the backbone and adding a head at the end, you freeze everything and add tiny detours *inside* the backbone. Same philosophy â€” preserve pretrained knowledge, adapt minimally.

---

## Exercise 4: LoRA Finetuning with PEFT (Supported)

Now use the HuggingFace PEFT library to do real LoRA finetuning. The conceptual understanding from Exercise 3 makes the library transparent â€” you know exactly what `LoraConfig` parameters mean because you implemented them yourself.

You'll LoRA-finetune GPT-2 on a small instruction dataset and compare trainable parameters, memory usage, and generation quality before/after.

Fill in the TODOs. Each is 1-3 lines.

<details>
<summary>ðŸ’¡ Solution</summary>

The PEFT library wraps the model and handles LoRA injection automatically. The key configuration:

- `r=8`: rank of the low-rank matrices (same as what you implemented in Exercise 3)
- `lora_alpha=16`: the alpha scaling factor (alpha/r = 2.0 in this case)
- `target_modules`: which layers to add LoRA to. For GPT-2, `c_attn` is the combined Q/K/V projection and `c_proj` is the output projection.
- `lora_dropout=0.05`: dropout on the LoRA path for regularization

```python
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["c_attn", "c_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
```

The training loop is the exact same heartbeat as always: forward, loss, zero_grad, backward, step. PEFT handles routing gradients only to the LoRA parameters.

</details>

In [None]:
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset

# Load a fresh GPT-2 for PEFT finetuning
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

# GPT-2 has no pad token â€” set it to eos
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id

print(f"Model: {model_name}")
print(f"Parameters: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M")

In [None]:
# TODO: Create a LoraConfig with:
#   r=8 (rank of the low-rank matrices)
#   lora_alpha=16 (scaling factor)
#   target_modules=["c_attn", "c_proj"] (attention projections in GPT-2)
#   lora_dropout=0.05
#   bias="none" (don't train biases)
#   task_type=TaskType.CAUSAL_LM
# YOUR CODE HERE (1 block)


# TODO: Wrap the model with LoRA using get_peft_model(model, lora_config)
# YOUR CODE HERE (1 line)


# Print trainable parameter summary
peft_model.print_trainable_parameters()

# Detailed breakdown
print("\nLoRA modules added:")
for name, module in peft_model.named_modules():
    if 'lora' in name.lower() and hasattr(module, 'weight'):
        print(f"  {name}: {module.weight.shape}")

In [None]:
# --- Generate text BEFORE LoRA finetuning (baseline) ---

def generate_text(model, tokenizer, prompt, max_new_tokens=60):
    """Generate text from a prompt."""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


eval_prompts = [
    "The meaning of life is",
    "Machine learning is a field that",
    "In a galaxy far, far away,",
]

print("BEFORE LoRA finetuning:")
print("=" * 60)
before_responses = {}
for prompt in eval_prompts:
    response = generate_text(peft_model, tokenizer, prompt)
    before_responses[prompt] = response
    print(f"\nPrompt: {prompt}")
    print(f"Response: {response[:200]}")
    print("-" * 40)

In [None]:
# --- Prepare a small finetuning dataset ---
# We'll finetune on a small text corpus to shift the model's style.
# Using a subset of the Tiny Shakespeare dataset for a visible style shift.

from torch.utils.data import Dataset, DataLoader

# Load a small text dataset
raw_dataset = load_dataset("karpathy/tiny_shakespeare", split="train")
text = raw_dataset[0]['text'][:50000]  # First 50K characters

# Tokenize into fixed-length chunks
MAX_LENGTH = 128

encodings = tokenizer(text, return_tensors="pt", truncation=False)
input_ids = encodings["input_ids"][0]

# Create non-overlapping chunks
chunks = [input_ids[i:i + MAX_LENGTH] for i in range(0, len(input_ids) - MAX_LENGTH, MAX_LENGTH)]
print(f"Created {len(chunks)} training chunks of length {MAX_LENGTH}")


class TextChunkDataset(Dataset):
    def __init__(self, chunks):
        self.chunks = chunks
    
    def __len__(self):
        return len(self.chunks)
    
    def __getitem__(self, idx):
        chunk = self.chunks[idx]
        return {"input_ids": chunk, "labels": chunk.clone()}


train_dataset = TextChunkDataset(chunks)
train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)

In [None]:
# --- LoRA Finetuning loop ---
# Same heartbeat as always: forward, loss, zero_grad, backward, step.
# The only difference: PEFT routes gradients only to LoRA parameters.

NUM_EPOCHS = 3
LEARNING_RATE = 3e-4  # Higher LR is fine for LoRA (fewer parameters)

# Only optimize LoRA parameters (PEFT handles this)
optimizer = torch.optim.AdamW(peft_model.parameters(), lr=LEARNING_RATE)
peft_model.train()

losses = []
step_count = 0

# Track memory if on GPU
if device.type == 'cuda':
    torch.cuda.reset_peak_memory_stats()

print(f"Training for {NUM_EPOCHS} epochs on {len(train_dataset)} chunks...")
print(f"Batch size: 4, Steps per epoch: ~{len(train_dataloader)}")
print()

for epoch in range(NUM_EPOCHS):
    epoch_loss = 0.0
    n_batches = 0
    
    for batch in train_dataloader:
        input_ids = batch["input_ids"].to(device)
        labels = batch["labels"].to(device)
        
        outputs = peft_model(input_ids=input_ids, labels=labels)
        loss = outputs.loss
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        n_batches += 1
        step_count += 1
        losses.append(loss.item())
        
        if step_count % 20 == 0:
            print(f"  Step {step_count:4d} | Loss: {loss.item():.4f}")
    
    avg_loss = epoch_loss / n_batches
    print(f"\nEpoch {epoch + 1}/{NUM_EPOCHS} \u2014 Avg loss: {avg_loss:.4f}")
    print()

if device.type == 'cuda':
    peak_mem = torch.cuda.max_memory_allocated() / 1e9
    print(f"Peak GPU memory during training: {peak_mem:.2f} GB")

print(f"Training complete! {step_count} total steps.")

In [None]:
# Plot training loss
plt.figure(figsize=(10, 4))
plt.plot(losses, linewidth=1, color='#a78bfa', alpha=0.5, label='Per-step loss')

# Smoothed
window = 15
if len(losses) > window:
    smoothed = [sum(losses[max(0, i - window):i + 1]) / len(losses[max(0, i - window):i + 1]) for i in range(len(losses))]
    plt.plot(smoothed, linewidth=2, color='#a78bfa', label=f'Smoothed ({window}-step)')

plt.xlabel('Training Step')
plt.ylabel('Cross-Entropy Loss')
plt.title('LoRA Finetuning Loss (PEFT)')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

In [None]:
# --- Compare before vs after LoRA finetuning ---
peft_model.eval()

print("BEFORE vs AFTER LoRA finetuning:")
print("=" * 60)

for prompt in eval_prompts:
    after_response = generate_text(peft_model, tokenizer, prompt)
    
    print(f"\nPrompt: {prompt}")
    print(f"\n  BEFORE (base GPT-2):")
    print(f"  {before_responses[prompt][:200]}")
    print(f"\n  AFTER (LoRA finetuned on Shakespeare):")
    print(f"  {after_response[:200]}")
    print("-" * 60)

print("\nThe LoRA-finetuned model should show a stylistic shift toward Shakespeare.")
print("This was achieved by training only the tiny LoRA parameters \u2014 the base")
print("model weights are completely unchanged.")

**What you just did:** Used the PEFT library to LoRA-finetune GPT-2, achieving a stylistic shift by training a tiny fraction of the total parameters.

The library is not magic â€” you know exactly what it does because you built `LoRALinear` from scratch in Exercise 3. `LoraConfig(r=8, lora_alpha=16)` creates the same A and B matrices with the same scaling factor. The library just handles the plumbing: injecting LoRA into the right layers, routing gradients, and managing the adapter weights.

Key observation: the learning rate for LoRA finetuning (3e-4) is higher than for full finetuning (5e-5 typically). With fewer trainable parameters, each parameter needs to change more per step.

---

## Exercise 5: Quantized Inference (Minimal Scaffolding)

The final exercise brings it full circle. You've seen the theory of quantization (Exercise 2) and LoRA (Exercises 3-4). Now load a quantized model and see quantization in practice: compare memory usage and output quality between full-precision and quantized models.

This exercise requires a GPU with bitsandbytes support. If running on CPU, the cells will show the expected results as comments.

**Note:** This exercise uses the `generate_text` function defined in Exercise 4. Make sure you have run all previous cells before proceeding.

**Your task:**
- Load GPT-2 in full precision (float32) and in 8-bit quantized mode using bitsandbytes
- Compare memory usage between the two
- Generate text from both and compare output quality
- Reflect on the memory-quality tradeoff

<details>
<summary>ðŸ’¡ Solution</summary>

The key insight: bitsandbytes integrates with HuggingFace `transformers` to load models in lower precision automatically. You pass `load_in_8bit=True` (or `load_in_4bit=True`) to `from_pretrained()`, and bitsandbytes handles the quantization.

```python
from transformers import BitsAndBytesConfig

# 8-bit quantization config
bnb_config_8bit = BitsAndBytesConfig(load_in_8bit=True)

# Load quantized model
model_8bit = AutoModelForCausalLM.from_pretrained(
    "gpt2",
    quantization_config=bnb_config_8bit,
    device_map="auto",
)

# For 4-bit:
bnb_config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # NormalFloat4 from the QLoRA paper
    bnb_4bit_compute_dtype=torch.bfloat16,
)
```

For GPT-2 (124M params), the memory savings are modest because the model is already small. The real impact is on 7B+ models where full precision requires 14+ GB and 4-bit fits in ~3.5 GB.

Common mistake: trying to run bitsandbytes on CPU. It requires a CUDA GPU.

</details>

In [None]:
# --- Load full-precision model and measure memory ---

import gc

# Clean up previous models
del gpt2, gpt2_lora, model, peft_model
gc.collect()
if device.type == 'cuda':
    torch.cuda.empty_cache()

# Load full-precision model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

if device.type == 'cuda':
    torch.cuda.reset_peak_memory_stats()

model_fp32 = AutoModelForCausalLM.from_pretrained("gpt2").to(device)

# Calculate model memory from parameter sizes
fp32_param_mem = sum(p.numel() * p.element_size() for p in model_fp32.parameters())

if device.type == 'cuda':
    fp32_gpu_mem = torch.cuda.max_memory_allocated() / 1e6
    print(f"Full-precision (float32):")
    print(f"  Parameter memory: {fp32_param_mem / 1e6:.1f} MB")
    print(f"  GPU memory used:  {fp32_gpu_mem:.1f} MB")
    fp32_gpu = fp32_gpu_mem
    
    # Generate text
    fp32_outputs = {}
    test_prompts = [
        "The future of artificial intelligence is",
        "Once upon a time, in a land far away,",
        "The most important thing about programming is",
    ]
    print("\nFull-precision outputs:")
    for prompt in test_prompts:
        response = generate_text(model_fp32, tokenizer, prompt)
        fp32_outputs[prompt] = response
        print(f"  {prompt}")
        print(f"  -> {response[:150]}")
        print()
        
    print(f"\nNote: For GPT-2 (124M params), the absolute savings are modest.")
    print(f"For a 7B model, float32 = ~28 GB vs int8 = ~7 GB vs int4 = ~3.5 GB.")
    print(f"That's the difference between 'fits on your GPU' and 'does not fit'.")
    
print(f"\nParameter memory breakdown:")
print(f"  float32: {fp32_param_mem / 1e6:.1f} MB ({fp32_param_mem / sum(p.numel() for p in model_fp32.parameters()):.0f} bytes/param)")
print(f"  float16 would be: {fp32_param_mem / 2 / 1e6:.1f} MB")
print(f"  int8 would be:    {fp32_param_mem / 4 / 1e6:.1f} MB")
print(f"  int4 would be:    {fp32_param_mem / 8 / 1e6:.1f} MB")

In [None]:
# --- Load quantized model (requires GPU with bitsandbytes) ---

if device.type == 'cuda':
    from transformers import BitsAndBytesConfig
    
    # Clean up
    del model_fp32
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()
    
    # Load in 8-bit
    bnb_config_8bit = BitsAndBytesConfig(load_in_8bit=True)
    model_8bit = AutoModelForCausalLM.from_pretrained(
        "gpt2",
        quantization_config=bnb_config_8bit,
        device_map="auto",
    )
    
    int8_gpu_mem = torch.cuda.max_memory_allocated() / 1e6
    
    print(f"8-bit quantized:")
    print(f"  GPU memory used: {int8_gpu_mem:.1f} MB")
    print(f"  vs full precision: {fp32_gpu:.1f} MB")
    print(f"  Savings: {fp32_gpu / int8_gpu_mem:.1f}x")
    
    # Generate text
    print("\n8-bit outputs:")
    int8_outputs = {}
    for prompt in test_prompts:
        response = generate_text(model_8bit, tokenizer, prompt)
        int8_outputs[prompt] = response
        print(f"  {prompt}")
        print(f"  -> {response[:150]}")
        print()
    
    # Side-by-side comparison
    print("\n" + "=" * 60)
    print("COMPARISON: Full Precision vs 8-bit Quantized")
    print("=" * 60)
    for prompt in test_prompts:
        print(f"\nPrompt: {prompt}")
        print(f"  FP32: {fp32_outputs[prompt][:150]}")
        print(f"  INT8: {int8_outputs[prompt][:150]}")
        match = fp32_outputs[prompt][:100] == int8_outputs[prompt][:100]
        print(f"  First 100 chars match: {match}")
    
    print(f"\nMemory: {fp32_gpu:.0f} MB (FP32) vs {int8_gpu_mem:.0f} MB (INT8)")
    print(f"Quality: outputs are nearly identical despite using 1/4 the bits per weight.")
    
    del model_8bit
    gc.collect()
    torch.cuda.empty_cache()
    
    # Try 4-bit if available
    torch.cuda.reset_peak_memory_stats()
    try:
        bnb_config_4bit = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",  # NormalFloat4 from the QLoRA paper
            bnb_4bit_compute_dtype=torch.bfloat16,
        )
        model_4bit = AutoModelForCausalLM.from_pretrained(
            "gpt2",
            quantization_config=bnb_config_4bit,
            device_map="auto",
        )
        
        int4_gpu_mem = torch.cuda.max_memory_allocated() / 1e6
        
        print(f"\n\n4-bit quantized (NF4):")
        print(f"  GPU memory used: {int4_gpu_mem:.1f} MB")
        print(f"  vs full precision: {fp32_gpu:.1f} MB")
        print(f"  Savings: {fp32_gpu / int4_gpu_mem:.1f}x")
        
        print("\n4-bit outputs:")
        for prompt in test_prompts:
            response = generate_text(model_4bit, tokenizer, prompt)
            print(f"  {prompt}")
            print(f"  -> {response[:150]}")
            print()
            
        del model_4bit
        gc.collect()
        torch.cuda.empty_cache()
        
    except Exception as e:
        print(f"\n4-bit loading failed: {e}")
        print("This may require a newer GPU or updated bitsandbytes.")

print("\n" + "=" * 60)
print("THE BIG PICTURE")
print("=" * 60)
print("For GPT-2 (124M), the savings are modest because the model is small.")
print("For real-world models, quantization is transformative:")
print("  Llama 2 7B:  float16 = 14 GB  |  int8 = 7 GB   |  int4 = 3.5 GB")
print("  Llama 2 13B: float16 = 26 GB  |  int8 = 13 GB  |  int4 = 6.5 GB")
print("  Llama 2 70B: float16 = 140 GB |  int8 = 70 GB  |  int4 = 35 GB")
print("\nA consumer RTX 4090 (24 GB) can run a 13B model in 4-bit.")
print("With QLoRA, you can FINETUNE a 7B model on that same GPU.")
print("\nThe 'massive GPU' barrier is largely a myth for parameter-efficient methods.")

**What you just observed:** Quantized models produce nearly identical outputs to full-precision models while using significantly less memory. For GPT-2 the absolute numbers are small, but the principle scales:

| Model | float16 | int8 | int4 (NF4) |
|-------|---------|------|------------|
| GPT-2 (124M) | 0.25 GB | 0.12 GB | 0.06 GB |
| Llama 2 7B | 14 GB | 7 GB | 3.5 GB |
| Llama 2 13B | 26 GB | 13 GB | 6.5 GB |
| Llama 2 70B | 140 GB | 70 GB | 35 GB |

The quality is preserved because neural network weights are highly compressible â€” most cluster near zero in a Gaussian distribution. The information density is much lower than the bit-width suggests.

---

## Key Takeaways

1. **Full finetuning hits a memory wall.** Training a 7B model requires ~84 GB: bf16 weights (14 GB) + bf16 gradients (14 GB) + fp32 Adam optimizer states (56 GB). The optimizer states â€” not the weights â€” are the bottleneck.

2. **Quantization maps float weights to integers with minimal quality loss.** Absmax and zero-point quantization are the building blocks. Real methods (GPTQ, AWQ) achieve INT4 with less than 1% degradation. The weights are compressible because they cluster near zero.

3. **LoRA freezes base weights and adds tiny trainable low-rank matrices.** Train 0.1-1% of parameters, match full finetuning quality. The low-rank constraint acts as implicit regularization â€” it is a feature, not a compromise.

4. **QLoRA combines both: quantized base + LoRA adapters.** Finetune a 7B model in ~4 GB instead of ~84 GB. A consumer GPU can finetune and serve real LLMs.

5. **Finetuning is a refinement, not a revolution.** Weight changes during finetuning are low-rank because you are adjusting, not rewriting. LoRA captures this insight directly.