# Lab 1: Model Architecture Exploration

**Module 2 - Core Implementation**

## Objectives

- Inspect model architecture from GGUF files
- Understand transformer layer structure
- Calculate parameter counts and memory requirements
- Compare different model architectures

## Prerequisites

- Completed Module 1
- Python 3.8+
- llama-cpp-python installed
- At least one GGUF model file

## Setup

In [None]:
# Install dependencies
!pip install llama-cpp-python numpy matplotlib

In [None]:
import sys
sys.path.append('../code')

from architecture_inspector import GGUFReader, extract_architecture, ModelArchitecture
import matplotlib.pyplot as plt
import numpy as np

## Exercise 1: Read Model Metadata

Load and inspect model architecture from a GGUF file.

In [None]:
# Path to your GGUF model
MODEL_PATH = "/path/to/your/model.gguf"  # Update this!

# Read GGUF metadata
reader = GGUFReader(MODEL_PATH)
reader.read()

# Display all metadata keys
print("Available metadata keys:")
for key in sorted(reader.metadata.keys()):
    print(f"  {key}")

In [None]:
# Extract architecture
arch = extract_architecture(reader.metadata)

print(f"Model Name: {arch.name}")
print(f"Architecture: {arch.architecture}")
print(f"\nCore Dimensions:")
print(f"  Layers: {arch.n_layer}")
print(f"  Hidden Size: {arch.n_embd:,}")
print(f"  Vocabulary: {arch.n_vocab:,}")
print(f"  Context Length: {arch.n_ctx_train:,}")
print(f"\nAttention:")
print(f"  Query Heads: {arch.n_head}")
print(f"  KV Heads: {arch.n_head_kv}")
print(f"  Head Dimension: {arch.head_dim}")
print(f"  GQA Ratio: {arch.gqa_ratio}:1")

## Exercise 2: Parameter Count Calculation

Calculate the approximate number of parameters in the model.

In [None]:
# Calculate parameter breakdown
embedding_params = arch.n_vocab * arch.n_embd
attention_params_per_layer = 4 * arch.n_embd * arch.n_embd  # Q, K, V, O
ffn_params_per_layer = 3 * arch.n_embd * arch.n_ff  # Gate, Up, Down
norm_params_per_layer = 2 * arch.n_embd  # 2 layer norms

layer_params = attention_params_per_layer + ffn_params_per_layer + norm_params_per_layer
total_layer_params = layer_params * arch.n_layer
output_params = arch.n_vocab * arch.n_embd

total_params = embedding_params + total_layer_params + output_params

print("Parameter Breakdown:")
print(f"  Embedding:        {embedding_params:>15,} ({embedding_params/total_params*100:>5.1f}%)")
print(f"  Transformer:      {total_layer_params:>15,} ({total_layer_params/total_params*100:>5.1f}%)")
print(f"    - Attention:    {attention_params_per_layer * arch.n_layer:>15,}")
print(f"    - FFN:          {ffn_params_per_layer * arch.n_layer:>15,}")
print(f"    - Layer Norm:   {norm_params_per_layer * arch.n_layer:>15,}")
print(f"  Output:           {output_params:>15,} ({output_params/total_params*100:>5.1f}%)")
print(f"  {'─'*50}")
print(f"  Total:            {total_params:>15,} (~{total_params/1e9:.1f}B)")

In [None]:
# Visualize parameter distribution
fig, ax = plt.subplots(figsize=(10, 6))

components = ['Embedding', 'Attention\n(all layers)', 'FFN\n(all layers)', 'Layer Norms', 'Output']
params = [
    embedding_params,
    attention_params_per_layer * arch.n_layer,
    ffn_params_per_layer * arch.n_layer,
    norm_params_per_layer * arch.n_layer,
    output_params
]

colors = plt.cm.Set3(range(len(components)))
ax.bar(components, [p/1e9 for p in params], color=colors)
ax.set_ylabel('Parameters (Billions)')
ax.set_title(f'{arch.name} Parameter Distribution')
ax.grid(axis='y', alpha=0.3)

for i, (comp, param) in enumerate(zip(components, params)):
    ax.text(i, param/1e9, f'{param/1e9:.2f}B', ha='center', va='bottom')

plt.tight_layout()
plt.show()

## Exercise 3: Memory Requirements

Calculate memory requirements for different quantization levels and context lengths.

In [None]:
# Model size for different quantizations
quantizations = {
    'FP32': 4,
    'FP16': 2,
    'Q8_0': 1,
    'Q4_0': 0.5,
}

print("Model Size (weights only):")
print(f"{'Quantization':<12} {'Bytes/Param':<12} {'Total Size':<12}")
print("─" * 40)

for quant, bytes_per_param in quantizations.items():
    size_bytes = total_params * bytes_per_param
    size_gb = size_bytes / (1024**3)
    print(f"{quant:<12} {bytes_per_param:<12.1f} {size_gb:<12.2f} GB")

In [None]:
# KV cache size for different context lengths
context_lengths = [512, 1024, 2048, 4096, 8192, 16384, 32768]

print("\nKV Cache Size (FP16):")
print(f"{'Context Length':<15} {'Cache Size':<15}")
print("─" * 30)

for n_ctx in context_lengths:
    cache_size = arch.estimate_kv_cache_size(n_ctx, bytes_per_elem=2)
    size_str = arch.format_size(cache_size)
    print(f"{n_ctx:<15,} {size_str:<15}")

In [None]:
# Visualize KV cache growth
fig, ax = plt.subplots(figsize=(10, 6))

cache_sizes_mb = [arch.estimate_kv_cache_size(n_ctx, 2) / (1024**2) for n_ctx in context_lengths]

ax.plot(context_lengths, cache_sizes_mb, marker='o', linewidth=2, markersize=8)
ax.set_xlabel('Context Length (tokens)')
ax.set_ylabel('KV Cache Size (MB)')
ax.set_title(f'{arch.name} KV Cache Memory Growth')
ax.grid(True, alpha=0.3)
ax.set_xscale('log')

# Annotate points
for ctx, size in zip(context_lengths, cache_sizes_mb):
    ax.annotate(f'{size:.0f}MB', (ctx, size), textcoords="offset points", 
                xytext=(0,10), ha='center', fontsize=8)

plt.tight_layout()
plt.show()

## Exercise 4: Attention Mechanism Analysis

Analyze the attention configuration and its impact on performance.

In [None]:
# Compare MHA vs GQA vs MQA
print("Attention Configuration Analysis:")
print(f"\nCurrent Model: {arch.name}")
print(f"  Query Heads: {arch.n_head}")
print(f"  KV Heads: {arch.n_head_kv}")

if arch.n_head == arch.n_head_kv:
    attention_type = "Multi-Head Attention (MHA)"
    efficiency_note = "Standard, baseline memory usage"
elif arch.n_head_kv == 1:
    attention_type = "Multi-Query Attention (MQA)"
    efficiency_note = f"{arch.n_head}x more memory efficient than MHA"
else:
    attention_type = f"Grouped-Query Attention (GQA)"
    efficiency_note = f"{arch.gqa_ratio}x more memory efficient than MHA"

print(f"\nType: {attention_type}")
print(f"Efficiency: {efficiency_note}")

# Calculate KV cache comparison
n_ctx = 4096
mha_cache = 2 * arch.n_layer * n_ctx * arch.n_head * arch.head_dim * 2
current_cache = arch.estimate_kv_cache_size(n_ctx, 2)
savings_ratio = mha_cache / current_cache

print(f"\nKV Cache Comparison @ {n_ctx} context:")
print(f"  If MHA ({arch.n_head} KV heads): {arch.format_size(mha_cache)}")
print(f"  Current ({arch.n_head_kv} KV heads): {arch.format_size(current_cache)}")
print(f"  Savings: {savings_ratio:.1f}x smaller")

## Exercise 5: FLOPs Estimation

Estimate computational requirements (FLOPs) for inference.

In [None]:
# FLOPs per token (forward pass)
def estimate_flops_per_token(arch: ModelArchitecture, seq_len: int) -> int:
    """
    Estimate FLOPs for generating one token
    
    For matrix multiply: M×N @ N×K = 2*M*N*K FLOPs
    """
    flops = 0
    
    for layer in range(arch.n_layer):
        # Attention: Q, K, V projections
        flops += 3 * 2 * arch.n_embd * arch.n_embd
        
        # Attention computation (simplified)
        flops += 2 * arch.n_head * seq_len * arch.head_dim
        
        # Attention output projection
        flops += 2 * arch.n_embd * arch.n_embd
        
        # FFN: gate, up, down projections
        flops += 2 * arch.n_embd * arch.n_ff  # gate
        flops += 2 * arch.n_embd * arch.n_ff  # up
        flops += 2 * arch.n_ff * arch.n_embd  # down
    
    # Output projection
    flops += 2 * arch.n_embd * arch.n_vocab
    
    return flops

# Calculate for different sequence lengths
print("FLOPs per Token Estimation:")
print(f"{'Seq Length':<12} {'FLOPs':<20} {'GFLOPs':<12}")
print("─" * 45)

for seq_len in [1, 100, 1000, 2048, 4096]:
    flops = estimate_flops_per_token(arch, seq_len)
    gflops = flops / 1e9
    print(f"{seq_len:<12,} {flops:<20,} {gflops:<12.2f}")

## Challenges

1. Load multiple models and compare their architectures
2. Calculate the theoretical maximum tokens/second for your hardware
3. Estimate the cost of training this model (hint: FLOPs × tokens × 3)
4. Design a custom architecture optimized for your specific use case

## Summary

In this lab, you:
- ✅ Inspected model architecture from GGUF files
- ✅ Calculated parameter counts and memory requirements
- ✅ Analyzed attention configurations (MHA/GQA/MQA)
- ✅ Estimated computational requirements (FLOPs)

## Next Steps

- [Lab 2: Tokenization Deep Dive](lab-02-tokenization-deep-dive.ipynb)
- [Documentation: Model Architecture](../docs/01-model-architecture-deep-dive.md)