# Lab 2: GGUF Format Exploration

**Module**: Module 1 - Foundations  
**Estimated Time**: 45-60 minutes  
**Difficulty**: Beginner to Intermediate  

---

## Learning Objectives

By completing this lab, you will:
- [ ] Understand the GGUF file format structure
- [ ] Use gguf-py to read and inspect model metadata
- [ ] Extract and analyze model architecture information
- [ ] Compare different quantization formats
- [ ] Understand the relationship between quantization and model size
- [ ] Inspect tensor information and data layout

## Prerequisites

- Completed Lab 1 (Setup and First Inference)
- Basic understanding of neural network architectures
- Python programming knowledge
- At least one GGUF model file downloaded

## What You'll Learn

GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp to store and load models efficiently. In this lab, you'll explore the internals of GGUF files, understand quantization schemes, and learn to inspect model architectures programmatically.

---

## Part 1: Setup and Understanding GGUF (10 minutes)

### What is GGUF?

GGUF is a binary file format that contains:
1. **Metadata**: Model information, hyperparameters, tokenizer data
2. **Tensor Information**: Names, shapes, and types of all model tensors
3. **Tensor Data**: The actual model weights (quantized or full precision)

### File Structure

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Magic Number       ‚îÇ 4 bytes: "GGUF" (0x46554747)
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ  Version            ‚îÇ 4 bytes: uint32
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ  Tensor Count       ‚îÇ 8 bytes: uint64
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ  Metadata Count     ‚îÇ 8 bytes: uint64
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ  Metadata KV Pairs  ‚îÇ Variable size
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ  Tensor Info        ‚îÇ Variable size
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ  Padding            ‚îÇ Alignment to 32 bytes
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ  Tensor Data        ‚îÇ Bulk data (quantized weights)
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

In [None]:
# Install gguf library
!pip install gguf -q

In [None]:
import gguf
from pathlib import Path
import json
import numpy as np
from collections import defaultdict

print(f"‚úì gguf library version: {gguf.__version__ if hasattr(gguf, '__version__') else 'unknown'}")

In [None]:
# Path to the model from Lab 1
MODEL_PATH = Path("./models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf")

if not MODEL_PATH.exists():
    print(f"‚úó Model not found at {MODEL_PATH}")
    print("Please complete Lab 1 first to download the model.")
else:
    print(f"‚úì Model found: {MODEL_PATH}")
    print(f"‚úì File size: {MODEL_PATH.stat().st_size / (1024**2):.2f} MB")

---

## Part 2: Reading GGUF Metadata (15 minutes)

Let's start by reading the model's metadata. This includes information about the model architecture, training parameters, and tokenizer configuration.

In [None]:
# Load the GGUF file
reader = gguf.GGUFReader(MODEL_PATH)

print(f"GGUF Version: {reader.version}")
print(f"Tensor Count: {reader.tensor_count}")
print(f"Metadata Fields: {len(reader.fields)}")
print("\n‚úì GGUF file loaded successfully!")

In [None]:
# Extract all metadata
def get_all_metadata(reader):
    """Extract all metadata as a dictionary."""
    metadata = {}
    for field in reader.fields.values():
        # Get field name and value
        field_name = field.name
        
        # Handle different field types
        if hasattr(field.parts, '__iter__') and not isinstance(field.parts, (str, bytes)):
            # Array or list
            field_value = list(field.parts)
        else:
            field_value = field.parts
        
        metadata[field_name] = field_value
    
    return metadata

metadata = get_all_metadata(reader)
print(f"Extracted {len(metadata)} metadata fields")

In [None]:
# Display key model information
def print_model_info(metadata):
    """Print human-readable model information."""
    print("=== Model Information ===")
    
    # Look for common metadata keys
    interesting_keys = [
        'general.architecture',
        'general.name',
        'general.file_type',
        'general.quantization_version',
        'llama.context_length',
        'llama.embedding_length',
        'llama.block_count',
        'llama.feed_forward_length',
        'llama.attention.head_count',
        'llama.attention.head_count_kv',
        'tokenizer.ggml.model',
        'tokenizer.ggml.tokens_length'
    ]
    
    for key in interesting_keys:
        if key in metadata:
            print(f"{key}: {metadata[key]}")

print_model_info(metadata)

### Exercise 2.1: Extract Model Architecture Details

From the metadata, extract and calculate:
1. Total number of parameters (approximate)
2. Number of attention heads
3. Hidden dimension size
4. Number of transformer layers
5. Vocabulary size

In [None]:
# TODO: Extract architecture details from metadata
# YOUR CODE HERE

architecture = metadata.get('general.architecture', 'unknown')
n_layers = None  # Extract from metadata
n_heads = None   # Extract from metadata
hidden_dim = None  # Extract from metadata (embedding_length)
vocab_size = None  # Extract from metadata (tokens_length)

print(f"Architecture: {architecture}")
print(f"Layers: {n_layers}")
print(f"Attention Heads: {n_heads}")
print(f"Hidden Dimension: {hidden_dim}")
print(f"Vocabulary Size: {vocab_size}")

# Calculate approximate parameter count
# Formula (simplified): params ‚âà 12 √ó n_layers √ó hidden_dim¬≤
if n_layers and hidden_dim:
    approx_params = 12 * n_layers * (hidden_dim ** 2)
    print(f"\nApproximate Parameters: {approx_params / 1e9:.2f}B")

In [None]:
# Auto-grading cell - DO NOT MODIFY
def test_architecture_extraction():
    assert n_layers is not None and n_layers > 0, "Number of layers not extracted"
    assert n_heads is not None and n_heads > 0, "Number of heads not extracted"
    assert hidden_dim is not None and hidden_dim > 0, "Hidden dimension not extracted"
    assert vocab_size is not None and vocab_size > 0, "Vocabulary size not extracted"
    print("‚úì Architecture details extracted correctly!")
    return True

test_architecture_extraction()

---

## Part 3: Inspecting Tensors (15 minutes)

Now let's look at the actual model tensors. Each tensor has:
- **Name**: Identifies its role (e.g., `token_embd.weight`, `blk.0.attn_q.weight`)
- **Shape**: Dimensions of the tensor
- **Type**: Data type and quantization scheme
- **Offset**: Location in the file
- **Size**: Number of bytes

In [None]:
# Get all tensors
tensors = list(reader.tensors)

print(f"Total tensors in model: {len(tensors)}")
print("\nFirst 10 tensors:")
for i, tensor in enumerate(tensors[:10]):
    print(f"{i+1}. {tensor.name}")
    print(f"   Shape: {tensor.shape}")
    print(f"   Type: {tensor.tensor_type}")
    print()

In [None]:
# Analyze tensor types
def analyze_tensor_types(tensors):
    """Analyze the distribution of tensor types in the model."""
    type_counts = defaultdict(int)
    type_sizes = defaultdict(int)
    
    for tensor in tensors:
        tensor_type = str(tensor.tensor_type)
        type_counts[tensor_type] += 1
        
        # Calculate tensor size in bytes
        # Note: This is approximate as we'd need to know exact quantization sizes
        n_elements = np.prod(tensor.shape)
        type_sizes[tensor_type] += n_elements
    
    return type_counts, type_sizes

type_counts, type_sizes = analyze_tensor_types(tensors)

print("=== Tensor Type Distribution ===")
for tensor_type, count in sorted(type_counts.items()):
    print(f"{tensor_type}: {count} tensors")

### Understanding Tensor Names

LLaMA model tensor naming follows a pattern:

- `token_embd.weight`: Token embedding matrix
- `blk.{N}.attn_q.weight`: Query weights for attention in layer N
- `blk.{N}.attn_k.weight`: Key weights for attention in layer N
- `blk.{N}.attn_v.weight`: Value weights for attention in layer N
- `blk.{N}.attn_output.weight`: Attention output projection
- `blk.{N}.ffn_up.weight`: Feed-forward "up" projection
- `blk.{N}.ffn_down.weight`: Feed-forward "down" projection
- `output.weight`: Final output projection

### Exercise 3.1: Group Tensors by Layer

Create a function that groups tensors by their layer number and counts tensors per layer.

In [None]:
def group_tensors_by_layer(tensors):
    """
    Group tensors by layer number.
    
    Returns:
        dict: {layer_num: [tensor_names]}
    """
    # TODO: Implement this function
    # YOUR CODE HERE
    
    layers = defaultdict(list)
    
    for tensor in tensors:
        # Extract layer number from tensor name (e.g., "blk.0.attn_q.weight" -> 0)
        # Hint: Use string parsing or regex
        pass
    
    return dict(layers)

# Test your implementation
layer_groups = group_tensors_by_layer(tensors)
print(f"Found {len(layer_groups)} unique layers")
if layer_groups:
    first_layer = min(layer_groups.keys())
    print(f"\nTensors in layer {first_layer}:")
    for name in layer_groups[first_layer][:5]:
        print(f"  - {name}")

### Exercise 3.2: Calculate Total Model Size

Calculate the total size of the model by summing all tensor data.

In [None]:
def calculate_model_size(tensors):
    """
    Calculate total model size in bytes.
    
    Note: This is approximate as different quantization types
    use different bytes per element.
    """
    # TODO: Implement this function
    # YOUR CODE HERE
    
    # Mapping of tensor types to approximate bytes per element
    bytes_per_type = {
        'F32': 4,  # Full precision float
        'F16': 2,  # Half precision
        'Q8_0': 1.125,  # 8-bit quantization
        'Q4_K': 0.5625,  # 4-bit K-quant
        'Q5_K': 0.6875,  # 5-bit K-quant
        'Q6_K': 0.8125,  # 6-bit K-quant
    }
    
    total_bytes = 0
    
    for tensor in tensors:
        # Calculate size based on shape and type
        pass
    
    return total_bytes

# Test your implementation
calculated_size = calculate_model_size(tensors)
actual_size = MODEL_PATH.stat().st_size

print(f"Calculated model size: {calculated_size / (1024**2):.2f} MB")
print(f"Actual file size: {actual_size / (1024**2):.2f} MB")
print(f"Difference: {abs(calculated_size - actual_size) / (1024**2):.2f} MB (overhead from metadata)")

---

## Part 4: Quantization Comparison (15 minutes)

Different quantization schemes offer different trade-offs between model size and quality.

### Common Quantization Types

- **Q4_0**: 4-bit quantization, basic (4.5 bits per weight)
- **Q4_K_M**: 4-bit K-quant, medium (4.85 bits per weight)
- **Q5_K_M**: 5-bit K-quant, medium (5.54 bits per weight)
- **Q6_K**: 6-bit K-quant (6.56 bits per weight)
- **Q8_0**: 8-bit quantization (8.5 bits per weight)
- **F16**: Half precision float (16 bits per weight)

### K-Quants

K-quants use mixed precision:
- Important tensors (attention) get higher precision
- Less important tensors get lower precision
- Result: Better quality at similar size

In [None]:
# Theoretical size comparison for a 7B parameter model
def compare_quantization_sizes(param_count_billions=7.0):
    """
    Compare file sizes for different quantization schemes.
    
    Args:
        param_count_billions: Model size in billions of parameters
    """
    param_count = param_count_billions * 1e9
    
    quantizations = {
        'F32 (Full Precision)': 32,
        'F16 (Half Precision)': 16,
        'Q8_0': 8.5,
        'Q6_K': 6.56,
        'Q5_K_M': 5.54,
        'Q4_K_M': 4.85,
        'Q4_0': 4.5,
    }
    
    print(f"=== Size Comparison for {param_count_billions}B Model ===")
    print(f"{'Quantization':<25} {'Bits/Weight':<12} {'Size (GB)':<10} {'vs F32'}")
    print("="*70)
    
    f32_size = param_count * 32 / 8 / (1024**3)
    
    for quant_name, bits_per_weight in quantizations.items():
        size_gb = param_count * bits_per_weight / 8 / (1024**3)
        ratio = size_gb / f32_size
        print(f"{quant_name:<25} {bits_per_weight:<12.2f} {size_gb:<10.2f} {ratio:.1%}")

compare_quantization_sizes(1.1)  # TinyLlama size
print()
compare_quantization_sizes(7.0)  # LLaMA-2-7B size

### Exercise 4.1: Download and Compare Multiple Quantizations

If time permits, download the same model in different quantizations and compare:
1. File sizes
2. Loading times
3. Inference speeds
4. Output quality (subjective)

Note: This exercise is optional as it requires downloading multiple models.

In [None]:
# Example: Compare two quantizations you have downloaded
# YOUR CODE HERE (optional)


---

## Part 5: Tokenizer Exploration (10 minutes)

The GGUF file also contains tokenizer information. Let's explore the tokenizer vocabulary.

In [None]:
# Extract tokenizer information
def get_tokenizer_info(metadata):
    """Extract tokenizer metadata."""
    tokenizer_info = {}
    
    for key, value in metadata.items():
        if 'tokenizer' in key.lower():
            tokenizer_info[key] = value
    
    return tokenizer_info

tokenizer_info = get_tokenizer_info(metadata)

print("=== Tokenizer Information ===")
for key, value in tokenizer_info.items():
    if isinstance(value, (list, bytes)):
        print(f"{key}: <{type(value).__name__} of length {len(value)}>")
    else:
        print(f"{key}: {value}")

### Exercise 5.1: Vocabulary Analysis

If the tokenizer vocabulary is accessible, analyze:
1. Total vocabulary size
2. Sample tokens
3. Special tokens (e.g., BOS, EOS, PAD)

In [None]:
# TODO: Analyze vocabulary
# YOUR CODE HERE

# This will depend on the specific metadata structure
# Look for keys like 'tokenizer.ggml.tokens'


---

## Validation

Run this cell to validate your lab completion:

In [None]:
def validate_lab():
    """Validate lab completion."""
    checks = []
    
    # Check 1: GGUF file loaded
    checks.append(("GGUF file loaded", reader is not None))
    
    # Check 2: Metadata extracted
    checks.append(("Metadata extracted", len(metadata) > 0))
    
    # Check 3: Architecture details extracted
    checks.append(("Architecture extracted", 
                   n_layers is not None and n_heads is not None))
    
    # Check 4: Tensors analyzed
    checks.append(("Tensors analyzed", len(tensors) > 0))
    
    # Check 5: Type distribution calculated
    checks.append(("Type distribution", len(type_counts) > 0))
    
    # Print results
    print("=== Lab Validation ===")
    all_passed = True
    for check_name, passed in checks:
        status = "‚úì" if passed else "‚úó"
        print(f"{status} {check_name}")
        if not passed:
            all_passed = False
    
    print("\n" + "="*50)
    if all_passed:
        print("üéâ Congratulations! You've completed Lab 2!")
        print("\nYou now understand:")
        print("  - GGUF file format structure")
        print("  - Model metadata and architecture")
        print("  - Tensor organization and types")
        print("  - Quantization schemes and trade-offs")
    else:
        print("‚ö†Ô∏è  Please complete all exercises before moving on.")
    
    return all_passed

validate_lab()

---

## Extension Challenges

### Challenge 1: GGUF Converter
Write a tool that reads a GGUF file and exports its metadata to JSON for easy inspection.

### Challenge 2: Quantization Recommender
Create a function that recommends the best quantization for a given use case (quality, size, speed).

### Challenge 3: Model Comparator
Build a tool that compares two GGUF models side-by-side showing their architecture differences.

### Challenge 4: Tensor Visualizer
Visualize the distribution of tensor sizes and types in a model using matplotlib.

### Challenge 5: Memory Estimator
Create a function that predicts the RAM required to run a model based on its GGUF metadata.

In [None]:
# Extension Challenge: Your implementation here


---

## Key Takeaways

In this lab, you learned:

1. **GGUF Structure**: Understanding the binary format for storing LLMs
2. **Metadata**: How model information is stored and accessed
3. **Tensors**: Organization and naming conventions for model weights
4. **Quantization**: Different schemes and their size/quality trade-offs
5. **Inspection Tools**: Using gguf-py to programmatically analyze models

### Quantization Quick Reference

| Quantization | Size | Quality | Use Case |
|--------------|------|---------|----------|
| Q4_K_M | Smallest | Good | Limited RAM, speed priority |
| Q5_K_M | Small | Better | Balanced use |
| Q6_K | Medium | Very good | Quality priority |
| Q8_0 | Larger | Excellent | Maximum quality |
| F16 | Largest | Best | Development/research |

### Next Steps

- **Lab 3**: Memory profiling and KV cache optimization
- **Module 2**: Deep dive into quantization algorithms
- **Read**: GGUF specification document

---

**Lab Created By**: Agent 4 (Lab Designer)  
**Last Updated**: 2025-11-18  
**Feedback**: [Submit feedback](../../feedback/)  