# Deconstructing Local LLMs

When we download an LLM for local use, it comes with several essential components that work together to make text generation possible. In this lesson, we'll explore these components, understand what they do, and learn how they fit together.

## Learning Objectives

By the end of this notebook, you will be able to:
- Identify the key files in a local LLM directory
- Understand the purpose of model configuration files
- Examine model weights and architecture
- Explore tokenizer components and how tokenization works
- Understand how text is converted to tokens and back

In [1]:
## Setup: Download and Save a Model

from transformers import AutoModelForCausalLM, AutoTokenizer
import os
import json
import torch
from safetensors import safe_open

# Set the directory where we'll save the model
save_directory = "./downloaded_model"  
model_name = "distilgpt2"

# Check if model already exists locally
if os.path.exists(save_directory) and os.listdir(save_directory):
    print(f"✓ Model already exists in {save_directory}")
    print("  Loading from local directory...")
    tokenizer = AutoTokenizer.from_pretrained(save_directory)
    model = AutoModelForCausalLM.from_pretrained(save_directory)
    print("✓ Model loaded successfully!")
else:
    # Create directory and download model
    os.makedirs(save_directory, exist_ok=True)
    
    print(f"Downloading {model_name} from Hugging Face Hub...")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    
    # Save the model to our local directory
    print(f"Saving model to {save_directory}...")
    model.save_pretrained(save_directory)
    tokenizer.save_pretrained(save_directory)
    print("✓ Model and tokenizer saved successfully!")

✓ Model already exists in ./downloaded_model
  Loading from local directory...
✓ Model loaded successfully!


## Exploring the Model Files

Let's see what files were created when we downloaded the model:

In [2]:
# List all files in the model directory
files = os.listdir(save_directory)
print("Files in the model directory:\n")
for file in sorted(files):
    # Get file size in MB
    file_path = os.path.join(save_directory, file)
    file_size = os.path.getsize(file_path) / (1024 * 1024)  # Convert to MB
    print(f"  {file:<30} {file_size:>10.2f} MB")

Files in the model directory:

  config.json                          0.00 MB
  generation_config.json               0.00 MB
  merges.txt                           0.44 MB
  model.safetensors                  312.48 MB
  special_tokens_map.json              0.00 MB
  tokenizer.json                       3.39 MB
  tokenizer_config.json                0.00 MB
  vocab.json                           0.76 MB


## Understanding the Key Components

The files we see in the model directory can be grouped into three main categories:

### 1. Model Configuration
- `config.json` - Model architecture and hyperparameters
- `generation_config.json` - Default generation parameters

### 2. Model Weights
- `model.safetensors` or `pytorch_model.bin` - The actual trained parameters

### 3. Tokenizer Components
- `tokenizer_config.json` - Tokenizer settings
- `vocab.json` - Vocabulary mapping tokens to IDs
- `merges.txt` - Byte-Pair Encoding (BPE) merges for subword tokenization
- `tokenizer.json` - Optimized tokenizer data
- `special_tokens_map.json` - Defines special tokens like `<|endoftext|>`

Let's examine each of these components in detail.

## 1. Model Configuration (config.json)

The `config.json` file contains essential information about the model architecture and hyperparameters. This tells the framework how to construct the model's neural network layers.

**Key parameters:**
- `model_type` - The architecture family (e.g., GPT-2, BERT)
- `vocab_size` - Number of tokens in the vocabulary
- `n_positions` - Maximum sequence length the model can handle
- `n_embd` - Dimension of embeddings and hidden layers
- `n_layer` - Number of transformer layers/blocks
- `n_head` - Number of attention heads in each layer
- `activation_function` - Non-linearity used (e.g., GELU, ReLU)
- `*_pdrop` - Dropout probabilities for regularization

In [3]:
# Load and examine the config.json file
config_path = os.path.join(save_directory, "config.json")
with open(config_path, "r") as f:
    config = json.load(f)

# Display key configuration parameters
print("Key model configuration parameters:\n")
important_params = [
    "model_type", "vocab_size", "n_positions", "n_embd", "n_layer", "n_head", 
    "activation_function", "resid_pdrop", "embd_pdrop", "attn_pdrop"
]
for param in important_params:
    if param in config:
        print(f"  {param:<25} {config[param]}")

Key model configuration parameters:

  model_type                gpt2
  vocab_size                50257
  n_positions               1024
  n_embd                    768
  n_layer                   6
  n_head                    12
  activation_function       gelu_new
  resid_pdrop               0.1
  embd_pdrop                0.1
  attn_pdrop                0.1


## 2. Model Weights

The model weights are stored in one of these formats:
- `pytorch_model.bin` - PyTorch's native format
- `model.safetensors` - A newer, safer format for storing tensors (preferred)

These files contain the actual trained parameters of the model - the weights and biases learned during training. Let's examine what's inside:

In [4]:
# Find the weights file
weights_file = None
for file in files:
    if file.endswith(".bin") or file.endswith(".safetensors"):
        weights_file = file
        break

if weights_file:
    print(f"Found weights file: {weights_file}\n")

    if weights_file.endswith(".bin"):
        weights_path = os.path.join(save_directory, weights_file)
        state_dict = torch.load(weights_path)

        print("Model weight matrices (first 10):\n")
        print(f"{'Layer Name':<50} {'Shape':<20} {'Sample Values'}")
        print("=" * 100)

        for name, tensor in list(state_dict.items())[:10]:
            preview = tensor.flatten()[:3].tolist()
            preview_str = f"[{preview[0]:.4f}, {preview[1]:.4f}, {preview[2]:.4f}, ...]"
            print(f"{name:<50} {str(tensor.shape):<20} {preview_str}")

    elif weights_file.endswith(".safetensors"):
        weights_path = os.path.join(save_directory, weights_file)
        with safe_open(weights_path, framework="pt") as f:
            tensor_names = list(f.keys())[:10]

            print("Model weight matrices (first 10):\n")
            print(f"{'Layer Name':<50} {'Shape':<20} {'Sample Values'}")
            print("=" * 100)

            for name in tensor_names:
                tensor = f.get_tensor(name)
                preview = tensor.flatten()[:3].tolist()
                preview_str = f"[{preview[0]:.4f}, {preview[1]:.4f}, {preview[2]:.4f}, ...]"
                print(f"{name:<50} {str(tensor.shape):<20} {preview_str}")
else:
    print("No weights file found")

Found weights file: model.safetensors

Model weight matrices (first 10):

Layer Name                                         Shape                Sample Values
transformer.h.0.attn.c_attn.bias                   torch.Size([2304])   [0.4693, -0.4959, -0.4158, ...]
transformer.h.0.attn.c_attn.weight                 torch.Size([768, 2304]) [-0.4988, -0.1990, -0.1046, ...]
transformer.h.0.attn.c_proj.bias                   torch.Size([768])    [0.1617, -0.1644, -0.1561, ...]
transformer.h.0.attn.c_proj.weight                 torch.Size([768, 768]) [0.2581, -0.1660, 0.0625, ...]
transformer.h.0.ln_1.bias                          torch.Size([768])    [0.0048, 0.0129, -0.0190, ...]
transformer.h.0.ln_1.weight                        torch.Size([768])    [0.2195, 0.1853, 0.1572, ...]
transformer.h.0.ln_2.bias                          torch.Size([768])    [0.0385, 0.0581, 0.0133, ...]
transformer.h.0.ln_2.weight                        torch.Size([768])    [0.1342, 0.2176, 0.2098, ...]
transforme

## 3. Tokenizer Components

The tokenizer is responsible for converting text into token IDs that the model can process, and vice versa.

### Tokenizer Configuration (tokenizer_config.json)

**Key settings:**
- `model_max_length` - Maximum sequence length the tokenizer will handle
- `bos_token`, `eos_token`, `unk_token` - Special tokens for beginning/end of sequence and unknown tokens

In [5]:
# Examine tokenizer_config.json
tokenizer_config_path = os.path.join(save_directory, "tokenizer_config.json")
if os.path.exists(tokenizer_config_path):
    with open(tokenizer_config_path, "r") as f:
        tokenizer_config = json.load(f)

    print("Tokenizer Configuration:\n")
    for key, value in tokenizer_config.items():
        # Format long values more nicely
        if isinstance(value, dict) and len(str(value)) > 80:
            print(f"  {key}: {type(value).__name__} with {len(value)} entries")
        else:
            print(f"  {key}: {value}")
else:
    print("No tokenizer_config.json found")

Tokenizer Configuration:

  add_prefix_space: False
  added_tokens_decoder: dict with 1 entries
  bos_token: <|endoftext|>
  clean_up_tokenization_spaces: False
  eos_token: <|endoftext|>
  extra_special_tokens: {}
  model_max_length: 1024
  tokenizer_class: GPT2Tokenizer
  unk_token: <|endoftext|>


In [6]:
### Vocabulary (vocab.json)

# Examine vocab.json
vocab_path = os.path.join(save_directory, "vocab.json")
if os.path.exists(vocab_path):
    with open(vocab_path, "r") as f:
        vocab = json.load(f)

    print(f"Vocabulary size: {len(vocab):,} tokens\n")

    # Show the first 20 tokens
    print("Sample tokens (first 20):")
    for i, (token, token_id) in enumerate(list(vocab.items())[:20]):
        print(f"  {token_id:5d}: {repr(token)}")

    # Show some interesting tokens
    print("\nSome interesting word tokens:")
    interesting_tokens = ["hello", "world", "programming", "AI", "model"]
    for token in interesting_tokens:
        if token in vocab:
            print(f"  {vocab[token]:5d}: {repr(token)}")

    # Show special tokens
    print("\nSpecial tokens:")
    special_tokens = ["<|endoftext|>", "<|pad|>", "<|mask|>"]
    for token in special_tokens:
        if token in vocab:
            print(f"  {vocab[token]:5d}: {repr(token)}")
else:
    print("No vocab.json found")

Vocabulary size: 50,257 tokens

Sample tokens (first 20):
      0: '!'
      1: '"'
      2: '#'
      3: '$'
      4: '%'
      5: '&'
      6: "'"
      7: '('
      8: ')'
      9: '*'
     10: '+'
     11: ','
     12: '-'
     13: '.'
     14: '/'
     15: '0'
     16: '1'
     17: '2'
     18: '3'
     19: '4'

Some interesting word tokens:
  31373: 'hello'
   6894: 'world'
  20185: 'AI'
  19849: 'model'

Special tokens:
  50256: '<|endoftext|>'


In [7]:
### BPE Merges (merges.txt)

# Examine merges.txt (BPE merges)
merges_path = os.path.join(save_directory, "merges.txt")
if os.path.exists(merges_path):
    with open(merges_path, "r", encoding="utf-8") as f:
        merges = f.readlines()

    print(f"Number of BPE merges: {len(merges):,}\n")

    # Show the first few merges
    print("First 10 BPE merges:")
    for i, merge in enumerate(merges[:10]):
        print(f"  {merge.strip()}")

    print("\n" + "=" * 80)
    print("Understanding BPE (Byte-Pair Encoding) merges:")
    print("=" * 80)
    print("• Each line shows two tokens that can be merged into one")
    print("• Merges are applied in order during tokenization")
    print("• This allows the model to handle unknown words by breaking them into subwords")
    print("• The 'Ġ' symbol represents a space character")
else:
    print("No merges.txt found")

Number of BPE merges: 50,001

First 10 BPE merges:
  #version: 0.2
  Ġ t
  Ġ a
  h e
  i n
  r e
  o n
  Ġt he
  e r
  Ġ s

Understanding BPE (Byte-Pair Encoding) merges:
• Each line shows two tokens that can be merged into one
• Merges are applied in order during tokenization
• This allows the model to handle unknown words by breaking them into subwords
• The 'Ġ' symbol represents a space character


## Tokenization in Action

Let's see how the tokenizer works with a real example:

In [8]:
# Reload the tokenizer to ensure we're using the local files
local_tokenizer = AutoTokenizer.from_pretrained(save_directory)

# Define a sample text
sample_text = "The quick brown fox jumps over the lazy dog. This is an example of tokenization in NLP."

# Tokenize the text
tokens = local_tokenizer.tokenize(sample_text)
token_ids = local_tokenizer.encode(sample_text)

# Display the results
print(f"Original text:\n  {sample_text}\n")
print(f"Tokenized into {len(tokens)} tokens:")
print(f"  {tokens}\n")

print(f"Converted to {len(token_ids)} token IDs:")
print(f"  {token_ids}\n")

# Show token to ID mapping
print("Token → ID mapping:")
print("-" * 40)
for token, token_id in zip(tokens, token_ids):
    print(f"  {repr(token):20} → {token_id}")

# Decode back to text
decoded_text = local_tokenizer.decode(token_ids)
print(f"\nDecoded text:\n  {decoded_text}")
print("\n✓ Round-trip successful! (Original → Tokens → IDs → Text)")

Original text:
  The quick brown fox jumps over the lazy dog. This is an example of tokenization in NLP.

Tokenized into 21 tokens:
  ['The', 'Ġquick', 'Ġbrown', 'Ġfox', 'Ġjumps', 'Ġover', 'Ġthe', 'Ġlazy', 'Ġdog', '.', 'ĠThis', 'Ġis', 'Ġan', 'Ġexample', 'Ġof', 'Ġtoken', 'ization', 'Ġin', 'ĠN', 'LP', '.']

Converted to 21 token IDs:
  [464, 2068, 7586, 21831, 18045, 625, 262, 16931, 3290, 13, 770, 318, 281, 1672, 286, 11241, 1634, 287, 399, 19930, 13]

Token → ID mapping:
----------------------------------------
  'The'                → 464
  'Ġquick'             → 2068
  'Ġbrown'             → 7586
  'Ġfox'               → 21831
  'Ġjumps'             → 18045
  'Ġover'              → 625
  'Ġthe'               → 262
  'Ġlazy'              → 16931
  'Ġdog'               → 3290
  '.'                  → 13
  'ĠThis'              → 770
  'Ġis'                → 318
  'Ġan'                → 281
  'Ġexample'           → 1672
  'Ġof'                → 286
  'Ġtoken'             → 11241
  'izati

## Summary

When you download a local LLM, you get a complete package with three essential components:

### 1. Model Architecture and Configuration
- **config.json** - Defines the neural network architecture (layers, attention heads, dimensions)
- **generation_config.json** - Default parameters for text generation (temperature, top_p, max length)

### 2. Model Weights
- **model.safetensors** or **pytorch_model.bin** - All trained neural network weights (the actual learned parameters)

### 3. Tokenizer Components
- **vocab.json** - Maps text tokens to their corresponding IDs
- **merges.txt** - BPE merge rules that determine how characters combine into subword tokens
- **tokenizer.json** - Optimized version combining vocabulary and merge rules
- **tokenizer_config.json** - Settings for tokenizer behavior
- **special_tokens_map.json** - Defines special tokens like `<|endoftext|>`

### Key Takeaways

1. **Configuration files** tell you how the model is structured
2. **Weight files** contain the learned knowledge (and are the largest files)
3. **Tokenizer files** enable text ↔ token ID conversion
4. **All components must work together** for the model to function properly
5. **BPE tokenization** enables handling of unknown words through subword units