# Inside a Small Model: Understanding How LLMs Work
## A Deep Dive with GPT4All

This notebook explores **what happens inside a language model** when you give it a prompt. We'll visualize:
- **Tokenization**: How text is broken into tokens
- **Model Output**: What probabilities the model produces
- **Decoding**: How tokens are selected to generate text

This notebook uses the [GPT4All Python](https://docs.gpt4all.io/gpt4all_python/home.html) package and works with any local small model in `.gguf` format (OLMo, Mistral, Phi, Qwen, Llama, etc.).

**No PyTorch required!** This runs entirely on CPU using GPT4All.

### Resources

- GPT4All [Github Repo](https://github.com/nomic-ai/gpt4all)
- GPT4All [Documentation](https://docs.gpt4all.io/gpt4all_python/home.html)
- Behind the scenes: [llama.cpp](https://github.com/ggml-org/llama.cpp) backend

### Attribution

Notebook developed by Eric Van Dusen, building on work by Greg Merritt and the [ds-modules/ollama-demo](https://github.com/ds-modules/ollama-demo) framework.

## 1. Environment Setup

First, we'll:
1. Install the GPT4All package if needed
2. Set up the path to our model files
3. Load a small model from disk

This notebook assumes you have at least one `.gguf` model file downloaded (see `GPT4All_Download_gguf.ipynb` for download instructions).

In [None]:
# Ensure that your python environment has gpt4all capability
try:
    from gpt4all import GPT4All
except ImportError:
    %pip install gpt4all
    from gpt4all import GPT4All

### 1.1 Set the Model Path

Choose the path based on your environment:
- **Shared Hub** (like DataHub): `/home/jovyan/shared/`
- **Local Jupyter**: Your custom path (e.g., `shared-rw` or full path)

In [None]:
# For shared hub deployment
path = "/home/jovyan/shared/"

# For local deployment (uncomment and modify as needed)
# path = "shared-rw"
# path = "/Users/yourusername/path/to/models"

In [None]:
# Check what models are available
!ls "{path}"

### 1.2 Load a Small Model

We'll load a small model (1-2B parameters, quantized to ~1GB). You can use any of these:
- `qwen2-1_5b-instruct-q4_0.gguf` (Qwen 1.5B)
- `Llama-3.2-1B-Instruct-Q4_0.gguf` (Llama 1B)
- `Phi-3-mini-4k-instruct.Q4_0.gguf` (Phi-3 Mini)
- `OLMo-2-0425-1B-Q4_K_M.gguf` (OLMo 1B)
- Or any other `.gguf` model you've downloaded!

<p style="background-color:#ffe6e6; color:#b30000; padding:6px; border-radius:6px;">
⚠️ Note: If you see a pink error box about CUDA, don't worry—that's normal. The model will run on CPU.
</p>

In [None]:
# Load the model
model = GPT4All(
    model_name="qwen2-1_5b-instruct-q4_0.gguf",  # Change this to use a different model
    model_path=path,
    verbose=True
)

## 2. Understanding Tokenization

Language models don't work with raw text—they work with **tokens**. A token is a chunk of text (could be a word, part of a word, or punctuation).

Let's explore how text gets broken into tokens.

### What is a Token?

- Common words are usually single tokens: `"hello"` → `[hello]`
- Uncommon words may be split: `"tokenization"` → `[token, ization]`
- Spaces and punctuation count too: `"Hello, world!"` → `[Hello, ",", " world", "!"]`

Different models use different tokenizers, so the same text may tokenize differently!

In [None]:
# Let's visualize tokenization
# GPT4All doesn't expose the tokenizer directly, but we can demonstrate the concept

def visualize_text_chunks(text):
    """
    Simple demonstration of how text might be chunked.
    Note: This is illustrative - actual tokenization varies by model.
    """
    print(f"Original text: '{text}'")
    print(f"Length in characters: {len(text)}")
    print("\nRough token visualization (simplified):")
    
    # Approximate token count (GPT models average ~4 chars per token)
    approx_tokens = len(text) / 4
    print(f"Approximate token count: {int(approx_tokens)}")
    
    # Show word-level breakdown as a simple approximation
    words = text.split()
    print(f"\nWord breakdown ({len(words)} words):")
    for i, word in enumerate(words, 1):
        print(f"  {i}. '{word}'")

# Try different examples
examples = [
    "Hello, world!",
    "The quick brown fox jumps over the lazy dog.",
    "Antidisestablishmentarianism",
    "AI is transforming society."
]

for example in examples:
    visualize_text_chunks(example)
    print("\n" + "="*60 + "\n")

### 2.1 Token Limits and Max Tokens

Every model has a **context window** (maximum number of tokens it can process) and you can set a **max_tokens** parameter to limit output length.

- **Context window**: Total tokens (prompt + response). Often 2048-4096 for small models.
- **max_tokens**: Maximum tokens in the *response only*.

If prompt + response exceeds the context window, the model can't process it!

In [None]:
# Example: Generate text with different token limits
prompt = "Explain what a language model is in simple terms."

print("Short response (20 tokens):")
with model.chat_session():
    response = model.generate(prompt=prompt, max_tokens=20)
    print(response)

print("\n" + "="*60 + "\n")

print("Longer response (100 tokens):")
with model.chat_session():
    response = model.generate(prompt=prompt, max_tokens=100)
    print(response)

## 3. Understanding Model Output: Probabilities

When a language model processes text, it doesn't directly output words. Instead, it outputs **probabilities** for every possible next token.

### How it works:

1. **Input**: The model receives tokens (e.g., "The cat sat")
2. **Processing**: Neural network computes probabilities for what comes next
3. **Output**: A probability distribution over all ~50,000 possible tokens
4. **Selection**: One token is chosen based on these probabilities
5. **Repeat**: That token is added, and the process repeats

Let's visualize this concept:

In [None]:
# Simulate what probability distributions might look like
import random

def simulate_next_token_probabilities(prompt, top_n=10):
    """
    Simulates what next-token probabilities might look like.
    Note: This is illustrative - actual model probabilities are computed internally.
    """
    print(f"Prompt: '{prompt}'")
    print(f"\nTop {top_n} probable next tokens (simulated):")
    print("-" * 40)
    
    # Common words that might follow different prompts
    if "cat" in prompt.lower():
        candidates = [("sat", 0.35), ("jumped", 0.20), ("ran", 0.15), 
                     ("meowed", 0.12), ("slept", 0.10), ("walked", 0.05),
                     ("played", 0.02), ("ate", 0.01)]
    elif "the" in prompt.lower():
        candidates = [("quick", 0.15), ("cat", 0.14), ("dog", 0.13),
                     ("answer", 0.12), ("question", 0.11), ("world", 0.10),
                     ("sky", 0.09), ("earth", 0.08), ("sun", 0.08)]
    else:
        candidates = [("is", 0.25), ("the", 0.20), ("and", 0.15),
                     ("to", 0.12), ("of", 0.10), ("in", 0.08),
                     ("that", 0.05), ("it", 0.05)]
    
    for i, (token, prob) in enumerate(candidates[:top_n], 1):
        bar = "█" * int(prob * 100)
        print(f"{i:2d}. '{token:12s}' {prob:5.1%} {bar}")
    
    print("\nThe model chooses one token based on these probabilities.")
    print("Higher probability = more likely to be chosen")

# Try different prompts
simulate_next_token_probabilities("The cat")
print("\n" + "="*60 + "\n")
simulate_next_token_probabilities("The")

### 3.1 Why Probabilities Matter

The model always produces probabilities, but **how we select from them** determines the output quality:

- **Greedy decoding**: Always pick the highest probability (deterministic)
- **Sampling**: Randomly sample based on probabilities (varied)
- **Temperature**: Controls randomness in sampling

We'll explore these next!

## 4. Understanding Decoding: How Tokens Are Selected

**Decoding** is the process of choosing which token to use next based on the model's probability distribution.

### 4.1 Temperature: Controlling Randomness

**Temperature** affects how "random" or "conservative" the model's choices are:

- **Temperature = 0**: Always pick the highest probability token ("greedy" decoding)
  - ✅ Deterministic—same input always gives same output
  - ✅ Factual and conservative
  - ❌ Repetitive and boring

- **Temperature = 0.5**: Slightly random, but still favors likely tokens
  - ✅ Some variation while staying reasonable
  - Good for most applications

- **Temperature = 1.0**: Full randomness based on probabilities
  - ✅ Creative and diverse
  - ❌ Can be less coherent

Let's see this in action!

In [None]:
# Same prompt, different temperatures
prompt = "The future of artificial intelligence is"
num_responses = 3

print("=" * 60)
print("TEMPERATURE = 0 (Greedy/Deterministic)")
print("=" * 60)
for i in range(num_responses):
    with model.chat_session():
        response = model.generate(prompt=prompt, max_tokens=30, temp=0.0)
        print(f"\nResponse {i+1}: {response}")

print("\n" + "=" * 60)
print("TEMPERATURE = 0.7 (Balanced)")
print("=" * 60)
for i in range(num_responses):
    with model.chat_session():
        response = model.generate(prompt=prompt, max_tokens=30, temp=0.7)
        print(f"\nResponse {i+1}: {response}")

print("\n" + "=" * 60)
print("TEMPERATURE = 1.0 (Creative/Random)")
print("=" * 60)
for i in range(num_responses):
    with model.chat_session():
        response = model.generate(prompt=prompt, max_tokens=30, temp=1.0)
        print(f"\nResponse {i+1}: {response}")

### 4.2 Top-k and Top-p Sampling

Besides temperature, we can control decoding with:

**Top-k sampling**: Only consider the top k most likely tokens
- `top_k=1`: Greedy (always pick the best)
- `top_k=10`: Choose randomly from the 10 most likely tokens
- `top_k=50`: Choose from top 50 (more variety)

**Top-p sampling** (nucleus sampling): Choose from tokens whose cumulative probability is ≤ p
- `top_p=0.1`: Very conservative (only very likely tokens)
- `top_p=0.9`: More diverse (include less likely tokens)
- `top_p=1.0`: Consider all tokens

Let's experiment:

In [None]:
# Experimenting with top_k
prompt = "Once upon a time, in a land far away,"

print("Top-k = 1 (Greedy):")
with model.chat_session():
    response = model.generate(prompt=prompt, max_tokens=40, temp=1.0, top_k=1)
    print(response)

print("\n" + "=" * 60 + "\n")

print("Top-k = 40 (More variety):")
with model.chat_session():
    response = model.generate(prompt=prompt, max_tokens=40, temp=1.0, top_k=40)
    print(response)

In [None]:
# Experimenting with top_p
prompt = "The meaning of life is"

print("Top-p = 0.1 (Conservative):")
with model.chat_session():
    response = model.generate(prompt=prompt, max_tokens=40, temp=1.0, top_p=0.1)
    print(response)

print("\n" + "=" * 60 + "\n")

print("Top-p = 0.95 (More diverse):")
with model.chat_session():
    response = model.generate(prompt=prompt, max_tokens=40, temp=1.0, top_p=0.95)
    print(response)

## 5. Putting It All Together: Generation Step-by-Step

Let's trace through what happens during text generation:

1. **Tokenization**: Your prompt is broken into tokens
2. **Model Processing**: Tokens go through the neural network
3. **Probability Computation**: Model outputs probabilities for next token
4. **Sampling**: A token is selected using temperature/top-k/top-p
5. **Append**: Selected token is added to the sequence
6. **Repeat**: Steps 2-5 repeat until max_tokens or stop condition

### 5.1 Visualizing Generation

In [None]:
def simulate_generation_steps(prompt, max_tokens=5):
    """
    Simulates and visualizes the step-by-step generation process.
    Note: This is a conceptual demonstration - actual generation happens internally.
    """
    print("STEP-BY-STEP GENERATION SIMULATION")
    print("=" * 60)
    print(f"Initial prompt: '{prompt}'")
    print(f"Max tokens to generate: {max_tokens}")
    print("\n" + "=" * 60 + "\n")
    
    current_text = prompt
    
    # Simulate token-by-token generation
    sample_tokens = ["beautiful", "mysterious", "ancient", "forgotten", "kingdom"]
    
    for step in range(max_tokens):
        print(f"Step {step + 1}:")
        print(f"  Current text: '{current_text}'")
        print(f"  → Model computes probabilities for next token...")
        
        # Simulate token selection
        next_token = sample_tokens[step] if step < len(sample_tokens) else "."
        print(f"  → Selected token: '{next_token}'")
        
        current_text += " " + next_token
        print(f"  Updated text: '{current_text}'")
        print()
    
    print("=" * 60)
    print(f"Final generated text: '{current_text}'")

simulate_generation_steps("Once upon a time, there was a", max_tokens=5)

### 5.2 Streaming Generation

In real applications, we often want to see tokens as they're generated (like ChatGPT). This is called **streaming**.

GPT4All supports streaming, which lets us see each token as it's produced:

In [None]:
# Streaming example - watch tokens appear one at a time
import sys

prompt = "Write a haiku about small language models:"

print(f"Prompt: {prompt}\n")
print("Streaming response:")
print("-" * 40)

with model.chat_session():
    tokens = model.generate(prompt=prompt, max_tokens=50, temp=0.7, streaming=True)
    
    for token in tokens:
        print(token, end='', flush=True)
        
print("\n" + "-" * 40)
print("\nGeneration complete!")

## 6. Controlling Generation: Repetition and Penalties

Sometimes models repeat themselves. We can use **repetition penalty** to discourage this.

- `repeat_penalty=1.0`: No penalty (default)
- `repeat_penalty=1.2`: Slightly discourage repetition
- `repeat_penalty=1.5`: Strongly discourage repetition
- `repeat_last_n`: How many recent tokens to check for repetition

In [None]:
# Example: Reducing repetition
prompt = "The cat sat on the mat. The cat"

print("Without repetition penalty:")
with model.chat_session():
    response = model.generate(prompt=prompt, max_tokens=30, temp=0.8, repeat_penalty=1.0)
    print(response)

print("\n" + "=" * 60 + "\n")

print("With repetition penalty (1.3):")
with model.chat_session():
    response = model.generate(prompt=prompt, max_tokens=30, temp=0.8, repeat_penalty=1.3)
    print(response)

## 7. Comparing Different Models

Different models (OLMo, Mistral, Phi, Qwen, Llama) behave differently even with the same prompt and parameters.

This is because they:
- Were trained on different data
- Have different architectures
- Use different tokenizers
- Have different sizes

### 7.1 Switching Models

To try a different model, just change the `model_name` when loading:

In [None]:
# Example of loading a different model
# Uncomment to try (make sure you have the model downloaded first!)

# model_olmo = GPT4All(
#     model_name="OLMo-2-0425-1B-Q4_K_M.gguf",
#     model_path=path,
#     verbose=True
# )

# model_llama = GPT4All(
#     model_name="Llama-3.2-1B-Instruct-Q4_0.gguf",
#     model_path=path,
#     verbose=True
# )

# model_phi = GPT4All(
#     model_name="Phi-3-mini-4k-instruct.Q4_0.gguf",
#     model_path=path,
#     verbose=True
# )

print("To compare models, load multiple models and run the same prompt on each!")

## 8. Key Takeaways

### What We Learned:

1. **Tokenization**
   - Text is broken into tokens (words, subwords, characters)
   - Models have token limits (context window)
   - Different models use different tokenizers

2. **Model Output**
   - Models output probability distributions, not words
   - Every token has a probability of being next
   - Generation selects tokens based on these probabilities

3. **Decoding Strategies**
   - **Temperature**: Controls randomness (0 = deterministic, 1+ = creative)
   - **Top-k**: Limit to k most likely tokens
   - **Top-p**: Limit to tokens with cumulative probability ≤ p
   - **Repetition penalty**: Discourage repeating tokens

4. **Generation Process**
   - One token at a time
   - Each token depends on all previous tokens
   - Can be streamed for real-time display

### Practical Applications:

- **Factual answers**: Use low temperature (0-0.3), greedy decoding
- **Creative writing**: Use higher temperature (0.7-1.0), top-p sampling
- **Code generation**: Use low temperature, repetition penalty
- **Brainstorming**: Use high temperature, high top-k

### Remember:

- Models don't "understand"—they predict probable next tokens
- Higher randomness ≠ better quality (just more varied)
- Small models (~1B params) work well for many tasks
- No internet or GPU needed with GPT4All!

## 9. Experiments to Try

Now that you understand how models work internally, try these experiments:

### Experiment 1: Temperature Extremes
Try temperature = 0, 0.5, 1.0, and 2.0 on the same prompt. What happens at 2.0?

In [None]:
# Your experiment here
prompt = "The most important invention in history was"

# Try different temperatures and observe the results!

### Experiment 2: Token Limits
Try max_tokens = 5, 20, 100, 200. How does response quality change?

In [None]:
# Your experiment here
prompt = "Explain quantum computing"

# Try different max_tokens values!

### Experiment 3: Comparing Models
Load different models (OLMo, Llama, Phi, Qwen) and compare their responses to the same prompt.

In [None]:
# Your experiment here
prompt = "What makes a good teacher?"

# Load and compare different models!

### Experiment 4: System Prompts
Use a system prompt to give the model a "personality" or role. How does this affect outputs?

In [None]:
# Your experiment here
system_prompt = "You are a helpful but very concise assistant who answers in haiku."
user_prompt = "What is machine learning?"

# Try different system prompts!

## 10. Further Reading

Want to learn more?

- [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/) - Visual explanation of how transformers work
- [GPT4All Documentation](https://docs.gpt4all.io/) - Full API reference
- [Hugging Face Model Hub](https://huggingface.co/models) - Browse thousands of models
- [Understanding GGUF format](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) - How quantized models work

### Related Notebooks in This Repo:

- `GPT4All_SmallLM_Demo.ipynb` - Basic usage and chat sessions
- `GPT4All_Download_gguf.ipynb` - How to download models
- `Gradio_Chatbot_GPT4All.ipynb` - Build a chatbot UI