# Lab 1: Setup and First Inference

**Module**: Module 1 - Foundations  
**Estimated Time**: 30-45 minutes  
**Difficulty**: Beginner  

---

## Learning Objectives

By completing this lab, you will:
- [ ] Install llama.cpp and its Python bindings successfully
- [ ] Download and set up your first language model
- [ ] Execute basic inference and generate text
- [ ] Understand and experiment with key inference parameters
- [ ] Observe the impact of parameters on generation quality and speed

## Prerequisites

- Python 3.8 or higher installed
- Basic Python programming knowledge
- At least 4GB of free disk space
- Internet connection for downloading models

## What You'll Build

In this lab, you'll set up a complete local LLM inference environment and run your first text generation. You'll experiment with different parameters to understand how they affect the model's behavior.

---

## Part 1: Environment Setup (10 minutes)

Let's start by setting up the necessary tools. We'll install the llama-cpp-python bindings, which provide a convenient Python interface to llama.cpp.

In [None]:
# Check Python version
import sys
print(f"Python version: {sys.version}")
print(f"Python executable: {sys.executable}")

# Verify we're running Python 3.8+
assert sys.version_info >= (3, 8), "Python 3.8 or higher is required"
print("‚úì Python version check passed!")

In [None]:
# Install llama-cpp-python
# Note: This may take a few minutes as it compiles C++ code
!pip install llama-cpp-python --upgrade

In [None]:
# Verify installation
try:
    import llama_cpp
    print(f"‚úì llama-cpp-python version: {llama_cpp.__version__}")
    print("‚úì Installation successful!")
except ImportError as e:
    print(f"‚úó Installation failed: {e}")
    raise

### Exercise 1.1: Install Additional Dependencies

Install the following helpful packages that we'll use for monitoring and analysis:
- `psutil` for memory monitoring
- `tqdm` for progress bars
- `requests` for downloading models

In [None]:
# TODO: Install psutil, tqdm, and requests
# YOUR CODE HERE


In [None]:
# Auto-grading cell - DO NOT MODIFY
def test_dependencies():
    try:
        import psutil
        import tqdm
        import requests
        print("‚úì All dependencies installed successfully!")
        return True
    except ImportError as e:
        print(f"‚úó Missing dependency: {e}")
        return False

test_dependencies()

---

## Part 2: Download a Model (10 minutes)

For this lab, we'll use a small quantized model that's perfect for learning. We'll download **TinyLlama-1.1B-Chat** in Q4_K_M quantization (~700MB).

### Understanding Model Naming

Model files follow this pattern: `{model-name}-{size}-{variant}.{quantization}.gguf`

- **TinyLlama**: The model family
- **1.1B**: Number of parameters (1.1 billion)
- **Chat**: Fine-tuned for chat/instruction following
- **Q4_K_M**: Quantization type (4-bit, K-quant, Medium)
- **.gguf**: File format

In [None]:
import os
from pathlib import Path
import requests
from tqdm import tqdm

# Create models directory
models_dir = Path("./models")
models_dir.mkdir(exist_ok=True)

# Model information
MODEL_URL = "https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"
MODEL_FILE = models_dir / "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"

print(f"Model will be saved to: {MODEL_FILE}")

In [None]:
# Download model with progress bar
def download_model(url, output_path):
    """Download a model file with progress bar."""
    if output_path.exists():
        print(f"‚úì Model already exists at {output_path}")
        return
    
    print(f"Downloading model from {url}...")
    response = requests.get(url, stream=True)
    total_size = int(response.headers.get('content-length', 0))
    
    with open(output_path, 'wb') as f, tqdm(
        desc=output_path.name,
        total=total_size,
        unit='iB',
        unit_scale=True,
        unit_divisor=1024,
    ) as pbar:
        for chunk in response.iter_content(chunk_size=8192):
            size = f.write(chunk)
            pbar.update(size)
    
    print(f"‚úì Download complete!")

# Note: For classroom/workshop settings, you may want to:
# 1. Pre-download models to a shared location
# 2. Copy from a local cache
# 3. Use a smaller model if bandwidth is limited

download_model(MODEL_URL, MODEL_FILE)

In [None]:
# Verify the model file
if MODEL_FILE.exists():
    size_mb = MODEL_FILE.stat().st_size / (1024 * 1024)
    print(f"‚úì Model file exists")
    print(f"‚úì File size: {size_mb:.2f} MB")
else:
    print("‚úó Model file not found!")

---

## Part 3: Load the Model and Run First Inference (10 minutes)

Now for the exciting part - let's load the model and generate some text!

In [None]:
from llama_cpp import Llama
import time

# Load the model
print("Loading model...")
start_time = time.time()

llm = Llama(
    model_path=str(MODEL_FILE),
    n_ctx=2048,  # Context window size
    n_threads=4,  # Number of CPU threads
    verbose=False  # Set to True to see detailed loading info
)

load_time = time.time() - start_time
print(f"‚úì Model loaded in {load_time:.2f} seconds")

In [None]:
# Your first inference!
prompt = "What is machine learning?"

print(f"Prompt: {prompt}\n")
print("Generating response...\n")

output = llm(
    prompt,
    max_tokens=100,  # Maximum tokens to generate
    temperature=0.7,  # Randomness (0.0 = deterministic, 2.0 = very random)
    top_p=0.9,       # Nucleus sampling
    echo=False       # Don't include prompt in output
)

print("Response:")
print(output['choices'][0]['text'])

### Understanding the Output

The output is a dictionary containing:
- `choices`: List of generated completions (usually just one)
- `usage`: Token usage statistics
- Other metadata

Let's explore the full output structure:

In [None]:
import json

# Pretty print the output structure
print(json.dumps(output, indent=2, default=str))

### Exercise 3.1: Extract Key Metrics

From the output dictionary, extract and display:
1. The generated text
2. The number of tokens generated
3. The finish reason (why generation stopped)
4. The total tokens used (prompt + completion)

In [None]:
# TODO: Extract and print the metrics
# YOUR CODE HERE

generated_text = None  # Extract from output
tokens_generated = None  # Extract from output['usage']
finish_reason = None  # Extract from output['choices'][0]
total_tokens = None  # Extract from output['usage']

print(f"Generated Text: {generated_text}")
print(f"Tokens Generated: {tokens_generated}")
print(f"Finish Reason: {finish_reason}")
print(f"Total Tokens: {total_tokens}")

In [None]:
# Auto-grading cell - DO NOT MODIFY
def test_metrics_extraction():
    assert generated_text is not None and len(generated_text) > 0, "Generated text not extracted"
    assert isinstance(tokens_generated, int) and tokens_generated > 0, "Tokens generated not extracted"
    assert finish_reason is not None, "Finish reason not extracted"
    assert isinstance(total_tokens, int) and total_tokens > tokens_generated, "Total tokens not extracted"
    print("‚úì All metrics extracted correctly!")
    return True

test_metrics_extraction()

---

## Part 4: Experiment with Parameters (15 minutes)

Now let's experiment with different parameters to understand their effects. We'll focus on the three most important parameters:

1. **temperature**: Controls randomness (0.0 = deterministic, 2.0 = very random)
2. **top_p**: Nucleus sampling - considers tokens with cumulative probability p
3. **max_tokens**: Maximum length of generated text

### Understanding Temperature

Temperature affects the probability distribution:
- **Low (0.0-0.3)**: More focused, deterministic, repetitive
- **Medium (0.7-1.0)**: Balanced creativity and coherence
- **High (1.5-2.0)**: More random, creative, potentially incoherent

In [None]:
# Helper function for clean generation
def generate_text(prompt, temperature=0.7, top_p=0.9, max_tokens=50):
    """Generate text with specified parameters."""
    output = llm(
        prompt,
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p,
        echo=False
    )
    return output['choices'][0]['text'].strip()

# Test it
test_prompt = "The best thing about AI is"
result = generate_text(test_prompt)
print(f"Prompt: {test_prompt}")
print(f"Response: {result}")

### Exercise 4.1: Temperature Comparison

Generate responses to the same prompt with three different temperatures and observe the differences.

In [None]:
prompt = "Explain quantum computing in one sentence:"
temperatures = [0.1, 0.7, 1.5]

print(f"Prompt: {prompt}\n")

# TODO: Generate responses with different temperatures
# YOUR CODE HERE

for temp in temperatures:
    # Generate response
    response = None  # Call generate_text with appropriate temperature
    print(f"Temperature {temp}:")
    print(f"{response}\n")

### Exercise 4.2: Measuring Generation Speed

Implement a function that measures tokens per second during generation.

In [None]:
def measure_generation_speed(prompt, max_tokens=100):
    """
    Generate text and measure performance.
    
    Returns:
        dict: Contains 'text', 'tokens', 'time_seconds', 'tokens_per_second'
    """
    # TODO: Implement this function
    # YOUR CODE HERE
    
    start_time = time.time()
    # Generate text
    # Measure time
    # Calculate tokens per second
    
    return {
        'text': None,
        'tokens': None,
        'time_seconds': None,
        'tokens_per_second': None
    }

# Test your implementation
result = measure_generation_speed("Write a short poem about AI:", max_tokens=50)
print(f"Generated {result['tokens']} tokens in {result['time_seconds']:.2f} seconds")
print(f"Speed: {result['tokens_per_second']:.2f} tokens/second")
print(f"\nText: {result['text']}")

### Exercise 4.3: Batch Testing

Create a function that tests multiple prompts and compares their generation characteristics.

In [None]:
test_prompts = [
    "What is Python?",
    "Explain neural networks:",
    "The future of AI will",
    "In one sentence, describe how LLMs work:"
]

# TODO: Generate responses for all prompts and create a summary
# YOUR CODE HERE

results = []
for prompt in test_prompts:
    # Generate and measure
    # Store results
    pass

# Print summary
print("\n=== Generation Summary ===")
# Print average tokens/second, total tokens, etc.

---

## Part 5: Advanced Parameter Exploration (Optional)

### Understanding More Parameters

llama.cpp supports many more parameters:

- **repeat_penalty**: Penalize repeated tokens (1.0 = no penalty, >1.0 = penalize)
- **top_k**: Consider only the top K tokens (0 = disabled)
- **stop**: Stop sequences to end generation
- **stream**: Stream tokens as they're generated

In [None]:
# Example: Using stop sequences
prompt = "Count from 1 to 10:\n1\n2\n3\n"

output = llm(
    prompt,
    max_tokens=50,
    temperature=0.1,
    stop=["\n8"],  # Stop at 8
    echo=False
)

print(f"Prompt: {prompt}")
print(f"\nGeneration (stops at 8):\n{output['choices'][0]['text']}")

In [None]:
# Example: Streaming generation
prompt = "The three laws of robotics are:"

print(f"Prompt: {prompt}\n")
print("Streaming response:")

stream = llm(
    prompt,
    max_tokens=100,
    temperature=0.7,
    stream=True
)

for output in stream:
    text = output['choices'][0]['text']
    print(text, end='', flush=True)

print("\n\n‚úì Streaming complete!")

---

## Validation

Run this cell to validate you've completed all required exercises:

In [None]:
def validate_lab():
    """Validate lab completion."""
    checks = []
    
    # Check 1: Dependencies installed
    try:
        import llama_cpp
        import psutil
        import tqdm
        import requests
        checks.append(("Dependencies installed", True))
    except ImportError:
        checks.append(("Dependencies installed", False))
    
    # Check 2: Model downloaded
    checks.append(("Model downloaded", MODEL_FILE.exists()))
    
    # Check 3: Model loaded
    checks.append(("Model loaded", llm is not None))
    
    # Check 4: Basic inference works
    try:
        test_output = llm("Test", max_tokens=5)
        checks.append(("Basic inference", True))
    except:
        checks.append(("Basic inference", False))
    
    # Print results
    print("=== Lab Validation ===")
    all_passed = True
    for check_name, passed in checks:
        status = "‚úì" if passed else "‚úó"
        print(f"{status} {check_name}")
        if not passed:
            all_passed = False
    
    print("\n" + "="*50)
    if all_passed:
        print("üéâ Congratulations! You've completed Lab 1!")
    else:
        print("‚ö†Ô∏è  Please complete all exercises before moving on.")
    
    return all_passed

validate_lab()

---

## Extension Challenges

Ready for more? Try these challenges:

### Challenge 1: Parameter Optimizer
Create a function that automatically finds the optimal temperature for a given task by testing multiple values and measuring output quality.

### Challenge 2: Prompt Templates
Implement a prompt template system that formats prompts for different tasks (Q&A, summarization, code generation).

### Challenge 3: Response Caching
Build a simple cache system that stores previous responses to avoid re-generating identical prompts.

### Challenge 4: Multi-Model Comparison
If you have multiple models, create a comparison tool that generates the same prompt with different models and compares results.

### Challenge 5: Memory Monitor
Use `psutil` to monitor and log memory usage during model loading and inference. Create a visualization of memory consumption over time.

In [None]:
# Extension Challenge: Your implementation here


---

## Key Takeaways

In this lab, you learned:

1. **Setup**: How to install llama.cpp Python bindings and download models
2. **Basic Inference**: Loading models and generating text
3. **Parameters**: Understanding temperature, top_p, and max_tokens
4. **Performance**: Measuring tokens per second and generation speed
5. **Advanced Features**: Streaming generation and stop sequences

### Next Steps

- **Lab 2**: Explore GGUF format internals and quantization
- **Lab 3**: Memory profiling and optimization
- **Module 1 Docs**: Read about model architecture and inference pipelines

### Troubleshooting

**Model loading is slow**: This is normal for the first load. Subsequent loads may be faster due to OS caching.

**Out of memory**: Try a smaller model or reduce `n_ctx` parameter.

**Installation fails**: Make sure you have C++ build tools installed. Check llama-cpp-python documentation for platform-specific instructions.

---

**Lab Created By**: Agent 4 (Lab Designer)  
**Last Updated**: 2025-11-18  
**Feedback**: [Submit feedback](../../feedback/)  