# Week 1 ‚Äî Foundations of LLM Evaluation & First Principles
### BenchRight LLM Evaluation Master Program (18 Weeks)

---

## üéØ Learning Objectives

By the end of this notebook, you will:

1. Understand *what LLM evaluation is* and why it is the "currency" of model quality
2. Learn the *4 pillars of LLM evaluation*: Quantitative, Qualitative, Safety, Performance
3. Run your first ONNX LLM benchmark in Google Colab
4. Measure latency on multiple prompts
5. Display results in a structured DataFrame

---

## üß† Section 1: What is LLM Evaluation? (Feynman Technique)

### Simple Explanation

Imagine you're buying a car. You'd want to know:
- **How fast does it go?** ‚Üí This is like measuring *throughput* and *latency*
- **How well does it steer?** ‚Üí This is like checking *correctness* and *reasoning*
- **How safe is it?** ‚Üí This is like testing for *hallucinations* and *toxicity*
- **How much fuel does it need?** ‚Üí This is like measuring *memory* and *compute cost*

**LLM evaluation** is the process of systematically answering one core question:

> *"How good is this model‚Äîobjectively, reproducibly, and safely?"*

If you can't explain your evaluation in simple terms, you don't truly understand it.

---

## üèõÔ∏è Section 2: The 4 Pillars of LLM Evaluation

Every evaluation falls into one of these four categories:

| Pillar | Question | Metrics |
|--------|----------|----------|
| **Quantitative** | What does the model understand? | Perplexity, accuracy, benchmarks (MMLU, HellaSwag) |
| **Qualitative** | How does it behave? | Coherence, LLM-as-judge scoring |
| **Safety** | Is it safe? | TruthfulQA, toxicity checks, hallucination detection |
| **Performance** | Is it usable? | Latency, throughput, memory usage |

---

## üõ†Ô∏è Step 1: Setup & Install Dependencies

First, we need to install the ONNX Runtime and the transformers library for tokenization.

**Why these libraries?**
- `onnxruntime`: Runs ONNX models efficiently on CPU/GPU
- `transformers`: Provides pre-trained tokenizers that convert text to model inputs

In [None]:
# Install required packages
!pip install onnxruntime transformers pandas

---

## üì¶ Step 2: Load tinyGPT ONNX Model

### Feynman Explanation

Think of loading a model like opening a cookbook:
1. The **ONNX file** is the cookbook (contains all the recipes/weights)
2. The **InferenceSession** is you reading and following the recipes
3. The **tokenizer** is like a translator that converts your ingredients (text) into a format the cookbook understands (numbers)

We use ONNX because it's a universal format that runs the same model on any platform.

In [None]:
import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np

# Path to the ONNX model (you should upload your model to /tmp/ in Colab)
model_path = "/tmp/tinygpt.onnx"

# Load the tokenizer (GPT-2 tokenizer works with tinyGPT)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Create an inference session with CPU provider
session = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])

print("‚úÖ Model loaded successfully!")
print(f"Input names: {[inp.name for inp in session.get_inputs()]}")
print(f"Output names: {[out.name for out in session.get_outputs()]}")

---

## üöÄ Step 3: Run First Inference

### Feynman Explanation

Running inference is like asking the model a question:
1. **Input**: Your text prompt (e.g., "Explain artificial intelligence.")
2. **Tokenization**: Convert text to numbers the model understands
3. **Forward pass**: The model processes the numbers through its neural network
4. **Output**: The model returns probabilities for each possible next token
5. **Decode**: We convert the most likely tokens back to readable text

In [None]:
# Define a test prompt
prompt = "Explain artificial intelligence."

# Tokenize the input
inputs = tokenizer(prompt, return_tensors="np")

# Run inference
outputs = session.run(None, {"input_ids": inputs["input_ids"]})

# Get the most likely next tokens
result_ids = np.argmax(outputs[0], axis=-1)[0]

# Decode back to text
result_text = tokenizer.decode(result_ids)

print(f"üìù Prompt: {prompt}")
print(f"ü§ñ Model output: {result_text}")

---

## ‚è±Ô∏è Step 4: Measure Latency on 3 Prompts

### Feynman Explanation

**Latency** is like measuring how long it takes a chef to prepare a dish:
- Start the timer when you give the order (input)
- Stop the timer when the dish arrives (output)
- The time in between is the latency

Lower latency = faster responses = better user experience.

We'll measure latency for 3 different prompts to understand how input affects performance.

In [None]:
import time

def benchmark(model_session, text, tokenizer_instance):
    """Measure inference latency for a given text prompt."""
    t0 = time.time()
    inputs = tokenizer_instance(text, return_tensors="np")
    _ = model_session.run(None, {"input_ids": inputs["input_ids"]})
    t1 = time.time()
    return (t1 - t0) * 1000  # Convert to milliseconds

# Define 3 test prompts
prompts = [
    "Explain machine learning.",
    "Summarize the Singapore financial system.",
    "Describe a robot to a child."
]

# Run benchmark for each prompt
results = []
for prompt in prompts:
    latency = benchmark(session, prompt, tokenizer)
    # Get output for display
    inputs = tokenizer(prompt, return_tensors="np")
    outputs = session.run(None, {"input_ids": inputs["input_ids"]})
    result_ids = np.argmax(outputs[0], axis=-1)[0]
    output_text = tokenizer.decode(result_ids)
    results.append({
        "Prompt": prompt,
        "Output": output_text[:100] + "..." if len(output_text) > 100 else output_text,
        "Latency (ms)": round(latency, 2)
    })
    print(f"‚úÖ Processed: '{prompt[:30]}...' - Latency: {latency:.2f} ms")

---

## üìä Step 5: Display Results as DataFrame

### Feynman Explanation

A **DataFrame** is like a spreadsheet:
- Rows represent each experiment (prompt)
- Columns represent measurements (prompt text, output, latency)

Using DataFrames makes it easy to:
- Compare results side-by-side
- Export to CSV/Excel for reports
- Calculate statistics (mean, std, etc.)

In [None]:
import pandas as pd

# Create DataFrame from results
df = pd.DataFrame(results)

# Display the results table
print("\nüìä Benchmark Results:")
print("=" * 80)
display(df)

# Calculate summary statistics
print("\nüìà Summary Statistics:")
print(f"  Mean Latency: {df['Latency (ms)'].mean():.2f} ms")
print(f"  Min Latency:  {df['Latency (ms)'].min():.2f} ms")
print(f"  Max Latency:  {df['Latency (ms)'].max():.2f} ms")

---

## ü§î Paul-Elder Critical Thinking Questions

Reflect on these questions using the Paul-Elder framework:

### Question 1: CLAIM
**What quality does measuring latency alone tell us about the model?**
*Consider: Does low latency mean the model is "good"? What's missing?*

### Question 2: EVIDENCE
**If latency varies between prompts, what evidence would explain the difference?**
*Consider: Input length, tokenization, model architecture.*

### Question 3: REASONING
**Why do we use ONNX Runtime instead of running the model natively in PyTorch?**
*Consider: Portability, optimization, deployment scenarios.*

### Question 4: ASSUMPTIONS
**What assumptions are we making when we benchmark on only 3 prompts?**
*Consider: Statistical significance, prompt diversity, real-world usage.*

### Question 5: IMPLICATIONS
**If we deployed this model to production based solely on these latency results, what could go wrong?**
*Consider: Quality, safety, edge cases.*

---

## üîÑ Inversion Thinking Exercise

**Instead of asking:** "Is my model good?"

**Ask:** "How can my model fail?"

### Exercise

List at least 5 ways this evaluation approach could fail or give misleading results:

1. ________________________________________________
2. ________________________________________________
3. ________________________________________________
4. ________________________________________________
5. ________________________________________________

**Hint:** Think about:
- Hallucinations in output
- Cold start vs. warm start latency
- CPU vs. GPU differences
- Memory constraints
- Prompt sensitivity

---

## üìù Mini-Project

### Task

Evaluate tinyGPT on the 3 prompts provided and measure latency. Create a comprehensive results table.

### Requirements

1. Run inference on each of the 3 prompts
2. Measure and record latency for each
3. Create a results table with columns: Prompt, Output, Latency (ms)
4. Write a brief interpretation of your results
5. Upload your results to `/examples/week01_results.md`

### Submission Format

```markdown
# Week 1 Mini-Project Results

## Results Table
| Prompt | Output | Latency (ms) |
|--------|--------|---------------|
| ... | ... | ... |

## Interpretation
[Your analysis here]
```

---

## ‚úÖ Knowledge Mastery Checklist

Before moving to Week 2, ensure you can check all boxes:

- [ ] I can explain what LLM evaluation is in simple terms
- [ ] I know the 4 pillars of evaluation (Quantitative, Qualitative, Safety, Performance)
- [ ] I successfully ran an ONNX model in Colab
- [ ] I measured and understand latency metrics
- [ ] I can apply Feynman, Paul-Elder, and Inversion thinking to evaluation
- [ ] I completed the mini-project

---

**Week 1 Complete!** üéâ

**Next:** *Week 2 ‚Äî Tokenization & ONNX Runtime Internals*