# Baseline Model Evaluation (CPU, no API)

This notebook evaluates a small open model (FLAN-T5-small) on a physics QA sample to establish a realistic baseline without API keys or GPUs.

In [None]:
# Install dependencies (CPU-friendly). Run once per environment.%pip install --quiet transformers torch sentencepiece

## 1) Load or create evaluation dataset

In [None]:
# Create a tiny demo dataset if the expected file is missing.import json, pathlibEVAL_PATH = pathlib.Path("../data/evaluation/physics_qa_dataset.json")EVAL_PATH.parent.mkdir(parents=True, exist_ok=True)if not EVAL_PATH.exists():    demo = {        "physics_qa_dataset": [            {"question":"What is the period of a simple pendulum of length 2 m on Earth?", "answer":"2.84 s"},            {"question":"State Newton's second law.", "answer":"F = m a"},            {"question":"What are the dimensions of Planck's constant?", "answer":"[M L^2 T^-1]"},            {"question":"Compute kinetic energy for m=5 kg, v=10 m/s.", "answer":"250 J"},            {"question":"What is the relation between photon energy and frequency?", "answer":"E = h f"}        ]    }    with open(EVAL_PATH, "w") as f:        json.dump(demo, f, indent=2)    print("Demo dataset created at", EVAL_PATH)else:    print("Using existing dataset:", EVAL_PATH)with open(EVAL_PATH, "r") as f:    data = json.load(f)    questions = data["physics_qa_dataset"]print(f"Loaded {len(questions)} questions")

## 2) Initialize baseline LLM (FLAN-T5-small, CPU)

In [None]:
from transformers import pipeline# text2text model is suitable for short QA-style outputs; small and CPU-friendlyqa = pipeline("text2text-generation", model="google/flan-t5-small")print("✓ FLAN-T5-small loaded (CPU)")

## 3) Baseline inference function and evaluation loop (N questions)

In [None]:
import pandas as pddef baseline_answer(question: str) -> str:    """Call the CPU-friendly model and return the string answer."""    out = qa(question, max_new_tokens=64, temperature=0.0)    return out[0]["generated_text"].strip()N = min(5, len(questions))  # small sample for a quick baselinerows, correct = [], 0for q in questions[:N]:    got = baseline_answer(q["question"])    exp = q["answer"]    is_correct = exp.lower() in got.lower()    rows.append({        "question": q["question"],        "expected": exp,        "got": got,        "correct": is_correct    })    if is_correct:        correct += 1df = pd.DataFrame(rows)acc = correct / max(1, N)print(df.to_string(index=False))print(f"\nAccuracy: {correct}/{N} = {acc*100:.1f}%")

## 4) Plot accuracy (simple bar chart)

In [None]:
import matplotlib.pyplot as pltplt.figure(figsize=(5,3))plt.bar(["Baseline"], [acc])plt.ylim(0,1)plt.title("Baseline Accuracy (small sample, CPU)")plt.text(0, acc + 0.02, f"{acc*100:.1f}%", ha='center')plt.show()

### Notes- This is a small real baseline using an open CPU model (no API).- For larger/stronger baselines you can switch to `flan-t5-base` or an API model if resources are available.- Matching is a simple substring check; for stricter evaluation you can add normalization or semantic matching.