A Copier template for an eval harness wrapped around an LLM agent. Pick the framework at scaffold time: DeepEval (local, no signup), Braintrust (managed dashboards), or Langfuse (open-source observability + eval).
- Local-first by default —
eval_framework=deepevalruns againsttests/eval_cases.yamlwith no external service. - Same dataset, swappable runners — the YAML case format is the same; only the runner changes per framework.
- CI-friendly —
make evalexits non-zero on regression. - Provider-agnostic agent — the system under test uses Anthropic / OpenAI / Bedrock via env vars.
- One of DeepEval, Braintrust, Langfuse.
- LLM SDKs:
anthropic,openai,boto3(Bedrock). - pydantic-settings, Typer, Copier, uv, ruff,
pre-commit,pytest.
uvx copier copy Template/python_agent_eval_template my_eval \
--data package_name=my_eval \
--data project_description="My agent eval" \
--data github_username=YOU \
--data eval_framework=deepevalcd my_eval
cp .env.example .env # add ANTHROPIC_API_KEY (or OPENAI / AWS creds)
make install
make eval # run the harness — exits non-zero on regression- id: greeting
input: "Say hi"
expected_keywords: ["hi", "hello"]
metric: keyword
- id: factual
input: "What is the capital of France?"
expected_keywords: ["Paris"]
metric: keyword
- id: hallucination
input: "Tell me about the year 1789 in France"
context: "The French Revolution began in 1789 with the storming of the Bastille."
metric: faithfulnessThe metric field selects how the case is scored:
keyword— pass if everyexpected_keywordssubstring appears in the answer (case-insensitive)relevance— LLM-judge: is the answer relevant to the input?faithfulness— LLM-judge: is the answer grounded incontext?
your-project/
├── pyproject.toml
├── Makefile
├── .env.example
├── AGENTS.md
├── src/<package>/
│ ├── settings.py
│ ├── cases.py # YAML → EvalCase dataclasses
│ ├── agent.py # system under test
│ ├── runner.py # load → call → score → report
│ ├── eval/__init__.py # framework factory
│ ├── eval/<framework>.py # the chosen runner
│ └── cli.py # typer: eval run, eval list
├── tests/
│ ├── eval_cases.yaml # the dataset
│ ├── conftest.py
│ ├── test_cases.py
│ └── test_runner.py
├── docker/
├── docs/
└── scripts/
| Variable | Default | Description |
|---|---|---|
LLM_PROVIDER |
(chosen at scaffold) | anthropic, openai, or bedrock |
ANTHROPIC_API_KEY |
Required when LLM_PROVIDER=anthropic |
|
OPENAI_API_KEY |
Required when LLM_PROVIDER=openai |
|
MODEL_ID |
claude-sonnet-4-6 |
Anthropic / Bedrock model ID |
OPENAI_MODEL_ID |
gpt-4o-mini |
OpenAI model ID |
BRAINTRUST_API_KEY |
When eval_framework=braintrust |
|
BRAINTRUST_PROJECT |
my-eval |
Braintrust project name |
LANGFUSE_PUBLIC_KEY |
When eval_framework=langfuse |
|
LANGFUSE_SECRET_KEY |
When eval_framework=langfuse |
|
LANGFUSE_HOST |
https://cloud.langfuse.com |
Langfuse server URL |
EVAL_PASS_THRESHOLD |
0.8 |
Minimum pass rate before eval exits non-zero |
| Command | Description |
|---|---|
make install |
Installs dependencies and pre-commit hooks |
make eval |
Run the harness |
make test |
pytest |
make lint |
Ruff |
make typecheck |
ty |
make format |
Ruff format |
make docs |
MkDocs serve |
make clean |
Clean caches |
uvx copier update --defaults