Python Agent Eval Template (`python_agent_eval_template`)

A Copier template for an eval harness wrapped around an LLM agent. Pick the framework at scaffold time: DeepEval (local, no signup), Braintrust (managed dashboards), or Langfuse (open-source observability + eval).

Why this template

Local-first by default — eval_framework=deepeval runs against tests/eval_cases.yaml with no external service.
Same dataset, swappable runners — the YAML case format is the same; only the runner changes per framework.
CI-friendly — make eval exits non-zero on regression.
Provider-agnostic agent — the system under test uses Anthropic / OpenAI / Bedrock via env vars.

Technology stack

One of DeepEval, Braintrust, Langfuse.
LLM SDKs: anthropic, openai, boto3 (Bedrock).
pydantic-settings, Typer, Copier, uv, ruff, pre-commit, pytest.

Usage

uvx copier copy Template/python_agent_eval_template my_eval \
  --data package_name=my_eval \
  --data project_description="My agent eval" \
  --data github_username=YOU \
  --data eval_framework=deepeval

Quick start

cd my_eval
cp .env.example .env       # add ANTHROPIC_API_KEY (or OPENAI / AWS creds)
make install
make eval                  # run the harness — exits non-zero on regression

Eval case format (`tests/eval_cases.yaml`)

- id: greeting
  input: "Say hi"
  expected_keywords: ["hi", "hello"]
  metric: keyword
- id: factual
  input: "What is the capital of France?"
  expected_keywords: ["Paris"]
  metric: keyword
- id: hallucination
  input: "Tell me about the year 1789 in France"
  context: "The French Revolution began in 1789 with the storming of the Bastille."
  metric: faithfulness

The metric field selects how the case is scored:

keyword — pass if every expected_keywords substring appears in the answer (case-insensitive)
relevance — LLM-judge: is the answer relevant to the input?
faithfulness — LLM-judge: is the answer grounded in context?

Generated project structure

your-project/
├── pyproject.toml
├── Makefile
├── .env.example
├── AGENTS.md
├── src/<package>/
│   ├── settings.py
│   ├── cases.py            # YAML → EvalCase dataclasses
│   ├── agent.py            # system under test
│   ├── runner.py           # load → call → score → report
│   ├── eval/__init__.py    # framework factory
│   ├── eval/<framework>.py # the chosen runner
│   └── cli.py              # typer: eval run, eval list
├── tests/
│   ├── eval_cases.yaml     # the dataset
│   ├── conftest.py
│   ├── test_cases.py
│   └── test_runner.py
├── docker/
├── docs/
└── scripts/

Environment variables

Variable	Default	Description
`LLM_PROVIDER`	(chosen at scaffold)	`anthropic`, `openai`, or `bedrock`
`ANTHROPIC_API_KEY`		Required when `LLM_PROVIDER=anthropic`
`OPENAI_API_KEY`		Required when `LLM_PROVIDER=openai`
`MODEL_ID`	`claude-sonnet-4-6`	Anthropic / Bedrock model ID
`OPENAI_MODEL_ID`	`gpt-4o-mini`	OpenAI model ID
`BRAINTRUST_API_KEY`		When `eval_framework=braintrust`
`BRAINTRUST_PROJECT`	`my-eval`	Braintrust project name
`LANGFUSE_PUBLIC_KEY`		When `eval_framework=langfuse`
`LANGFUSE_SECRET_KEY`		When `eval_framework=langfuse`
`LANGFUSE_HOST`	`https://cloud.langfuse.com`	Langfuse server URL
`EVAL_PASS_THRESHOLD`	`0.8`	Minimum pass rate before `eval` exits non-zero

Makefile commands

Command	Description
`make install`	Installs dependencies and pre-commit hooks
`make eval`	Run the harness
`make test`	pytest
`make lint`	Ruff
`make typecheck`	`ty`
`make format`	Ruff format
`make docs`	MkDocs serve
`make clean`	Clean caches

Update an existing project

uvx copier update --defaults

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
python-template		python-template
.gitignore		.gitignore
README.md		README.md
copier.yaml		copier.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python Agent Eval Template (`python_agent_eval_template`)

Why this template

Technology stack

Usage

Quick start

Eval case format (`tests/eval_cases.yaml`)

Generated project structure

Environment variables

Makefile commands

Update an existing project

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Python Agent Eval Template (python_agent_eval_template)

Why this template

Technology stack

Usage

Quick start

Eval case format (tests/eval_cases.yaml)

Generated project structure

Environment variables

Makefile commands

Update an existing project

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Python Agent Eval Template (`python_agent_eval_template`)

Eval case format (`tests/eval_cases.yaml`)

Packages