Skip to content

bravetux/python_agent_eval_template

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Python Agent Eval Template (python_agent_eval_template)

A Copier template for an eval harness wrapped around an LLM agent. Pick the framework at scaffold time: DeepEval (local, no signup), Braintrust (managed dashboards), or Langfuse (open-source observability + eval).

Why this template

  • Local-first by defaulteval_framework=deepeval runs against tests/eval_cases.yaml with no external service.
  • Same dataset, swappable runners — the YAML case format is the same; only the runner changes per framework.
  • CI-friendlymake eval exits non-zero on regression.
  • Provider-agnostic agent — the system under test uses Anthropic / OpenAI / Bedrock via env vars.

Technology stack

Usage

uvx copier copy Template/python_agent_eval_template my_eval \
  --data package_name=my_eval \
  --data project_description="My agent eval" \
  --data github_username=YOU \
  --data eval_framework=deepeval

Quick start

cd my_eval
cp .env.example .env       # add ANTHROPIC_API_KEY (or OPENAI / AWS creds)
make install
make eval                  # run the harness — exits non-zero on regression

Eval case format (tests/eval_cases.yaml)

- id: greeting
  input: "Say hi"
  expected_keywords: ["hi", "hello"]
  metric: keyword
- id: factual
  input: "What is the capital of France?"
  expected_keywords: ["Paris"]
  metric: keyword
- id: hallucination
  input: "Tell me about the year 1789 in France"
  context: "The French Revolution began in 1789 with the storming of the Bastille."
  metric: faithfulness

The metric field selects how the case is scored:

  • keyword — pass if every expected_keywords substring appears in the answer (case-insensitive)
  • relevance — LLM-judge: is the answer relevant to the input?
  • faithfulness — LLM-judge: is the answer grounded in context?

Generated project structure

your-project/
├── pyproject.toml
├── Makefile
├── .env.example
├── AGENTS.md
├── src/<package>/
│   ├── settings.py
│   ├── cases.py            # YAML → EvalCase dataclasses
│   ├── agent.py            # system under test
│   ├── runner.py           # load → call → score → report
│   ├── eval/__init__.py    # framework factory
│   ├── eval/<framework>.py # the chosen runner
│   └── cli.py              # typer: eval run, eval list
├── tests/
│   ├── eval_cases.yaml     # the dataset
│   ├── conftest.py
│   ├── test_cases.py
│   └── test_runner.py
├── docker/
├── docs/
└── scripts/

Environment variables

Variable Default Description
LLM_PROVIDER (chosen at scaffold) anthropic, openai, or bedrock
ANTHROPIC_API_KEY Required when LLM_PROVIDER=anthropic
OPENAI_API_KEY Required when LLM_PROVIDER=openai
MODEL_ID claude-sonnet-4-6 Anthropic / Bedrock model ID
OPENAI_MODEL_ID gpt-4o-mini OpenAI model ID
BRAINTRUST_API_KEY When eval_framework=braintrust
BRAINTRUST_PROJECT my-eval Braintrust project name
LANGFUSE_PUBLIC_KEY When eval_framework=langfuse
LANGFUSE_SECRET_KEY When eval_framework=langfuse
LANGFUSE_HOST https://cloud.langfuse.com Langfuse server URL
EVAL_PASS_THRESHOLD 0.8 Minimum pass rate before eval exits non-zero

Makefile commands

Command Description
make install Installs dependencies and pre-commit hooks
make eval Run the harness
make test pytest
make lint Ruff
make typecheck ty
make format Ruff format
make docs MkDocs serve
make clean Clean caches

Update an existing project

uvx copier update --defaults

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors