LLM testing for humans.
Bring LLM evaluation into your existing pytest workflow.
No custom test runners. No new concepts. Just pytest.
pip install pytest-eval# No imports needed. The ai fixture IS the API.
def test_chatbot(ai):
response = my_chatbot("What is the capital of France?")
assert ai.similar(response, "Paris is the capital of France")pytest -vtests/test_chatbot.py::test_chatbot PASSED
✓ similar ███████████████ 0.94 ≥0.80
──────────────────────────────────────────────────────
pytest-eval v0.1.0
──────────────────────────────────────────────────────
Test Result Score Cost
──────────────────────────────────────────────────────
test_chatbot ✓ ██████████████░ 0.94 $0
──────────────────────────────────────────────────────
1 tests │ 1 passed │ $0.0000 total
──────────────────────────────────────────────────────
That's it. No LLMTestCase objects, no custom runner, no cloud dashboard.
| DeepEval | pytest-eval | |
|---|---|---|
| Basic test | ~15 lines, 4 imports | ~3 lines, 0 imports |
| Test runner | deepeval test run |
pytest |
| Metrics | 50+ to learn | ~10 methods on one fixture |
| Dependencies | 30+ (OpenTelemetry, gRPC, Sentry...) | 4 core |
| Telemetry | Cloud dashboard by default | None. Fully local. |
| Method | What it does | Cost |
|---|---|---|
ai.similar(a, b, threshold=0.8) |
Semantic similarity check | Free (local) |
ai.similarity_score(a, b) |
Returns similarity float 0–1 | Free (local) |
ai.judge(text, criteria) |
LLM evaluates against criteria | $ |
ai.grounded(response, context) |
RAG faithfulness check | $ |
ai.relevant(response, query) |
Answer relevancy | $ |
ai.hallucinated(response, context) |
Detect unsupported claims | $ |
ai.toxic(text) |
Toxicity detection | Free |
ai.biased(text) |
Bias detection | Free |
ai.valid_json(text, schema=None) |
JSON validation + Pydantic parsing | Free |
ai.assert_snapshot(value, name) |
Regression testing vs saved baseline | Free (local) |
ai.metric(name, text, **kw) |
Run a custom registered metric | Varies |
ai.cost |
Cumulative $ for this test | — |
ai.latency |
Cumulative seconds for this test | — |
Free methods use local models (sentence-transformers). No API key needed.
$ methods call an LLM API (OpenAI by default). Requires OPENAI_API_KEY.
def test_capital(ai):
response = my_chatbot("What is the capital of France?")
assert ai.similar(response, "Paris is the capital of France")def test_tone(ai):
response = my_chatbot("I want to cancel my subscription")
assert ai.judge(response, "Response is polite and offers help")from pydantic import BaseModel
class City(BaseModel):
name: str
country: str
def test_structured(ai):
response = my_llm("Give me Paris info as JSON")
city = ai.valid_json(response, City)
assert city.country == "France"def test_rag(ai):
query = "What is our refund policy?"
docs = retriever.get_relevant_docs(query)
response = generator.generate(query, docs)
assert ai.grounded(response, docs)
assert ai.relevant(response, query)
assert not ai.hallucinated(response, docs)def test_regression(ai):
response = my_chatbot("What are your business hours?")
ai.assert_snapshot(response, name="business_hours", threshold=0.85)# First run saves baseline. Next runs compare.
# Update baselines when intentional changes are made:
pytest --snapshot-updateimport pytest
@pytest.mark.parametrize("model", ["gpt-4o", "claude-sonnet-4-20250514", "llama-3.1-8b"])
def test_accuracy(ai, model):
response = call_llm(model=model, prompt="What is 2+2?")
assert ai.similar(response, "4")from pytest_eval import Metric, MetricResult
@Metric.register("brand_voice")
def brand_voice(text: str, **kwargs) -> MetricResult:
formal = sum(1 for w in ["please", "thank you"] if w in text.lower())
score = min(formal / 2, 1.0)
return MetricResult(score=score, passed=score >= kwargs.get("threshold", 0.5))
def test_brand(ai):
assert ai.metric("brand_voice", response, threshold=0.7)[tool.pytest.ini_options]
ai_provider = "openai"
ai_model = "gpt-4o-mini"
ai_embedding_model = "local"
ai_threshold = 0.8
ai_budget = 5.00
ai_snapshot_dir = ".pytest_eval_snapshots"OPENAI_API_KEY=sk-...
PYTEST_EVAL_PROVIDER=openai
PYTEST_EVAL_MODEL=gpt-4o-mini
PYTEST_EVAL_BUDGET=5.00pytest --ai-provider=openai # Provider
pytest --ai-model=gpt-4o # Model
pytest --ai-threshold=0.9 # Similarity threshold
pytest --ai-budget=2.00 # Cap spending per run
pytest --ai-report=report.json # JSON report output
pytest --ai-verbose # Show scores for passing tests
pytest --snapshot-update # Update snapshot baselines
pytest -m ai # Run only @pytest.mark.ai tests
pytest -m "not cost_high" # Skip expensive testsPrecedence: CLI > env vars > pyproject.toml > defaults
pytest-eval supports multiple LLM providers:
pip install 'pytest-eval[openai]' # OpenAI (default)
pip install 'pytest-eval[anthropic]' # Anthropic
pip install 'pytest-eval[litellm]' # 100+ providers via LiteLLM
pip install 'pytest-eval[safety]' # Toxicity/bias detection (detoxify)
pip install 'pytest-eval[all]' # EverythingLocal embeddings (sentence-transformers) are always included — no API key needed for similar(), similarity_score(), and assert_snapshot().
Every assertion failure explains what happened:
AssertionError: Semantic similarity below threshold
actual: "The capital of France is Lyon"
expected: "The capital of France is Paris"
similarity: 0.72
threshold: 0.85
reason: Texts differ on the key fact (Lyon vs Paris)
pytest-eval renders score bars and a summary table directly in your terminal:
- Per-test metric detail lines (with
-vor--ai-verbose) - Session summary table with visual score bars
- Cost tracking per test and per session
See CONTRIBUTING.md.
MIT