pytest-eval

LLM testing for humans.

Bring LLM evaluation into your existing pytest workflow.
No custom test runners. No new concepts. Just pytest.

Install

pip install pytest-eval

Quick Start

# No imports needed. The ai fixture IS the API.

def test_chatbot(ai):
    response = my_chatbot("What is the capital of France?")
    assert ai.similar(response, "Paris is the capital of France")

pytest -v

tests/test_chatbot.py::test_chatbot PASSED
    ✓ similar       ███████████████  0.94  ≥0.80

  ──────────────────────────────────────────────────────
  pytest-eval                                     v0.1.0
  ──────────────────────────────────────────────────────
    Test           Result Score                    Cost
  ──────────────────────────────────────────────────────
    test_chatbot     ✓   ██████████████░  0.94      $0
  ──────────────────────────────────────────────────────
    1 tests  │  1 passed  │  $0.0000 total
  ──────────────────────────────────────────────────────

That's it. No LLMTestCase objects, no custom runner, no cloud dashboard.

Why pytest-eval?

	DeepEval	pytest-eval
Basic test	~15 lines, 4 imports	~3 lines, 0 imports
Test runner	`deepeval test run`	`pytest`
Metrics	50+ to learn	~10 methods on one fixture
Dependencies	30+ (OpenTelemetry, gRPC, Sentry...)	4 core
Telemetry	Cloud dashboard by default	None. Fully local.

Methods

Method	What it does	Cost
`ai.similar(a, b, threshold=0.8)`	Semantic similarity check	Free (local)
`ai.similarity_score(a, b)`	Returns similarity float 0–1	Free (local)
`ai.judge(text, criteria)`	LLM evaluates against criteria	$
`ai.grounded(response, context)`	RAG faithfulness check	$
`ai.relevant(response, query)`	Answer relevancy	$
`ai.hallucinated(response, context)`	Detect unsupported claims	$
`ai.toxic(text)`	Toxicity detection	Free
`ai.biased(text)`	Bias detection	Free
`ai.valid_json(text, schema=None)`	JSON validation + Pydantic parsing	Free
`ai.assert_snapshot(value, name)`	Regression testing vs saved baseline	Free (local)
`ai.metric(name, text, **kw)`	Run a custom registered metric	Varies
`ai.cost`	Cumulative $ for this test	—
`ai.latency`	Cumulative seconds for this test	—

Free methods use local models (sentence-transformers). No API key needed. $ methods call an LLM API (OpenAI by default). Requires OPENAI_API_KEY.

Examples

Semantic Similarity (free, local)

def test_capital(ai):
    response = my_chatbot("What is the capital of France?")
    assert ai.similar(response, "Paris is the capital of France")

LLM-as-Judge

def test_tone(ai):
    response = my_chatbot("I want to cancel my subscription")
    assert ai.judge(response, "Response is polite and offers help")

Structured Output

from pydantic import BaseModel

class City(BaseModel):
    name: str
    country: str

def test_structured(ai):
    response = my_llm("Give me Paris info as JSON")
    city = ai.valid_json(response, City)
    assert city.country == "France"

RAG Pipeline

def test_rag(ai):
    query = "What is our refund policy?"
    docs = retriever.get_relevant_docs(query)
    response = generator.generate(query, docs)

    assert ai.grounded(response, docs)
    assert ai.relevant(response, query)
    assert not ai.hallucinated(response, docs)

Snapshot Regression

def test_regression(ai):
    response = my_chatbot("What are your business hours?")
    ai.assert_snapshot(response, name="business_hours", threshold=0.85)

# First run saves baseline. Next runs compare.
# Update baselines when intentional changes are made:
pytest --snapshot-update

Multi-Model Comparison

import pytest

@pytest.mark.parametrize("model", ["gpt-4o", "claude-sonnet-4-20250514", "llama-3.1-8b"])
def test_accuracy(ai, model):
    response = call_llm(model=model, prompt="What is 2+2?")
    assert ai.similar(response, "4")

Custom Metrics

from pytest_eval import Metric, MetricResult

@Metric.register("brand_voice")
def brand_voice(text: str, **kwargs) -> MetricResult:
    formal = sum(1 for w in ["please", "thank you"] if w in text.lower())
    score = min(formal / 2, 1.0)
    return MetricResult(score=score, passed=score >= kwargs.get("threshold", 0.5))

def test_brand(ai):
    assert ai.metric("brand_voice", response, threshold=0.7)

Configuration

pyproject.toml

[tool.pytest.ini_options]
ai_provider = "openai"
ai_model = "gpt-4o-mini"
ai_embedding_model = "local"
ai_threshold = 0.8
ai_budget = 5.00
ai_snapshot_dir = ".pytest_eval_snapshots"

Environment Variables

OPENAI_API_KEY=sk-...
PYTEST_EVAL_PROVIDER=openai
PYTEST_EVAL_MODEL=gpt-4o-mini
PYTEST_EVAL_BUDGET=5.00

CLI Options

pytest --ai-provider=openai    # Provider
pytest --ai-model=gpt-4o       # Model
pytest --ai-threshold=0.9      # Similarity threshold
pytest --ai-budget=2.00        # Cap spending per run
pytest --ai-report=report.json # JSON report output
pytest --ai-verbose            # Show scores for passing tests
pytest --snapshot-update       # Update snapshot baselines
pytest -m ai                   # Run only @pytest.mark.ai tests
pytest -m "not cost_high"      # Skip expensive tests

Precedence: CLI > env vars > pyproject.toml > defaults

Providers

pytest-eval supports multiple LLM providers:

pip install 'pytest-eval[openai]'     # OpenAI (default)
pip install 'pytest-eval[anthropic]'  # Anthropic
pip install 'pytest-eval[litellm]'    # 100+ providers via LiteLLM
pip install 'pytest-eval[safety]'     # Toxicity/bias detection (detoxify)
pip install 'pytest-eval[all]'        # Everything

Local embeddings (sentence-transformers) are always included — no API key needed for similar(), similarity_score(), and assert_snapshot().

Rich Failure Messages

Every assertion failure explains what happened:

AssertionError: Semantic similarity below threshold
  actual:     "The capital of France is Lyon"
  expected:   "The capital of France is Paris"
  similarity: 0.72
  threshold:  0.85
  reason:     Texts differ on the key fact (Lyon vs Paris)

TUI Output

pytest-eval renders score bars and a summary table directly in your terminal:

Per-test metric detail lines (with -v or --ai-verbose)
Session summary table with visual score bars
Cost tracking per test and per session

Contributing

See CONTRIBUTING.md.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
pytest_eval		pytest_eval
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SPEC.md		SPEC.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pytest-eval

Install

Quick Start

Why pytest-eval?

Methods

Examples

Semantic Similarity (free, local)

LLM-as-Judge

Structured Output

RAG Pipeline

Snapshot Regression

Multi-Model Comparison

Custom Metrics

Configuration

pyproject.toml

Environment Variables

CLI Options

Providers

Rich Failure Messages

TUI Output

Contributing

License

About

Uh oh!

Releases 1

Packages

Languages

License

doganarif/pytest-eval

Folders and files

Latest commit

History

Repository files navigation

pytest-eval

Install

Quick Start

Why pytest-eval?

Methods

Examples

Semantic Similarity (free, local)

LLM-as-Judge

Structured Output

RAG Pipeline

Snapshot Regression

Multi-Model Comparison

Custom Metrics

Configuration

pyproject.toml

Environment Variables

CLI Options

Providers

Rich Failure Messages

TUI Output

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages