GitHub - emartai/evalflow: pytest for LLMs — catch prompt regressions before they reach production

> evalflow

pytest for LLMs

You changed one prompt.
Summarization improved.
Classification silently broke.
Nobody noticed for 4 days.

evalflow catches this in CI before it ships.

Install

pip install evalflow

Quick Start

evalflow init
evalflow eval

What you get on day one:

local prompt and dataset files
SQLite-backed run history in .evalflow/
CI-friendly exit codes
offline cache support for repeatable checks

Terminal Screenshot

> evalflow eval

Running 5 test cases against gpt-4o-mini...

✓ summarize_short_article    0.91
✓ classify_sentiment         1.00
✓ extract_entities           0.87
✗ answer_with_context        0.61
✓ rewrite_formal             0.93

Quality Gate: PASS
Failures: 1
Run ID: 20240315-a3f9c2d81b4e

Why evalflow

Traditional unit tests do not tell you when a prompt tweak quietly degrades a task. evalflow gives you a small local quality gate for prompt, model, and dataset changes.

Use it when you need to:

catch regressions before merge
compare runs locally
keep prompt versions in YAML
run the same gate in CI and on a laptop

GitHub Actions Workflow

# .github/workflows/evalflow.yml
name: LLM Quality Gate

on:
  pull_request:
    paths:
      - "prompts/**"
      - "evals/**"
      - "**.py"

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install evalflow
      - run: evalflow eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Features

pytest-style exit codes: 0=pass, 1=fail, 2=error
exact match, embedding, consistency, and LLM judge methods
baseline snapshots catch regressions, not just low scores
prompt registry keeps prompts versioned in YAML
works with OpenAI, Anthropic, Groq, Gemini, and Ollama
local SQLite storage, no account needed
offline cache for repeated and CI-safe checks

Command Surface

evalflow init
evalflow eval
evalflow doctor
evalflow runs
evalflow compare RUN_A RUN_B
evalflow prompt list

Documentation

Docs hub: emartai.mintlify.app
Quickstart source: docs/quickstart.mdx
CLI reference source: docs/cli-reference.mdx
CI guide source: docs/ci-github-actions.mdx
Provider docs: docs/providers

Security

evalflow reads API keys from environment variables, never config files
evalflow.yaml stores env var names, not secret values
keep .env and .evalflow/ out of git
see docs/dev-doc/security.md for the full security model

Reporting Security Issues

Please do not open public GitHub issues for security vulnerabilities. Open a private GitHub Security Advisory.

Examples

Development

See CONTRIBUTING.md for local setup, tests, smoke checks, and performance baselines.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.claude		.claude
.github/workflows		.github/workflows
docs		docs
examples		examples
packages		packages
scripts		scripts
security		security
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Install

Quick Start

Terminal Screenshot

Why evalflow

GitHub Actions Workflow

Features

Command Surface

Documentation

Security

Reporting Security Issues

Examples

Development

License

About

Uh oh!

Releases 6

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Install

Quick Start

Terminal Screenshot

Why evalflow

GitHub Actions Workflow

Features

Command Surface

Documentation

Security

Reporting Security Issues

Examples

Development

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages