reval correlates your Langfuse eval sessions with your git history and uses a multi-agent LLM pipeline to pinpoint which code changes caused which metric regressions. It produces a report with explanations, evidence, and suggested fixes.
From PyPI:
pip install reval-cliFrom source:
git clone https://github.com/calebevans/reval.git
cd reval
pip install .For development (includes pytest, mypy, ruff, pre-commit):
pip install ".[dev]"Requires Python 3.10+.
- Generate a starter config:
reval init- Set your Langfuse credentials (or add them to
reval.yaml):
export LANGFUSE_BASE_URL="https://cloud.langfuse.com"
export LANGFUSE_PUBLIC_KEY="pk-..."
export LANGFUSE_SECRET_KEY="sk-..."- Run an analysis against a Langfuse eval session:
reval analyze --eval-results <session-id>- To compare two sessions (current vs. baseline) and correlate regressions with code changes:
reval analyze \
--eval-results <current-session-id> \
--eval-baseline <baseline-session-id> \
--base mainreval is configured through a reval.yaml file in your project root. Every
field has a sensible default, so the file is optional for simple use cases.
langfuse:
api_url: https://cloud.langfuse.com
public_key: pk-...
secret_key: sk-...
project_id: "" # auto-detected if omitted
current_session_id: "" # or use --eval-results
baseline_session_id: "" # or use --eval-baseline
publish: false # post results back to Langfuse
metrics:
- name: answer_relevancy
threshold: 0.05 # flag if score drops by more than this
- name: faithfulness
threshold: 0.05
relevance:
include_patterns: [] # empty = include all non-ignored files
ignore_patterns:
- "**/tests/**"
- "**/__pycache__/**"
- "*.md"
- "*.lock"
category_mappings:
prompt:
- "**/prompts/**"
- "**/*.prompt"
model_config:
- "**/config/model*"
- "**/*llm_config*"
retrieval:
- "**/retrieval/**"
- "**/rag/**"
tool_definition:
- "**/tools/**"
- "**/functions/**"
output_parsing:
- "**/parsers/**"
- "**/schema*"
eval_config:
- "**/eval*"
llm:
model: openai/gpt-4o # any LiteLLM model identifier
temperature: 0.2
max_tokens: 4096
context_window: null # override the model's default context window
diff_model: null # use a different model for diff analysis
eval_model: null # use a different model for eval analysis
synthesis_model: null # use a different model for synthesis
git:
base: HEAD # base commit ref
head: working # "working" = uncommitted changeslangfuse - Connection settings for your Langfuse instance. Credentials can
also be set through environment variables (see below). Set publish: true to
write analysis results back to Langfuse as comments.
metrics - List of metric names and their regression thresholds. A metric is
flagged as regressed when current_score - baseline_score falls below
-threshold. Defaults to 0.05 if not specified.
relevance - Controls which files from the git diff are included in analysis.
Files matching ignore_patterns are excluded. If include_patterns is
non-empty, only files matching at least one include pattern (and no ignore
pattern) are kept. The category_mappings section maps glob patterns to
semantic categories (prompt, model_config, retrieval, etc.) so the analysis
agents understand the role of each changed file.
llm - Model configuration. The model field accepts any
LiteLLM model identifier (e.g.
openai/gpt-4o, anthropic/claude-sonnet-4-20250514, vertex_ai/gemini-2.0-flash).
You can assign different models to each analysis agent using diff_model,
eval_model, and synthesis_model.
git - The commit refs to diff. Set head to working to diff uncommitted
changes against base, or set both to commit SHAs/branch names.
Langfuse credentials can be provided through environment variables instead of
(or in addition to) reval.yaml. Environment variables take precedence when
the corresponding config field is left empty.
| Variable | Config equivalent | Description |
|---|---|---|
LANGFUSE_BASE_URL |
langfuse.api_url |
Langfuse API URL |
LANGFUSE_PUBLIC_KEY |
langfuse.public_key |
Langfuse public key |
LANGFUSE_SECRET_KEY |
langfuse.secret_key |
Langfuse secret key |
LANGFUSE_PROJECT_ID |
langfuse.project_id |
Langfuse project ID (auto-detected if omitted) |
Generate a starter reval.yaml with interactive prompts.
reval init [--output PATH]| Option | Default | Description |
|---|---|---|
--output |
reval.yaml |
Path for the generated config file |
Run the analysis pipeline. This is the main command.
reval analyze [OPTIONS]| Option | Default | Description |
|---|---|---|
--eval-results |
Langfuse session ID for the current eval run (required) | |
--eval-baseline |
Langfuse session ID for the baseline run (omit for single-session mode) | |
--base |
From config or HEAD |
Base commit ref |
--head |
From config or working |
Head ref (working for uncommitted changes) |
--config |
reval.yaml |
Path to config file |
--output |
terminal |
Output format: terminal, json, or markdown |
--output-file |
Write the report to a file instead of stdout | |
--threshold |
0.05 |
Global regression threshold (overrides per-metric config) |
--model |
From config | LLM model to use (overrides config) |
--publish / --no-publish |
From config | Publish results back to Langfuse |
--verbose |
false |
Show debug information |
Re-render a previously saved JSON report in a different format.
reval report REPORT_FILE [OPTIONS]| Option | Default | Description |
|---|---|---|
--output |
terminal |
Output format: terminal, json, or markdown |
--output-file |
Write the report to a file instead of stdout |
Example: save a JSON report, then render it as markdown later:
reval analyze --eval-results sess-123 --output json --output-file report.json
reval report report.json --output markdownActivated when you provide both --eval-results and --eval-baseline. reval
fetches both sessions from Langfuse, diffs the git history between --base and
--head, and runs three agents:
- Diff agent examines code changes in isolation and forms hypotheses about their potential eval impact.
- Eval agent investigates each regressed test case by comparing outputs, scores, and evaluator reasoning between current and baseline runs.
- Synthesis agent correlates the diff and eval findings into a final report with explanations and suggested fixes.
Activated when you omit --eval-baseline. reval analyzes a single eval session
without a baseline comparison. It loads source files matching your relevance
patterns, runs the eval agent on any test cases that fall below threshold, and
produces findings about what may be going wrong.
| Format | Flag | Description |
|---|---|---|
| Terminal | --output terminal |
Rich tables and panels with color-coded diffs (default) |
| JSON | --output json |
Machine-readable output, can be re-rendered with reval report |
| Markdown | --output markdown |
Tables and fenced diff blocks, suitable for PRs or documentation |
All formats can be written to a file with --output-file PATH.
When --publish is passed (or langfuse.publish is set to true in config),
reval posts its analysis results back to Langfuse:
- A session comment with the full markdown report is added to the current session.
- A trace comment with relevant findings is added to each failed trace.
This makes it easy to review reval's analysis directly in the Langfuse UI alongside your eval results.