Self-improving evaluation loop for LLM workflows.
Full-screen terminal UI. Run workflows, analyze failures, auto-generate rules, regression test. One tool, one loop.
Failures are analyzed, rules are auto-generated, and applied on the next run. No data sent to any AI provider. Fully self-hosted.
run -> analyze -> test -> run (improved) -> ...
npm install -g eddgateRequirements: Node.js 20+, Claude CLI (any subscription) or ANTHROPIC_API_KEY
eddgateThat's it. A full-screen terminal UI launches. Select Run, Analyze, or Test from the menu.
Everything happens inside the TUI: workflow execution with live dashboard, failure analysis, regression testing, MCP server management, plugin import, language switching.
eddgate is for any multi-step AI task where quality matters. If you're copy-pasting into ChatGPT and hoping for the best, eddgate replaces that with a repeatable, verifiable pipeline.
| I want to... | Workflow to use | What I feed in | What I get out |
|---|---|---|---|
| Summarize a long report with proper citations | document-pipeline |
report.pdf or report.md |
Structured summary with [1] citations and reference list |
| Translate technical docs to Korean and verify accuracy | translation |
docs/api-guide.md (English) |
Korean translation + accuracy verification score |
| Review a pull request diff for bugs | code-review |
git diff > changes.txt |
Issue list with severity, line references, fix suggestions |
| Debug a production error | bug-fix |
Error log pasted as text | Root cause analysis + fix proposal + verification |
| Answer questions from company documents | rag-pipeline |
Question text (docs pre-indexed) | Grounded answer with source citations, hallucination score |
| Process customer complaints into structured data | Custom workflow | complaints.csv or email text |
JSON with categories, urgency, recommended actions |
eddgate # launch TUI
# 1. Select "Run" from menu
# 2. Pick "document-pipeline" (or any workflow)
# 3. Choose "Select file" -> pick any .txt or .md file
# 4. Model: Sonnet (default, good enough for most tasks)
# 5. Choose "Dry run" first to see the structure without spending
# 6. When ready, choose "Start run" and watch the live dashboardAfter the run, check the results panel. If quality looks good, enable "Save JSONL trace" on the next run -- that trace file is what powers Analyze, Test, and Monitor.
Day 1: Run with trace enabled -> get baseline results
Day 2: Analyze failures -> auto-generate rules
Day 3: Run again -> rules auto-applied, quality improves
Day 4: Test snapshot -> save this good state
Day N: Edit prompt -> Test diff -> verify no regression -> deploy
All TUI actions are also available as CLI commands for scripting and CI pipelines:
eddgate init # scaffold project
eddgate doctor # check setup
eddgate run example -i input.txt # run workflow
eddgate analyze -d traces # find failure patterns
eddgate test snapshot -d traces # save behavioral baseline
eddgate test diff -d traces # detect regressionseddgate closes the loop: failure analysis feeds directly back into execution improvement. One tool, one cycle.
What are these? Automatic quality checkpoints between each step. If a step produces garbage, the pipeline stops immediately instead of wasting tokens on the next step. You don't need to configure these -- they're defined in the workflow YAML and run automatically.
Two tiers:
- Tier 1: Zod schema checks. Deterministic. 0% false positives. 5ms. Every step.
- Tier 2: LLM-as-judge. Groundedness/relevance. Key transitions only.
When to use: You ran a workflow and the results were bad (failed steps, low quality scores, unexpected outputs). This command reads your trace files and tells you what went wrong, how often, and how to fix it. It can also auto-generate rules that prevent the same failure next time.
eddgate analyze -d traces 105 failures in 2 patterns:
C1 Eval gate failed at "validate_final" (avg score: 0.75, 103 times)
103 occurrences (98%)
Score range: 0.42 - 0.85
Fix: lower threshold or improve prompt specificity
Rule: validate_final_adjusted_threshold.yaml
C2 Rate limit hit at "validate_final" (2 times)
Fix: add delay between steps or reduce maxRetries
eddgate analyze -d traces --generate-rules # auto-create rules
eddgate analyze -d traces --context # context window profilerGenerated rules are auto-loaded on next eddgate run.
When to use: Your pipeline works well and you want to make sure future changes (prompt edits, model switches, config tweaks) don't break it. Think of it like unit tests for your AI pipeline.
eddgate test snapshot -d traces # save baseline
# ... modify prompts ...
eddgate run my-workflow -i input.txt --trace-jsonl traces/new.jsonl
eddgate test diff -d traces # compare against baseline REGRESSIONS (1):
validate_final.evalScore
before: 0.78
after: 0.65
-> REGRESSION
PASS: No regressions. (or) FAIL: Regressions detected.
CI exit code 1 on regression. Plug into GitHub Actions.
When to use: Your workflow costs more than expected, or takes too long. The profiler reads your traces and shows exactly which step is burning the most tokens (= money). Common finding: a validation step retrying 48 times when it should retry 2.
eddgate analyze -d traces --context Per-step breakdown:
Step Calls Input Output Total % of Total
retrieve 1 935 3,655 4,590 15.4% ====
generate_citation 2 6 6,908 6,914 23.2% ======
validate_final 48 63,744 38,400 102,144 34.3% =========
Waste detected:
"validate_final" made 48 calls (2 expected) -- ~100K tokens wasted on retries
Recommendations:
Reduce retries for "validate_final" or lower eval threshold
Run eddgate for the full-screen terminal UI. Everything is accessible from menus -- no commands to memorize.
| Menu | What it does |
|---|---|
| Run | Select workflow, model, effort, thinking mode, input. Configure run options (HTML report, JSONL trace, budget limit, dry run). Live dashboard during execution. Results panel with step table after completion. |
| Analyze | Failure analysis, context profiler, offline eval, A/B prompt test, diff-eval, version-diff. |
| Test | Snapshot, diff, list snapshots, deployment gate (check thresholds). |
| Monitor | Status overview (success rate gauge, metrics table), cost breakdown (bar chart by model, table by step), quality scores (eval averages with distribution bars). All from saved traces. |
| Traces | Browse saved trace files. Select one to view: steps summary (left) + full event timeline (right) with color-coded events, token counts, scores. |
| MCP | Add/remove/list MCP servers without editing YAML. |
| Plugins | View workflows/roles, visualize workflow, debug single step, RAG index/search (Pinecone MCP), import from file. |
| Settings | Default model, language (Korean/English), view config, doctor (health check), init (scaffold project). |
Keyboard: Arrow keys navigate, Enter selects, Esc goes back, q quits, Tab switches panels.
After selecting workflow/model/effort/thinking, a run options menu lets you configure:
| Option | What it does |
|---|---|
| Start run | Proceed with current settings. |
| Save HTML report | Enter path -- generates dark-mode HTML report after execution. |
| Save JSONL trace | Enter path -- records all events during execution. |
| Set budget limit | Enter USD amount -- stops workflow if cost exceeds limit. |
| Dry run | Toggle on to preview workflow structure without executing. |
Options loop back so you can set multiple before starting.
During workflow execution, the TUI shows a live orchestration dashboard:
All TUI actions are also available as commands for scripting and CI pipelines.
| Command | What it does |
|---|---|
eddgate run <workflow> |
Execute a workflow with validation gates. Fails fast on bad output. |
eddgate analyze |
Cluster failure patterns, suggest fixes. --generate-rules creates YAML rules. --context shows token usage. |
eddgate test snapshot |
Save current behavioral baseline from traces. |
eddgate test diff |
Compare against baseline. Exits 1 on regression (CI-friendly). |
eddgate test list |
Show saved snapshots. |
eddgate init |
Create project structure. |
eddgate doctor |
Check Node.js, Claude CLI, config validity, graph integrity. |
eddgate list workflows |
List available workflow YAML files. |
eddgate list roles |
List available role definitions. |
| Flag | What it does |
|---|---|
-i, --input <file> |
Input file or text. If file, contents are read. |
-m, --model <model> |
Override model: sonnet, opus, haiku, claude-opus-4-5, claude-sonnet-4-5 |
-e, --effort <level> |
Effort: low, medium, high, max |
--report <path> |
Generate HTML report (dark mode, collapsible steps, score gauges). |
--trace-jsonl <path> |
Save structured JSONL trace for later analysis. |
--max-budget-usd <n> |
Stop workflow if accumulated cost exceeds this amount. |
--dry-run |
Preview workflow structure without executing. |
--json |
Machine-readable JSON output. |
--quiet |
Errors only. |
| Command | What it does |
|---|---|
eddgate advanced eval <workflow> |
Re-score saved traces using LLM judge. |
eddgate advanced diff-eval <workflow> |
Compare eval scores between git commits. |
eddgate advanced gate |
Deployment gate. Exits 1 if thresholds not met. |
eddgate advanced monitor status |
Success rate, p50/p95 latency, tokens, cost. |
eddgate advanced monitor cost |
Cost breakdown by model and step. |
eddgate advanced monitor quality |
Eval score trends over time. |
eddgate advanced viz <workflow> |
Mermaid diagram or ASCII visualization. |
eddgate advanced step <workflow> <step-id> |
Run a single step in isolation. |
eddgate advanced trace <file> |
View JSONL trace with timeline. |
eddgate advanced mcp <action> |
Manage MCP servers: list, add, remove. |
eddgate advanced version-diff |
Prompt/workflow changes between git commits. |
eddgate advanced rag index |
Chunk documents and upsert to Pinecone via MCP. |
eddgate advanced rag search <query> |
Search Pinecone index, return ranked chunks. |
eddgate advanced ab-test |
Run same workflow with two prompt variants, compare scores. |
eddgate advanced improve |
Auto-suggest prompt fixes from failure analysis. --apply to auto-write. |
eddgate serve |
Start HTTP API server for workflow execution. --port 3000 |
When to use: You have a folder of company documents (PDFs, markdown, text files) and want AI to answer questions based on those documents instead of making things up. First index the documents, then use rag-pipeline workflow to query them.
# Index documents
eddgate advanced rag index -d docs/ --index my-docs
# Search
eddgate advanced rag search "how does auth work?" --index my-docsOr from the TUI: Plugins > RAG index / RAG search.
Built-in rag-pipeline workflow: classify query -> vector search -> grounded generation -> validate groundedness.
steps:
- id: "retrieve_context"
type: "retrieve"
context:
tools: ["mcp:pinecone:search-records"]
- id: "generate_answer"
type: "generate"
evaluation:
type: "groundedness"
threshold: 0.7Requires Pinecone MCP server configured in eddgate.config.yaml.
When to use: You rewrote a prompt and want to know if the new version is actually better, not just different. Runs both versions on the same input and uses a statistical test (Welch's t-test) to determine if the difference is real or just random noise.
eddgate advanced ab-test \
--workflow document-pipeline \
--prompt-a templates/prompts/analyzer.md \
--prompt-b templates/prompts/analyzer.v2.md \
-i input.txt \
-n 3Or from the TUI: Analyze > A/B prompt test.
Output:
Metric Variant A Variant B Delta
──────────────────── ────────────── ────────────── ──────────────
Avg Score 0.742 0.819 +0.077
Avg Tokens 8,450 7,200 -1,250
Avg Cost $0.0234 $0.0198 -0.0036
Avg Time 12.3s 10.8s -1.5s
Winner: Variant B
Score advantage: 0.077
Winner logic: uses Welch's t-test (p < 0.05) to determine statistical significance. Reports p-value and 95% confidence interval. Runs are interleaved (ABABAB) to eliminate ordering bias.
name: "My Pipeline"
config:
defaultModel: "sonnet"
topology: "pipeline"
onValidationFail: "block"
steps:
- id: "analyze"
type: "classify"
context:
identity:
role: "analyzer"
constraints: ["output JSON"]
tools: []
validation:
rules:
- type: "required_fields"
spec: { fields: ["topics"] }
message: "topics required"
- id: "generate"
type: "generate"
dependsOn: ["analyze"]
evaluation:
enabled: true
type: "groundedness"
threshold: 0.7
onFail: "block"Independent steps run in parallel automatically with topology: "parallel":
config:
topology: "parallel" # independent steps run concurrently
steps:
- id: "search_docs"
type: "retrieve"
tools: ["web_search"]
- id: "search_code"
type: "retrieve"
tools: ["file_read"]
- id: "combine"
type: "generate"
dependsOn: ["search_docs", "search_code"] # waits for bothsearch_docs and search_code run simultaneously. combine waits for both. Typical speedup: 30-40% for workflows with independent retrieval steps.
Auto-detects backend:
| Backend | Setup | Cost |
|---|---|---|
| Claude CLI | Any Claude subscription | Included |
| Anthropic API | ANTHROPIC_API_KEY |
Per token |
| Workflow | What to feed in | What you get | Steps |
|---|---|---|---|
document-pipeline |
Any long document (.md, .txt, .pdf text) -- e.g. a 30-page policy doc | Structured summary with [1] citations, organized by topic, with reference list |
8 |
code-review |
A diff file (git diff > changes.txt) or code snippet |
Issue list: severity, line numbers, what's wrong, how to fix | 3 |
bug-fix |
Error log, stack trace, or bug description text | Root cause analysis + proposed fix + verification that the fix works | 4 |
api-design |
Requirements doc or feature description | OpenAPI-style endpoint design + request/response examples + docs | 3 |
translation |
Any text file in source language | Translated text + back-translation accuracy score (catches mistranslations) | 3 |
rag-pipeline |
A question (docs must be indexed first via Plugins > RAG) | Answer grounded in your documents + source citations + hallucination score | 4 |
Don't see what you need? Copy any YAML, edit the steps, and you have a custom workflow. Or import one via Plugins > Import workflow.
Default: 0.7 (industry standard for LLM-as-judge).
| Score | Meaning |
|---|---|
| 0.7+ | Pass -- acceptable quality |
| 0.8+ | Good -- bug-fix and translation use this |
| 0.9+ | Unrealistic for most LLM tasks (judge agreement is ~80-85%) |
| < 0.7 | Fail -- gate blocks, retry or stop |
Configurable per step in workflow YAML. eddgate analyze suggests adjusted thresholds based on observed score ranges.
When to use: You ran a workflow, some steps failed, and you want to know how to fix the prompt -- not just that it failed. This reads your failure patterns and generates concrete prompt edits, then lets you review each one before applying.
# CLI: auto-suggest and apply
eddgate advanced improve -d traces --prompts templates/prompts --apply
# CLI: preview only
eddgate advanced improve -d traces --prompts templates/prompts --dry-runOr from the TUI: Analyze > Auto-improve prompts.
1. Reads traces/*.jsonl -> finds failure clusters (same as Analyze)
2. For each failing step, loads the prompt file (templates/prompts/<role>.md)
3. Sends current prompt + failure patterns to LLM -> gets rewritten prompt
4. Shows you the diff: original (left) vs suggested (right)
5. You choose: Approve / Modify / Skip for each suggestion
6. Approved changes are written to the prompt file immediately
Each suggestion includes confidence level (high/medium/low) and the failure pattern that triggered it.
When to use: You want to trigger eddgate workflows from another system (a web app, a Slack bot, a cron job) without using the CLI directly.
eddgate serve --port 3000| Method | Path | Description |
|---|---|---|
GET |
/health |
Health check (status, version, uptime) |
GET |
/workflows |
List available workflow YAML files |
POST |
/run |
Start a workflow (returns runId immediately) |
GET |
/runs |
List all runs with status |
GET |
/runs/:id |
Get run result (steps, tokens, cost, eval scores) |
# Start server
eddgate serve --port 3000
# From another terminal:
# Start a workflow
curl -X POST http://localhost:3000/run \
-H "Content-Type: application/json" \
-d '{"workflow": "document-pipeline", "input": "Summarize this report..."}'
# -> {"runId": "run-1711...", "status": "running"}
# Check status
curl http://localhost:3000/runs/run-1711...
# -> {"status": "completed", "result": {"totalCost": 0.02, "steps": [...]}}Workflows execute asynchronously -- POST /run returns immediately with a runId, then you poll GET /runs/:id for the result. No external dependencies (uses Node.js built-in http module).
When to use: You don't need to configure anything. This works automatically. Every time you run a workflow, eddgate remembers what happened (which steps failed, what scores were, what errors occurred). Next time you run the same workflow, that knowledge is injected into the AI's system prompt so it can avoid previous mistakes.
Run #1: "validate_final" fails 3 times with "missing citations"
-> saved to .eddgate/memory/
Run #2: System prompt now includes:
"Previous Run Insights (1 run, 0% success rate)
Known issues: validate_final failed 3x: missing citations
Avg quality: validate_final: 0.45 (problematic)"
-> AI knows to add citations this time
Stored in .eddgate/memory/ as JSON files (max 50, auto-pruned):
- Workflow name, status, duration, cost
- Per-step: status, eval score, error message
- Aggregated into: success rate, top issues, avg scores per step
Before each run, a concise summary is built and appended to every agent's system prompt:
## Previous Run Insights (5 runs, 60% success rate)
Known issues from previous runs:
- validate_final: 3 failures (missing required citations)
- generate_draft: 2 failures (output too short)
Average quality scores by step:
- classify_input: 0.85 (good)
- generate_draft: 0.52 (needs attention)
- validate_final: 0.45 (problematic)
This is automatic and non-blocking -- if memory loading fails, the workflow runs normally without it.
# .github/workflows/eddgate-loop.yml
name: eddgate loop
on:
push:
paths: ['templates/prompts/**', 'templates/workflows/**']
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: 22 }
- run: npm ci && npm run build
# Validate workflow graphs
- run: node dist/cli/index.js doctor --ci -w templates/workflows
# Check for regressions
- run: node dist/cli/index.js test diff -d traces
# Deployment gate
- run: node dist/cli/index.js advanced gate --results eval-results.json --rules templates/gate-rules.yamltest diff exits 1 on regression. gate exits 1 on threshold failure. CI blocks the merge.
MIT
