Skip to content

docs(evaluation): revamp evals documentation for new eval system#648

Open
KarthikAvinashFI wants to merge 7 commits into
devfrom
karthikavinash/th-4638-evals-revamp-doc
Open

docs(evaluation): revamp evals documentation for new eval system#648
KarthikAvinashFI wants to merge 7 commits into
devfrom
karthikavinash/th-4638-evals-revamp-doc

Conversation

@KarthikAvinashFI
Copy link
Copy Markdown
Contributor

@KarthikAvinashFI KarthikAvinashFI commented May 8, 2026

Summary

Brings the evaluation docs in line with the post-revamp platform. Replaces the old four-method taxonomy (LLM as Judge / Deterministic / Statistical Metric / LLM as Ranker) with what the UI actually shows today: Agents, LLM-As-A-Judge, and Code. Adds new concept and feature pages for things that were undocumented (composite evals, versioning, ground truth, error localization, test playground, data injection, output types in their new label-based form, MCP connectors). Rewrites the trace and simulation eval guides around the actual Tasks and Create a Simulation flows.

Linear: TH-4638, TH-4934

What changed

Concepts (under evaluation/concepts/)

Rewritten: eval-types, eval-templates, eval-results, judge-models, understanding-evaluation.
New: output-types, data-injection, composite-evals, versioning, mcp-connectors.

Features (under evaluation/features/)

Rewritten: custom, evaluate.
New: test-playground, error-localization, ground-truth, mcp-connectors.
Minor: custom-models (added trace projects to the surfaces list).

Cookbooks (under cookbook/evaluation/)

New: eval-with-mcp-connectors — end-to-end CRM lookup example.

Surface-specific eval guides (outside evaluation/)

observe/features/evals rewritten around the Tasks flow (Basic Info / Evaluations / Filters / Scheduling) and the Historical data / New incoming data run modes.

quickstart/running-evals-in-simulation aligned with the 4-step Create a Simulation wizard (Add simulation details, Choose Scenario(s), Select Evaluations, Summary) and updated mapping fields.

Navigation

src/lib/navigation.ts updated to include the new concept, feature, and cookbook pages in the sidebar.

Removed

eval-groups.mdx and all references. The Groups feature is no longer reachable from the main UI navigation.

Style guide compliance

  • All concept pages start with ## About; no UI walkthrough screenshots in concept pages.
  • All feature pages have one screenshot placeholder per major step.
  • No em-dashes, no marketing language, no bold headings.
  • Internal terms (AgentLoop, Falcon AI Loop, Temporal, Celery, RestrictedPython, nsjail, VLLM, internal class names) do not appear in any doc.

Verification

  • pnpm audit-links — 0 broken nav links, 0 broken content links.
  • Every concrete UI claim cross-checked against the live frontend (EvalCreatePage.jsx, EvalPickerConfigFull.jsx, TestPlayground.jsx, etc.) and backend (model_hub/types.py, evaluations/engine/instance.py, ee/evals/llm/agent_evaluator/).

Test plan

  • pnpm audit-links passes.
  • pnpm build passes.
  • Spot-check each new and rewritten page in pnpm dev.
  • Replace screenshot placeholders before un-drafting.

Aligns the evaluation docs with the post-revamp platform: three eval
types (Agents / LLM-As-A-Judge / Code), three output types
(Pass/fail / Scoring / Choices), composite templates, versioning,
ground truth, error localization, and updated apply flows for
datasets, trace projects (now via Tasks), and simulation.

Concepts (rewritten / new):
- eval-types: 3-type taxonomy matching the create-page tabs
- eval-templates: built-in vs custom, single vs composite, versioning
- eval-results: result formats per output type
- judge-models: Turing models + bring-your-own
- understanding-evaluation: surfaces and how it all fits
- output-types (new): Pass/fail, Scoring (label-based), Choices
- data-injection (new): the six Context options
- composite-evals (new): aggregation functions and child axis
- versioning (new): Set as Default, Restore Version, pinning

Features (rewritten / new):
- custom: full create flow for all 3 types with field reference
- evaluate: dataset apply flow + SDK
- test-playground (new): four source modes, AI generate
- error-localization (new): toggle, run lifecycle, SDK
- ground-truth (new): upload, mapping, embedding statuses

Surface-specific updates:
- observe/features/evals: rewritten around the Tasks page flow
  (Basic Info / Evaluations / Filters / Scheduling)
- quickstart/running-evals-in-simulation: aligned with the
  4-step Create a Simulation wizard

Eval Groups was removed from docs as the feature is no longer
exposed in the main UI navigation.

TH-4638
Adds reference pages for built-in evals that were missing documentation
(deterministic, statistical, and agent-mode templates). Also fixes
Detect Hallucination input requirement.
Adds rows for the freshly-generated reference pages so users can find
them from the Built-in Evals catalog.
Cleaned up auto-generated descriptions and parameter tables across 90 built-in
eval reference pages. Removed truncated description suffixes, replaced
placeholder parameter descriptions (e.g. "The output.") with concrete
type and value-shape information.
@KarthikAvinashFI KarthikAvinashFI marked this pull request as ready for review May 11, 2026 07:33
…k (TH-4934)

- Concept: how connectors plug into Agent-mode evals, what the judge sees, cost and latency.
- Feature: UI walkthrough for attaching connectors to an eval, troubleshooting table.
- Cookbook: end-to-end example using a CRM MCP server to verify support replies.
- Nav: register all three under Evaluation / Cookbooks.
SuhaniNagpal7 pushed a commit that referenced this pull request May 22, 2026
Restructures /docs/evaluation/ to match internal-docs/product-docs-playbook
recommendations and to make every documented eval correspond to a real
UI-visible template.

## New IA

  Evaluation
  ├── Overview (rewritten, absorbs understanding-evaluation.mdx)
  ├── Quickstart (new — /docs/quickstart/evals.mdx, SDK-first)
  ├── Concepts/ (9 retrofitted pages)
  ├── Run evals/ (4 new how-tos split from evaluate.mdx + cicd.mdx moved in)
  ├── Build evals/ (5 pages moved from features/)
  ├── Judge models/ (2 pages moved from features/)
  └── Evaluator catalog/ (new builtin/categories/ with 8 catalog pages)

## Key changes

- Split features/evaluate.mdx into 4 task-shaped pages under run/:
  in-the-ui, python-sdk, typescript-sdk, api. Each uses the canonical
  fi.evals.evaluate() function and ai-evaluation package, replacing the
  stale Evaluator-class pattern.
- Moved 8 feature pages into build/ and judge-models/ via git mv to
  preserve history. Updated cross-section links accordingly.
- Rewrote evaluation/index.mdx as a true overview with Mermaid lifecycle
  diagram, "Where to start" cards, and intent-driven Next Steps.
- Retrofitted 9 concept pages to playbook 03 anatomy: added Mermaid
  diagrams, "What it isn't" boundary sections, and concept-page
  frontmatter (page_type, diataxis, primary_question, direct_answer,
  has_diagram, related_concepts, etc.).
- Added build/custom.mdx UI/SDK tab structure and judge-models pages
  with corrected SDK examples.
- Created 8 evaluator catalog category pages (RAG, Agent, Safety, Text,
  Format, Code, Multimodal, Audio) generated from system_evals YAMLs.
  Each row sorted alphabetically by template name.
- Rewrote builtin/index.mdx as a catalog hub: 8-card category grid +
  A-Z table trimmed to the 129 UI-visible templates (was 152). The 23
  hidden/orphan rows are unlinked from the catalog; their leaf files
  remain on disk for direct-URL access.
- Restored src/components/docs/Mermaid.astro (was in commit 0ad763d
  but missing on PR #648's base) and registered it in the auto-import
  map. Converted ```mermaid fences to <Mermaid code={...} />.
- Cross-section: fixed inbound links from faq, dataset/features,
  simulation/features, cookbooks, redirects.ts to use the new
  build/, run/, judge-models/ paths.

## Conventions enforced section-wide

- Heading: ## Related concepts on concept pages (playbook 03),
  ## Next steps on everything else.
- Bullet style: - [Link](url): short description.
- Sentence case below H1; ban-list still clear (no powerful, seamless,
  simply, etc.).
- 0 em-dashes across new content.
- 0 unsupported icon names (mapped to the Card component's iconPaths).
- 0 stale "Future AGI" (with space) in the eval section.
- 0 stale "Evaluator class with eval_templates=/inputs=/model_name="
  pattern in the run/, quickstart/, judge-models/ pages.

## Verification

- pnpm build: 714 pages, no errors.
- pnpm audit-links: 0 broken nav, 0 broken content links.
- All 129 A-Z table rows link to an existing leaf page.
- All 13 alias-slug templates (bleu_score → bleu, ASR/STT_accuracy →
  audio-transcription, etc.) linked correctly across category pages.

## Out of scope (Tier 2 — follow-up)

- Reference subsection (eval result schema, evaluator input schema,
  score types).
- Troubleshooting subsection (5 symptom-driven pages).
- Stale Evaluator-class pattern in the 153 individual builtin leaf
  pages (PR #648 authored, separate cleanup pass).
- Deletion of the 23 unlinked stale leaves (user said: separate commit).
- Repo-wide "Future AGI" → "FutureAGI" rename (out of scope).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant