docs(evals): IA reshape + Tier 2 reference and troubleshooting pages by SuhaniNagpal7 · Pull Request #661 · future-agi/docs

SuhaniNagpal7 · 2026-05-22T06:08:48Z

Summary

Reshapes the /docs/evaluation/ section to the playbook IA and adds the Tier 2 reference + troubleshooting subgroups. 92 files touched: 28 added, 26 deleted, 31 modified, 7 renamed.

What changed

1. New `Reference/` subgroup (3 pages)

src/pages/docs/evaluation/reference/result-schema.mdx — eval result fields (output, score, passed, reason, runtime/latency_ms, model, eval_id, eval_name), JSON shapes per output type, async retrieval pattern via Evaluator.evaluate(is_async=True) + evaluator.get_eval_result(eval_id).
src/pages/docs/evaluation/reference/input-schema.mdx — canonical input keys (input, output, context, expected, reference, hypothesis, conversation, etc.), mapping mechanism per surface (Dataset / Trace / Simulation / Test playground / SDK).
src/pages/docs/evaluation/reference/score-types.mdx — strict mapping of pass_fail / percentage / deterministic output types to numeric scores and pass derivation; Code eval return-value rules; composite aggregation functions; pass threshold semantics.

2. New `Troubleshooting/` subgroup (5 pages)

score-drift.mdx — "Eval scores changed unexpectedly": template version drift, model swap, judge non-determinism, mapping change, dataset edits.
judge-variance.mdx — "Judge output is inconsistent": subjective criteria, temperature, threshold-in-noise-band, ground-truth anchoring, when to switch to Code.
slow-runs.mdx — "Eval run is slow": model choice, sync vs async, Agent tool calls, long context, multimodal, rate limits; per-row latency expectation table.
mapping.mdx — "Dataset fields don't match the eval template": required-key mapping, media auto-detection on url/file/image/audio/video substrings, field-type mismatches.
ci-failures.mdx — "CI eval gate failed": auth, threshold flap, version pin drift, model swap, rate limits, GitHub Actions baseline.

3. New `Run/` subgroup (replaces `features/cicd.mdx`)

run/in-the-ui.mdx — dashboard flow.
run/python-sdk.mdx — Python SDK using from fi.evals import evaluate.
run/typescript-sdk.mdx — @future-agi/ai-evaluation package.
run/api.mdx — REST endpoint POST /sdk/api/v1/new-eval/ with sync + async (is_async=true) patterns.
run/cicd.mdx — moved from features/cicd.mdx, updated.

4. New `Build/` subgroup (moved from `features/`)

Renamed features/custom.mdx → build/custom.mdx (added UI + SDK tabs).
Renamed features/test-playground.mdx → build/test-playground.mdx.
Renamed features/ground-truth.mdx → build/ground-truth.mdx.
Renamed features/error-localization.mdx → build/error-localization.mdx.
Renamed features/mcp-connectors.mdx → build/mcp-connectors.mdx (uses 5 real product screenshots).

5. New `Judge models/` subgroup

Renamed features/custom-models.mdx → judge-models/custom.mdx.
New judge-models/futureagi.mdx — Turing flash / small / large tiers.

6. New `Evaluator catalog/` (catalog group under `builtin/`)

8 new category pages: categories/rag.mdx, agent.mdx, safety.mdx, text.mdx, format.mdx, code.mdx, multimodal.mdx, audio.mdx. Each lists the evaluators in that category (alphabetical A-Z).
builtin/index.mdx rewritten as a full A-Z catalog table (alphabetical, all 130 visible UI evals).

7. Concept-page rewrites

Rewrote 9 concept pages to add the playbook concept-page frontmatter (page_type, diataxis, concept_family, related_concepts/tasks, schema_type), "About" section, and ## Related concepts bullets:

concepts/eval-types.mdx, eval-templates.mdx, output-types.mdx, judge-models.mdx, eval-results.mdx, composite-evals.mdx, versioning.mdx, data-injection.mdx, mcp-connectors.mdx.
Deleted concepts/understanding-evaluation.mdx (replaced by the section overview).

8. Evaluation section landing

evaluation/index.mdx — rewritten as a playbook-style overview with the eval count fixed to 130+, removed the hero screenshot.

9. Quickstart

New src/pages/docs/quickstart/evals.mdx — first-eval-in-5-minutes flow.

10. Deletions (26 files)

23 stale eval leaf pages removed because they're hidden in the production UI (visible_ui: false in model_hub/system_evals/*.yaml), so the docs page had no in-product entry point:

answer-similarity, api-call, contain-evals, contains-all, contains-any, contains-none, content-moderation, content-safety-violation, custom-code-evaluation, deterministic-evals, ends-with, equals, factual-accuracy, instruction-adherence, is-compliant, is-factually-consistent, json-scheme-validation, length-between, length-greater-than, length-less-than, recall-score, regex, starts-with.

3 other deletions:

concepts/understanding-evaluation.mdx — superseded by section overview.
features/evaluate.mdx — content split between Run/ pages.
features/futureagi-models.mdx — moved to judge-models/futureagi.mdx.

11. Components and infra

src/components/docs/Mermaid.astro — restored from commit 0ad763d7 for diagrams used in concept pages.
src/plugins/vite-docs-transform.mjs — added Mermaid to auto-import map.
src/lib/navigation.ts — added Build/, Run/, Judge models/, Evaluator catalog/, Reference/, Troubleshooting/ subgroups under Evaluation.
src/lib/redirects.ts — added redirects for moved pages (features/cicd → run/cicd, features/custom-models → judge-models/custom, etc.) plus 8 alias slugs (bleu_score → bleu, etc.).

12. Assets

5 new product screenshots under public/images/docs/evaluation/mcp-connectors/{1,2,3,4,5}.png (replacing 5 broken paths in the MCP connectors guide).

13. Cross-link updates

Updated outbound links in 12 files that pointed at moved/deleted pages:

cookbook/decrease-hallucination.mdx, cookbook/evaluation/eval-correction-loop.mdx, cookbook/evaluation/eval-with-mcp-connectors.mdx
dataset/features/experiments.mdx, dataset/features/run-prompt.mdx
7 individual builtin/*.mdx pages (context-adherence, customer-agent-prompt-conformance, detect-hallucination, is-concise, is-email, is-helpful, llm-function-calling, task-completion)
faq.mdx, observe/features/evals.mdx, quickstart/running-evals-in-simulation.mdx, simulation/features/prompt-simulation.mdx

Code accuracy

All Python/JSON snippets in the new Reference and Troubleshooting pages were verified against the canonical SDK surface:

evaluate() function in src/pages/docs/sdk/evals/evaluate.mdx
Evaluator class in src/pages/docs/sdk/evals/cloud-evals.mdx
Async pattern in src/pages/docs/cookbook/quickstart/async-batch-eval.mdx
API response shape in src/pages/docs/evaluation/run/api.mdx

Test plan

npm run dev and walk Evaluation sidebar: Overview, Concepts, Build, Run, Judge models, Evaluator catalog, Reference, Troubleshooting.
Confirm Reference/ and Troubleshooting/ pages render (Mermaid, tables, code blocks).
Confirm 5 MCP connectors screenshots load on evaluation/build/mcp-connectors.
Spot-check redirects: /docs/evaluation/features/cicd → /docs/evaluation/run/cicd.
Spot-check evaluator catalog: A-Z table sort, category pages alphabetical.
Confirm none of the 23 deleted slugs are still linked anywhere.

Fix the playbook-flagged P0 issues before the larger IA/concept rewrite: - Replace stale "70+ built-in templates" with verified "130+" count (129 visible_ui:true templates in model_hub/system_evals/, 153 total including 24 hidden internal templates). - Rename "Future AGI" -> "FutureAGI" across eval pages to match playbook 18 vocabulary rules (scoped to eval section only; rest of docs repo still uses "Future AGI" per STYLE-GUIDE.md and will be unified later). - Remove stale {/* SCREENSHOT NEEDED ... */} placeholder from data-injection.mdx (playbook anti-pattern 27). - Normalize Title Case H2 headings to sentence case per playbook 18 (preserves UI labels: Restore Version, Test vs Save, LLM-As-A-Judge). pnpm audit-links: 0 broken nav links, 0 broken content links.

Restructures /docs/evaluation/ to match internal-docs/product-docs-playbook recommendations and to make every documented eval correspond to a real UI-visible template. ## New IA Evaluation ├── Overview (rewritten, absorbs understanding-evaluation.mdx) ├── Quickstart (new — /docs/quickstart/evals.mdx, SDK-first) ├── Concepts/ (9 retrofitted pages) ├── Run evals/ (4 new how-tos split from evaluate.mdx + cicd.mdx moved in) ├── Build evals/ (5 pages moved from features/) ├── Judge models/ (2 pages moved from features/) └── Evaluator catalog/ (new builtin/categories/ with 8 catalog pages) ## Key changes - Split features/evaluate.mdx into 4 task-shaped pages under run/: in-the-ui, python-sdk, typescript-sdk, api. Each uses the canonical fi.evals.evaluate() function and ai-evaluation package, replacing the stale Evaluator-class pattern. - Moved 8 feature pages into build/ and judge-models/ via git mv to preserve history. Updated cross-section links accordingly. - Rewrote evaluation/index.mdx as a true overview with Mermaid lifecycle diagram, "Where to start" cards, and intent-driven Next Steps. - Retrofitted 9 concept pages to playbook 03 anatomy: added Mermaid diagrams, "What it isn't" boundary sections, and concept-page frontmatter (page_type, diataxis, primary_question, direct_answer, has_diagram, related_concepts, etc.). - Added build/custom.mdx UI/SDK tab structure and judge-models pages with corrected SDK examples. - Created 8 evaluator catalog category pages (RAG, Agent, Safety, Text, Format, Code, Multimodal, Audio) generated from system_evals YAMLs. Each row sorted alphabetically by template name. - Rewrote builtin/index.mdx as a catalog hub: 8-card category grid + A-Z table trimmed to the 129 UI-visible templates (was 152). The 23 hidden/orphan rows are unlinked from the catalog; their leaf files remain on disk for direct-URL access. - Restored src/components/docs/Mermaid.astro (was in commit 0ad763d but missing on PR #648's base) and registered it in the auto-import map. Converted ```mermaid fences to <Mermaid code={...} />. - Cross-section: fixed inbound links from faq, dataset/features, simulation/features, cookbooks, redirects.ts to use the new build/, run/, judge-models/ paths. ## Conventions enforced section-wide - Heading: ## Related concepts on concept pages (playbook 03), ## Next steps on everything else. - Bullet style: - [Link](url): short description. - Sentence case below H1; ban-list still clear (no powerful, seamless, simply, etc.). - 0 em-dashes across new content. - 0 unsupported icon names (mapped to the Card component's iconPaths). - 0 stale "Future AGI" (with space) in the eval section. - 0 stale "Evaluator class with eval_templates=/inputs=/model_name=" pattern in the run/, quickstart/, judge-models/ pages. ## Verification - pnpm build: 714 pages, no errors. - pnpm audit-links: 0 broken nav, 0 broken content links. - All 129 A-Z table rows link to an existing leaf page. - All 13 alias-slug templates (bleu_score → bleu, ASR/STT_accuracy → audio-transcription, etc.) linked correctly across category pages. ## Out of scope (Tier 2 — follow-up) - Reference subsection (eval result schema, evaluator input schema, score types). - Troubleshooting subsection (5 symptom-driven pages). - Stale Evaluator-class pattern in the 153 individual builtin leaf pages (PR #648 authored, separate cleanup pass). - Deletion of the 23 unlinked stale leaves (user said: separate commit). - Repo-wide "Future AGI" → "FutureAGI" rename (out of scope).

… screenshots Two follow-ups to the IA reshape commit: - Delete 23 leaf pages whose source YAMLs have visible_ui:false (i.e. templates not shown in the actual eval picker UI). All inbound links were redirected in the previous commit, so removing the pages doesn't introduce dead links. Reduces orphan-page count from 161 to 138 and brings the documented eval set into parity with what users actually see in the dashboard. Removed: answer-similarity, api-call, contain-evals, contains-all, contains-any, contains-none, content-moderation, content-safety-violation, custom-code-evaluation, deterministic-evals, ends-with, equals, factual-accuracy, instruction-adherence, is-compliant, is-factually-consistent, json-scheme-validation, length-between, length-greater-than, length-less-than, recall-score, regex, starts-with - Replace the 5 broken /screenshot/product/evaluation/mcp-connectors/N.png references on build/mcp-connectors.mdx with real captures saved at /images/docs/evaluation/mcp-connectors/N.png. Drops the 5 stale <Note>Image placeholder</Note> blocks. Verification: - pnpm build: 693 pages, 0 errors. - pnpm audit-links: 0 broken nav, 0 broken content links.

3 reference pages (eval result schema, evaluator input schema, score types) and 5 troubleshooting pages (score drift, judge variance, slow runs, dataset mapping, CI gate failures). Code snippets verified against the canonical fi.evals API surface.

entelligence-ai-pr-reviews · 2026-05-22T06:08:51Z

Automatic Review Skipped

Too many files for automatic review.

If you would still like a review, you can trigger one manually by commenting:

@entelligence review

…enshots Restructures the Imagine page to match internal-docs/product-docs-playbook feature-deep-dive template and updates all 4 screenshots to dark theme. Page changes - Add When to use / When not to use sections. - Add How it works internally with a Mermaid data-flow diagram (build time -> canvas -> saved view -> live re-bind vs. dynamic re-run). - Add Troubleshooting table (6 Symptom/Cause/Fix rows) for the common fragile paths: Save View disabled, prose instead of widgets, stuck skeleton loader, empty widgets on a new trace, rate limit, wrong field. - Replace product-noun Next Steps cards with intent-driven Related links. - Expand frontmatter to the master spec (page_type, products, feature_status, audience, difficulty, owner, reviewers, last_tested, last_screenshotted, schema_type, seo, geo, canonical, related). - Reframe "dashboard"/"layout" as "view"/"collection" and replace "LLM" with "agent" where the surface is Falcon's agent (addresses reviewer comments from upstream PR #641). - Correct dynamic-analysis cache key to (saved_view, widget, trace_id) and saved-view scope to project-or-workspace per the SavedView and ImagineAnalysis models in core-backend. - Fix two broken internal links: /docs/observe/dashboards -> /docs/observe/features/dashboard, and /docs/observe/tracing -> /docs/observe. Screenshots - 10/11/12/13.png recaptured in dark theme; 12.png Save View button repainted (white background + black text) to compensate for a UI contrast bug. Verified against codebase - 17 widget types (VALID_WIDGET_TYPES in render_widget.py, WIDGET_REGISTRY in imagine/widgets/index.js). - 12-column grid, Temporal timeouts (30s/90s/10s) and retry counts, 10/60s chat rate limit, 45s dynamic-analysis endpoint, skeleton-loader text "Falcon is analyzing this trace...". - Suggested prompt chip wording matches SuggestedPrompts.jsx exactly (product copy is what it is for now; treating chip text as a separate product-copy change rather than a doc edit). Depends on - PR #661 (docs(evals): IA reshape...) for src/components/docs/Mermaid.astro. This PR's Mermaid import will only resolve once #661 lands.

nik13

needs more work. ill take a call to merge it or will pick in docs revamp. on hold for now

Suhani Nagpal added 4 commits May 21, 2026 16:38

SuhaniNagpal7 requested a review from nik13 May 22, 2026 06:11

SuhaniNagpal7 force-pushed the docs/evals-playbook-revamp branch from 2c329cf to 6cb487c Compare May 22, 2026 06:14

nik13 reviewed May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(evals): IA reshape + Tier 2 reference and troubleshooting pages#661

docs(evals): IA reshape + Tier 2 reference and troubleshooting pages#661
SuhaniNagpal7 wants to merge 4 commits into
karthikavinash/th-4638-evals-revamp-docfrom
docs/evals-playbook-revamp

SuhaniNagpal7 commented May 22, 2026 •

edited

Loading

Uh oh!

entelligence-ai-pr-reviews Bot commented May 22, 2026

Uh oh!

nik13 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SuhaniNagpal7 commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

1. New Reference/ subgroup (3 pages)

2. New Troubleshooting/ subgroup (5 pages)

3. New Run/ subgroup (replaces features/cicd.mdx)

4. New Build/ subgroup (moved from features/)

5. New Judge models/ subgroup

6. New Evaluator catalog/ (catalog group under builtin/)

7. Concept-page rewrites

8. Evaluation section landing

9. Quickstart

10. Deletions (26 files)

11. Components and infra

12. Assets

13. Cross-link updates

Code accuracy

Test plan

Uh oh!

entelligence-ai-pr-reviews Bot commented May 22, 2026

Uh oh!

nik13 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SuhaniNagpal7 commented May 22, 2026 •

edited

Loading

1. New `Reference/` subgroup (3 pages)

2. New `Troubleshooting/` subgroup (5 pages)

3. New `Run/` subgroup (replaces `features/cicd.mdx`)

4. New `Build/` subgroup (moved from `features/`)

5. New `Judge models/` subgroup

6. New `Evaluator catalog/` (catalog group under `builtin/`)