Skip to content

docs(evals): IA reshape + Tier 2 reference and troubleshooting pages#661

Open
SuhaniNagpal7 wants to merge 4 commits into
karthikavinash/th-4638-evals-revamp-docfrom
docs/evals-playbook-revamp
Open

docs(evals): IA reshape + Tier 2 reference and troubleshooting pages#661
SuhaniNagpal7 wants to merge 4 commits into
karthikavinash/th-4638-evals-revamp-docfrom
docs/evals-playbook-revamp

Conversation

@SuhaniNagpal7
Copy link
Copy Markdown
Contributor

@SuhaniNagpal7 SuhaniNagpal7 commented May 22, 2026

Summary

Reshapes the /docs/evaluation/ section to the playbook IA and adds the Tier 2 reference + troubleshooting subgroups. 92 files touched: 28 added, 26 deleted, 31 modified, 7 renamed.

What changed

1. New Reference/ subgroup (3 pages)

  • src/pages/docs/evaluation/reference/result-schema.mdx — eval result fields (output, score, passed, reason, runtime/latency_ms, model, eval_id, eval_name), JSON shapes per output type, async retrieval pattern via Evaluator.evaluate(is_async=True) + evaluator.get_eval_result(eval_id).
  • src/pages/docs/evaluation/reference/input-schema.mdx — canonical input keys (input, output, context, expected, reference, hypothesis, conversation, etc.), mapping mechanism per surface (Dataset / Trace / Simulation / Test playground / SDK).
  • src/pages/docs/evaluation/reference/score-types.mdx — strict mapping of pass_fail / percentage / deterministic output types to numeric scores and pass derivation; Code eval return-value rules; composite aggregation functions; pass threshold semantics.

2. New Troubleshooting/ subgroup (5 pages)

  • score-drift.mdx — "Eval scores changed unexpectedly": template version drift, model swap, judge non-determinism, mapping change, dataset edits.
  • judge-variance.mdx — "Judge output is inconsistent": subjective criteria, temperature, threshold-in-noise-band, ground-truth anchoring, when to switch to Code.
  • slow-runs.mdx — "Eval run is slow": model choice, sync vs async, Agent tool calls, long context, multimodal, rate limits; per-row latency expectation table.
  • mapping.mdx — "Dataset fields don't match the eval template": required-key mapping, media auto-detection on url/file/image/audio/video substrings, field-type mismatches.
  • ci-failures.mdx — "CI eval gate failed": auth, threshold flap, version pin drift, model swap, rate limits, GitHub Actions baseline.

3. New Run/ subgroup (replaces features/cicd.mdx)

  • run/in-the-ui.mdx — dashboard flow.
  • run/python-sdk.mdx — Python SDK using from fi.evals import evaluate.
  • run/typescript-sdk.mdx@future-agi/ai-evaluation package.
  • run/api.mdx — REST endpoint POST /sdk/api/v1/new-eval/ with sync + async (is_async=true) patterns.
  • run/cicd.mdx — moved from features/cicd.mdx, updated.

4. New Build/ subgroup (moved from features/)

  • Renamed features/custom.mdxbuild/custom.mdx (added UI + SDK tabs).
  • Renamed features/test-playground.mdxbuild/test-playground.mdx.
  • Renamed features/ground-truth.mdxbuild/ground-truth.mdx.
  • Renamed features/error-localization.mdxbuild/error-localization.mdx.
  • Renamed features/mcp-connectors.mdxbuild/mcp-connectors.mdx (uses 5 real product screenshots).

5. New Judge models/ subgroup

  • Renamed features/custom-models.mdxjudge-models/custom.mdx.
  • New judge-models/futureagi.mdx — Turing flash / small / large tiers.

6. New Evaluator catalog/ (catalog group under builtin/)

  • 8 new category pages: categories/rag.mdx, agent.mdx, safety.mdx, text.mdx, format.mdx, code.mdx, multimodal.mdx, audio.mdx. Each lists the evaluators in that category (alphabetical A-Z).
  • builtin/index.mdx rewritten as a full A-Z catalog table (alphabetical, all 130 visible UI evals).

7. Concept-page rewrites

Rewrote 9 concept pages to add the playbook concept-page frontmatter (page_type, diataxis, concept_family, related_concepts/tasks, schema_type), "About" section, and ## Related concepts bullets:

  • concepts/eval-types.mdx, eval-templates.mdx, output-types.mdx, judge-models.mdx, eval-results.mdx, composite-evals.mdx, versioning.mdx, data-injection.mdx, mcp-connectors.mdx.
  • Deleted concepts/understanding-evaluation.mdx (replaced by the section overview).

8. Evaluation section landing

  • evaluation/index.mdx — rewritten as a playbook-style overview with the eval count fixed to 130+, removed the hero screenshot.

9. Quickstart

  • New src/pages/docs/quickstart/evals.mdx — first-eval-in-5-minutes flow.

10. Deletions (26 files)

23 stale eval leaf pages removed because they're hidden in the production UI (visible_ui: false in model_hub/system_evals/*.yaml), so the docs page had no in-product entry point:

  • answer-similarity, api-call, contain-evals, contains-all, contains-any, contains-none, content-moderation, content-safety-violation, custom-code-evaluation, deterministic-evals, ends-with, equals, factual-accuracy, instruction-adherence, is-compliant, is-factually-consistent, json-scheme-validation, length-between, length-greater-than, length-less-than, recall-score, regex, starts-with.

3 other deletions:

  • concepts/understanding-evaluation.mdx — superseded by section overview.
  • features/evaluate.mdx — content split between Run/ pages.
  • features/futureagi-models.mdx — moved to judge-models/futureagi.mdx.

11. Components and infra

  • src/components/docs/Mermaid.astro — restored from commit 0ad763d7 for diagrams used in concept pages.
  • src/plugins/vite-docs-transform.mjs — added Mermaid to auto-import map.
  • src/lib/navigation.ts — added Build/, Run/, Judge models/, Evaluator catalog/, Reference/, Troubleshooting/ subgroups under Evaluation.
  • src/lib/redirects.ts — added redirects for moved pages (features/cicdrun/cicd, features/custom-modelsjudge-models/custom, etc.) plus 8 alias slugs (bleu_scorebleu, etc.).

12. Assets

  • 5 new product screenshots under public/images/docs/evaluation/mcp-connectors/{1,2,3,4,5}.png (replacing 5 broken paths in the MCP connectors guide).

13. Cross-link updates

Updated outbound links in 12 files that pointed at moved/deleted pages:

  • cookbook/decrease-hallucination.mdx, cookbook/evaluation/eval-correction-loop.mdx, cookbook/evaluation/eval-with-mcp-connectors.mdx
  • dataset/features/experiments.mdx, dataset/features/run-prompt.mdx
  • 7 individual builtin/*.mdx pages (context-adherence, customer-agent-prompt-conformance, detect-hallucination, is-concise, is-email, is-helpful, llm-function-calling, task-completion)
  • faq.mdx, observe/features/evals.mdx, quickstart/running-evals-in-simulation.mdx, simulation/features/prompt-simulation.mdx

Code accuracy

All Python/JSON snippets in the new Reference and Troubleshooting pages were verified against the canonical SDK surface:

  • evaluate() function in src/pages/docs/sdk/evals/evaluate.mdx
  • Evaluator class in src/pages/docs/sdk/evals/cloud-evals.mdx
  • Async pattern in src/pages/docs/cookbook/quickstart/async-batch-eval.mdx
  • API response shape in src/pages/docs/evaluation/run/api.mdx

Test plan

  • npm run dev and walk Evaluation sidebar: Overview, Concepts, Build, Run, Judge models, Evaluator catalog, Reference, Troubleshooting.
  • Confirm Reference/ and Troubleshooting/ pages render (Mermaid, tables, code blocks).
  • Confirm 5 MCP connectors screenshots load on evaluation/build/mcp-connectors.
  • Spot-check redirects: /docs/evaluation/features/cicd/docs/evaluation/run/cicd.
  • Spot-check evaluator catalog: A-Z table sort, category pages alphabetical.
  • Confirm none of the 23 deleted slugs are still linked anywhere.

Suhani Nagpal added 4 commits May 21, 2026 16:38
Fix the playbook-flagged P0 issues before the larger IA/concept rewrite:

- Replace stale "70+ built-in templates" with verified "130+" count
  (129 visible_ui:true templates in model_hub/system_evals/, 153 total
  including 24 hidden internal templates).
- Rename "Future AGI" -> "FutureAGI" across eval pages to match playbook
  18 vocabulary rules (scoped to eval section only; rest of docs repo
  still uses "Future AGI" per STYLE-GUIDE.md and will be unified later).
- Remove stale {/* SCREENSHOT NEEDED ... */} placeholder from
  data-injection.mdx (playbook anti-pattern 27).
- Normalize Title Case H2 headings to sentence case per playbook 18
  (preserves UI labels: Restore Version, Test vs Save, LLM-As-A-Judge).

pnpm audit-links: 0 broken nav links, 0 broken content links.
Restructures /docs/evaluation/ to match internal-docs/product-docs-playbook
recommendations and to make every documented eval correspond to a real
UI-visible template.

## New IA

  Evaluation
  ├── Overview (rewritten, absorbs understanding-evaluation.mdx)
  ├── Quickstart (new — /docs/quickstart/evals.mdx, SDK-first)
  ├── Concepts/ (9 retrofitted pages)
  ├── Run evals/ (4 new how-tos split from evaluate.mdx + cicd.mdx moved in)
  ├── Build evals/ (5 pages moved from features/)
  ├── Judge models/ (2 pages moved from features/)
  └── Evaluator catalog/ (new builtin/categories/ with 8 catalog pages)

## Key changes

- Split features/evaluate.mdx into 4 task-shaped pages under run/:
  in-the-ui, python-sdk, typescript-sdk, api. Each uses the canonical
  fi.evals.evaluate() function and ai-evaluation package, replacing the
  stale Evaluator-class pattern.
- Moved 8 feature pages into build/ and judge-models/ via git mv to
  preserve history. Updated cross-section links accordingly.
- Rewrote evaluation/index.mdx as a true overview with Mermaid lifecycle
  diagram, "Where to start" cards, and intent-driven Next Steps.
- Retrofitted 9 concept pages to playbook 03 anatomy: added Mermaid
  diagrams, "What it isn't" boundary sections, and concept-page
  frontmatter (page_type, diataxis, primary_question, direct_answer,
  has_diagram, related_concepts, etc.).
- Added build/custom.mdx UI/SDK tab structure and judge-models pages
  with corrected SDK examples.
- Created 8 evaluator catalog category pages (RAG, Agent, Safety, Text,
  Format, Code, Multimodal, Audio) generated from system_evals YAMLs.
  Each row sorted alphabetically by template name.
- Rewrote builtin/index.mdx as a catalog hub: 8-card category grid +
  A-Z table trimmed to the 129 UI-visible templates (was 152). The 23
  hidden/orphan rows are unlinked from the catalog; their leaf files
  remain on disk for direct-URL access.
- Restored src/components/docs/Mermaid.astro (was in commit 0ad763d
  but missing on PR #648's base) and registered it in the auto-import
  map. Converted ```mermaid fences to <Mermaid code={...} />.
- Cross-section: fixed inbound links from faq, dataset/features,
  simulation/features, cookbooks, redirects.ts to use the new
  build/, run/, judge-models/ paths.

## Conventions enforced section-wide

- Heading: ## Related concepts on concept pages (playbook 03),
  ## Next steps on everything else.
- Bullet style: - [Link](url): short description.
- Sentence case below H1; ban-list still clear (no powerful, seamless,
  simply, etc.).
- 0 em-dashes across new content.
- 0 unsupported icon names (mapped to the Card component's iconPaths).
- 0 stale "Future AGI" (with space) in the eval section.
- 0 stale "Evaluator class with eval_templates=/inputs=/model_name="
  pattern in the run/, quickstart/, judge-models/ pages.

## Verification

- pnpm build: 714 pages, no errors.
- pnpm audit-links: 0 broken nav, 0 broken content links.
- All 129 A-Z table rows link to an existing leaf page.
- All 13 alias-slug templates (bleu_score → bleu, ASR/STT_accuracy →
  audio-transcription, etc.) linked correctly across category pages.

## Out of scope (Tier 2 — follow-up)

- Reference subsection (eval result schema, evaluator input schema,
  score types).
- Troubleshooting subsection (5 symptom-driven pages).
- Stale Evaluator-class pattern in the 153 individual builtin leaf
  pages (PR #648 authored, separate cleanup pass).
- Deletion of the 23 unlinked stale leaves (user said: separate commit).
- Repo-wide "Future AGI" → "FutureAGI" rename (out of scope).
… screenshots

Two follow-ups to the IA reshape commit:

- Delete 23 leaf pages whose source YAMLs have visible_ui:false (i.e.
  templates not shown in the actual eval picker UI). All inbound links
  were redirected in the previous commit, so removing the pages
  doesn't introduce dead links. Reduces orphan-page count from 161 to
  138 and brings the documented eval set into parity with what users
  actually see in the dashboard.

  Removed:
    answer-similarity, api-call, contain-evals, contains-all,
    contains-any, contains-none, content-moderation,
    content-safety-violation, custom-code-evaluation, deterministic-evals,
    ends-with, equals, factual-accuracy, instruction-adherence,
    is-compliant, is-factually-consistent, json-scheme-validation,
    length-between, length-greater-than, length-less-than, recall-score,
    regex, starts-with

- Replace the 5 broken /screenshot/product/evaluation/mcp-connectors/N.png
  references on build/mcp-connectors.mdx with real captures saved at
  /images/docs/evaluation/mcp-connectors/N.png. Drops the 5 stale
  <Note>Image placeholder</Note> blocks.

Verification:
- pnpm build: 693 pages, 0 errors.
- pnpm audit-links: 0 broken nav, 0 broken content links.
3 reference pages (eval result schema, evaluator input schema, score
types) and 5 troubleshooting pages (score drift, judge variance, slow
runs, dataset mapping, CI gate failures). Code snippets verified
against the canonical fi.evals API surface.
@entelligence-ai-pr-reviews
Copy link
Copy Markdown

Automatic Review Skipped

Too many files for automatic review.

If you would still like a review, you can trigger one manually by commenting:

@entelligence review

@SuhaniNagpal7 SuhaniNagpal7 requested a review from nik13 May 22, 2026 06:11
@SuhaniNagpal7 SuhaniNagpal7 force-pushed the docs/evals-playbook-revamp branch from 2c329cf to 6cb487c Compare May 22, 2026 06:14
SuhaniNagpal7 pushed a commit that referenced this pull request May 25, 2026
…enshots

Restructures the Imagine page to match internal-docs/product-docs-playbook
feature-deep-dive template and updates all 4 screenshots to dark theme.

Page changes
- Add When to use / When not to use sections.
- Add How it works internally with a Mermaid data-flow diagram (build
  time -> canvas -> saved view -> live re-bind vs. dynamic re-run).
- Add Troubleshooting table (6 Symptom/Cause/Fix rows) for the common
  fragile paths: Save View disabled, prose instead of widgets, stuck
  skeleton loader, empty widgets on a new trace, rate limit, wrong field.
- Replace product-noun Next Steps cards with intent-driven Related links.
- Expand frontmatter to the master spec (page_type, products,
  feature_status, audience, difficulty, owner, reviewers, last_tested,
  last_screenshotted, schema_type, seo, geo, canonical, related).
- Reframe "dashboard"/"layout" as "view"/"collection" and replace "LLM"
  with "agent" where the surface is Falcon's agent (addresses reviewer
  comments from upstream PR #641).
- Correct dynamic-analysis cache key to (saved_view, widget, trace_id)
  and saved-view scope to project-or-workspace per the SavedView and
  ImagineAnalysis models in core-backend.
- Fix two broken internal links: /docs/observe/dashboards ->
  /docs/observe/features/dashboard, and /docs/observe/tracing ->
  /docs/observe.

Screenshots
- 10/11/12/13.png recaptured in dark theme; 12.png Save View button
  repainted (white background + black text) to compensate for a UI
  contrast bug.

Verified against codebase
- 17 widget types (VALID_WIDGET_TYPES in render_widget.py,
  WIDGET_REGISTRY in imagine/widgets/index.js).
- 12-column grid, Temporal timeouts (30s/90s/10s) and retry counts,
  10/60s chat rate limit, 45s dynamic-analysis endpoint, skeleton-loader
  text "Falcon is analyzing this trace...".
- Suggested prompt chip wording matches SuggestedPrompts.jsx exactly
  (product copy is what it is for now; treating chip text as a separate
  product-copy change rather than a doc edit).

Depends on
- PR #661 (docs(evals): IA reshape...) for src/components/docs/Mermaid.astro.
  This PR's Mermaid import will only resolve once #661 lands.
Copy link
Copy Markdown
Contributor

@nik13 nik13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs more work. ill take a call to merge it or will pick in docs revamp. on hold for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants