docs(evals): IA reshape + Tier 2 reference and troubleshooting pages#661
Open
SuhaniNagpal7 wants to merge 4 commits into
Open
docs(evals): IA reshape + Tier 2 reference and troubleshooting pages#661SuhaniNagpal7 wants to merge 4 commits into
SuhaniNagpal7 wants to merge 4 commits into
Conversation
added 4 commits
May 21, 2026 16:38
Fix the playbook-flagged P0 issues before the larger IA/concept rewrite:
- Replace stale "70+ built-in templates" with verified "130+" count
(129 visible_ui:true templates in model_hub/system_evals/, 153 total
including 24 hidden internal templates).
- Rename "Future AGI" -> "FutureAGI" across eval pages to match playbook
18 vocabulary rules (scoped to eval section only; rest of docs repo
still uses "Future AGI" per STYLE-GUIDE.md and will be unified later).
- Remove stale {/* SCREENSHOT NEEDED ... */} placeholder from
data-injection.mdx (playbook anti-pattern 27).
- Normalize Title Case H2 headings to sentence case per playbook 18
(preserves UI labels: Restore Version, Test vs Save, LLM-As-A-Judge).
pnpm audit-links: 0 broken nav links, 0 broken content links.
Restructures /docs/evaluation/ to match internal-docs/product-docs-playbook recommendations and to make every documented eval correspond to a real UI-visible template. ## New IA Evaluation ├── Overview (rewritten, absorbs understanding-evaluation.mdx) ├── Quickstart (new — /docs/quickstart/evals.mdx, SDK-first) ├── Concepts/ (9 retrofitted pages) ├── Run evals/ (4 new how-tos split from evaluate.mdx + cicd.mdx moved in) ├── Build evals/ (5 pages moved from features/) ├── Judge models/ (2 pages moved from features/) └── Evaluator catalog/ (new builtin/categories/ with 8 catalog pages) ## Key changes - Split features/evaluate.mdx into 4 task-shaped pages under run/: in-the-ui, python-sdk, typescript-sdk, api. Each uses the canonical fi.evals.evaluate() function and ai-evaluation package, replacing the stale Evaluator-class pattern. - Moved 8 feature pages into build/ and judge-models/ via git mv to preserve history. Updated cross-section links accordingly. - Rewrote evaluation/index.mdx as a true overview with Mermaid lifecycle diagram, "Where to start" cards, and intent-driven Next Steps. - Retrofitted 9 concept pages to playbook 03 anatomy: added Mermaid diagrams, "What it isn't" boundary sections, and concept-page frontmatter (page_type, diataxis, primary_question, direct_answer, has_diagram, related_concepts, etc.). - Added build/custom.mdx UI/SDK tab structure and judge-models pages with corrected SDK examples. - Created 8 evaluator catalog category pages (RAG, Agent, Safety, Text, Format, Code, Multimodal, Audio) generated from system_evals YAMLs. Each row sorted alphabetically by template name. - Rewrote builtin/index.mdx as a catalog hub: 8-card category grid + A-Z table trimmed to the 129 UI-visible templates (was 152). The 23 hidden/orphan rows are unlinked from the catalog; their leaf files remain on disk for direct-URL access. - Restored src/components/docs/Mermaid.astro (was in commit 0ad763d but missing on PR #648's base) and registered it in the auto-import map. Converted ```mermaid fences to <Mermaid code={...} />. - Cross-section: fixed inbound links from faq, dataset/features, simulation/features, cookbooks, redirects.ts to use the new build/, run/, judge-models/ paths. ## Conventions enforced section-wide - Heading: ## Related concepts on concept pages (playbook 03), ## Next steps on everything else. - Bullet style: - [Link](url): short description. - Sentence case below H1; ban-list still clear (no powerful, seamless, simply, etc.). - 0 em-dashes across new content. - 0 unsupported icon names (mapped to the Card component's iconPaths). - 0 stale "Future AGI" (with space) in the eval section. - 0 stale "Evaluator class with eval_templates=/inputs=/model_name=" pattern in the run/, quickstart/, judge-models/ pages. ## Verification - pnpm build: 714 pages, no errors. - pnpm audit-links: 0 broken nav, 0 broken content links. - All 129 A-Z table rows link to an existing leaf page. - All 13 alias-slug templates (bleu_score → bleu, ASR/STT_accuracy → audio-transcription, etc.) linked correctly across category pages. ## Out of scope (Tier 2 — follow-up) - Reference subsection (eval result schema, evaluator input schema, score types). - Troubleshooting subsection (5 symptom-driven pages). - Stale Evaluator-class pattern in the 153 individual builtin leaf pages (PR #648 authored, separate cleanup pass). - Deletion of the 23 unlinked stale leaves (user said: separate commit). - Repo-wide "Future AGI" → "FutureAGI" rename (out of scope).
… screenshots
Two follow-ups to the IA reshape commit:
- Delete 23 leaf pages whose source YAMLs have visible_ui:false (i.e.
templates not shown in the actual eval picker UI). All inbound links
were redirected in the previous commit, so removing the pages
doesn't introduce dead links. Reduces orphan-page count from 161 to
138 and brings the documented eval set into parity with what users
actually see in the dashboard.
Removed:
answer-similarity, api-call, contain-evals, contains-all,
contains-any, contains-none, content-moderation,
content-safety-violation, custom-code-evaluation, deterministic-evals,
ends-with, equals, factual-accuracy, instruction-adherence,
is-compliant, is-factually-consistent, json-scheme-validation,
length-between, length-greater-than, length-less-than, recall-score,
regex, starts-with
- Replace the 5 broken /screenshot/product/evaluation/mcp-connectors/N.png
references on build/mcp-connectors.mdx with real captures saved at
/images/docs/evaluation/mcp-connectors/N.png. Drops the 5 stale
<Note>Image placeholder</Note> blocks.
Verification:
- pnpm build: 693 pages, 0 errors.
- pnpm audit-links: 0 broken nav, 0 broken content links.
3 reference pages (eval result schema, evaluator input schema, score types) and 5 troubleshooting pages (score drift, judge variance, slow runs, dataset mapping, CI gate failures). Code snippets verified against the canonical fi.evals API surface.
|
Automatic Review Skipped Too many files for automatic review. If you would still like a review, you can trigger one manually by commenting: |
2c329cf to
6cb487c
Compare
SuhaniNagpal7
pushed a commit
that referenced
this pull request
May 25, 2026
…enshots Restructures the Imagine page to match internal-docs/product-docs-playbook feature-deep-dive template and updates all 4 screenshots to dark theme. Page changes - Add When to use / When not to use sections. - Add How it works internally with a Mermaid data-flow diagram (build time -> canvas -> saved view -> live re-bind vs. dynamic re-run). - Add Troubleshooting table (6 Symptom/Cause/Fix rows) for the common fragile paths: Save View disabled, prose instead of widgets, stuck skeleton loader, empty widgets on a new trace, rate limit, wrong field. - Replace product-noun Next Steps cards with intent-driven Related links. - Expand frontmatter to the master spec (page_type, products, feature_status, audience, difficulty, owner, reviewers, last_tested, last_screenshotted, schema_type, seo, geo, canonical, related). - Reframe "dashboard"/"layout" as "view"/"collection" and replace "LLM" with "agent" where the surface is Falcon's agent (addresses reviewer comments from upstream PR #641). - Correct dynamic-analysis cache key to (saved_view, widget, trace_id) and saved-view scope to project-or-workspace per the SavedView and ImagineAnalysis models in core-backend. - Fix two broken internal links: /docs/observe/dashboards -> /docs/observe/features/dashboard, and /docs/observe/tracing -> /docs/observe. Screenshots - 10/11/12/13.png recaptured in dark theme; 12.png Save View button repainted (white background + black text) to compensate for a UI contrast bug. Verified against codebase - 17 widget types (VALID_WIDGET_TYPES in render_widget.py, WIDGET_REGISTRY in imagine/widgets/index.js). - 12-column grid, Temporal timeouts (30s/90s/10s) and retry counts, 10/60s chat rate limit, 45s dynamic-analysis endpoint, skeleton-loader text "Falcon is analyzing this trace...". - Suggested prompt chip wording matches SuggestedPrompts.jsx exactly (product copy is what it is for now; treating chip text as a separate product-copy change rather than a doc edit). Depends on - PR #661 (docs(evals): IA reshape...) for src/components/docs/Mermaid.astro. This PR's Mermaid import will only resolve once #661 lands.
nik13
reviewed
May 27, 2026
Contributor
nik13
left a comment
There was a problem hiding this comment.
needs more work. ill take a call to merge it or will pick in docs revamp. on hold for now
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reshapes the
/docs/evaluation/section to the playbook IA and adds the Tier 2 reference + troubleshooting subgroups. 92 files touched: 28 added, 26 deleted, 31 modified, 7 renamed.What changed
1. New
Reference/subgroup (3 pages)src/pages/docs/evaluation/reference/result-schema.mdx— eval result fields (output,score,passed,reason,runtime/latency_ms,model,eval_id,eval_name), JSON shapes per output type, async retrieval pattern viaEvaluator.evaluate(is_async=True)+evaluator.get_eval_result(eval_id).src/pages/docs/evaluation/reference/input-schema.mdx— canonical input keys (input,output,context,expected,reference,hypothesis,conversation, etc.), mapping mechanism per surface (Dataset / Trace / Simulation / Test playground / SDK).src/pages/docs/evaluation/reference/score-types.mdx— strict mapping ofpass_fail/percentage/deterministicoutput types to numeric scores and pass derivation; Code eval return-value rules; composite aggregation functions; pass threshold semantics.2. New
Troubleshooting/subgroup (5 pages)score-drift.mdx— "Eval scores changed unexpectedly": template version drift, model swap, judge non-determinism, mapping change, dataset edits.judge-variance.mdx— "Judge output is inconsistent": subjective criteria, temperature, threshold-in-noise-band, ground-truth anchoring, when to switch to Code.slow-runs.mdx— "Eval run is slow": model choice, sync vs async, Agent tool calls, long context, multimodal, rate limits; per-row latency expectation table.mapping.mdx— "Dataset fields don't match the eval template": required-key mapping, media auto-detection onurl/file/image/audio/videosubstrings, field-type mismatches.ci-failures.mdx— "CI eval gate failed": auth, threshold flap, version pin drift, model swap, rate limits, GitHub Actions baseline.3. New
Run/subgroup (replacesfeatures/cicd.mdx)run/in-the-ui.mdx— dashboard flow.run/python-sdk.mdx— Python SDK usingfrom fi.evals import evaluate.run/typescript-sdk.mdx—@future-agi/ai-evaluationpackage.run/api.mdx— REST endpointPOST /sdk/api/v1/new-eval/with sync + async (is_async=true) patterns.run/cicd.mdx— moved fromfeatures/cicd.mdx, updated.4. New
Build/subgroup (moved fromfeatures/)features/custom.mdx→build/custom.mdx(added UI + SDK tabs).features/test-playground.mdx→build/test-playground.mdx.features/ground-truth.mdx→build/ground-truth.mdx.features/error-localization.mdx→build/error-localization.mdx.features/mcp-connectors.mdx→build/mcp-connectors.mdx(uses 5 real product screenshots).5. New
Judge models/subgroupfeatures/custom-models.mdx→judge-models/custom.mdx.judge-models/futureagi.mdx— Turing flash / small / large tiers.6. New
Evaluator catalog/(catalog group underbuiltin/)categories/rag.mdx,agent.mdx,safety.mdx,text.mdx,format.mdx,code.mdx,multimodal.mdx,audio.mdx. Each lists the evaluators in that category (alphabetical A-Z).builtin/index.mdxrewritten as a full A-Z catalog table (alphabetical, all 130 visible UI evals).7. Concept-page rewrites
Rewrote 9 concept pages to add the playbook concept-page frontmatter (page_type, diataxis, concept_family, related_concepts/tasks, schema_type), "About" section, and
## Related conceptsbullets:concepts/eval-types.mdx,eval-templates.mdx,output-types.mdx,judge-models.mdx,eval-results.mdx,composite-evals.mdx,versioning.mdx,data-injection.mdx,mcp-connectors.mdx.concepts/understanding-evaluation.mdx(replaced by the section overview).8. Evaluation section landing
evaluation/index.mdx— rewritten as a playbook-style overview with the eval count fixed to 130+, removed the hero screenshot.9. Quickstart
src/pages/docs/quickstart/evals.mdx— first-eval-in-5-minutes flow.10. Deletions (26 files)
23 stale eval leaf pages removed because they're hidden in the production UI (
visible_ui: falseinmodel_hub/system_evals/*.yaml), so the docs page had no in-product entry point:answer-similarity,api-call,contain-evals,contains-all,contains-any,contains-none,content-moderation,content-safety-violation,custom-code-evaluation,deterministic-evals,ends-with,equals,factual-accuracy,instruction-adherence,is-compliant,is-factually-consistent,json-scheme-validation,length-between,length-greater-than,length-less-than,recall-score,regex,starts-with.3 other deletions:
concepts/understanding-evaluation.mdx— superseded by section overview.features/evaluate.mdx— content split between Run/ pages.features/futureagi-models.mdx— moved tojudge-models/futureagi.mdx.11. Components and infra
src/components/docs/Mermaid.astro— restored from commit0ad763d7for diagrams used in concept pages.src/plugins/vite-docs-transform.mjs— added Mermaid to auto-import map.src/lib/navigation.ts— added Build/, Run/, Judge models/, Evaluator catalog/, Reference/, Troubleshooting/ subgroups under Evaluation.src/lib/redirects.ts— added redirects for moved pages (features/cicd→run/cicd,features/custom-models→judge-models/custom, etc.) plus 8 alias slugs (bleu_score→bleu, etc.).12. Assets
public/images/docs/evaluation/mcp-connectors/{1,2,3,4,5}.png(replacing 5 broken paths in the MCP connectors guide).13. Cross-link updates
Updated outbound links in 12 files that pointed at moved/deleted pages:
cookbook/decrease-hallucination.mdx,cookbook/evaluation/eval-correction-loop.mdx,cookbook/evaluation/eval-with-mcp-connectors.mdxdataset/features/experiments.mdx,dataset/features/run-prompt.mdxbuiltin/*.mdxpages (context-adherence, customer-agent-prompt-conformance, detect-hallucination, is-concise, is-email, is-helpful, llm-function-calling, task-completion)faq.mdx,observe/features/evals.mdx,quickstart/running-evals-in-simulation.mdx,simulation/features/prompt-simulation.mdxCode accuracy
All Python/JSON snippets in the new Reference and Troubleshooting pages were verified against the canonical SDK surface:
evaluate()function insrc/pages/docs/sdk/evals/evaluate.mdxEvaluatorclass insrc/pages/docs/sdk/evals/cloud-evals.mdxsrc/pages/docs/cookbook/quickstart/async-batch-eval.mdxsrc/pages/docs/evaluation/run/api.mdxTest plan
npm run devand walk Evaluation sidebar: Overview, Concepts, Build, Run, Judge models, Evaluator catalog, Reference, Troubleshooting.evaluation/build/mcp-connectors./docs/evaluation/features/cicd→/docs/evaluation/run/cicd.