docs(emergent): add forge-observability section covering 5-utility telemetry API

jddunn · jddunn · commit 45ef25ec0f4c · 2026-04-22T17:45:53.000-07:00
diff --git a/docs/architecture/EMERGENT_CAPABILITIES.md b/docs/architecture/EMERGENT_CAPABILITIES.md
@@ -315,6 +315,110 @@ session ──(5+ uses, >0.8 confidence, panel approved)──→ agent ──(h
 | **Agent** | Persisted for the creating agent | Survives restarts | 5+ uses, confidence > 0.8, two-reviewer panel |
 | **Shared** | All agents in the runtime | Permanent until demoted | Human approval required (HITL gate) |
 
+## Forge Observability
+
+The forge pipeline ships with a five-utility observability layer under [`@framers/agentos/emergent`](/api/modules#emergent) so any consumer can see live forge health without re-implementing the instrumentation. Each utility is standalone, pure, and composes with whatever telemetry the host already has.
+
+```
+ forge_tool invocation
+        │
+        ▼
+ ┌────────────────────────────────────────┐
+ │  wrapForgeTool                         │
+ │  · JSON-parse stringified schemas      │
+ │  · normalize mode synonyms             │
+ │  · backstop required fields            │
+ │  · scope-tag every attempt             │
+ └────────────┬───────────────────────────┘
+              │
+              ▼
+ ┌────────────────────────────────────────┐
+ │  inferSchemaFromTestCases              │
+ │  · synthesize inputSchema.properties   │
+ │    from testCase inputs when missing   │
+ │  · same for outputSchema               │
+ └────────────┬───────────────────────────┘
+              │
+              ▼
+ ┌────────────────────────────────────────┐
+ │  validateForgeShape (pre-judge)        │
+ │  · empty schema properties → reject    │
+ │  · <2 testCases → reject               │
+ │  · empty-input testCases → reject      │
+ │  rejection short-circuits the judge    │
+ └────────────┬───────────────────────────┘
+              │
+              ▼
+     EmergentJudge (unchanged)
+              │
+              ▼
+ ┌────────────────────────────────────────┐
+ │  capture callback (one per attempt)    │
+ │    ForgeStatsAggregator.recordAttempt  │
+ │      · uniqueNames / uniqueApproved    │
+ │      · uniqueTerminalRejections        │
+ │      · classifyForgeRejection          │
+ │        → rejectionReasons histogram    │
+ └────────────┬───────────────────────────┘
+              │
+              ▼
+      snapshot()  →  host telemetry
+```
+
+### API surface
+
+| Utility | Kind | Purpose |
+|---|---|---|
+| [`wrapForgeTool`](/api/functions/wrapForgeTool) | wrapper (`ForgeToolMetaTool → ITool`) | Normalizes messy LLM forge args, runs pre-judge shape check, captures every attempt to the caller's sink regardless of outcome. Takes an optional `scope` label and `log` event callback so consumers can group attempts (e.g., `dept: 'medical'`) and render lifecycle events to stdout / pm2 / structured logs without the wrapper owning any console dependency. |
+| [`validateForgeShape`](/api/functions/validateForgeShape) | pure function (`ForgeShapeRequest → string[]`) | Catches the three failure modes that dominate cheap-tier rejections before the judge LLM runs: empty schema properties, fewer than 2 testCases, empty-input testCases. Every shape-check rejection saves one judge invocation plus the sandbox round-trip that would have followed it. |
+| [`inferSchemaFromTestCases`](/api/functions/inferSchemaFromTestCases) | pure function (in-place mutation) | Synthesizes `inputSchema.properties` / `outputSchema.properties` from concrete testCase values when the LLM forgot to declare them. Rescues the "examples without formalization" failure mode without relaxing schema discipline. Unions fields across every testCase so a single incomplete case does not narrow the inferred schema. |
+| [`classifyForgeRejection`](/api/functions/classifyForgeRejection) | pure function (`string → ForgeRejectionCategory`) | Bins rejection-reason text into six categories: `schema_extra_field`, `shape_check`, `syntax_error`, `parse_error`, `judge_correctness`, `other`. Order matters: `schema_extra_field` wins over `judge_correctness` because it is the more specific and more actionable signal. A growing `other` bucket is the signal to read raw reasons and extend the pattern set. |
+| [`ForgeStatsAggregator`](/api/classes/ForgeStatsAggregator) | class | Per-run rollup: `attempts`, `approved`, `rejected`, `approvedConfidenceSum`, `uniqueNames`, `uniqueApproved`, `uniqueTerminalRejections`, and the `rejectionReasons` histogram. `uniqueApproved` vs `uniqueTerminalRejections` is the real quality signal: unique-tool approval rate, not attempt-level approval rate. Shape pinned — extend by adding fields, never rename existing ones. |
+
+### Composed wiring
+
+```typescript
+import {
+  EmergentCapabilityEngine, ForgeToolMetaTool,
+  wrapForgeTool, ForgeStatsAggregator,
+} from '@framers/agentos/emergent';
+
+const engine = new EmergentCapabilityEngine({ /* ... */ });
+const forgeTool = new ForgeToolMetaTool(engine);
+
+const stats = new ForgeStatsAggregator();
+
+const wrapped = wrapForgeTool({
+  raw: forgeTool,
+  agentId: 'agent-1',
+  sessionId: 'session-1',
+  scope: 'medical',  // optional; propagated onto every CapturedForge
+  capture: record => stats.recordAttempt(
+    record.approved, record.confidence, record.name, record.errorReason,
+  ),
+  log: event => {
+    // event: { kind: 'start' | 'approved' | 'rejected' | 'error', toolName, ... }
+    // Optional; omit for quiet mode.
+  },
+});
+
+// Expose `wrapped` to the agent. After the run:
+const snapshot = stats.snapshot();
+// → { attempts, approved, rejected, uniqueApproved, uniqueTerminalRejections, rejectionReasons, ... }
+```
+
+### Interpreting the histogram
+
+- Dominant `schema_extra_field` bucket — the LLM declares strict output schemas then returns extra fields. Mitigation: tighten the forge-guidance prompt or fix the sandbox's schema discipline.
+- Dominant `shape_check` bucket — the LLM keeps producing well-intentioned requests that the pre-judge validator rejects (empty properties, too few testCases). Usually fixable with a better system prompt that shows a worked forge example.
+- Dominant `judge_correctness` bucket — tool code has real logic bugs the judge catches (division, threshold inversions, unbounded outputs). Investigate the specific forges.
+- Non-zero `syntax_error` — LLM is emitting TypeScript syntax in a JavaScript sandbox, or single-line `if`/`for` without braces. Prompt fix.
+- `uniqueApproved / uniqueNames` near 1.0 — retry loop recovers well. Near 0 — LLM gets stuck on the same name across retries.
+
+### Reference consumer: paracosm
+
+Paracosm threads these utilities end-to-end through its SSE + cost telemetry surface. Every forge attempt shows up as a `forge_attempt` SSE event, is folded into the run's `_cost.forgeStats` payload on every subsequent event, lands in the run artifact's `finalCost().forgeStats`, and is aggregated across the last 100 runs at `/retry-stats.forges`. See [`apps/paracosm/src/runtime/emergent-setup.ts`](https://github.com/framersai/paracosm/blob/master/src/runtime/emergent-setup.ts) and [`cost-tracker.ts`](https://github.com/framersai/paracosm/blob/master/src/runtime/cost-tracker.ts) for the integration pattern.
+
 ## End-to-End Example: Agent Conversation
 
 ```