Skip to content

Commit 45ef25e

Browse files
committed
docs(emergent): add forge-observability section covering 5-utility telemetry API
1 parent de9f61a commit 45ef25e

1 file changed

Lines changed: 104 additions & 0 deletions

File tree

docs/architecture/EMERGENT_CAPABILITIES.md

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -315,6 +315,110 @@ session ──(5+ uses, >0.8 confidence, panel approved)──→ agent ──(h
315315
| **Agent** | Persisted for the creating agent | Survives restarts | 5+ uses, confidence > 0.8, two-reviewer panel |
316316
| **Shared** | All agents in the runtime | Permanent until demoted | Human approval required (HITL gate) |
317317

318+
## Forge Observability
319+
320+
The forge pipeline ships with a five-utility observability layer under [`@framers/agentos/emergent`](/api/modules#emergent) so any consumer can see live forge health without re-implementing the instrumentation. Each utility is standalone, pure, and composes with whatever telemetry the host already has.
321+
322+
```
323+
forge_tool invocation
324+
325+
326+
┌────────────────────────────────────────┐
327+
│ wrapForgeTool │
328+
│ · JSON-parse stringified schemas │
329+
│ · normalize mode synonyms │
330+
│ · backstop required fields │
331+
│ · scope-tag every attempt │
332+
└────────────┬───────────────────────────┘
333+
334+
335+
┌────────────────────────────────────────┐
336+
│ inferSchemaFromTestCases │
337+
│ · synthesize inputSchema.properties │
338+
│ from testCase inputs when missing │
339+
│ · same for outputSchema │
340+
└────────────┬───────────────────────────┘
341+
342+
343+
┌────────────────────────────────────────┐
344+
│ validateForgeShape (pre-judge) │
345+
│ · empty schema properties → reject │
346+
│ · <2 testCases → reject │
347+
│ · empty-input testCases → reject │
348+
│ rejection short-circuits the judge │
349+
└────────────┬───────────────────────────┘
350+
351+
352+
EmergentJudge (unchanged)
353+
354+
355+
┌────────────────────────────────────────┐
356+
│ capture callback (one per attempt) │
357+
│ ForgeStatsAggregator.recordAttempt │
358+
│ · uniqueNames / uniqueApproved │
359+
│ · uniqueTerminalRejections │
360+
│ · classifyForgeRejection │
361+
│ → rejectionReasons histogram │
362+
└────────────┬───────────────────────────┘
363+
364+
365+
snapshot() → host telemetry
366+
```
367+
368+
### API surface
369+
370+
| Utility | Kind | Purpose |
371+
|---|---|---|
372+
| [`wrapForgeTool`](/api/functions/wrapForgeTool) | wrapper (`ForgeToolMetaTool → ITool`) | Normalizes messy LLM forge args, runs pre-judge shape check, captures every attempt to the caller's sink regardless of outcome. Takes an optional `scope` label and `log` event callback so consumers can group attempts (e.g., `dept: 'medical'`) and render lifecycle events to stdout / pm2 / structured logs without the wrapper owning any console dependency. |
373+
| [`validateForgeShape`](/api/functions/validateForgeShape) | pure function (`ForgeShapeRequest → string[]`) | Catches the three failure modes that dominate cheap-tier rejections before the judge LLM runs: empty schema properties, fewer than 2 testCases, empty-input testCases. Every shape-check rejection saves one judge invocation plus the sandbox round-trip that would have followed it. |
374+
| [`inferSchemaFromTestCases`](/api/functions/inferSchemaFromTestCases) | pure function (in-place mutation) | Synthesizes `inputSchema.properties` / `outputSchema.properties` from concrete testCase values when the LLM forgot to declare them. Rescues the "examples without formalization" failure mode without relaxing schema discipline. Unions fields across every testCase so a single incomplete case does not narrow the inferred schema. |
375+
| [`classifyForgeRejection`](/api/functions/classifyForgeRejection) | pure function (`string → ForgeRejectionCategory`) | Bins rejection-reason text into six categories: `schema_extra_field`, `shape_check`, `syntax_error`, `parse_error`, `judge_correctness`, `other`. Order matters: `schema_extra_field` wins over `judge_correctness` because it is the more specific and more actionable signal. A growing `other` bucket is the signal to read raw reasons and extend the pattern set. |
376+
| [`ForgeStatsAggregator`](/api/classes/ForgeStatsAggregator) | class | Per-run rollup: `attempts`, `approved`, `rejected`, `approvedConfidenceSum`, `uniqueNames`, `uniqueApproved`, `uniqueTerminalRejections`, and the `rejectionReasons` histogram. `uniqueApproved` vs `uniqueTerminalRejections` is the real quality signal: unique-tool approval rate, not attempt-level approval rate. Shape pinned — extend by adding fields, never rename existing ones. |
377+
378+
### Composed wiring
379+
380+
```typescript
381+
import {
382+
EmergentCapabilityEngine, ForgeToolMetaTool,
383+
wrapForgeTool, ForgeStatsAggregator,
384+
} from '@framers/agentos/emergent';
385+
386+
const engine = new EmergentCapabilityEngine({ /* ... */ });
387+
const forgeTool = new ForgeToolMetaTool(engine);
388+
389+
const stats = new ForgeStatsAggregator();
390+
391+
const wrapped = wrapForgeTool({
392+
raw: forgeTool,
393+
agentId: 'agent-1',
394+
sessionId: 'session-1',
395+
scope: 'medical', // optional; propagated onto every CapturedForge
396+
capture: record => stats.recordAttempt(
397+
record.approved, record.confidence, record.name, record.errorReason,
398+
),
399+
log: event => {
400+
// event: { kind: 'start' | 'approved' | 'rejected' | 'error', toolName, ... }
401+
// Optional; omit for quiet mode.
402+
},
403+
});
404+
405+
// Expose `wrapped` to the agent. After the run:
406+
const snapshot = stats.snapshot();
407+
// → { attempts, approved, rejected, uniqueApproved, uniqueTerminalRejections, rejectionReasons, ... }
408+
```
409+
410+
### Interpreting the histogram
411+
412+
- Dominant `schema_extra_field` bucket — the LLM declares strict output schemas then returns extra fields. Mitigation: tighten the forge-guidance prompt or fix the sandbox's schema discipline.
413+
- Dominant `shape_check` bucket — the LLM keeps producing well-intentioned requests that the pre-judge validator rejects (empty properties, too few testCases). Usually fixable with a better system prompt that shows a worked forge example.
414+
- Dominant `judge_correctness` bucket — tool code has real logic bugs the judge catches (division, threshold inversions, unbounded outputs). Investigate the specific forges.
415+
- Non-zero `syntax_error` — LLM is emitting TypeScript syntax in a JavaScript sandbox, or single-line `if`/`for` without braces. Prompt fix.
416+
- `uniqueApproved / uniqueNames` near 1.0 — retry loop recovers well. Near 0 — LLM gets stuck on the same name across retries.
417+
418+
### Reference consumer: paracosm
419+
420+
Paracosm threads these utilities end-to-end through its SSE + cost telemetry surface. Every forge attempt shows up as a `forge_attempt` SSE event, is folded into the run's `_cost.forgeStats` payload on every subsequent event, lands in the run artifact's `finalCost().forgeStats`, and is aggregated across the last 100 runs at `/retry-stats.forges`. See [`apps/paracosm/src/runtime/emergent-setup.ts`](https://github.com/framersai/paracosm/blob/master/src/runtime/emergent-setup.ts) and [`cost-tracker.ts`](https://github.com/framersai/paracosm/blob/master/src/runtime/cost-tracker.ts) for the integration pattern.
421+
318422
## End-to-End Example: Agent Conversation
319423

320424
```

0 commit comments

Comments
 (0)