You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|**Agent**| Persisted for the creating agent | Survives restarts | 5+ uses, confidence > 0.8, two-reviewer panel |
316
316
|**Shared**| All agents in the runtime | Permanent until demoted | Human approval required (HITL gate) |
317
317
318
+
## Forge Observability
319
+
320
+
The forge pipeline ships with a five-utility observability layer under [`@framers/agentos/emergent`](/api/modules#emergent) so any consumer can see live forge health without re-implementing the instrumentation. Each utility is standalone, pure, and composes with whatever telemetry the host already has.
321
+
322
+
```
323
+
forge_tool invocation
324
+
│
325
+
▼
326
+
┌────────────────────────────────────────┐
327
+
│ wrapForgeTool │
328
+
│ · JSON-parse stringified schemas │
329
+
│ · normalize mode synonyms │
330
+
│ · backstop required fields │
331
+
│ · scope-tag every attempt │
332
+
└────────────┬───────────────────────────┘
333
+
│
334
+
▼
335
+
┌────────────────────────────────────────┐
336
+
│ inferSchemaFromTestCases │
337
+
│ · synthesize inputSchema.properties │
338
+
│ from testCase inputs when missing │
339
+
│ · same for outputSchema │
340
+
└────────────┬───────────────────────────┘
341
+
│
342
+
▼
343
+
┌────────────────────────────────────────┐
344
+
│ validateForgeShape (pre-judge) │
345
+
│ · empty schema properties → reject │
346
+
│ · <2 testCases → reject │
347
+
│ · empty-input testCases → reject │
348
+
│ rejection short-circuits the judge │
349
+
└────────────┬───────────────────────────┘
350
+
│
351
+
▼
352
+
EmergentJudge (unchanged)
353
+
│
354
+
▼
355
+
┌────────────────────────────────────────┐
356
+
│ capture callback (one per attempt) │
357
+
│ ForgeStatsAggregator.recordAttempt │
358
+
│ · uniqueNames / uniqueApproved │
359
+
│ · uniqueTerminalRejections │
360
+
│ · classifyForgeRejection │
361
+
│ → rejectionReasons histogram │
362
+
└────────────┬───────────────────────────┘
363
+
│
364
+
▼
365
+
snapshot() → host telemetry
366
+
```
367
+
368
+
### API surface
369
+
370
+
| Utility | Kind | Purpose |
371
+
|---|---|---|
372
+
|[`wrapForgeTool`](/api/functions/wrapForgeTool)| wrapper (`ForgeToolMetaTool → ITool`) | Normalizes messy LLM forge args, runs pre-judge shape check, captures every attempt to the caller's sink regardless of outcome. Takes an optional `scope` label and `log` event callback so consumers can group attempts (e.g., `dept: 'medical'`) and render lifecycle events to stdout / pm2 / structured logs without the wrapper owning any console dependency. |
373
+
|[`validateForgeShape`](/api/functions/validateForgeShape)| pure function (`ForgeShapeRequest → string[]`) | Catches the three failure modes that dominate cheap-tier rejections before the judge LLM runs: empty schema properties, fewer than 2 testCases, empty-input testCases. Every shape-check rejection saves one judge invocation plus the sandbox round-trip that would have followed it. |
374
+
|[`inferSchemaFromTestCases`](/api/functions/inferSchemaFromTestCases)| pure function (in-place mutation) | Synthesizes `inputSchema.properties` / `outputSchema.properties` from concrete testCase values when the LLM forgot to declare them. Rescues the "examples without formalization" failure mode without relaxing schema discipline. Unions fields across every testCase so a single incomplete case does not narrow the inferred schema. |
375
+
|[`classifyForgeRejection`](/api/functions/classifyForgeRejection)| pure function (`string → ForgeRejectionCategory`) | Bins rejection-reason text into six categories: `schema_extra_field`, `shape_check`, `syntax_error`, `parse_error`, `judge_correctness`, `other`. Order matters: `schema_extra_field` wins over `judge_correctness` because it is the more specific and more actionable signal. A growing `other` bucket is the signal to read raw reasons and extend the pattern set. |
376
+
|[`ForgeStatsAggregator`](/api/classes/ForgeStatsAggregator)| class | Per-run rollup: `attempts`, `approved`, `rejected`, `approvedConfidenceSum`, `uniqueNames`, `uniqueApproved`, `uniqueTerminalRejections`, and the `rejectionReasons` histogram. `uniqueApproved` vs `uniqueTerminalRejections` is the real quality signal: unique-tool approval rate, not attempt-level approval rate. Shape pinned — extend by adding fields, never rename existing ones. |
- Dominant `schema_extra_field` bucket — the LLM declares strict output schemas then returns extra fields. Mitigation: tighten the forge-guidance prompt or fix the sandbox's schema discipline.
413
+
- Dominant `shape_check` bucket — the LLM keeps producing well-intentioned requests that the pre-judge validator rejects (empty properties, too few testCases). Usually fixable with a better system prompt that shows a worked forge example.
414
+
- Dominant `judge_correctness` bucket — tool code has real logic bugs the judge catches (division, threshold inversions, unbounded outputs). Investigate the specific forges.
415
+
- Non-zero `syntax_error` — LLM is emitting TypeScript syntax in a JavaScript sandbox, or single-line `if`/`for` without braces. Prompt fix.
416
+
-`uniqueApproved / uniqueNames` near 1.0 — retry loop recovers well. Near 0 — LLM gets stuck on the same name across retries.
417
+
418
+
### Reference consumer: paracosm
419
+
420
+
Paracosm threads these utilities end-to-end through its SSE + cost telemetry surface. Every forge attempt shows up as a `forge_attempt` SSE event, is folded into the run's `_cost.forgeStats` payload on every subsequent event, lands in the run artifact's `finalCost().forgeStats`, and is aggregated across the last 100 runs at `/retry-stats.forges`. See [`apps/paracosm/src/runtime/emergent-setup.ts`](https://github.com/framersai/paracosm/blob/master/src/runtime/emergent-setup.ts) and [`cost-tracker.ts`](https://github.com/framersai/paracosm/blob/master/src/runtime/cost-tracker.ts) for the integration pattern.
0 commit comments