feat: [ENG-2241] AutoHarness V2 graceful-degradation tests#500
feat: [ENG-2241] AutoHarness V2 graceful-degradation tests#500danhdoan merged 1 commit intoproj/autoharness-v2from
Conversation
|
Claude finished @danhdoan's task in 3m 27s —— View job PR Review — ENG-2241 Graceful Degradation TestsTasks
SummaryScore: 4 / 5 This is a well-executed pure-coverage PR. The three-invariant-per-test structure (return shape, namespace presence, sandbox health) is the right model for end-to-end degradation testing, and every scenario from the A4 brutal-review item is represented with appropriate assertions. The test-file-level doc block accurately maps to the implementation. A few minor issues to address: Correctness
Non-null assertions on
Style
What's correct and well done
|
Phase 3 Task 3.4 — closes brutal-review item A4 with 7 unit tests
covering every harness-failure scenario from the execution plan.
Each test exercises the full pipeline end-to-end: real
`HarnessStore` (in-memory-backed) + real `HarnessModuleBuilder` +
real `SandboxService.loadHarness` + `executeCode` for the "sandbox
stays healthy" proof.
Build-time failures (harness NOT loaded):
1. Syntax error at module load → reason: 'syntax'
2. Throw in meta() → reason: 'meta-threw'
Per-invocation failures (harness loads; wrapper normalizes the
throw):
3. Throw in curate() → "curate() failed: user error"
4. Infinite loop in curate() → vm timeout, ~5s
5. Infinite recursion in curate() → stack overflow, wrapper throws
7. Never-resolving Promise → Promise.race timer, ~5s
Non-failure (legitimate legal behavior):
6. Returns undefined → resolves to undefined; no warn
1. `loadHarness` returns the right `HarnessLoadResult` variant
(`{loaded: false, reason}` for 1-2; `{loaded: true}` for 3-7).
2. For build-time failures: sandbox context has no `harness.*`
namespace (verified via `typeof harness !== 'undefined' ===
false`).
For per-invocation failures: harness IS loaded, direct
`module.curate(ctx)` throws (or resolves to undefined for case 6).
3. Sandbox continues executing unrelated code correctly after the
harness failure — verified via `executeCode('2 + 2', 's1')`
returning 4. If the harness failure corrupted sandbox state,
this assertion would surface it.
Cases 3-7 call `result.module.curate!(ctx)` directly rather than
routing through the sandbox's `harness.curate()` injection.
Reason: invoking an async harness function from REPL-executed
sandbox code requires an IIFE that doesn't cleanly surface the
thrown error (the REPL doesn't support top-level await in a way
that propagates rejections for test assertions). The sandbox's
`harness.curate` is a thin closure over `module.curate`
(`() => module.curate(buildCtx())`); calling the module method
directly tests the same code path. Task 3.3's behavioral test
already verifies the sandbox injection wiring; this file's
contribution is the failure-mode semantics.
Task doc specified "stub `IHarnessStore`"; shipped with real
`HarnessStore` backed by `FileKeyStorage({inMemory: true})`. The
store is a pass-through in these tests (save version → getLatest
returns it); stubbing adds ceremony without coverage benefit. If
stub-strict is preferred, swap to `InMemoryHarnessStore` from
`test/helpers/` in a one-line change.
|
Claude finished @danhdoan's task in 4m 4s —— View job PR Review — ENG-2241 Graceful Degradation TestsTasks
Score: 5 / 5All four substantive issues from the prior review have been resolved. This is a clean, well-scoped coverage PR. What was fixed since the last review
Remaining / new observationsOne nit remains (low priority, inline): The One doc inaccuracy (inline): The file-level JSDoc says "≈ 15-20s" but the actual runtime is ~10s (only tests 4 and 7 wait ~5s; all others are fast). What's correct and well done
|
Summary
loadHarnessnever throws regardless of what the harness code does. If any scenario here regressed, a bug in the evaluator would silently crash user sessions.test/unit/agent/harness/graceful-degradation.test.ts— 7 scenarios, one per A4 case, each exercising the fullSandboxService.loadHarness→ builder → sandbox pipeline. Three-invariant assertion per test: return shape, namespace presence/per-call throw, sandbox-stays-healthy.HarnessModuleBuilder,SandboxService, orHarnessStore— only exercising them. This is a pure coverage PR.Type of change
Scope (select all touched areas)
Linked issues
Root cause (bug fixes only, otherwise write
N/A)Test plan
test/unit/agent/harness/graceful-degradation.test.ts1. Syntax error at module load→{loaded: false, reason: 'syntax'}+ noharnessin sandbox +2 + 2returns 42. Throw in meta()→{loaded: false, reason: 'meta-threw'}+ noharnessin sandbox +2 + 2returns 43. Throw in curate()→loaded: true;module.curate(ctx)throws with/curate\(\) failed/; sandbox healthy4. Infinite loop in curate()→loaded: true; V8 vm timeout (~5003ms); sandbox healthy5. Infinite recursion in curate()→loaded: true; stack overflow caught; sandbox healthy7. Never-resolving Promise from curate()→loaded: true; Promise.race timer (~5003ms) with/exceeded/message; sandbox healthy6. Returns undefined from curate()→loaded: true;module.curate(ctx)resolves toundefined; sandbox healthy. Pinned explicitly as NOT a warning case so a future "warn on undefined returns" drift breaks this test.User-visible changes
None. Pure test coverage addition. No consumer of the tested paths changes behavior.
Evidence
Before this PR, the test file didn't exist. After: all 7 pass. Full suite: 6705 passing / 0 failing.
Checklist
npm test) — 7 new tests; full suite 6705 passing / 0 failingnpm run lint) — 0 errors, 226 pre-existing warningsnpm run typecheck) — exit=0npm run build) — exit=0feat: [ENG-2241] ...features/autoharness-v2/tasks/phase_3/task_04-graceful-degradation.md(research repo) drove the scope; the stub-vs-real-store deviation + per-invocation-direct-call design choice flagged below for post-merge task-doc tighteningmain— targetsproj/autoharness-v2, notmainRisks and mitigations
Risk: Test-file runtime is ~10s because two tests wait ~5s each for vm / Promise.race timeouts to fire. Under CI load the vm timeout could drift past 5.1s.
.timeout(8000)budget — vm timeout + 3s headroom. V8'svm.Script.runInContexttimeout is a hard wall-clock limit, not a soft signal. Real flakes would point at CI CPU exhaustion rather than test brittleness. If the runtime becomes painful, extract the two timeout tests to a dedicated "slow" file that runs on a nightly cadence; correctness signals stay in the PR gate.Risk: The per-invocation failure tests (3-7) call
result.module.curate!(ctx)directly rather than throughservice.executeCode('harness.curate(...)', 's1'). If the sandbox'sharness.curatewrapper inbuildHarnessNamespaceever diverges from a nakedmodule.curate(ctx)call, this test file wouldn't catch the divergence.harness.* visible inside sandbox code) verifies the injection-wrapper contract directly — if a regression made the wrapper add behavior (logging, retry, state-tracking) beyond a pass-through call, that test would catch it. This file's job is degradation semantics, not injection plumbing.Risk: Task doc prescribes "stub
IHarnessStore"; shipped with realHarnessStorebacked byFileKeyStorage({inMemory: true}).InMemoryHarnessStore(fromtest/helpers/, shipped by Phat in Phase 1 Task 1.1) — one-line change.Notes for reviewers
All 7 titles lead with the verbatim scenario name from the execution plan, then append the expected outcome. Reading the titles in isolation tells you exactly what's under test:
"1. Syntax error at module load → {loaded:false, reason:syntax}". If the outcome clause is ever wrong, a failing test report points directly at the contract drift.The sandbox-healthy invariant is load-bearing. Every test ends with
expectSandboxHealthy()— a one-liner that runs2 + 2through the sandbox and asserts4. Cheap proof that the harness failure didn't corrupt sandbox state (e.g., by leaking a broken VM context or mangling session maps). If any test's sandbox-healthy assertion ever fails, the evaluator or the sandbox injection has a real bug; don't retry, debug.Case 6 (
returns undefined) is explicitly pinned as NOT a failure — the test comment documents this so a future "defensive warning on undefined return" drift would break the test and force an explicit decision. Templates can legally return undefined; the LLM-side caller handles that. The module builder must not emit a warning or wrap the result.Cases 4 and 5 cover distinct timeout mechanisms: case 4 (
while(true){}) is caught by V8's vm wall-clock timeout onvm.Script.runInContext; case 5 (function go(){go()}; go()) is a stack overflow caught by V8 as a native error. Same normalized outcome (curate() failed: ...) but the underlying mechanisms are orthogonal — both need to be wrapped by the same error-normalization path inHarnessModuleBuilder.wrapInvocation.Task 3.5 (isolation integration test) will replace this file's unit-level setup with a real file-backed sandbox + attack fixtures. That PR exercises orthogonal concerns (cross-context isolation); this one proves per-scenario degradation. Both are in scope for Phase 3's ship gate.
Related
test/unit/agent/harness/graceful-degradation.test.tssrc/agent/infra/harness/harness-module-builder.ts(ENG-2239)src/agent/infra/harness/harness-store.ts(ENG-2227 + ENG-2228)src/agent/infra/sandbox/sandbox-service.ts.loadHarness()(ENG-2240)features/autoharness-v2/tasks/phase_3/task_04-graceful-degradation.md(research repo)