Skip to content

chore(appkit): production hardening — shutdown re-entrancy + SSE idle keep-alive#381

Draft
ditadi wants to merge 1 commit into
stack/taskflow/durable-task-demofrom
stack/taskflow/production-hardening
Draft

chore(appkit): production hardening — shutdown re-entrancy + SSE idle keep-alive#381
ditadi wants to merge 1 commit into
stack/taskflow/durable-task-demofrom
stack/taskflow/production-hardening

Conversation

@ditadi
Copy link
Copy Markdown
Contributor

@ditadi ditadi commented May 12, 2026

🥞 Stacked PR

Use this link to review incremental changes.


  • SIGTERM/SIGINT re-entrancy guard in ServerPlugin._gracefulShutdown.
    In interactive dev the OS sends SIGINT and the supervisor follows up
    with SIGTERM, racing two _shutdownCoreServices invocations — observed
    failure modes are Lakebase pool double-close, OTLP batcher double-
    flush, and "process.exit called twice". Guard the second signal with
    a flag and log it at debug.

  • Wall-clock idle keep-alive in the executeTask SSE bridge (25 s,
    IDLE_KEEPALIVE_INTERVAL_MS). The engine emits heartbeat events only
    while the executor is actively running; a task waiting on a slow
    downstream call can stay quiet long enough for an AWS/GCP/Cloudflare
    load balancer to drop the idle socket (typical 60 s timeout).
    Belt-and-braces wall-clock interval on top of the engine path,
    unref()'d so it doesn't keep the process alive on its own, and
    cleared from the bridge's outer finally so any error or early-exit
    cleanup path doesn't leak the timer.

  • CLAUDE.md: new "TaskFlow Core Service" section documenting the
    mental model, public API on this.taskflow, the executeTask shortcut,
    the when-to-use-which-execution-method table, the recovery pattern,
    and OBO/autoRecover guardrails. Adds vendored @databricks/taskflow
    to Key Dependencies (sha256-pinned via VENDOR.json) and a sixth
    "durable by default for long ops" design principle. Updates Graceful
    Shutdown bullets to reflect the new re-entrancy behaviour and the
    bridge drain on shutdown.

  • .claude/references/taskflow.md (new, 404 lines): canonical
    NEVER/MUST/SHOULD reference for TaskFlow usage from plugins. Covers
    when-to-use, task registration, idempotency keys, handler signature
    and ctx.emit, recovery patterns (agent loop, staged pipeline, saga),
    OBO/asUser interaction, shutdown semantics, and conflict semantics
    on duplicate submits. CLAUDE.md links to it as the authoritative
    source for the rules its TaskFlow Core Service section summarises.

Tests:

  • src/plugins/server/tests/server.test.ts: re-entrancy guard test
    fires two _gracefulShutdown calls in parallel and asserts
    shutdownCoreServices / abortActiveOperations / server.close each
    run exactly once.
  • src/taskflow/tests/execute-task.test.ts (new): two focused tests
    for the wall-clock keep-alive — one asserts the comment frame fires
    per IDLE_KEEPALIVE_INTERVAL_MS window while the engine is silent,
    the other asserts the interval is cleared once the bridge exits
    cleanly so a later advanceTimersByTime adds no new frames.

Deliberately out of scope:

  • Register-time OBO+autoRecover hard-error in TaskflowService.task():
    OBO-ness is a property of the caller of executeTask (the active
    UserContext scope), not the registration, so a register-time check
    is structurally not available without adding a new isOboOnly flag
    to TaskDefinition. The runtime first-call warning from PR 4
    (oboAutoRecoverWarned set, deduped per (plugin, task)) covers the
    misconfiguration at the boundary where it materialises.
  • VENDOR.json source-commit pin: already shipped in PR 3.
  • Internal review notes (taskflow-review-findings.md,
    taskflow-review-plan.md) intentionally not committed.

Validation: pnpm -r typecheck ✓, pnpm build ✓, pnpm exec biome check
(touched files) ✓, pnpm exec knip ✓, pnpm test ✓ (126 files / 2307
tests; +1 file / +3 tests vs PR 6).

Signed-off-by: ditadi victordperd@gmail.com

… keep-alive

* SIGTERM/SIGINT re-entrancy guard in ServerPlugin._gracefulShutdown.
  In interactive dev the OS sends SIGINT and the supervisor follows up
  with SIGTERM, racing two _shutdownCoreServices invocations — observed
  failure modes are Lakebase pool double-close, OTLP batcher double-
  flush, and "process.exit called twice". Guard the second signal with
  a flag and log it at debug.

* Wall-clock idle keep-alive in the executeTask SSE bridge (25 s,
  IDLE_KEEPALIVE_INTERVAL_MS). The engine emits heartbeat events only
  while the executor is actively running; a task waiting on a slow
  downstream call can stay quiet long enough for an AWS/GCP/Cloudflare
  load balancer to drop the idle socket (typical 60 s timeout).
  Belt-and-braces wall-clock interval on top of the engine path,
  unref()'d so it doesn't keep the process alive on its own, and
  cleared from the bridge's outer finally so any error or early-exit
  cleanup path doesn't leak the timer.

* CLAUDE.md: new "TaskFlow Core Service" section documenting the
  mental model, public API on this.taskflow, the executeTask shortcut,
  the when-to-use-which-execution-method table, the recovery pattern,
  and OBO/autoRecover guardrails. Adds vendored @databricks/taskflow
  to Key Dependencies (sha256-pinned via VENDOR.json) and a sixth
  "durable by default for long ops" design principle. Updates Graceful
  Shutdown bullets to reflect the new re-entrancy behaviour and the
  bridge drain on shutdown.

* .claude/references/taskflow.md (new, 404 lines): canonical
  NEVER/MUST/SHOULD reference for TaskFlow usage from plugins. Covers
  when-to-use, task registration, idempotency keys, handler signature
  and ctx.emit, recovery patterns (agent loop, staged pipeline, saga),
  OBO/asUser interaction, shutdown semantics, and conflict semantics
  on duplicate submits. CLAUDE.md links to it as the authoritative
  source for the rules its TaskFlow Core Service section summarises.

Tests:
- src/plugins/server/tests/server.test.ts: re-entrancy guard test
  fires two _gracefulShutdown calls in parallel and asserts
  shutdownCoreServices / abortActiveOperations / server.close each
  run exactly once.
- src/taskflow/tests/execute-task.test.ts (new): two focused tests
  for the wall-clock keep-alive — one asserts the comment frame fires
  per IDLE_KEEPALIVE_INTERVAL_MS window while the engine is silent,
  the other asserts the interval is cleared once the bridge exits
  cleanly so a later advanceTimersByTime adds no new frames.

Deliberately out of scope:
- Register-time OBO+autoRecover hard-error in TaskflowService.task():
  OBO-ness is a property of the *caller* of executeTask (the active
  UserContext scope), not the registration, so a register-time check
  is structurally not available without adding a new isOboOnly flag
  to TaskDefinition. The runtime first-call warning from PR 4
  (oboAutoRecoverWarned set, deduped per (plugin, task)) covers the
  misconfiguration at the boundary where it materialises.
- VENDOR.json source-commit pin: already shipped in PR 3.
- Internal review notes (taskflow-review-findings.md,
  taskflow-review-plan.md) intentionally not committed.

Validation: pnpm -r typecheck ✓, pnpm build ✓, pnpm exec biome check
(touched files) ✓, pnpm exec knip ✓, pnpm test ✓ (126 files / 2307
tests; +1 file / +3 tests vs PR 6).

Signed-off-by: ditadi <victordperd@gmail.com>
@ditadi ditadi force-pushed the stack/taskflow/production-hardening branch from ec0b3f7 to fad4231 Compare May 12, 2026 17:25
@ditadi ditadi force-pushed the stack/taskflow/durable-task-demo branch from fe71e3a to c02bcd9 Compare May 12, 2026 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant