chore(appkit): production hardening — shutdown re-entrancy + SSE idle keep-alive#381
Draft
ditadi wants to merge 1 commit into
Draft
Conversation
This was referenced May 12, 2026
fdf81b8 to
fe71e3a
Compare
a636d01 to
ec0b3f7
Compare
… keep-alive * SIGTERM/SIGINT re-entrancy guard in ServerPlugin._gracefulShutdown. In interactive dev the OS sends SIGINT and the supervisor follows up with SIGTERM, racing two _shutdownCoreServices invocations — observed failure modes are Lakebase pool double-close, OTLP batcher double- flush, and "process.exit called twice". Guard the second signal with a flag and log it at debug. * Wall-clock idle keep-alive in the executeTask SSE bridge (25 s, IDLE_KEEPALIVE_INTERVAL_MS). The engine emits heartbeat events only while the executor is actively running; a task waiting on a slow downstream call can stay quiet long enough for an AWS/GCP/Cloudflare load balancer to drop the idle socket (typical 60 s timeout). Belt-and-braces wall-clock interval on top of the engine path, unref()'d so it doesn't keep the process alive on its own, and cleared from the bridge's outer finally so any error or early-exit cleanup path doesn't leak the timer. * CLAUDE.md: new "TaskFlow Core Service" section documenting the mental model, public API on this.taskflow, the executeTask shortcut, the when-to-use-which-execution-method table, the recovery pattern, and OBO/autoRecover guardrails. Adds vendored @databricks/taskflow to Key Dependencies (sha256-pinned via VENDOR.json) and a sixth "durable by default for long ops" design principle. Updates Graceful Shutdown bullets to reflect the new re-entrancy behaviour and the bridge drain on shutdown. * .claude/references/taskflow.md (new, 404 lines): canonical NEVER/MUST/SHOULD reference for TaskFlow usage from plugins. Covers when-to-use, task registration, idempotency keys, handler signature and ctx.emit, recovery patterns (agent loop, staged pipeline, saga), OBO/asUser interaction, shutdown semantics, and conflict semantics on duplicate submits. CLAUDE.md links to it as the authoritative source for the rules its TaskFlow Core Service section summarises. Tests: - src/plugins/server/tests/server.test.ts: re-entrancy guard test fires two _gracefulShutdown calls in parallel and asserts shutdownCoreServices / abortActiveOperations / server.close each run exactly once. - src/taskflow/tests/execute-task.test.ts (new): two focused tests for the wall-clock keep-alive — one asserts the comment frame fires per IDLE_KEEPALIVE_INTERVAL_MS window while the engine is silent, the other asserts the interval is cleared once the bridge exits cleanly so a later advanceTimersByTime adds no new frames. Deliberately out of scope: - Register-time OBO+autoRecover hard-error in TaskflowService.task(): OBO-ness is a property of the *caller* of executeTask (the active UserContext scope), not the registration, so a register-time check is structurally not available without adding a new isOboOnly flag to TaskDefinition. The runtime first-call warning from PR 4 (oboAutoRecoverWarned set, deduped per (plugin, task)) covers the misconfiguration at the boundary where it materialises. - VENDOR.json source-commit pin: already shipped in PR 3. - Internal review notes (taskflow-review-findings.md, taskflow-review-plan.md) intentionally not committed. Validation: pnpm -r typecheck ✓, pnpm build ✓, pnpm exec biome check (touched files) ✓, pnpm exec knip ✓, pnpm test ✓ (126 files / 2307 tests; +1 file / +3 tests vs PR 6). Signed-off-by: ditadi <victordperd@gmail.com>
ec0b3f7 to
fad4231
Compare
fe71e3a to
c02bcd9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🥞 Stacked PR
Use this link to review incremental changes.
SIGTERM/SIGINT re-entrancy guard in ServerPlugin._gracefulShutdown.
In interactive dev the OS sends SIGINT and the supervisor follows up
with SIGTERM, racing two _shutdownCoreServices invocations — observed
failure modes are Lakebase pool double-close, OTLP batcher double-
flush, and "process.exit called twice". Guard the second signal with
a flag and log it at debug.
Wall-clock idle keep-alive in the executeTask SSE bridge (25 s,
IDLE_KEEPALIVE_INTERVAL_MS). The engine emits heartbeat events only
while the executor is actively running; a task waiting on a slow
downstream call can stay quiet long enough for an AWS/GCP/Cloudflare
load balancer to drop the idle socket (typical 60 s timeout).
Belt-and-braces wall-clock interval on top of the engine path,
unref()'d so it doesn't keep the process alive on its own, and
cleared from the bridge's outer finally so any error or early-exit
cleanup path doesn't leak the timer.
CLAUDE.md: new "TaskFlow Core Service" section documenting the
mental model, public API on this.taskflow, the executeTask shortcut,
the when-to-use-which-execution-method table, the recovery pattern,
and OBO/autoRecover guardrails. Adds vendored @databricks/taskflow
to Key Dependencies (sha256-pinned via VENDOR.json) and a sixth
"durable by default for long ops" design principle. Updates Graceful
Shutdown bullets to reflect the new re-entrancy behaviour and the
bridge drain on shutdown.
.claude/references/taskflow.md (new, 404 lines): canonical
NEVER/MUST/SHOULD reference for TaskFlow usage from plugins. Covers
when-to-use, task registration, idempotency keys, handler signature
and ctx.emit, recovery patterns (agent loop, staged pipeline, saga),
OBO/asUser interaction, shutdown semantics, and conflict semantics
on duplicate submits. CLAUDE.md links to it as the authoritative
source for the rules its TaskFlow Core Service section summarises.
Tests:
fires two _gracefulShutdown calls in parallel and asserts
shutdownCoreServices / abortActiveOperations / server.close each
run exactly once.
for the wall-clock keep-alive — one asserts the comment frame fires
per IDLE_KEEPALIVE_INTERVAL_MS window while the engine is silent,
the other asserts the interval is cleared once the bridge exits
cleanly so a later advanceTimersByTime adds no new frames.
Deliberately out of scope:
OBO-ness is a property of the caller of executeTask (the active
UserContext scope), not the registration, so a register-time check
is structurally not available without adding a new isOboOnly flag
to TaskDefinition. The runtime first-call warning from PR 4
(oboAutoRecoverWarned set, deduped per (plugin, task)) covers the
misconfiguration at the boundary where it materialises.
taskflow-review-plan.md) intentionally not committed.
Validation: pnpm -r typecheck ✓, pnpm build ✓, pnpm exec biome check
(touched files) ✓, pnpm exec knip ✓, pnpm test ✓ (126 files / 2307
tests; +1 file / +3 tests vs PR 6).
Signed-off-by: ditadi victordperd@gmail.com