Skip to content

fix(gateway): harden readiness lifecycle and secret validation#655

Merged
marcusrbrown merged 1 commit into
mainfrom
fix/gateway-reliability-cohort
May 20, 2026
Merged

fix(gateway): harden readiness lifecycle and secret validation#655
marcusrbrown merged 1 commit into
mainfrom
fix/gateway-reliability-cohort

Conversation

@marcusrbrown
Copy link
Copy Markdown
Collaborator

Three reliability improvements that came out of PR #649's review cycle. They share the same scope (gateway lifecycle and config validation) and ship as one PR.

Readiness flag now re-arms across the full Discord reconnect lifecycle. Previously the flag was written once on clientReady and persisted for the rest of the process. After a permanent disconnect — rate-limit ban, gateway revocation, network partition past discord.js's retry budget — the healthcheck stayed green forever. discord.js v14 also only emits clientReady once per session; reconnects emit shardResume (session resumed) or shardReady (new session) instead. The fix listens on all four events: clientReady, shardReady, and shardResume write the flag; shardDisconnect clears it. Each log line includes an origin field so operators can tell which event re-armed the healthcheck.

Errors during shardDisconnect cleanup now log and continue instead of rethrowing. Crashing the gateway during the very disconnect we're trying to report is strictly worse than leaving a stale flag for one cycle.

readOptionalSecret rejects values containing line-breaking characters. AWS credentials copy-pasted from a wrapped terminal can carry mid-value newlines that pass trimEnd but break S3 request signing later with an opaque AWS signature error. The check applies symmetrically to file-read and env-var paths and covers \r, \n, U+0085 (NEL), U+2028 (LS), and U+2029 (PS). Errors point operators at the file path or env var name.

main.ts exposes makeGatewayProgram as an injectable factory so startup ordering is observable from a unit test. A reversal of setupReadinessFlag and client.login would silently reintroduce the stale-flag bug; the new ordering test fails fast if anyone moves the wiring.

Gateway tests grow from 87 to 145 across the three areas.

Copy link
Copy Markdown
Owner

@fro-bot fro-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: CONDITIONAL

Can merge after addressing the one blocking issue. The overall design is sound and test coverage is thorough.


Blocking issues

readiness.ts: Verify ReadinessClient interface was updated from once to on

The base-branch file (pre-merge) still declares once, but the new setupReadinessFlag calls client.on(...) for four events. If the interface was not updated, TypeScript rejects the build. The diff shows the interface being replaced with the correct on signature — if CI is green this is already resolved.

Expected post-merge interface:

export interface ReadinessClient {
  on: (
    event: 'clientReady' | 'shardReady' | 'shardResume' | 'shardDisconnect',
    listener: (...args: unknown[]) => void,
  ) => this
}

Non-blocking concerns

  1. shardDisconnect fires on transient reconnects. discord.js emits shardDisconnect on every reconnect cycle, including short network blips. The healthcheck flips red during each transient disconnect, which can trigger container restarts if the orchestrator threshold is tight. Deliberate trade-off per the PR description — operators should tune the liveness-probe failureThreshold accordingly.

  2. No cold-start test for shardReady/shardResume without a prior clientReady. All transition tests begin with fireClientReady(). A cold start where shardReady fires first (possible in multi-shard setups) is untested.

  3. makeGatewayProgram is exported as "for unit testing only" but is part of the compiled package surface. The JSDoc comment is present; consider whether a test-helper barrel or naming convention makes intent clearer long-term. Minor.

  4. Ordering-test nullish fallbacks are dead code. The toBeDefined() assertions above expect(setupOrder ?? Infinity).toBeLessThan(loginOrder ?? -Infinity) guarantee the values are numbers; the ?? guards are unreachable. No bug, just slightly confusing.


Missing tests

  • readOptionalSecret U+0085 (NEL) via env-var path. File-read path is tested for U+2028; env-var branch only tested for \n and \r. Regex is symmetric so risk is very low.
  • shardReady cold start (no prior clientReady): should write the flag — not exercised.

Risk assessment: LOW

  • Readiness flag is best-effort healthcheck. Worst failure mode (stale green) is explicitly addressed and covered by test.
  • Secret validation is pure input validation with no network side-effects. The \^E in the diff viewer is a rendering artifact; the actual merged file correctly uses \u0085.
  • makeGatewayProgram is a mechanical extraction with no behavioral change to the production path.
  • Test count grows 87 -> 145, proportionate to the change surface.

Run Summary
Field Value
Event pull_request
Repository fro-bot/agent
Run ID 26151330388
Cache hit
Session ses_1bb7514ebffeA5W338c67JvSXM

Three reliability improvements that landed from PR #649's review cycle:

- readiness flag now re-arms on Discord reconnect cycles. Previously the
  flag was written once via clientReady and persisted for the lifetime of
  the process, so a permanent disconnect (rate-limit ban, gateway
  revocation, network partition past discord.js's retry budget) left the
  healthcheck green forever. Now clientReady uses on (not once) so
  reconnects re-write the flag, and shardDisconnect clears it so the
  healthcheck goes red during outages.

- readOptionalSecret rejects values containing embedded newlines. AWS
  credentials copy-pasted from a wrapped terminal can land with mid-value
  newlines that pass trimEnd but break S3 request signing later with an
  opaque AWS error. The check applies symmetrically to file-read and
  env-var paths and throws with a clear, actionable message pointing to
  the file path or env var name.

- main.ts now exposes makeGatewayProgram as an injectable factory so the
  startup ordering between setupReadinessFlag and client.login is
  observable from a unit test. A reversal of that order would silently
  reintroduce the stale-flag bug; the new test fails fast if anyone moves
  the wiring.
@marcusrbrown marcusrbrown force-pushed the fix/gateway-reliability-cohort branch from 8022654 to 2c65f73 Compare May 20, 2026 08:45
Copy link
Copy Markdown
Owner

@fro-bot fro-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: PASS

All issues from the previous review cycle have been addressed. The blocking interface change is confirmed, and the two missing test gaps (U+0085 NEL via env-var path and cold-start shardReady/shardResume) are now covered.

Blocking issues

None.

Non-blocking concerns

  1. shardDisconnect fires on transient reconnects — unchanged from prior review. Deliberate trade-off documented in the PR. Operators need to tune liveness-probe failureThreshold to avoid container churn on brief network blips. Low operational risk at this scale.

  2. makeGatewayProgram is exported as "test-only" — JSDoc note is present; the export is part of the compiled surface. Acceptable as-is.

Missing tests

None — the updated commit adds:

  • U+0085 (NEL) via env-var path (config.test.ts:248)
  • Cold-start shardReady without prior clientReady (readiness.test.ts)
  • Cold-start shardResume without prior clientReady (readiness.test.ts)
  • The ordering-test nullish fallback was also replaced with an explicit early-throw guard

Risk assessment (LOW): likelihood of regression, security exposure, or blast radius

  • ReadinessClient interface correctly updated to on — TypeScript compilation confirmed by diff.
  • Secret validation regex uses \u0085 correctly (diff viewer \^E artifact is a JSON encoding quirk, not a code bug — verified against the actual file).
  • makeGatewayProgram is a mechanical extraction; production code path is unchanged.
  • Test count: 87 → 158 (up from the originally stated 145 — three additional cold-start and NEL tests were added in this push).

Run Summary
Field Value
Event pull_request
Repository fro-bot/agent
Run ID 26151699167
Cache hit
Session ses_1bb7514ebffeA5W338c67JvSXM

@marcusrbrown marcusrbrown merged commit e0f0f2a into main May 20, 2026
10 checks passed
@marcusrbrown marcusrbrown deleted the fix/gateway-reliability-cohort branch May 20, 2026 08:51
@fro-bot fro-bot mentioned this pull request May 20, 2026
46 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants