fix(gateway): harden readiness lifecycle and secret validation#655
Conversation
fro-bot
left a comment
There was a problem hiding this comment.
Verdict: CONDITIONAL
Can merge after addressing the one blocking issue. The overall design is sound and test coverage is thorough.
Blocking issues
readiness.ts: Verify ReadinessClient interface was updated from once to on
The base-branch file (pre-merge) still declares once, but the new setupReadinessFlag calls client.on(...) for four events. If the interface was not updated, TypeScript rejects the build. The diff shows the interface being replaced with the correct on signature — if CI is green this is already resolved.
Expected post-merge interface:
export interface ReadinessClient {
on: (
event: 'clientReady' | 'shardReady' | 'shardResume' | 'shardDisconnect',
listener: (...args: unknown[]) => void,
) => this
}Non-blocking concerns
-
shardDisconnectfires on transient reconnects. discord.js emitsshardDisconnecton every reconnect cycle, including short network blips. The healthcheck flips red during each transient disconnect, which can trigger container restarts if the orchestrator threshold is tight. Deliberate trade-off per the PR description — operators should tune the liveness-probefailureThresholdaccordingly. -
No cold-start test for
shardReady/shardResumewithout a priorclientReady. All transition tests begin withfireClientReady(). A cold start whereshardReadyfires first (possible in multi-shard setups) is untested. -
makeGatewayProgramis exported as "for unit testing only" but is part of the compiled package surface. The JSDoc comment is present; consider whether a test-helper barrel or naming convention makes intent clearer long-term. Minor. -
Ordering-test nullish fallbacks are dead code. The
toBeDefined()assertions aboveexpect(setupOrder ?? Infinity).toBeLessThan(loginOrder ?? -Infinity)guarantee the values are numbers; the??guards are unreachable. No bug, just slightly confusing.
Missing tests
readOptionalSecretU+0085 (NEL) via env-var path. File-read path is tested for U+2028; env-var branch only tested for\nand\r. Regex is symmetric so risk is very low.shardReadycold start (no priorclientReady): should write the flag — not exercised.
Risk assessment: LOW
- Readiness flag is best-effort healthcheck. Worst failure mode (stale green) is explicitly addressed and covered by test.
- Secret validation is pure input validation with no network side-effects. The
\^Ein the diff viewer is a rendering artifact; the actual merged file correctly uses\u0085. makeGatewayProgramis a mechanical extraction with no behavioral change to the production path.- Test count grows 87 -> 145, proportionate to the change surface.
Run Summary
| Field | Value |
|---|---|
| Event | pull_request |
| Repository | fro-bot/agent |
| Run ID | 26151330388 |
| Cache | hit |
| Session | ses_1bb7514ebffeA5W338c67JvSXM |
Three reliability improvements that landed from PR #649's review cycle: - readiness flag now re-arms on Discord reconnect cycles. Previously the flag was written once via clientReady and persisted for the lifetime of the process, so a permanent disconnect (rate-limit ban, gateway revocation, network partition past discord.js's retry budget) left the healthcheck green forever. Now clientReady uses on (not once) so reconnects re-write the flag, and shardDisconnect clears it so the healthcheck goes red during outages. - readOptionalSecret rejects values containing embedded newlines. AWS credentials copy-pasted from a wrapped terminal can land with mid-value newlines that pass trimEnd but break S3 request signing later with an opaque AWS error. The check applies symmetrically to file-read and env-var paths and throws with a clear, actionable message pointing to the file path or env var name. - main.ts now exposes makeGatewayProgram as an injectable factory so the startup ordering between setupReadinessFlag and client.login is observable from a unit test. A reversal of that order would silently reintroduce the stale-flag bug; the new test fails fast if anyone moves the wiring.
8022654 to
2c65f73
Compare
fro-bot
left a comment
There was a problem hiding this comment.
Verdict: PASS
All issues from the previous review cycle have been addressed. The blocking interface change is confirmed, and the two missing test gaps (U+0085 NEL via env-var path and cold-start shardReady/shardResume) are now covered.
Blocking issues
None.
Non-blocking concerns
-
shardDisconnectfires on transient reconnects — unchanged from prior review. Deliberate trade-off documented in the PR. Operators need to tune liveness-probefailureThresholdto avoid container churn on brief network blips. Low operational risk at this scale. -
makeGatewayProgramis exported as "test-only" — JSDoc note is present; the export is part of the compiled surface. Acceptable as-is.
Missing tests
None — the updated commit adds:
- U+0085 (NEL) via env-var path (
config.test.ts:248) - Cold-start
shardReadywithout priorclientReady(readiness.test.ts) - Cold-start
shardResumewithout priorclientReady(readiness.test.ts) - The ordering-test nullish fallback was also replaced with an explicit early-throw guard
Risk assessment (LOW): likelihood of regression, security exposure, or blast radius
ReadinessClientinterface correctly updated toon— TypeScript compilation confirmed by diff.- Secret validation regex uses
\u0085correctly (diff viewer\^Eartifact is a JSON encoding quirk, not a code bug — verified against the actual file). makeGatewayProgramis a mechanical extraction; production code path is unchanged.- Test count: 87 → 158 (up from the originally stated 145 — three additional cold-start and NEL tests were added in this push).
Run Summary
| Field | Value |
|---|---|
| Event | pull_request |
| Repository | fro-bot/agent |
| Run ID | 26151699167 |
| Cache | hit |
| Session | ses_1bb7514ebffeA5W338c67JvSXM |
Three reliability improvements that came out of PR #649's review cycle. They share the same scope (gateway lifecycle and config validation) and ship as one PR.
Readiness flag now re-arms across the full Discord reconnect lifecycle. Previously the flag was written once on
clientReadyand persisted for the rest of the process. After a permanent disconnect — rate-limit ban, gateway revocation, network partition past discord.js's retry budget — the healthcheck stayed green forever. discord.js v14 also only emitsclientReadyonce per session; reconnects emitshardResume(session resumed) orshardReady(new session) instead. The fix listens on all four events:clientReady,shardReady, andshardResumewrite the flag;shardDisconnectclears it. Each log line includes anoriginfield so operators can tell which event re-armed the healthcheck.Errors during
shardDisconnectcleanup now log and continue instead of rethrowing. Crashing the gateway during the very disconnect we're trying to report is strictly worse than leaving a stale flag for one cycle.readOptionalSecretrejects values containing line-breaking characters. AWS credentials copy-pasted from a wrapped terminal can carry mid-value newlines that passtrimEndbut break S3 request signing later with an opaque AWS signature error. The check applies symmetrically to file-read and env-var paths and covers\r,\n, U+0085 (NEL), U+2028 (LS), and U+2029 (PS). Errors point operators at the file path or env var name.main.tsexposesmakeGatewayProgramas an injectable factory so startup ordering is observable from a unit test. A reversal ofsetupReadinessFlagandclient.loginwould silently reintroduce the stale-flag bug; the new ordering test fails fast if anyone moves the wiring.Gateway tests grow from 87 to 145 across the three areas.