Skip to content

fix(redis): attach client error handler so a dropped connection doesn't crash the process#999

Open
JohnMcLear wants to merge 1 commit into
mainfrom
fix/redis-client-error-handler
Open

fix(redis): attach client error handler so a dropped connection doesn't crash the process#999
JohnMcLear wants to merge 1 commit into
mainfrom
fix/redis-client-error-handler

Conversation

@JohnMcLear
Copy link
Copy Markdown
Member

Problem

The redis analogue of ether/etherpad#7878 (postgres, PR #998). node-redis re-emits connection/socket errors — an idle drop, server restart, failover, or a proxy/LB/firewall closing the connection — on the client as an EventEmitter 'error'. With no listener, Node treats it as an uncaught exception and terminates the host process, and node-redis additionally won't start reconnecting. The official node-redis error-handling docs require attaching this listener; redis_db.ts never did, so any of those events crashed Etherpad.

Fix

Attach client.on('error', …) in init() before connect() (so connect-time failures are covered too): log the error and let node-redis transparently reconnect.

Test (real, not a wiring assertion)

test/redis/connection-drop.spec.ts starts a redis container, warms the client, then kills its connection server-side with CLIENT KILL TYPE normal — exactly what a failover or a proxy idle-timeout does — and asserts:

  1. the handler caught the error (proves the fix ran), and
  2. the client recovered (a subsequent round-trip works).

Verified locally that this fails without the handler (node-redis can't reconnect, so the recovery query hangs to a timeout) and passes with it. redis is already in the container test matrix, so this runs in CI under pnpm run test redis (the new spec uses a dynamic mapped port, so it doesn't collide with the existing fixed-6379 redis spec).

Verification

  • pnpm run lint / format:check / ts-check / build — clean
  • vitest run test/redis/connection-drop.spec.ts — passes with fix, fails (timeout) without

Context

Part of a per-driver sweep prompted by #998. Each affected backend gets its own PR (no combined PR). postgres = #998; mysql to follow; mongodb/cassandra/elasticsearch/couch and the local stores are not affected; rethink/surrealdb still under review.

🤖 Generated with Claude Code

…'t crash the process

node-redis re-emits connection/socket errors (idle drop, server restart,
failover, a proxy closing the connection) on the client as an
EventEmitter 'error'. With no listener Node treats it as uncaught and
terminates the host process — and node-redis also won't begin
reconnecting. The redis driver never attached one, so any of those
events crashed Etherpad. This is the redis analogue of the postgres bug
in ether/etherpad#7878; the official node-redis docs require this
listener.

- Attach `client.on('error', …)` before connect() (so connect-time
  failures are covered too): log the error and let node-redis reconnect.
- Add test/redis/connection-drop.spec.ts: warm the client, kill its
  connection server-side with CLIENT KILL (what a failover / proxy idle
  timeout does), and assert the handler caught it and the client
  recovered. Verified locally that this FAILS without the handler (the
  client cannot reconnect and the recovery query hangs to a timeout) and
  PASSES with it. redis is already in the container test matrix, so this
  runs in CI under `pnpm run test redis`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented Jun 1, 2026

Review Summary by Qodo

(Agentic_describe updated until commit e8d9ce7)

Fix redis client error handling to prevent process crashes

🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Attach error handler to redis client before connect to prevent process crash
• Log connection errors and allow transparent reconnection on socket failures
• Add integration test verifying recovery from server-side connection kills
Diagram
flowchart LR
  A["Redis Connection Error"] -->|"Without Handler"| B["Process Crash"]
  A -->|"With Handler"| C["Error Logged"]
  C --> D["Client Reconnects"]
  D --> E["Service Continues"]

Loading

Grey Divider

File Changes

1. databases/redis_db.ts 🐞 Bug fix +11/-0

Add redis client error handler for reconnection

• Attach client.on('error', ...) listener before connect() call
• Log connection/socket errors with stack trace for debugging
• Allows node-redis to transparently reconnect on idle drops, restarts, or failovers
• Prevents uncaught exception from terminating the host process

databases/redis_db.ts


2. test/redis/connection-drop.spec.ts 🧪 Tests +74/-0

Add redis connection drop recovery test

• New integration test using testcontainers to spawn redis dynamically
• Warms client connection, kills it server-side with CLIENT KILL TYPE normal
• Asserts error handler logged the connection drop event
• Verifies client successfully recovered and resumed operations

test/redis/connection-drop.spec.ts


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented Jun 1, 2026

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0)

Grey Divider


Remediation recommended

1. Flaky reconnect wait 🐞 Bug ☼ Reliability
Description
connection-drop.spec.ts assumes the redis client will emit the error and fully reconnect within 1
second, then immediately asserts and performs a round-trip. This hard-coded timing can
intermittently fail under slower CI or load even when the reconnect behavior is correct.
Code

test/redis/connection-drop.spec.ts[R61-69]

Evidence
The test uses a fixed 1-second sleep and then immediately asserts on loggedErrors and performs a
set/get round-trip, which is a timing assumption rather than waiting for a condition to become
true.

test/redis/connection-drop.spec.ts[61-69]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`test/redis/connection-drop.spec.ts` waits a fixed 1s after `CLIENT KILL` and then immediately asserts the error was logged and performs commands. This creates a race: the reconnect may legitimately take longer than 1 second, causing test flakiness.

### Issue Context
The test is validating reconnect behavior after a dropped connection. Reconnect timing is inherently variable, so the test should wait on an observed condition (error logged and/or successful round-trip) with a bounded timeout rather than sleeping a constant duration.

### Fix Focus Areas
- test/redis/connection-drop.spec.ts[61-69]

### Suggested change
Replace the `setTimeout(1000)` with a bounded poll/retry, for example:
- Use `await expect.poll(() => loggedErrors.some(...), { timeout: 10000 }).toBe(true)` to wait until the error handler actually ran.
- Then retry `db.get/set` (or poll a `ping`/round-trip) until it succeeds, again with a timeout.

This keeps the test deterministic (waits for the condition) while still bounded (won’t hang indefinitely).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

@qodo-free-for-open-source-projects
Copy link
Copy Markdown

qodo-free-for-open-source-projects Bot commented Jun 1, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0)

Grey Divider


Remediation recommended

1. Admin client cleanup missing 🐞 Bug ☼ Reliability
Description
The test creates an admin Redis client but does not close it in a finally block, so a failure during
connect/CLIENT KILL/quit can leave an open socket that may hang the test runner or interfere with
cleanup. Only db.close() is guaranteed to run in the outer finally.
Code

test/redis/connection-drop.spec.ts[R56-60]

Evidence
The admin client is created and used, but its cleanup is not in a finally; the only finally block
closes db, not admin.

test/redis/connection-drop.spec.ts[50-72]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`test/redis/connection-drop.spec.ts` creates an `admin` Redis client to execute `CLIENT KILL`, but it is not protected by `try/finally`. If `connect()` or `sendCommand()` throws, `admin.quit()` will be skipped and the socket can remain open.

## Issue Context
This can cause intermittent CI hangs/open-handle warnings and makes failures harder to diagnose.

## Fix Focus Areas
- test/redis/connection-drop.spec.ts[54-60]

## Suggested change
Wrap the admin client lifecycle in its own `try/finally`, ensuring it is always closed/disconnected, e.g.:
- create `admin`
- `await admin.connect()`
- `try { await admin.sendCommand(...) } finally { await admin.quit().catch(() => admin.disconnect?.()) }`
(Use whatever shutdown method is appropriate for the redis client in this codebase.)

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

2. Fixed reconnect timing assumption 🐞 Bug ☼ Reliability
Description
The test waits a fixed 1s after killing the connection and then immediately asserts recovery with a
set/get, which can be timing-sensitive if reconnection takes longer under load. This can cause
intermittent test failures even though the production fix is correct.
Code

test/redis/connection-drop.spec.ts[R61-69]

Evidence
A fixed 1-second delay is used to wait for reconnection, followed by immediate assertions without
any retry/polling.

test/redis/connection-drop.spec.ts[61-69]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The test uses `setTimeout(..., 1000)` as a hard-coded delay to allow the error to fire and the client to reconnect, then performs a round-trip. Fixed sleeps are inherently race-prone.

## Issue Context
Even with a generous delay, CI variability can occasionally exceed the assumed window. A polling/retry approach makes the test deterministic: it passes as soon as recovery happens and fails only after a real timeout.

## Fix Focus Areas
- test/redis/connection-drop.spec.ts[61-69]

## Suggested change
Replace the fixed sleep with a bounded poll (up to the test timeout) that waits for either:
- the error log to appear, and/or
- a successful redis round-trip (e.g., `db.get()` or `db.set()` succeeds).

Example approach:
- Loop until deadline (e.g., 10–30s): try `await db.get("dropkey")` / `await db.set(...)`; break on success; otherwise `await setTimeout(200)` and retry.
- Then assert `loggedErrors` contains the expected message and perform the final set/get assertions.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant