Skip to content

Connection error handling#379

Open
tobyhede wants to merge 7 commits intomainfrom
connection-error-handling
Open

Connection error handling#379
tobyhede wants to merge 7 commits intomainfrom
connection-error-handling

Conversation

@tobyhede
Copy link
Contributor

@tobyhede tobyhede commented Mar 10, 2026

Acknowledgment

By submitting this pull request, I confirm that CipherStash can use, modify, copy, and redistribute this contribution, under the terms of CipherStash's choice.

Summary by CodeRabbit

  • New Features

    • Connection timeouts now default to 120 seconds with improved error messaging.
    • Enhanced connection resilience to handle slow queries and concurrent client connections more effectively.
  • Documentation

    • Updated development guide with Docker build and testing workflow clarifications.
    • Added macOS troubleshooting guidance for file descriptor limits.

@coderabbitai
Copy link

coderabbitai bot commented Mar 10, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request implements connection timeout handling for the proxy, including configuration defaults (120 seconds), error formatting with millisecond precision, appropriate error responses to clients, and comprehensive integration tests validating connection resilience under concurrent and load scenarios. Documentation and Docker setup are updated to clarify testing workflows.

Changes

Cohort / File(s) Summary
Connection Timeout Configuration
packages/cipherstash-proxy/src/config/database.rs, packages/cipherstash-proxy-integration/src/common.rs
Added default connection timeout constant (120 seconds), implemented timeout conversion logic where None defaults to 120s, Some(0) disables timeout, and custom values map to Duration. Integration test helper sets 10-second PostgreSQL client timeout.
Timeout Error Handling
packages/cipherstash-proxy/src/error.rs, packages/cipherstash-proxy/src/postgresql/error_handler.rs, packages/cipherstash-proxy/src/postgresql/messages/error_response.rs
Updated ConnectionTimeout error display to format in milliseconds, added mapping in error handler to return connection_timeout error response with SQLSTATE 57P05, and modified ErrorResponse::connection_timeout() to accept and use a custom message string.
Handler Timeout Logic
packages/cipherstash-proxy/src/postgresql/handler.rs
Added per-call timeout detection for startup and protocol reads with early error response sends, introduced separate timeout_sender for concurrent task timeout handling, and added send_timeout_error helper for best-effort timeout error transmission.
Connection Resilience Tests
packages/cipherstash-proxy-integration/src/connection_resilience.rs, packages/cipherstash-proxy-integration/src/lib.rs
New test module with four async integration tests validating: concurrent query isolation with time bounds, post-disconnect reconnection, concurrent load handling with fast query completion, and backend advisory lock behavior with proxy blocking semantics.
Docker & Build Configuration
mise.toml, tests/docker-compose.yml, DEVELOPMENT.md
Set pull_policy: never for proxy services to enforce local image use, added Docker image tag cipherstash/proxy:latest in build task, and documented required testing workflow with troubleshooting guidance for pull-related errors.
Development Setup
CLAUDE.md
Added macOS-specific troubleshooting guidance for file descriptor limit errors, instructing users to raise via ulimit -n 10240 and persist in shell profile.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • freshtonic
  • auxesis

Poem

🐰 A timeout's tale, now properly told—
With milliseconds gleaming in bold!
Tests dance with locks and concurrent cheer,
While Docker images build crystal clear.
The proxy now knows when to let go,
And sends errors swift as a rabbit's throw! 🐇

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Connection error handling' accurately reflects the main focus of this PR, which introduces comprehensive connection timeout handling and error responses throughout the proxy codebase.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch connection-error-handling

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/cipherstash-proxy-integration/src/connection_resilience.rs`:
- Around line 145-168: The readiness Notify (a_ready/a_ready_tx) can fire before
client_b is actually blocked, making the test flaky; replace the notify_one() +
fixed sleep logic with a deterministic wait that polls Postgres (e.g. pg_locks
or pg_stat_activity) on the direct connection until the proxy connection is
observed waiting for the advisory lock. Concretely: stop using
a_ready/a_ready_tx as the sole signal; after spawning b_handle (which uses
connect_with_tls(PROXY) and executes b_lock_query), poll from the direct
Postgres client (the direct connection used earlier) by running a query against
pg_locks/pg_stat_activity and check for client_b’s PID/status showing it is
waiting for the advisory lock, and only then proceed to measure connection C;
ensure the polling loop has a sensible timeout and fails the test if the wait
state never appears.

In `@packages/cipherstash-proxy/src/postgresql/handler.rs`:
- Around line 262-276: The ConnectionTimeout path must emit the ErrorResponse
not only after tokio::try_join! but also for early returns before the stream
split (the timeout occurrences at the earlier startup/auth branches referenced
in the review). Add a small helper (e.g.,
emit_connection_timeout(timeout_sender, err)) that builds
ErrorResponse::connection_timeout(err.to_string()), attempts
BytesMut::try_from(...), sends via timeout_sender, and does a
tokio::task::yield_now().await; then call that helper in the pre-split timeout
return sites (the branches that currently return Error::ConnectionTimeout around
the startup/password-exchange logic) instead of returning immediately, or move
the timeout-emission into a centralized early-return path so all
Error::ConnectionTimeouts use the same emission logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d8410d5b-183a-44fe-9ca1-63f8f6ea7dda

📥 Commits

Reviewing files that changed from the base of the PR and between 4dff16e and 806eb59.

📒 Files selected for processing (11)
  • DEVELOPMENT.md
  • mise.toml
  • packages/cipherstash-proxy-integration/src/common.rs
  • packages/cipherstash-proxy-integration/src/connection_resilience.rs
  • packages/cipherstash-proxy-integration/src/lib.rs
  • packages/cipherstash-proxy/src/config/database.rs
  • packages/cipherstash-proxy/src/error.rs
  • packages/cipherstash-proxy/src/postgresql/error_handler.rs
  • packages/cipherstash-proxy/src/postgresql/handler.rs
  • packages/cipherstash-proxy/src/postgresql/messages/error_response.rs
  • tests/docker-compose.yml

@tobyhede tobyhede force-pushed the connection-error-handling branch from 468f4f2 to eec033f Compare March 10, 2026 03:19
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
packages/cipherstash-proxy-integration/src/connection_resilience.rs (2)

16-18: ⚠️ Potential issue | 🟠 Major

Tie the advisory-lock poll to connection B, not just the lock ID.

Right now poll_query only proves that some backend is waiting on ADVISORY_LOCK_ID. Because the key is hard-coded at Line 18 and client_b is created inside the task, another concurrent suite can satisfy this poll before this test's session is actually blocked. Create client_b before tokio::spawn, capture its backend PID, and filter pg_locks by that PID (or generate a per-test lock key) so the readiness gate stays specific to this test.

In PostgreSQL, can `pg_locks` be filtered by backend `pid` to identify the session waiting on a specific advisory lock, and how are `pg_advisory_lock(bigint)` keys represented in `pg_locks`?

Also applies to: 141-158

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/cipherstash-proxy-integration/src/connection_resilience.rs` around
lines 16 - 18, The readiness poll currently checks pg_locks only by the
hard-coded ADVISORY_LOCK_ID (const ADVISORY_LOCK_ID) which can match a different
test's session; create the client used to acquire the lock (client_b) before
calling tokio::spawn so you can obtain its backend PID, then update poll_query
to filter pg_locks by that PID (or alternatively generate a unique per-test
ADVISORY_LOCK_ID) so the poll verifies that the specific client_b session is
blocked on pg_advisory_lock rather than any session using the same numeric key;
adjust the code paths that call poll_query (and the spawned task that runs the
lock) to use the captured PID or unique key accordingly.

32-33: ⚠️ Potential issue | 🟠 Major

Replace these sleep-based readiness gates with deterministic checks.

Line 33, Line 66, and Line 93 are still using fixed delays to guess when the previous state change has happened. That means these tests can pass before the slow workload is actually active, or only fail because teardown/startup took longer on CI. Please gate these assertions on an observable condition instead, similar to the pg_locks polling used later in this file, or remove the grace period entirely in the disconnect test so it exercises the immediate reconnect path.

Also applies to: 65-66, 92-93

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/cipherstash-proxy-integration/src/connection_resilience.rs` around
lines 32 - 33, The fixed-duration
tokio::time::sleep(Duration::from_millis(200)).await calls are nondeterministic
readiness gates; replace them with deterministic polling of an observable
condition (e.g., query the same pg_locks check used later in this file or poll
the proxy/connection state) so the test waits until the expected lock/worker
state is actually present before proceeding; alternatively, for the disconnect
test remove the grace period entirely and assert the immediate reconnect path by
checking the connection became active/inactive via the observable (use the same
pg_locks or connection-status query code path already in this file rather than
sleeping).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@packages/cipherstash-proxy-integration/src/connection_resilience.rs`:
- Around line 16-18: The readiness poll currently checks pg_locks only by the
hard-coded ADVISORY_LOCK_ID (const ADVISORY_LOCK_ID) which can match a different
test's session; create the client used to acquire the lock (client_b) before
calling tokio::spawn so you can obtain its backend PID, then update poll_query
to filter pg_locks by that PID (or alternatively generate a unique per-test
ADVISORY_LOCK_ID) so the poll verifies that the specific client_b session is
blocked on pg_advisory_lock rather than any session using the same numeric key;
adjust the code paths that call poll_query (and the spawned task that runs the
lock) to use the captured PID or unique key accordingly.
- Around line 32-33: The fixed-duration
tokio::time::sleep(Duration::from_millis(200)).await calls are nondeterministic
readiness gates; replace them with deterministic polling of an observable
condition (e.g., query the same pg_locks check used later in this file or poll
the proxy/connection state) so the test waits until the expected lock/worker
state is actually present before proceeding; alternatively, for the disconnect
test remove the grace period entirely and assert the immediate reconnect path by
checking the connection became active/inactive via the observable (use the same
pg_locks or connection-status query code path already in this file rather than
sleeping).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0b6eefc7-81f1-422b-bca9-856c9391b822

📥 Commits

Reviewing files that changed from the base of the PR and between 806eb59 and 5e936ad.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • packages/cipherstash-proxy-integration/src/connection_resilience.rs
  • packages/cipherstash-proxy/Cargo.toml
  • packages/cipherstash-proxy/src/config/database.rs
  • packages/cipherstash-proxy/src/postgresql/error_handler.rs
  • packages/cipherstash-proxy/src/postgresql/handler.rs

Copy link
Contributor

@freshtonic freshtonic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but the tests are failing. I've added a comment with my hypothesis as to why.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
CLAUDE.md (1)

59-63: Consider enhancing the note with additional context.

The current guidance is helpful and actionable. To make it more comprehensive, consider these optional enhancements:

  1. Explain the context: Mention that this is particularly relevant when running integration tests with many concurrent connections
  2. Shell inclusivity: Reference both Zsh (.zshrc) and Bash (.bash_profile or .bashrc) users
  3. macOS system limits: Note that on some macOS versions, the system-wide hard limit might also need adjustment (e.g., via launchctl limit maxfiles)
📝 Optional enhanced version
-> **macOS Note:** If you hit file descriptor limits during development (e.g. "Too many open files"), you may need to increase the limit:
+> **macOS Note:** If you hit file descriptor limits during development (e.g., "Too many open files" when running integration tests with many concurrent connections), you may need to increase the limit:
 > ```bash
 > ulimit -n 10240
 > ```
-> To make this persistent, add it to your shell profile (e.g. `~/.zshrc`).
+> To make this persistent, add it to your shell profile (e.g., `~/.zshrc` for Zsh or `~/.bash_profile` for Bash). On some macOS versions, you may also need to adjust system-wide limits.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CLAUDE.md` around lines 59 - 63, Update the macOS note around the "ulimit -n
10240" guidance: add a brief sentence that this is most relevant when running
integration tests or workloads with many concurrent connections, mention
persistence instructions for both Zsh and Bash (e.g., "~/.zshrc" and
"~/.bash_profile" or "~/.bashrc"), and add a short note that some macOS versions
may also require adjusting system-wide hard limits (e.g., via "launchctl limit
maxfiles") so users know to check both per-shell and system limits.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@CLAUDE.md`:
- Around line 59-63: Update the macOS note around the "ulimit -n 10240"
guidance: add a brief sentence that this is most relevant when running
integration tests or workloads with many concurrent connections, mention
persistence instructions for both Zsh and Bash (e.g., "~/.zshrc" and
"~/.bash_profile" or "~/.bashrc"), and add a short note that some macOS versions
may also require adjusting system-wide hard limits (e.g., via "launchctl limit
maxfiles") so users know to check both per-shell and system limits.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5dd5496e-25bb-475a-8079-3318c0e80ee3

📥 Commits

Reviewing files that changed from the base of the PR and between fc8544d and 7e8331d.

📒 Files selected for processing (1)
  • CLAUDE.md

…r reporting

- Add configurable connection_timeout (default 120s, 0 to disable) to DatabaseConfig
- Fix ConnectionTimeout error message to display milliseconds (was showing seconds)
- Map ConnectionTimeout to PostgreSQL error code 57P05 instead of generic 58000
- Send ErrorResponse to client on timeout via channel writer
- Add pull_policy: never to docker-compose proxy services to ensure local builds
- Add connect_timeout to integration test client config
- Add unit tests for config defaults, error message format, and error code mapping
- Add integration tests for connection isolation under load
- Document local build requirement in DEVELOPMENT.md
- Pass actual timeout duration in ErrorResponse message to client
- Document rationale for using 57P05 over 08006 error code
- Document yield_now() as best-effort flush with known limitation
- Document advisory lock test race condition timing margin
- Replace magic number 12345 with named ADVISORY_LOCK_ID constant
metrics 0.24.1 has a lifetime issue (rust-lang/rust#141402) that causes
compilation failures on recent Rust versions.
Replace Notify + fixed sleep in advisory lock test with pg_locks polling
for deterministic synchronization. Send ErrorResponse to clients for
pre-split connection timeouts instead of silent disconnect.
client_a was hardcoded to PG_PORT (5532) while the proxy may connect to
a different PostgreSQL instance (e.g. postgres-tls:5617 in TLS CI). This
meant the advisory lock was held on a different server than client_b was
targeting, so the lock never blocked and the test was flaky in CI.

Use get_database_port() to ensure client_a connects to the same
PostgreSQL instance the proxy uses.
@tobyhede tobyhede force-pushed the connection-error-handling branch from 61f92bb to 5ebf5a8 Compare March 25, 2026 06:04
@tobyhede tobyhede requested a review from freshtonic March 25, 2026 06:05
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
packages/cipherstash-proxy/src/config/database.rs (1)

78-86: Consider documenting this behavioral change.

This introduces a significant change: previously, omitting connection_timeout in config meant no timeout (unbounded reads), but now it defaults to 120 seconds. Based on the relevant code snippets, this affects protocol::read_message, startup::read_message, and backend reads across the codebase.

The implementation is correct, but this silent behavior change could surprise existing deployments. Consider:

  1. Adding a note in release/changelog
  2. Documenting connection_timeout = 0 as the explicit opt-out in config documentation
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/cipherstash-proxy/src/config/database.rs` around lines 78 - 86, The
change to connection_timeout() now treats None as a 120s default via
DEFAULT_CONNECTION_TIMEOUT_MS which alters prior behavior where omitting
connection_timeout meant no timeout; update documentation and release notes to
call this out and instruct users to set connection_timeout = 0 in their config
to explicitly opt out of timeouts, and add a short note in the changelog and any
config docs that reference protocol::read_message, startup::read_message, and
backend read behavior to prevent surprises.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/cipherstash-proxy/src/config/database.rs`:
- Around line 78-86: The change to connection_timeout() now treats None as a
120s default via DEFAULT_CONNECTION_TIMEOUT_MS which alters prior behavior where
omitting connection_timeout meant no timeout; update documentation and release
notes to call this out and instruct users to set connection_timeout = 0 in their
config to explicitly opt out of timeouts, and add a short note in the changelog
and any config docs that reference protocol::read_message,
startup::read_message, and backend read behavior to prevent surprises.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ece25c30-5cab-4b82-8b53-e991b1f5aa65

📥 Commits

Reviewing files that changed from the base of the PR and between 7e8331d and 61f92bb.

📒 Files selected for processing (12)
  • CLAUDE.md
  • DEVELOPMENT.md
  • mise.toml
  • packages/cipherstash-proxy-integration/src/common.rs
  • packages/cipherstash-proxy-integration/src/connection_resilience.rs
  • packages/cipherstash-proxy-integration/src/lib.rs
  • packages/cipherstash-proxy/src/config/database.rs
  • packages/cipherstash-proxy/src/error.rs
  • packages/cipherstash-proxy/src/postgresql/error_handler.rs
  • packages/cipherstash-proxy/src/postgresql/handler.rs
  • packages/cipherstash-proxy/src/postgresql/messages/error_response.rs
  • tests/docker-compose.yml
✅ Files skipped from review due to trivial changes (4)
  • packages/cipherstash-proxy-integration/src/lib.rs
  • DEVELOPMENT.md
  • packages/cipherstash-proxy-integration/src/connection_resilience.rs
  • packages/cipherstash-proxy/src/postgresql/handler.rs
🚧 Files skipped from review as they are similar to previous changes (4)
  • CLAUDE.md
  • tests/docker-compose.yml
  • packages/cipherstash-proxy/src/error.rs
  • packages/cipherstash-proxy-integration/src/common.rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants