Skip to content

Allow bound tunnel fallback during endpoint health lag#236

Merged
jhaynie merged 4 commits intomainfrom
fix/gravity-bound-tunnel-fallback
Apr 27, 2026
Merged

Allow bound tunnel fallback during endpoint health lag#236
jhaynie merged 4 commits intomainfrom
fix/gravity-bound-tunnel-fallback

Conversation

@jhaynie
Copy link
Copy Markdown
Member

@jhaynie jhaynie commented Apr 27, 2026

Summary

  • allow response traffic to use an already-bound healthy tunnel stream even when endpoint health has gone stale
  • add a regression test for the reconnect window where all endpoints look unhealthy but the bound tunnel is still alive

Problem

During reconnect, refreshEndpointHealth() can mark every endpoint unhealthy before the data path has actually failed. In that window, Hadron can receive Ion's /_health request over an existing tunnel, but selectStreamForPacket() returns no healthy gravity endpoints for the response because selector health is stale.

Fix

If selector health yields no endpoint, or exhausts healthy endpoints, selectStreamForPacket() now checks the existing flow binding. If the bound endpoint still has a healthy tunnel stream, it uses that stream instead of failing.

Testing

  • go test ./gravity

Summary by CodeRabbit

  • New Tests

    • Added selector integration tests verifying bound-tunnel behavior when endpoint health is stale and that using the bound tunnel refreshes flow bindings.
  • Bug Fixes

    • Enhanced stream selection to reuse an existing bound tunnel for active flows when endpoint health appears stale, preventing unnecessary failovers and preserving connectivity during transitions.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 27, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 553d8618-22f4-4bd2-916a-e286128fad71

📥 Commits

Reviewing files that changed from the base of the PR and between e7a7f2c and ff1f900.

📒 Files selected for processing (2)
  • gravity/endpoint_independence_test.go
  • gravity/grpc_client.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • gravity/endpoint_independence_test.go
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Analyze (go)
🔇 Additional comments (2)
gravity/grpc_client.go (2)

5311-5348: Bound-flow fallback is wired in at the right failure points.

Falling back to the existing flow binding both when selector.Select returns nil and after healthy-endpoint attempts are exhausted matches the reconnect-window failure mode without changing the normal selection path.


5381-5405: The helper now preserves active bindings correctly.

selectBoundTunnelFallback revalidates the binding, reuses the bound endpoint only when a healthy tunnel stream still exists, and refreshes LastUsed under selector.mu before returning.


📝 Walkthrough

Walkthrough

Adds a bound-tunnel fallback to stream selection: when endpoint selection fails or endpoints are stale, the client may reuse an existing flow binding’s endpoint and refresh its LastUsed TTL. Two new tests verify using the bound tunnel when health is stale and that the binding TTL is refreshed.

Changes

Cohort / File(s) Summary
Bound tunnel fallback implementation
gravity/grpc_client.go
Adds unexported selectBoundTunnelFallback used by selectStreamForPacket to reuse an existing flow binding when selector selection is unavailable or endpoints are stale. Validates binding exists, has a non-nil endpoint, is within selector.ttl via binding.LastUsed, and that selectStreamForEndpoint returns a healthy stream; refreshes binding.LastUsed on success and logs debug messages.
Bound tunnel fallback tests
gravity/endpoint_independence_test.go
Adds two integration tests: TestSelectStreamForPacket_UsesBoundTunnelWhenEndpointHealthIsStale (ensures bound traffic still uses the bound endpoint’s healthy tunnel stream when endpoint and connection health are stale) and TestSelectStreamForPacket_BoundTunnelFallbackRefreshesBindingTTL (verifies the fallback updates the binding’s LastUsed timestamp).
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@gravity/grpc_client.go`:
- Around line 5316-5318: The warning logged inside selectBoundTunnelFallback is
too chatty on the hot path; change the g.logger.Warn call in
selectBoundTunnelFallback (and the analogous call at the other occurrence) to
either g.logger.Debug or implement per-endpoint/flow rate-limiting/once-only
emission (e.g., track a map keyed by fallbackURL or endpoint ID and only log the
first transition or use a token-bucket / time-window check) so the message is
not emitted for every packet while endpoint health is stale.
- Around line 5386-5400: The bound-tunnel fallback path doesn't refresh
binding.LastUsed so active flows can TTL out; after successfully obtaining a
stream from selectStreamForEndpoint(payload, binding.Endpoint.URL) set
binding.LastUsed = time.Now() under the selector.mu write lock (use
selector.mu.Lock()/Unlock()) so the binding's timestamp is updated in
selector.bindings before returning the stream and URL; ensure you update the
existing binding (pointer) in place and keep the successful return behavior
otherwise.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d4f6ab16-44da-43d8-8108-b2687967cce0

📥 Commits

Reviewing files that changed from the base of the PR and between 72768d9 and e7a7f2c.

📒 Files selected for processing (2)
  • gravity/endpoint_independence_test.go
  • gravity/grpc_client.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build
  • GitHub Check: Analyze (go)
🔇 Additional comments (1)
gravity/endpoint_independence_test.go (1)

1664-1699: Strong regression test for stale-health bound-tunnel fallback.

This scenario is well targeted: it forces stale endpoint health while keeping the bound tunnel alive, then verifies selectStreamForPacket still selects the bound endpoint instead of returning a “no healthy endpoints” error.

Comment thread gravity/grpc_client.go
Comment thread gravity/grpc_client.go
@jhaynie jhaynie merged commit 6638d0e into main Apr 27, 2026
5 checks passed
@jhaynie jhaynie deleted the fix/gravity-bound-tunnel-fallback branch April 27, 2026 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant