Allow bound tunnel fallback during endpoint health lag#236
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📜 Recent review details⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (2)
📝 WalkthroughWalkthroughAdds a bound-tunnel fallback to stream selection: when endpoint selection fails or endpoints are stale, the client may reuse an existing flow binding’s endpoint and refresh its LastUsed TTL. Two new tests verify using the bound tunnel when health is stale and that the binding TTL is refreshed. Changes
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@gravity/grpc_client.go`:
- Around line 5316-5318: The warning logged inside selectBoundTunnelFallback is
too chatty on the hot path; change the g.logger.Warn call in
selectBoundTunnelFallback (and the analogous call at the other occurrence) to
either g.logger.Debug or implement per-endpoint/flow rate-limiting/once-only
emission (e.g., track a map keyed by fallbackURL or endpoint ID and only log the
first transition or use a token-bucket / time-window check) so the message is
not emitted for every packet while endpoint health is stale.
- Around line 5386-5400: The bound-tunnel fallback path doesn't refresh
binding.LastUsed so active flows can TTL out; after successfully obtaining a
stream from selectStreamForEndpoint(payload, binding.Endpoint.URL) set
binding.LastUsed = time.Now() under the selector.mu write lock (use
selector.mu.Lock()/Unlock()) so the binding's timestamp is updated in
selector.bindings before returning the stream and URL; ensure you update the
existing binding (pointer) in place and keep the successful return behavior
otherwise.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d4f6ab16-44da-43d8-8108-b2687967cce0
📒 Files selected for processing (2)
gravity/endpoint_independence_test.gogravity/grpc_client.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build
- GitHub Check: Analyze (go)
🔇 Additional comments (1)
gravity/endpoint_independence_test.go (1)
1664-1699: Strong regression test for stale-health bound-tunnel fallback.This scenario is well targeted: it forces stale endpoint health while keeping the bound tunnel alive, then verifies
selectStreamForPacketstill selects the bound endpoint instead of returning a “no healthy endpoints” error.
Summary
Problem
During reconnect,
refreshEndpointHealth()can mark every endpoint unhealthy before the data path has actually failed. In that window, Hadron can receive Ion's/_healthrequest over an existing tunnel, butselectStreamForPacket()returnsno healthy gravity endpointsfor the response because selector health is stale.Fix
If selector health yields no endpoint, or exhausts healthy endpoints,
selectStreamForPacket()now checks the existing flow binding. If the bound endpoint still has a healthy tunnel stream, it uses that stream instead of failing.Testing
go test ./gravitySummary by CodeRabbit
New Tests
Bug Fixes