Require healthy tunnel streams for endpoint routing#240
Conversation
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughEndpoint health logic now requires per-connection health, a non-nil control stream, and at least one healthy tunnel stream per connection. Tests were updated to reflect this requirement, and one test added a 20ms timing delay before sending connection IDs. Changes
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Review rate limit: 3/5 reviews remaining, refill in 20 minutes and 44 seconds. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
gravity/grpc_client.go (1)
1823-1847:⚠️ Potential issue | 🟠 MajorRequire a live control stream in this health gate.
Line 1846 currently treats
connectionHealth[connIndex]as the control-plane signal, but elsewhere that flag is derived fromconn.GetState()whilehandleControlStream()can nilcontrolStreams[streamIndex]independently. That means an endpoint can still flip back to healthy after its control stream has died, as long as the transport stays READY and one tunnel is still marked healthy.Possible fix
connectionHealth := make([]bool, len(g.streamManager.connectionHealth)) g.streamManager.healthMu.RLock() copy(connectionHealth, g.streamManager.connectionHealth) g.streamManager.healthMu.RUnlock() + controlHealthy := make([]bool, len(connectionHealth)) + g.streamManager.controlMu.RLock() + for i, stream := range g.streamManager.controlStreams { + if i < len(controlHealthy) && stream != nil { + controlHealthy[i] = true + } + } + g.streamManager.controlMu.RUnlock() + healthyTunnelByConn := make([]bool, len(connectionHealth)) g.streamManager.tunnelMu.RLock() for _, streamInfo := range g.streamManager.tunnelStreams { if streamInfo == nil || !streamInfo.isHealthy { continue @@ - if connIndex >= 0 && connIndex < len(connectionHealth) && connectionHealth[connIndex] && healthyTunnelByConn[connIndex] { + if connIndex >= 0 && + connIndex < len(connectionHealth) && + controlHealthy[connIndex] && + connectionHealth[connIndex] && + healthyTunnelByConn[connIndex] { healthy = true break } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@gravity/grpc_client.go` around lines 1823 - 1847, When deciding per-endpoint health, also require that the control stream for that connection is present/alive: while iterating connections (the loop that checks connectionHealth[connIndex] and healthyTunnelByConn[connIndex]) also acquire the stream manager's control lock and verify g.streamManager.controlStreams[connIndex] != nil (or controlStreams[connIndex].isHealthy if such flag exists) before treating the endpoint as healthy; update the health gate that sets "healthy" to include this control-stream check so an endpoint cannot be considered healthy if its control stream has been nulled by handleControlStream().
🧹 Nitpick comments (1)
gravity/control_stream_resilience_test.go (1)
238-242: Prefer blocking channel sync over fixed sleep in this test.
time.Sleep(20 * time.Millisecond)makes the test timing-dependent. Sinceg.connectionIDChanis unbuffered, blocking sends already provide deterministic synchronization withestablishTunnelStreams().Proposed simplification
- go func() { - time.Sleep(20 * time.Millisecond) - g.connectionIDChan <- "machine-1" - g.connectionIDChan <- "machine-1" - }() + go func() { + g.connectionIDChan <- "machine-1" + g.connectionIDChan <- "machine-1" + }()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@gravity/control_stream_resilience_test.go` around lines 238 - 242, The test uses a fixed sleep before sending on g.connectionIDChan which is unnecessary and timing-dependent; remove the time.Sleep(20 * time.Millisecond) line in the goroutine so the unbuffered sends to g.connectionIDChan block and synchronize deterministically with establishTunnelStreams(), leaving the two sends (g.connectionIDChan <- "machine-1") intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@gravity/grpc_client.go`:
- Around line 1823-1847: When deciding per-endpoint health, also require that
the control stream for that connection is present/alive: while iterating
connections (the loop that checks connectionHealth[connIndex] and
healthyTunnelByConn[connIndex]) also acquire the stream manager's control lock
and verify g.streamManager.controlStreams[connIndex] != nil (or
controlStreams[connIndex].isHealthy if such flag exists) before treating the
endpoint as healthy; update the health gate that sets "healthy" to include this
control-stream check so an endpoint cannot be considered healthy if its control
stream has been nulled by handleControlStream().
---
Nitpick comments:
In `@gravity/control_stream_resilience_test.go`:
- Around line 238-242: The test uses a fixed sleep before sending on
g.connectionIDChan which is unnecessary and timing-dependent; remove the
time.Sleep(20 * time.Millisecond) line in the goroutine so the unbuffered sends
to g.connectionIDChan block and synchronize deterministically with
establishTunnelStreams(), leaving the two sends (g.connectionIDChan <-
"machine-1") intact.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a9b0c404-353d-4164-aca2-7339eea200f3
📒 Files selected for processing (4)
gravity/control_stream_resilience_test.gogravity/endpoint_independence_test.gogravity/grpc_client.gogravity/hardening_test.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build
- GitHub Check: Analyze (go)
🔇 Additional comments (3)
gravity/hardening_test.go (3)
1480-1515: Good regression alignment with dual health derivation.This update correctly validates that endpoint health depends on both control-plane health and healthy tunnel-stream presence, matching
refreshEndpointHealthbehavior.
1519-1557: Nice negative-case coverage for all-unhealthy control connections.Including healthy tunnel entries while keeping all control connections unhealthy is a strong guard that tunnel presence alone cannot mark endpoints healthy.
1560-1594: Great targeted test for the new tunnel requirement.
TestRefreshEndpointHealth_RequiresHealthyTunnelStreamcleanly captures the key rule: healthy control stream without at least one healthy tunnel stream must remain unroutable.
Summary
Validation
no healthy tunnel streamsreconnect loop observedSummary by CodeRabbit
Bug Fixes
Tests