Skip to content

Require healthy tunnel streams for endpoint routing#240

Merged
jhaynie merged 2 commits intomainfrom
fix/endpoint-healthy-tunnel-streams
Apr 29, 2026
Merged

Require healthy tunnel streams for endpoint routing#240
jhaynie merged 2 commits intomainfrom
fix/endpoint-healthy-tunnel-streams

Conversation

@jhaynie
Copy link
Copy Markdown
Member

@jhaynie jhaynie commented Apr 29, 2026

Summary

  • require a healthy tunnel stream before an endpoint is considered healthy for packet routing
  • remove the transient early-healthy reconnect window so endpoints are only routable after refreshEndpointHealth confirms control plus tunnel health
  • add regression coverage for endpoint health derivation and the reconnect path

Validation

  • go test ./gravity
  • local multi-endpoint Hadron validation against 3 explicit local Ion endpoints
    • clean startup on all 3 endpoints
    • clean recovery after restarting one Ion
    • no no healthy tunnel streams reconnect loop observed

Summary by CodeRabbit

  • Bug Fixes

    • Endpoint health logic refined: endpoints are considered healthy only when connection health, an active control stream, and at least one healthy tunnel stream are present.
  • Tests

    • Updated and added tests to validate the enhanced endpoint health evaluation and control-stream resilience; timing in resilience test adjusted to ensure stable behavior.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 29, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

Endpoint health logic now requires per-connection health, a non-nil control stream, and at least one healthy tunnel stream per connection. Tests were updated to reflect this requirement, and one test added a 20ms timing delay before sending connection IDs.

Changes

Cohort / File(s) Summary
Health Evaluation Logic
gravity/grpc_client.go
refreshEndpointHealth changed to mark endpoints healthy only when connectionHealth[connIndex] is true, a non-nil control stream exists, and at least one healthy tunnel stream maps to that connection. reconnectSingleEndpoint no longer directly sets endpoint.healthy or lastHeartbeat; those are determined by refreshEndpointHealth.
Health Evaluation Tests
gravity/hardening_test.go, gravity/endpoint_independence_test.go
Test expectations, seeding, and documentation updated so endpoint health assertions require both connection/control-stream readiness and at least one healthy tunnel stream. New tests added to cover combinations of healthy/unhealthy tunnel and control streams.
Stream Resilience Timing
gravity/control_stream_resilience_test.go
TestEstablishTunnelStreams_SkipsNilClients now waits 20ms after starting a goroutine before pushing machine-1 into g.connectionIDChan twice; the time package was imported.
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Review rate limit: 3/5 reviews remaining, refill in 20 minutes and 44 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
gravity/grpc_client.go (1)

1823-1847: ⚠️ Potential issue | 🟠 Major

Require a live control stream in this health gate.

Line 1846 currently treats connectionHealth[connIndex] as the control-plane signal, but elsewhere that flag is derived from conn.GetState() while handleControlStream() can nil controlStreams[streamIndex] independently. That means an endpoint can still flip back to healthy after its control stream has died, as long as the transport stays READY and one tunnel is still marked healthy.

Possible fix
 	connectionHealth := make([]bool, len(g.streamManager.connectionHealth))
 	g.streamManager.healthMu.RLock()
 	copy(connectionHealth, g.streamManager.connectionHealth)
 	g.streamManager.healthMu.RUnlock()
 
+	controlHealthy := make([]bool, len(connectionHealth))
+	g.streamManager.controlMu.RLock()
+	for i, stream := range g.streamManager.controlStreams {
+		if i < len(controlHealthy) && stream != nil {
+			controlHealthy[i] = true
+		}
+	}
+	g.streamManager.controlMu.RUnlock()
+
 	healthyTunnelByConn := make([]bool, len(connectionHealth))
 	g.streamManager.tunnelMu.RLock()
 	for _, streamInfo := range g.streamManager.tunnelStreams {
 		if streamInfo == nil || !streamInfo.isHealthy {
 			continue
@@
-			if connIndex >= 0 && connIndex < len(connectionHealth) && connectionHealth[connIndex] && healthyTunnelByConn[connIndex] {
+			if connIndex >= 0 &&
+				connIndex < len(connectionHealth) &&
+				controlHealthy[connIndex] &&
+				connectionHealth[connIndex] &&
+				healthyTunnelByConn[connIndex] {
 				healthy = true
 				break
 			}
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gravity/grpc_client.go` around lines 1823 - 1847, When deciding per-endpoint
health, also require that the control stream for that connection is
present/alive: while iterating connections (the loop that checks
connectionHealth[connIndex] and healthyTunnelByConn[connIndex]) also acquire the
stream manager's control lock and verify
g.streamManager.controlStreams[connIndex] != nil (or
controlStreams[connIndex].isHealthy if such flag exists) before treating the
endpoint as healthy; update the health gate that sets "healthy" to include this
control-stream check so an endpoint cannot be considered healthy if its control
stream has been nulled by handleControlStream().
🧹 Nitpick comments (1)
gravity/control_stream_resilience_test.go (1)

238-242: Prefer blocking channel sync over fixed sleep in this test.

time.Sleep(20 * time.Millisecond) makes the test timing-dependent. Since g.connectionIDChan is unbuffered, blocking sends already provide deterministic synchronization with establishTunnelStreams().

Proposed simplification
-	go func() {
-		time.Sleep(20 * time.Millisecond)
-		g.connectionIDChan <- "machine-1"
-		g.connectionIDChan <- "machine-1"
-	}()
+	go func() {
+		g.connectionIDChan <- "machine-1"
+		g.connectionIDChan <- "machine-1"
+	}()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gravity/control_stream_resilience_test.go` around lines 238 - 242, The test
uses a fixed sleep before sending on g.connectionIDChan which is unnecessary and
timing-dependent; remove the time.Sleep(20 * time.Millisecond) line in the
goroutine so the unbuffered sends to g.connectionIDChan block and synchronize
deterministically with establishTunnelStreams(), leaving the two sends
(g.connectionIDChan <- "machine-1") intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@gravity/grpc_client.go`:
- Around line 1823-1847: When deciding per-endpoint health, also require that
the control stream for that connection is present/alive: while iterating
connections (the loop that checks connectionHealth[connIndex] and
healthyTunnelByConn[connIndex]) also acquire the stream manager's control lock
and verify g.streamManager.controlStreams[connIndex] != nil (or
controlStreams[connIndex].isHealthy if such flag exists) before treating the
endpoint as healthy; update the health gate that sets "healthy" to include this
control-stream check so an endpoint cannot be considered healthy if its control
stream has been nulled by handleControlStream().

---

Nitpick comments:
In `@gravity/control_stream_resilience_test.go`:
- Around line 238-242: The test uses a fixed sleep before sending on
g.connectionIDChan which is unnecessary and timing-dependent; remove the
time.Sleep(20 * time.Millisecond) line in the goroutine so the unbuffered sends
to g.connectionIDChan block and synchronize deterministically with
establishTunnelStreams(), leaving the two sends (g.connectionIDChan <-
"machine-1") intact.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a9b0c404-353d-4164-aca2-7339eea200f3

📥 Commits

Reviewing files that changed from the base of the PR and between 2eaf46a and 80bb17a.

📒 Files selected for processing (4)
  • gravity/control_stream_resilience_test.go
  • gravity/endpoint_independence_test.go
  • gravity/grpc_client.go
  • gravity/hardening_test.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build
  • GitHub Check: Analyze (go)
🔇 Additional comments (3)
gravity/hardening_test.go (3)

1480-1515: Good regression alignment with dual health derivation.

This update correctly validates that endpoint health depends on both control-plane health and healthy tunnel-stream presence, matching refreshEndpointHealth behavior.


1519-1557: Nice negative-case coverage for all-unhealthy control connections.

Including healthy tunnel entries while keeping all control connections unhealthy is a strong guard that tunnel presence alone cannot mark endpoints healthy.


1560-1594: Great targeted test for the new tunnel requirement.

TestRefreshEndpointHealth_RequiresHealthyTunnelStream cleanly captures the key rule: healthy control stream without at least one healthy tunnel stream must remain unroutable.

@jhaynie jhaynie merged commit 1f70f02 into main Apr 29, 2026
4 of 5 checks passed
@jhaynie jhaynie deleted the fix/endpoint-healthy-tunnel-streams branch April 29, 2026 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant