Skip to content

Add tunnel-ready gravity endpoint callback#241

Merged
jhaynie merged 6 commits intomainfrom
fix/gravity-reconnect-discovered-endpoints-clean
Apr 29, 2026
Merged

Add tunnel-ready gravity endpoint callback#241
jhaynie merged 6 commits intomainfrom
fix/gravity-reconnect-discovered-endpoints-clean

Conversation

@jhaynie
Copy link
Copy Markdown
Member

@jhaynie jhaynie commented Apr 29, 2026

Summary

  • add a new callback for gravity clients
  • fire it only after endpoint health is re-derived with a healthy control stream and at least one healthy tunnel stream
  • add source-level tests for the tunnel-ready endpoint set

Validation

  • ok github.com/agentuity/go-common/gravity (cached)
  • ok github.com/agentuity/go-common/gravity (cached)

Summary by CodeRabbit

  • New Features

    • Added an "endpoint tunnel-ready" callback invoked when tunnel streams are established and at least one healthy tunnel exists, indicating an endpoint is eligible for routing.
  • Documentation

    • Clarified that the existing "endpoint ready" signal refers to control-plane readiness (not necessarily routing-ready).
  • Tests

    • Added tests for detecting tunnel-ready endpoint health states and adjusted test setup for consistency.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 29, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f9c6a660-f4eb-4360-b409-221f262bf634

📥 Commits

Reviewing files that changed from the base of the PR and between 89e540c and 2ce8727.

📒 Files selected for processing (1)
  • gravity/grpc_client.go
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build
  • GitHub Check: Analyze (actions)
  • GitHub Check: Analyze (go)
🔇 Additional comments (7)
gravity/grpc_client.go (7)

284-286: LGTM!

The new onEndpointTunnelReady field is consistent with the existing onEndpointReady pattern and appropriately placed alongside it.


528-530: LGTM!

Config wiring follows the existing pattern for OnEndpointReady.


755-760: LGTM!

The integration correctly sequences the health refresh before firing callbacks, and places the callback invocation after g.connected = true but before background goroutines start—appropriate since the endpoint is ready at this point.


1057-1060: LGTM!

The multi-endpoint integration correctly fires callbacks after g.connected = true is set and health is refreshed, consistent with the single-endpoint path.


2828-2844: LGTM!

fireOnEndpointTunnelReady correctly follows the established pattern from fireOnEndpointReady: async invocation with panic recovery to prevent callback failures from crashing the client.


2846-2899: Previous review concerns appear to be addressed.

The implementation now correctly handles both modes:

  1. Multi-endpoint mode (lines 2848-2861): Iterates g.endpoints under endpointsMu and returns indices of healthy endpoints.

  2. Legacy single-endpoint mode (lines 2862-2898): Falls back when g.endpoints is empty, checking for any connection with a healthy control stream and at least one healthy tunnel stream.

  3. Race condition fix: The healthMu.RLock() is now acquired at line 2867 before sizing/copying connectionHealth, addressing the previous concern about concurrent appends.

The lock ordering (endpointsMu → healthMu → controlMu → tunnelMu) is consistent with the rest of the codebase.


4561-4568: LGTM!

The reconnection path correctly:

  1. Rebuilds endpoint stream indices
  2. Refreshes endpoint health (re-deriving the health state)
  3. Only fires the tunnel-ready callback for the specific endpoint that just reconnected

This aligns with the PR objective of firing the callback "only after endpoint health is re-derived."


📝 Walkthrough

Walkthrough

Adds GravityConfig.OnEndpointTunnelReady and emits it asynchronously when endpoints become tunnel-ready (eligible for routing) during startup and reconnection. Implements tunnel-ready selection logic in the client and updates tests to validate tunnel-ready index selection and related health handling.

Changes

Cohort / File(s) Summary
Client implementation
gravity/grpc_client.go
Adds OnEndpointTunnelReady wiring into the client; implements tunnelReadyEndpointIndices() selection logic; rebuilds endpoint indices and refreshes endpoint health on start/reconnect; adds fireOnEndpointTunnelReady to invoke the callback asynchronously with panic recovery; updates Start, startMultiEndpoint, and reconnectSingleEndpoint flows to emit the callback for eligible endpoints.
Config types & docs
gravity/types.go
Adds public OnEndpointTunnelReady func(endpointIndex int) to GravityConfig; clarifies OnEndpointReady docs to indicate it denotes control-plane readiness only (distinct from tunnel/traffic readiness).
Tests
gravity/hardening_test.go
Adds tests for tunnelReadyEndpointIndices() behavior (mixed tunnel/control/connection health and single-endpoint fallback); updates RequiresHealthyTunnelStream test to initialize endpoint healthy flags; minor comment/spacing adjustments.
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Warning

Tools execution failed with the following error:

Failed to run tools: 14 UNAVAILABLE: read ECONNRESET


Review rate limit: 4/5 reviews remaining, refill in 12 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
gravity/hardening_test.go (1)

1563-1565: Strengthen the new test by setting explicit pre-refresh health state.

Set both endpoints to healthy before refreshEndpointHealth() so the test verifies an actual transition to unhealthy for endpoint 0, not just default zero-value behavior.

Suggested tweak
 	ep0 := &GravityEndpoint{URL: "grpc://10.0.0.1:443"}
+	ep0.healthy.Store(true)
 	ep1 := &GravityEndpoint{URL: "grpc://10.0.0.2:443"}
+	ep1.healthy.Store(true)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gravity/hardening_test.go` around lines 1563 - 1565, Test currently relies on
zero-value health, so set explicit healthy state on both endpoints before
calling refreshEndpointHealth(): locate the test's ep0 and ep1 GravityEndpoint
instances and set their health field/property (e.g., Healthy = true or the
appropriate health flag on GravityEndpoint) so refreshEndpointHealth() can
actually flip ep0 to unhealthy and the assertion verifies a real transition;
ensure you set the same initial healthy state on both ep0 and ep1 prior to
invoking refreshEndpointHealth().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@gravity/hardening_test.go`:
- Around line 1563-1565: Test currently relies on zero-value health, so set
explicit healthy state on both endpoints before calling refreshEndpointHealth():
locate the test's ep0 and ep1 GravityEndpoint instances and set their health
field/property (e.g., Healthy = true or the appropriate health flag on
GravityEndpoint) so refreshEndpointHealth() can actually flip ep0 to unhealthy
and the assertion verifies a real transition; ensure you set the same initial
healthy state on both ep0 and ep1 prior to invoking refreshEndpointHealth().

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a66992b5-4158-4dda-ac24-2adee959edf4

📥 Commits

Reviewing files that changed from the base of the PR and between 1f70f02 and 80bb17a.

📒 Files selected for processing (4)
  • gravity/control_stream_resilience_test.go
  • gravity/endpoint_independence_test.go
  • gravity/grpc_client.go
  • gravity/hardening_test.go
📜 Review details
🔇 Additional comments (4)
gravity/grpc_client.go (1)

1823-1846: LGTM — Correctly gates endpoint health on tunnel stream availability.

The implementation properly derives endpoint health from two conditions:

  1. The gRPC connection is healthy (connectionHealth[connIndex])
  2. At least one healthy tunnel stream exists for that connection (healthyTunnelByConn[connIndex])

The snapshot-under-lock approach minimizes lock contention while maintaining consistency. The bounds check on line 1846 is sufficient since healthyTunnelByConn is created with len(connectionHealth), guaranteeing equal lengths.

gravity/endpoint_independence_test.go (1)

2268-2271: Comment update is accurate and helpful.

The revised note correctly reflects the new endpoint-health derivation contract and should reduce future test confusion.

gravity/control_stream_resilience_test.go (1)

7-7: Timing fix is reasonable for the handshake sequencing race.

This change aligns the test with the drain/hello ordering and makes the tunnel-establishment assertions meaningful.

Also applies to: 239-239

gravity/hardening_test.go (1)

1480-1482: Great coverage update for the new health derivation rules.

These tests now validate the connection-health + healthy-tunnel conjunction and close a key behavior gap.

Also applies to: 1501-1506, 1519-1520, 1543-1549, 1560-1594

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@gravity/grpc_client.go`:
- Around line 2846-2860: The tunnelReadyEndpointIndices function only inspects
g.endpoints, so in single-URL mode where g.endpoints is empty it never reports
readiness; update tunnelReadyEndpointIndices to also check the single-endpoint
field g.endpoint (in addition to g.endpoints) and, when that endpoint is non-nil
and its healthy.Load() returns true, include it in the returned ready indices so
the startup/reconnect logic (and OnEndpointTunnelReady callbacks) see a ready
endpoint; make this change inside tunnelReadyEndpointIndices while preserving
the existing endpointsMu lock and the ready slice semantics so callers of
tunnelReadyEndpointIndices (and code that triggers OnEndpointTunnelReady)
receive a readiness entry for the single-endpoint client.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 62a258d4-0e26-4f9e-860c-208003859c32

📥 Commits

Reviewing files that changed from the base of the PR and between 80bb17a and cb5f939.

📒 Files selected for processing (3)
  • gravity/grpc_client.go
  • gravity/hardening_test.go
  • gravity/types.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • gravity/hardening_test.go
📜 Review details
🔇 Additional comments (2)
gravity/types.go (1)

89-101: Good callback-stage separation in config API.

The split between OnEndpointReady (control-plane readiness) and OnEndpointTunnelReady (routing readiness) is clear and improves integration safety for consumers.

gravity/grpc_client.go (1)

2828-2844: Callback invocation wrapper looks robust.

Async dispatch plus panic recovery around OnEndpointTunnelReady is a solid safety pattern for user-provided callbacks.

Comment thread gravity/grpc_client.go
Comment on lines +2846 to +2860
func (g *GravityClient) tunnelReadyEndpointIndices() []int {
g.endpointsMu.RLock()
defer g.endpointsMu.RUnlock()

ready := make([]int, 0, len(g.endpoints))
for i, ep := range g.endpoints {
if ep == nil {
continue
}
if ep.healthy.Load() {
ready = append(ready, i)
}
}
return ready
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Single-endpoint mode currently suppresses OnEndpointTunnelReady.

Line 2847-Line 2859 only inspects g.endpoints. In the single-URL path, g.endpoints is empty, so startup callbacks at Line 757 and reconnection callback checks never see a ready endpoint. Result: OnEndpointTunnelReady does not fire for single-endpoint clients.

💡 Proposed fix
 func (g *GravityClient) tunnelReadyEndpointIndices() []int {
 	g.endpointsMu.RLock()
+	if len(g.endpoints) == 0 {
+		g.endpointsMu.RUnlock()
+		// Single-endpoint mode fallback: endpoint index 0 is tunnel-ready
+		// when the client has a healthy control stream and at least one
+		// healthy tunnel stream.
+		if g.hasHealthyEndpoint() {
+			return []int{0}
+		}
+		return nil
+	}
 	defer g.endpointsMu.RUnlock()
 
 	ready := make([]int, 0, len(g.endpoints))
 	for i, ep := range g.endpoints {
 		if ep == nil {
 			continue
 		}
 		if ep.healthy.Load() {
 			ready = append(ready, i)
 		}
 	}
 	return ready
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gravity/grpc_client.go` around lines 2846 - 2860, The
tunnelReadyEndpointIndices function only inspects g.endpoints, so in single-URL
mode where g.endpoints is empty it never reports readiness; update
tunnelReadyEndpointIndices to also check the single-endpoint field g.endpoint
(in addition to g.endpoints) and, when that endpoint is non-nil and its
healthy.Load() returns true, include it in the returned ready indices so the
startup/reconnect logic (and OnEndpointTunnelReady callbacks) see a ready
endpoint; make this change inside tunnelReadyEndpointIndices while preserving
the existing endpointsMu lock and the ready slice semantics so callers of
tunnelReadyEndpointIndices (and code that triggers OnEndpointTunnelReady)
receive a readiness entry for the single-endpoint client.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@gravity/grpc_client.go`:
- Around line 2867-2870: The code reads len(g.streamManager.connectionHealth)
before acquiring g.streamManager.healthMu, which can race with concurrent
appends; fix by acquiring the read lock (g.streamManager.healthMu.RLock())
before computing the length and allocating connectionHealth and keep the lock
held while calling copy(connectionHealth, g.streamManager.connectionHealth),
then RUnlock(); i.e., move the RLock() to precede the make(...) and perform
make/len/copy inside the locked region for g.streamManager.connectionHealth.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b74facc7-fc2d-493e-ba01-9fce385dbb86

📥 Commits

Reviewing files that changed from the base of the PR and between cb5f939 and 89e540c.

📒 Files selected for processing (2)
  • gravity/grpc_client.go
  • gravity/hardening_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • gravity/hardening_test.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (2)
gravity/grpc_client.go (2)

2828-2844: Panic-safe async callback wrapper looks good.

fireOnEndpointTunnelReady mirrors the existing callback safety pattern and prevents callback panics from destabilizing the client.


755-759: Good callback emission points for startup and endpoint reconnection.

These hooks correctly gate OnEndpointTunnelReady behind derived readiness and cover both initial connect and reconnect paths.

Also applies to: 1057-1059, 4563-4568

Comment thread gravity/grpc_client.go Outdated
@jhaynie jhaynie merged commit dff43ad into main Apr 29, 2026
5 checks passed
@jhaynie jhaynie deleted the fix/gravity-reconnect-discovered-endpoints-clean branch April 29, 2026 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant