fix: handle ResourceSyncResponse in gravity client#234
Merged
Conversation
Add restart_count field (uint32, field 14) to the HostInfo protobuf message. This allows hadron to report its systemd restart counter to ion during the session hello, enabling crash-loop detection in the debug machines endpoint. - gravity_session.proto: added restart_count = 14 - gravity_session.pb.go: regenerated - grpc_client.go: populate RestartCount from GravityConfig - types.go: added RestartCount to GravityConfig
When an ion restarts (tableflip), the gRPC connection transitions to IDLE state. In this state, gRPC does not attempt to reconnect until a new RPC call is issued or Connect() is explicitly called. The health monitor detected IDLE connections (logging 'connection N is unhealthy, state: IDLE') but never attempted recovery — causing hadrons to permanently lose tunnel connectivity to restarted ions. This caused a 48-minute full outage in us-central: all 3 gravity endpoints went IDLE after ion rollout, hadrons had 0 tunnel streams, and no deployments could be provisioned or serve traffic. A manual hadron restart was required to recover. Fix: when performHealthCheck detects an IDLE connection, call conn.Connect() to force gRPC to transition from IDLE → CONNECTING → READY (or TRANSIENT_FAILURE if the server is unreachable, which has its own retry logic). Track consecutive idle counts for observability.
Two fixes from PR review: 1. addEndpoint grows connectionHealth when adding endpoints dynamically but did not grow connectionIdleCount, leaving it shorter than connectionHealth. This would cause idle recovery tracking to silently skip dynamically-added endpoints. Mirror the growth logic. 2. When a connection stays IDLE for >3 consecutive health checks, escalate beyond just conn.Connect() — trigger a full endpoint reconnection via handleEndpointDisconnection which tears down and rebuilds all streams for that endpoint. This handles cases where the underlying gRPC connection is fundamentally broken and Connect() alone cannot recover it.
… escalation - TestAddEndpoint_GrowsConnectionIdleCount: verifies connectionIdleCount is grown alongside connectionHealth when addEndpoint adds a new endpoint dynamically (prevents idle recovery from silently skipping dynamically-added endpoints) - TestPerformHealthCheck_PersistentIDLEEscalates: verifies that when connectionIdleCount exceeds the escalation threshold (>3), performHealthCheck triggers handleEndpointDisconnection without panicking (pre-sets count to 3, runs one health check, verifies count=4 and no crash from the async escalation)
…ion, lifecycle tests P1 — Connection state transitions: - SHUTDOWN connection: no recovery attempt, counter reset - Nil connection: marked unhealthy without panic - IDLE → CONNECTING transition: idle counter resets to 0 P1 — Endpoint disconnection cascade: - Single endpoint IDLE while others healthy: targeted recovery - All endpoints IDLE simultaneously: all get recovery (rolling restart) - Invalid endpoint index: no panic - Already reconnecting guard: duplicate disconnection skipped P2 — Stream + connection interaction: - Streams on IDLE connection: not recovered (connReady=false) - Stream recovery blocked until connection reaches READY - refreshEndpointHealth: derives from connectionHealth correctly - All connections unhealthy: all endpoints marked unhealthy P3 — Timing & lifecycle: - Context cancellation: health monitor goroutine exits promptly - Empty connections: no panic - Mismatched array lengths: documents current panic behavior (connectionHealth shorter than connections — needs bounds fix) Also found a real bug: performHealthCheck panics when connectionHealth array is shorter than connections array. Documented as TestPerformHealthCheck_MismatchedArrayLengths_Panics.
performHealthCheck iterated over g.connections but indexed into g.streamManager.connectionHealth without bounds checking. If connectionHealth was shorter than connections (e.g., during dynamic endpoint addition before arrays are grown), this caused an index out of range panic. Fix: break the loop when i >= len(connectionHealth). Connections beyond the health array are skipped — they'll be picked up once the arrays are grown by addEndpoint. Updated TestPerformHealthCheck_MismatchedArrayLengths_NoPanic to verify the fix (was previously documenting the panic).
…eIDLE The test read connectionHealth[0] immediately after performHealthCheck() returned, but the write happens in a background goroutine spawned by handleEndpointDisconnection. Wait for the async write with a polling loop under the healthMu lock.
…iliation Add ResourceSyncRequest/Response proto messages and SendResourceSync() client method. Replaces per-resource RouteDeploymentRequest calls during reconnection with a single atomic message containing all active resources. Gravity compares the list against its cached state and unprovisions any stale entries, preventing VIP collisions from gossip contamination.
Add handler for ResourceSyncResponse message type that was being logged as unhandled. This completes the ResourceSync flow where ion sends a response after processing the sync request.
Contributor
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThis pull request adds a new resource synchronization capability to the gRPC client, introducing protobuf message types and a Changes
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add handler for
ResourceSyncResponsemessage type in the gravity client that was being logged as unhandled.Changes
handleResourceSyncResponsefunction ingrpc_client.goSessionMessage_ResourceSyncResponseinprocessSessionMessageWhy
The ResourceSync flow sends a request from hadron to ion, and ion sends back a response. Without this handler, the response was logged as "unhandled session message type".
This completes the ResourceSync message flow:
ResourceSyncRequestto all ionsResourceSyncResponsebackSummary by CodeRabbit
Release Notes