Skip to content

fix(gravity): keep discovery enabled for DNS-derived direct IPs#246

Merged
jhaynie merged 1 commit intomainfrom
fix/gravity-discovery-direct-ip-origin
May 5, 2026
Merged

fix(gravity): keep discovery enabled for DNS-derived direct IPs#246
jhaynie merged 1 commit intomainfrom
fix/gravity-discovery-direct-ip-origin

Conversation

@jhaynie
Copy link
Copy Markdown
Member

@jhaynie jhaynie commented May 5, 2026

Summary

Fixes the Gravity peer discovery regression introduced by 32ce707.

That change correctly tried to prevent peer discovery when an operator explicitly configures Gravity with direct IP URLs, but it also disabled discovery for Hadron's production path where DNS is resolved into per-ion direct IP URLs before constructing the Gravity client.

This PR keeps the explicit/static direct-IP safeguard while allowing DNS-derived direct-IP URL sets to continue peer discovery.

Changes

  • Add shouldDisablePeerDiscoveryForResolvedURLs.
  • Disable peer discovery for:
    • explicit primary direct-IP Gravity URL
    • static direct-IP GravityURLs with no resolver
    • static direct-IP GravityURLs whose resolver returns a different set
  • Keep peer discovery enabled when DiscoveryResolveFunc returns the same direct-IP URL set, which indicates the URLs were DNS-derived.
  • Add regression tests for explicit IPs, static direct-IP lists, DNS-derived direct-IP lists, and adding a DNS-discovered direct-IP endpoint below MaxGravityPeers.

Validation

go test ./gravity
go test ./...
git diff --check

Summary by CodeRabbit

  • Bug Fixes

    • Peer discovery now stays enabled unless the configured primary is a direct-IP, discovery resolution is unavailable, or resolved direct-IP results disagree with discovery; discovery is no longer disabled merely because resolved URLs are direct IPs.
  • Tests

    • Expanded unit tests covering explicit, static, and DNS-discovered endpoint scenarios to validate discovery start/stop behavior and resolver usage.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

Replaces the previous “all URLs are direct IPs” check with a new predicate, shouldDisablePeerDiscoveryForResolvedURLs, which disables peer discovery only for explicit direct-IP primary overrides, when the resolver is absent, or when the normalized resolved direct‑IP set differs from resolver output. Adds sameGravityURLSet, shouldResolveForPeerDiscoveryDisableCheck, and tests.

Changes

Peer Discovery Disable Decision Logic

Layer / File(s) Summary
Data / Helpers
gravity/grpc_client.go
Added sameGravityURLSet(a,b []string) bool, shouldResolveForPeerDiscoveryDisableCheck(urls []string, primaryURL string, hasResolver bool) bool, and shouldDisablePeerDiscoveryForResolvedURLs(urls []string, primaryURL string, discoveryResolveFunc func() []string) bool.
Core Integration
gravity/grpc_client.go
startMultiEndpoint now reads primaryURL and discoveryResolveFunc under g.mu, optionally resolves discovery URLs, and sets g.peerDiscoveryDisabled by calling shouldDisablePeerDiscoveryForResolvedURLs(...) instead of using allGravityURLsAreDirectIPs(urls).
Resolver Behavior
gravity/grpc_client.go
resolvePeerDiscoveryURLs now returns (nil, false) immediately when g.discoveryResolveFunc is nil to avoid nil resolver use.
Tests / Validation
gravity/peer_discovery_test.go
Added TestShouldDisablePeerDiscoveryForResolvedURLs and TestShouldResolveForPeerDiscoveryDisableCheck; replaced and split startup tests into TestStartPeerDiscovery_DisabledForExplicitDirectIPGravityURLs, TestStartPeerDiscovery_DisabledForStaticDirectIPGravityURLs, and TestStartPeerDiscovery_EnabledForDNSDiscoveredDirectIPGravityURLs; added TestCheckPeerDiscovery_DNSDiscoveredDirectIPAddsEndpointBelowCapacity.
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@jhaynie jhaynie force-pushed the fix/gravity-discovery-direct-ip-origin branch from 5b5356b to 7bff694 Compare May 5, 2026 20:04
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@gravity/grpc_client.go`:
- Around line 1252-1259: The startup path currently calls
g.discoveryResolveFunc() directly (in the return !sameGravityURLSet(urls,
g.discoveryResolveFunc()) line) which can block; instead invoke the resolver
through the existing timeout-wrapping path used by startMultiEndpoint so the
call cannot hang indefinitely—replace the direct g.discoveryResolveFunc() call
with the timeout-protected resolver invocation (the same helper/wrapper used by
startMultiEndpoint) and pass its result into sameGravityURLSet(urls, ...) so
discovery check uses the timed resolver.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a29c8d4c-efc9-412a-99c0-0b231fa12682

📥 Commits

Reviewing files that changed from the base of the PR and between cf968e8 and 5b5356b.

📒 Files selected for processing (2)
  • gravity/grpc_client.go
  • gravity/peer_discovery_test.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build
  • GitHub Check: Analyze (go)
🔇 Additional comments (2)
gravity/grpc_client.go (1)

1215-1239: Set comparison helper is solid

sameGravityURLSet correctly normalizes trim/dedupe and ignores order, which fits the DNS-derived direct-IP equivalence check.

gravity/peer_discovery_test.go (1)

253-316: Great regression coverage for direct-IP origin semantics

These cases collectively validate the intended distinction between explicit/static direct IP configs and DNS-derived direct-IP sets, including startup gating and below-capacity add behavior.

Also applies to: 509-540, 972-1043

Comment thread gravity/grpc_client.go Outdated
Comment on lines +1252 to +1259
if g.discoveryResolveFunc == nil {
return true
}

// Hadron may resolve DNS before constructing the Gravity client, which
// means DNS-derived peers arrive here as direct IP URLs. Keep peer discovery
// enabled when the resolver confirms this URL set came from discovery.
return !sameGravityURLSet(urls, g.discoveryResolveFunc())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Bound resolver execution in startup path

Line 1259 calls discoveryResolveFunc() directly during startup. If the resolver blocks (DNS stall/network hang), startMultiEndpoint() can hang indefinitely before the client is usable. Please use the existing timeout path instead of a direct call.

Suggested fix
 func (g *GravityClient) shouldDisablePeerDiscoveryForResolvedURLs(urls []string) bool {
 	if !allGravityURLsAreDirectIPs(urls) {
 		return false
 	}
@@
 	if g.discoveryResolveFunc == nil {
 		return true
 	}
@@
-	return !sameGravityURLSet(urls, g.discoveryResolveFunc())
+	resolved, ok := g.resolvePeerDiscoveryURLs()
+	if !ok {
+		// Resolver timed out/canceled; avoid blocking startup.
+		// Conservative choice: keep discovery disabled when origin cannot be confirmed.
+		return true
+	}
+	return !sameGravityURLSet(urls, resolved)
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@gravity/grpc_client.go` around lines 1252 - 1259, The startup path currently
calls g.discoveryResolveFunc() directly (in the return !sameGravityURLSet(urls,
g.discoveryResolveFunc()) line) which can block; instead invoke the resolver
through the existing timeout-wrapping path used by startMultiEndpoint so the
call cannot hang indefinitely—replace the direct g.discoveryResolveFunc() call
with the timeout-protected resolver invocation (the same helper/wrapper used by
startMultiEndpoint) and pass its result into sameGravityURLSet(urls, ...) so
discovery check uses the timed resolver.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
gravity/grpc_client.go (1)

844-848: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Bound resolver execution in startup discovery gating.

Line 1261 directly invokes discoveryResolveFunc(). If resolver execution stalls, startMultiEndpoint() can block during startup. Use the existing timeout-bounded resolver path and handle timeout/cancel conservatively.

Suggested fix
-	primaryURL := g.url
-	discoveryResolveFunc := g.discoveryResolveFunc
+	primaryURL := g.url
 	g.mu.Unlock()
-	g.peerDiscoveryDisabled = shouldDisablePeerDiscoveryForResolvedURLs(urls, primaryURL, discoveryResolveFunc)
+	g.peerDiscoveryDisabled = g.shouldDisablePeerDiscoveryForResolvedURLs(urls, primaryURL)
-func shouldDisablePeerDiscoveryForResolvedURLs(urls []string, primaryURL string, discoveryResolveFunc func() []string) bool {
+func (g *GravityClient) shouldDisablePeerDiscoveryForResolvedURLs(urls []string, primaryURL string) bool {
 	if !allGravityURLsAreDirectIPs(urls) {
 		return false
 	}
@@
-	if discoveryResolveFunc == nil {
+	if g.discoveryResolveFunc == nil {
 		return true
 	}
@@
-	return !sameGravityURLSet(urls, discoveryResolveFunc())
+	resolved, ok := g.resolvePeerDiscoveryURLs()
+	if !ok {
+		return true
+	}
+	return !sameGravityURLSet(urls, resolved)
 }

Also applies to: 1243-1262

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@gravity/grpc_client.go` around lines 844 - 848, The code currently calls
discoveryResolveFunc() directly while holding no timeout, which can block
startMultiEndpoint() during startup; instead, use the same timeout-bounded
resolver helper already used elsewhere (the bounded resolve/cancel path) to
invoke discoveryResolveFunc with a context deadline and handle
timeout/cancellation conservatively; update the call site around
startMultiEndpoint()/shouldDisablePeerDiscoveryForResolvedURLs to call the
timeout-wrapped resolver, treat timeouts as non-fatal (assume discovery disabled
or retry later), and ensure g.peerDiscoveryDisabled is set based on the
bounded-result or explicit timeout error rather than a potentially blocking
direct call to discoveryResolveFunc.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@gravity/grpc_client.go`:
- Around line 844-848: The code currently calls discoveryResolveFunc() directly
while holding no timeout, which can block startMultiEndpoint() during startup;
instead, use the same timeout-bounded resolver helper already used elsewhere
(the bounded resolve/cancel path) to invoke discoveryResolveFunc with a context
deadline and handle timeout/cancellation conservatively; update the call site
around startMultiEndpoint()/shouldDisablePeerDiscoveryForResolvedURLs to call
the timeout-wrapped resolver, treat timeouts as non-fatal (assume discovery
disabled or retry later), and ensure g.peerDiscoveryDisabled is set based on the
bounded-result or explicit timeout error rather than a potentially blocking
direct call to discoveryResolveFunc.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cd7a96a7-b488-4c79-afad-fe721245563a

📥 Commits

Reviewing files that changed from the base of the PR and between 5b5356b and 7bff694.

📒 Files selected for processing (2)
  • gravity/grpc_client.go
  • gravity/peer_discovery_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • gravity/peer_discovery_test.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (1)
gravity/grpc_client.go (1)

1217-1241: URL-set comparison helper is a good fit.

sameGravityURLSet correctly ignores order, duplicates, and surrounding whitespace, which is exactly what this discovery-origin check needs.

@jhaynie jhaynie force-pushed the fix/gravity-discovery-direct-ip-origin branch from 7bff694 to a945d54 Compare May 5, 2026 20:10
@jhaynie jhaynie merged commit c1b94cd into main May 5, 2026
4 of 5 checks passed
@jhaynie jhaynie deleted the fix/gravity-discovery-direct-ip-origin branch May 5, 2026 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant