Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(share/discovery)!: revamp Discovery #2117

Merged
merged 59 commits into from
May 3, 2023
Merged

Conversation

distractedm1nd
Copy link
Member

@distractedm1nd distractedm1nd commented Apr 21, 2023

Closes #2107

There was a discrepancy between the amount of peers inside of the peer manager full node pool, and the limitedSet inside of discovery. It was possible for the node to get in a state where it had no more full nodes to sample from, because the peers inside the limitedSet were blocking on the connection, and never being handed off to the OnPeersUpdate callback.

To fix this, we (@Wondertan, @walldiss and I):

  • Set a context deadline for the call to Connect with a peer. Previously, it would never timeout. This is because RoutedHost will block until the context is canceled, even though there is a peer dial timeout.
  • Only add peers to the limitedSet after we have successfully connected. This turns the previous peerLimit into a "soft" limit. This prevents in-progress connections from clogging spots in the limitedSet, but also allows for more peers to end up in the limited set than the set limit (we don't throw away already connected peers).
  • Removed the timer reset upon a peer disconnecting. This peer disconnection happens often, causing a significant delay in the next call to FindPeers.
  • Added logs, cleaned up various comments
  • Backoff
    • Adds a RemoveBackoff method to the backoffConnector, to remove cancelled connections from the cache
    • Adds Backoff method
  • Cleans up limited set
  • Added tests for Discovery
  • Handle the case where discovered peer is already connected
  • Introduces importable Discovery params
  • Revisits constans

I have tested this locally, and it works well with various peer limits. The first few calls to findPeers cause the highest peer count in the limited set (roughly about double), but bad connections are trimmed relatively quickly, and within a few cycles, we land with a pretty stable peer set where there are only 1-2 peer changes every interval, if any.

NOTE: Config breaking!

@distractedm1nd distractedm1nd added bug Something isn't working area:shares Shares and samples labels Apr 21, 2023
@distractedm1nd distractedm1nd marked this pull request as ready for review April 21, 2023 03:25
share/availability/discovery/discovery.go Outdated Show resolved Hide resolved
share/availability/discovery/discovery.go Outdated Show resolved Hide resolved
share/availability/discovery/discovery.go Outdated Show resolved Hide resolved
@Wondertan Wondertan added the kind:fix Attached to bug-fixing PRs label Apr 21, 2023
nodebuilder/share/config.go Outdated Show resolved Hide resolved
share/availability/discovery/discovery.go Outdated Show resolved Hide resolved
renaynay
renaynay previously approved these changes Apr 21, 2023
Copy link
Member

@renaynay renaynay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findCancel and cancelFind - lets pick one

share/availability/discovery/discovery.go Outdated Show resolved Hide resolved
share/availability/discovery/discovery.go Outdated Show resolved Hide resolved
share/availability/discovery/set.go Outdated Show resolved Hide resolved
share/availability/discovery/discovery.go Outdated Show resolved Hide resolved
@Wondertan
Copy link
Member

I will also write tests over the weekends and maybe refactor if enough inspiration

share/availability/discovery/discovery.go Outdated Show resolved Hide resolved
share/availability/discovery/discovery.go Outdated Show resolved Hide resolved
share/availability/discovery/discovery.go Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

Codecov Report

Merging #2117 (f837321) into main (c4e994b) will decrease coverage by 0.43%.
The diff coverage is 47.00%.

@@            Coverage Diff             @@
##             main    #2117      +/-   ##
==========================================
- Coverage   55.80%   55.37%   -0.43%     
==========================================
  Files         209      212       +3     
  Lines       13289    13643     +354     
==========================================
+ Hits         7416     7555     +139     
- Misses       5131     5327     +196     
- Partials      742      761      +19     
Impacted Files Coverage Δ
api/gateway/availability.go 0.00% <0.00%> (ø)
api/gateway/header.go 0.00% <0.00%> (ø)
api/gateway/share.go 8.91% <0.00%> (-0.77%) ⬇️
das/metrics.go 45.53% <0.00%> (ø)
share/p2p/peers/metrics.go 32.48% <32.48%> (ø)
share/p2p/shrexnd/client.go 68.33% <33.33%> (+5.83%) ⬆️
share/availability/discovery/discovery.go 53.17% <43.93%> (-3.98%) ⬇️
share/eds/store.go 67.80% <50.00%> (-0.78%) ⬇️
share/availability/discovery/backoff.go 79.24% <55.55%> (-5.20%) ⬇️
share/getters/shrex.go 85.57% <57.14%> (-2.18%) ⬇️
... and 11 more

... and 5 files with indirect coverage changes

@walldiss walldiss requested a review from vgonkivs May 3, 2023 05:25
renaynay
renaynay previously approved these changes May 3, 2023
renaynay
renaynay previously approved these changes May 3, 2023
renaynay
renaynay previously approved these changes May 3, 2023
Copy link
Member

@renaynay renaynay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only one comment re panic

share/availability/discovery/discovery.go Show resolved Hide resolved
Copy link
Member

@walldiss walldiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow 58 commits! Discovery is so much cleaner now!

@Wondertan Wondertan enabled auto-merge (squash) May 3, 2023 09:13
@distractedm1nd distractedm1nd removed the request for review from vgonkivs May 3, 2023 09:25
@distractedm1nd
Copy link
Member Author

Blocked because @vgonkivs has previously requested changes

@Wondertan Wondertan merged commit 1329f8b into main May 3, 2023
16 of 17 checks passed
@Wondertan Wondertan deleted the discovery-dial-timeout branch May 3, 2023 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:shares Shares and samples bug Something isn't working kind:break! Attached to breaking PRs kind:fix Attached to bug-fixing PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug(share/discovery): amount of peers in discovery shrinks over time
6 participants