Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(share/discovery): amount of peers in discovery shrinks over time #2107

Closed
walldiss opened this issue Apr 20, 2023 · 0 comments · Fixed by #2117
Closed

bug(share/discovery): amount of peers in discovery shrinks over time #2107

walldiss opened this issue Apr 20, 2023 · 0 comments · Fixed by #2117
Assignees
Labels
area:shares Shares and samples bug Something isn't working

Comments

@walldiss
Copy link
Member

walldiss commented Apr 20, 2023

Problem

It was observed in multiple runs that amount of peers connection established by discovery is shrinking over time and could even reach only 1 peer. Currently shrex uses only peers maintained by discovery for historical syncing. Low amount of peers available from discovery means all blocksync requests are hitting few (or 1) peers and resulting in very high load on those peers. Need to investigate the root cause.

@walldiss walldiss added bug Something isn't working area:shares Shares and samples labels Apr 20, 2023
@walldiss walldiss self-assigned this Apr 20, 2023
Wondertan added a commit that referenced this issue May 3, 2023
Closes #2107 

There was a discrepancy between the amount of peers inside of the peer manager full node pool, and the limitedSet inside of discovery. It was possible for the node to get in a state where it had no more full nodes to sample from, because the peers inside the limitedSet were blocking on the connection, and never being handed off to the OnPeersUpdate callback. 

To fix this, we (@Wondertan, @walldiss  and I):
* Set a context deadline for the call to Connect with a peer. Previously, it would never timeout. This is because RoutedHost will block until the context is canceled, even though there is a peer dial timeout.
* Only add peers to the limitedSet after we have successfully connected. This turns the previous peerLimit into a "soft" limit. This prevents in-progress connections from clogging spots in the limitedSet, but also allows for more peers to end up in the limited set than the set limit (we don't throw away already connected peers).
* Removed the timer reset upon a peer disconnecting. This peer disconnection happens often, causing a significant delay in the next call to FindPeers.
* Added logs, cleaned up various comments
* Backoff
	* Adds a RemoveBackoff method to the backoffConnector, to remove cancelled connections from the cache
	* Adds Backoff method
* Cleans up limited set
* Added tests for Discovery
* Handle the case where discovered peer is already connected
* Introduces importable Discovery params
* Revisits constans

Co-authored-by: Wondertan <hlibwondertan@gmail.com>
Co-authored-by: Vlad <vlad@celestia.org>
Co-authored-by: Vlad <13818348+walldiss@users.noreply.github.com>
distractedm1nd added a commit that referenced this issue May 22, 2023
## Problem
#2107
## Overview
Could hotfix the problem by allowing peer manager to save nodes
discovered through shrexSub to be used in full nodes pool.
The solution would also benefit shrex performance by allowing
distribution of request to more peers.

---------

Co-authored-by: Ryan <ryan@celestia.org>
Co-authored-by: Hlib Kanunnikov <hlibwondertan@gmail.com>
vgonkivs pushed a commit to vgonkivs/celestia-node that referenced this issue May 22, 2023
…tiaorg#2105)

## Problem
celestiaorg#2107
## Overview
Could hotfix the problem by allowing peer manager to save nodes
discovered through shrexSub to be used in full nodes pool.
The solution would also benefit shrex performance by allowing
distribution of request to more peers.

---------

Co-authored-by: Ryan <ryan@celestia.org>
Co-authored-by: Hlib Kanunnikov <hlibwondertan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:shares Shares and samples bug Something isn't working
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants