Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate unreliable peers upwards from the trie accumulator to the block synchronizer #3594

Merged
merged 7 commits into from
Feb 2, 2023

Conversation

alsrdn
Copy link
Contributor

@alsrdn alsrdn commented Jan 20, 2023

The trie accumulator uses a snapshot of the peer list provided by the block synchronizer through the global state synchronizer in order to get the required tries.

Because we didn't provide the block synchronizer any feedback on what the peers that dind't have the global state data were, the block synchronizer may re-issue the global state sync request with the same peer set as for the previous failed attempt. This may lead to delays in getting the global state data.

This PR adds:

  • the ability to propagate the unreliable peers (that didn't have the requested data) upwards to the global state synchronizer and then to the block synchronizer. This gives feedback that the original peer list provided should be changed if another global sync request is made.
  • block synchronizer now demotes the peers unreliable received from the global state synchronizer
  • a test to verify that the peer list is changed based on the feedback from the global state sync components.
  • slight change into how PeerQuality is handled when demoting and promoting peers; now peers with Reliable or Unreliable quality will not be set as Unknown when demoting or promoting respectively.

These changes reduce the risk that the global state synchronizer gets stuck asking the same peers all over again.

Further potential improvements that can be made:

  • Promote peers that were successful when downloading global state. We only demote the peers now.
  • Use a reputation based peer list (similar to the one used in the block synchronizer) in the trie accumulator rather than going through peers one by one.

Fixes: #3563

Signed-off-by: Alexandru Sardan <alexandru@casperlabs.io>
The only time a peer has `Unknown` quality is when it was just added to
the peer list and we haven't heard anything from it.
If a peer that has `Unknown` quality is promoted we now make it
`Reliable` since it means that an item we required was found on that
node.
Also if a peer with `Unknown` quality is demoted, we make it
`Unreliable`.

Signed-off-by: Alexandru Sardan <alexandru@casperlabs.io>
The trie accumulator uses a snapshot of the peer list provided by the
block synchronizer through the global state synchronizer in order to get
the required tries.
Propagate the unreliable peers (that didn't have the requested data)
upwards to the global state synchronizer and then to the block
synchronizer. This gives feedback that the original peer list provided
should be changed if another global sync request is made.
Demote the peers received from the global state synchronizer in the
block synchronizer.
Also add a test to verify that the peer list is changed based on the
feedback from the global state sync components.

Signed-off-by: Alexandru Sardan <alexandru@casperlabs.io>
@alsrdn alsrdn marked this pull request as ready for review January 21, 2023 11:29
@alsrdn alsrdn self-assigned this Jan 21, 2023
@alsrdn
Copy link
Contributor Author

alsrdn commented Jan 21, 2023

bors try

casperlabs-bors-ng bot added a commit that referenced this pull request Jan 21, 2023
@casperlabs-bors-ng
Copy link
Contributor

try

Build succeeded:

node/src/components/block_synchronizer.rs Show resolved Hide resolved
node/src/components/block_synchronizer/peer_list.rs Outdated Show resolved Hide resolved
node/src/components/block_synchronizer/peer_list.rs Outdated Show resolved Hide resolved
Comment on lines +94 to +95
#[tokio::test]
async fn global_state_sync_wont_stall_with_bad_peers() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test! 👍

Fix a bug that was pointed out in code review related to the number of
peers returned when getting qualified peers.
Also add some unit tests for the peer list.

Signed-off-by: Alexandru Sardan <alexandru@casperlabs.io>
Copy link
Contributor

@rafal-ch rafal-ch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Copy link
Collaborator

@Fraser999 Fraser999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor change requested, otherwise LGTM.

TrieAccumulator(TrieAccumulatorError),
#[error("ContractRuntime failed to put a trie into global state: {0}")]
PutTrie(engine_state::Error),
#[error("trie accumulator encountered an error while fetching a trie; unreliable peers {0:?}")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should avoid mixing debug output into display output. There's utils::DisplayIter which could probably help here.

Same comment for the PutTrie variant below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 18b802d

Comment on lines +44 to +66
#[derive(Debug, Clone)]
pub(crate) struct Response {
hash: Digest,
unreliable_peers: Vec<NodeId>,
}

impl Response {
pub(crate) fn new(hash: Digest, unreliable_peers: Vec<NodeId>) -> Self {
Self {
hash,
unreliable_peers,
}
}

pub(crate) fn hash(&self) -> &Digest {
&self.hash
}

pub(crate) fn unreliable_peers(self) -> Vec<NodeId> {
self.unreliable_peers
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could do a two state enum for Success / Failure, and flatten the Error as a field on the failure variant; this would allow places that return Result<Response, Error> to be simplified to just return the enum variant (etc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could to that though I'm more partial to having separate types that are clearly delimited. What I think we can do is have a trait implemented for both the Response and Error that can extract the unreliable peers which would make the code looking at the response type potentially more nicer. WDYT?

assert!(peer_list.is_peer_reliable(&test_peer));
}

#[test]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might as well add the unreliable_peer_remains_unreliable_if_demoted test case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in baa3461

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a nit and a suggestion to consider, but LGTM, gj

@EdHastingsCasperAssociation
Copy link
Collaborator

bors try

casperlabs-bors-ng bot added a commit that referenced this pull request Jan 30, 2023
@casperlabs-bors-ng
Copy link
Contributor

try

Build succeeded:

Signed-off-by: Alexandru Sardan <alexandru@casperlabs.io>
@alsrdn
Copy link
Contributor Author

alsrdn commented Feb 2, 2023

bors try

casperlabs-bors-ng bot added a commit that referenced this pull request Feb 2, 2023
@casperlabs-bors-ng
Copy link
Contributor

try

Build succeeded:

@alsrdn
Copy link
Contributor Author

alsrdn commented Feb 2, 2023

bors merge

@casperlabs-bors-ng
Copy link
Contributor

Build succeeded:

@casperlabs-bors-ng casperlabs-bors-ng bot merged commit 749cab9 into casper-network:dev Feb 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Long Running Network Joining node in catchup/sync leap for long
4 participants