-
Notifications
You must be signed in to change notification settings - Fork 649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
state-sync: don't request trie nodes to peers known not to have them #1097
Conversation
p2p/state.py
Outdated
@@ -71,7 +71,8 @@ def __init__(self, | |||
self.root_hash = root_hash | |||
self.scheduler = StateSync(root_hash, account_db) | |||
self._handler = PeerRequestHandler(self.chaindb, self.logger, self.cancel_token) | |||
self._peers_with_pending_requests: Dict[ETHPeer, float] = {} | |||
self._requested_nodes: Dict[ETHPeer, Tuple[float, List[Hash32]]] = {} | |||
self._peer_missing_nodes: Dict[ETHPeer, List[Hash32]] = collections.defaultdict(list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the way we use this we should use a set
instead of a list
to store the missing nodes.
p2p/state.py
Outdated
async def get_peer_for_request(self, node_keys: Set[Hash32]) -> ETHPeer: | ||
"""Return an idle peer that may have any of the trie nodes in node_keys.""" | ||
async for peer in self.peer_pool: | ||
peer = cast(ETHPeer, peer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more of a side-bar question. At some point we'll be connected to multiple types of peers, light, full, etc. Will we need to build out an API for only getting back peers which match the capabilities we need (i.e. if we are connected to both normal peers and light peers, we won't want to iterate over all of the peers, but rather, only the ETHPeer
ones.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need that or can we instead have a separate PeerPool
instance for every peer type, with separate services that subscribe only to the pool they need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
This seems like a good idea, since we're likely to also have different policies on how we treat different peers and we won't want to conflate those things.
p2p/state.py
Outdated
@@ -71,7 +71,8 @@ def __init__(self, | |||
self.root_hash = root_hash | |||
self.scheduler = StateSync(root_hash, account_db) | |||
self._handler = PeerRequestHandler(self.chaindb, self.logger, self.cancel_token) | |||
self._peers_with_pending_requests: Dict[ETHPeer, float] = {} | |||
self._requested_nodes: Dict[ETHPeer, Tuple[float, List[Hash32]]] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe _active_requests
would better capture what this is storing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me
p2p/state.py
Outdated
# XXX: This is a quick workaround for | ||
# https://github.com/ethereum/py-evm/issues/1074, which will be replaced soon | ||
# with a proper fix. | ||
await self.wait(asyncio.sleep(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be shortcutted as self.sleep(0)
on the Service
class since this is likely a pattern we'll use regularly.
p2p/state.py
Outdated
# This is probably a batch that we retried after a timeout and ended up receiving | ||
# more than once, so ignore but log as an INFO just in case. | ||
self.logger.info( | ||
"Got %d NodeData entries from %s that were not expected, ignoring them", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC the go-ethereum client tracks duplicates. Probably something we could start tracking and keep this at the DEBUG
level. Then we still have a strong-ish signal about whether we're receiving a lot of them but it won't down out the console.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Filed #1098 for it, although in my (limited) testing, we're getting a lot less timeouts with those changes
p2p/state.py
Outdated
self.logger.debug( | ||
"No idle peers have any of the trie nodes we want, sleeping a bit") | ||
await self.wait(asyncio.sleep(0.2)) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If none of the peers we are connected to have a node we're looking for this loop will never terminate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right... Maybe for now we could have another exception to signal that none of our peers have any of the nodes we're about to request. Then we could do one/more of the following:
- If PeerPool is full, disconnect from one or more peers to make room for others that might have the nodes
- Increase sleep delay
- Request another batch of nodes
What do you think? Any other ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine punting on the long term solution for this (but lets document it with another issue). I think option 3 is our best and simplest solution for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out it would take a big redesign of the whole StateDownloader
in order to request a different batch as we don't have something like a back-burner area to put those node keys that no peers seem to have. I guess we'll have to go with option 1 and 2 when none our peers have the nodes. I'll give it some more thought, though, maybe I'll come up with something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and implemented the above so now when no peers have the nodes in a batch, they're added to a list of missing node keys and retried after a while.
p2p/state.py
Outdated
self.logger.debug( | ||
"Timed out waiting for %d nodes from %s", len(node_keys), peer) | ||
timed_out.extend(node_keys) | ||
self._requested_nodes.pop(peer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we re-structure this in a non-mutative manner.
timed_out = cytoolz.valfilter(lambda v: time.time() - v[0] > self._reply_timeout, self.requested_nodes)
self._requested_nodes = cytoolz.dissoc(self._request_nodes, *timed_out.keys())
timed_out_nodes = cytoolz.concat(node_keys for _, node_keys in timed_out.values())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather avoid doing that now, as it's tricky to exercise this code path and the non-mutative approach is not as clear as the current one. I'm sure I'd end up introducing bugs and we wouldn't uncover them easily. If only we had proper unit tests for those things... I really should take the time to write those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this would be easier to test if we isolated this functionality into something stand-alone
class StateSyncTracker:
...
It could house all of the statefullness of state syncing which would allow us to test it's APIs without needing to spin up the full state syncer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(let me know if it's unclear what I mean by this and I can try to flesh out an example a bit further)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it'll definitely need some refactoring in order to be unit tested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can take a stab at it, but I'm done for the day and will be off tomorrow, so this will have to wait until next week. I'll at least try to fix the issue above with the infinite loop when no peers have the nodes we want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done this refactoring and added tests, together with the change I mentioned above to retry missing nodes
p2p/state.py
Outdated
@@ -201,7 +230,7 @@ def msg_queue_maxsize(self) -> int: | |||
now = time.time() | |||
sleep_duration = (oldest_request_time + self._reply_timeout) - now | |||
try: | |||
await self.wait_first(asyncio.sleep(sleep_duration)) | |||
await self.wait(asyncio.sleep(sleep_duration)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More places where self.sleep
would be more concise
p2p/state.py
Outdated
@@ -238,18 +267,19 @@ def msg_queue_maxsize(self) -> int: | |||
|
|||
async def _periodically_report_progress(self) -> None: | |||
while self.is_running: | |||
requested_nodes = sum( | |||
len(node_keys) for _, node_keys in self._requested_nodes.values()) | |||
self.logger.info("====== State sync progress ========") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatcha think about going ahead and re-structuring this output to be a single line similar to geth's?
processed: %11d tnps: %5s committed: %11d scheduled: %7d timeout: %10d duplicates: %11d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but what's tnps
and why the different formatting lengths for every item?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tnps -> tree-nodes-per-second
, the numbers are the fixed width spacing so that each line is consistently easy to read (and I tried to pick numbers appropriate for the expected scale of the values we expect to see in those places, feel free to adjust to your best judgement)
bbf15a6
to
6e704e5
Compare
# XXX: This is a quick workaround for | ||
# https://github.com/ethereum/py-evm/issues/1074, which will be replaced soon | ||
# with a proper fix. | ||
await self.sleep(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be self.sleep(0)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, what do you mean? that's what it is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No clue, I must have looked at this and read it as asyncio.sleep
. Disregard.
It will be replaced shortly with a proper fix, but should allow us to validate some assumptions and make sure the actual fix will ultimately address the issue Issue ethereum#1074
Keep track of which nodes we requested to every peer so that we know which ones they're missing when they reply with just a subset of what was requested. That way we can avoid re-requesting those missing nodes to peers that don't have them. #1073 Also, when a reply from a peer contains only a subset of what was requested, we immediately re-request the missing ones. Closes: ethereum#1077
Now if none of our peers have any of the nodes we want, those nodes are added to a list of missing node keys and retried after a while.
6e704e5
to
a94ef03
Compare
Keep track of which nodes we requested to every peer so that we know which
ones they're missing when they reply with just a subset of what was
requested. That way we can avoid re-requesting those missing nodes to peers
that don't have them. #1073
Also, when a reply from a peer contains only a subset of what was
requested, we immediately re-request the missing ones. Closes: #1077
And a quick workaround to avoid losing peers during state sync (#1074)