Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

state-sync: don't request trie nodes to peers known not to have them #1097

Merged
merged 8 commits into from Jul 30, 2018

Conversation

gsalgado
Copy link
Collaborator

Keep track of which nodes we requested to every peer so that we know which
ones they're missing when they reply with just a subset of what was
requested. That way we can avoid re-requesting those missing nodes to peers
that don't have them. #1073

Also, when a reply from a peer contains only a subset of what was
requested, we immediately re-request the missing ones. Closes: #1077

And a quick workaround to avoid losing peers during state sync (#1074)

p2p/state.py Outdated
@@ -71,7 +71,8 @@ def __init__(self,
self.root_hash = root_hash
self.scheduler = StateSync(root_hash, account_db)
self._handler = PeerRequestHandler(self.chaindb, self.logger, self.cancel_token)
self._peers_with_pending_requests: Dict[ETHPeer, float] = {}
self._requested_nodes: Dict[ETHPeer, Tuple[float, List[Hash32]]] = {}
self._peer_missing_nodes: Dict[ETHPeer, List[Hash32]] = collections.defaultdict(list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the way we use this we should use a set instead of a list to store the missing nodes.

p2p/state.py Outdated
async def get_peer_for_request(self, node_keys: Set[Hash32]) -> ETHPeer:
"""Return an idle peer that may have any of the trie nodes in node_keys."""
async for peer in self.peer_pool:
peer = cast(ETHPeer, peer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of a side-bar question. At some point we'll be connected to multiple types of peers, light, full, etc. Will we need to build out an API for only getting back peers which match the capabilities we need (i.e. if we are connected to both normal peers and light peers, we won't want to iterate over all of the peers, but rather, only the ETHPeer ones.).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need that or can we instead have a separate PeerPool instance for every peer type, with separate services that subscribe only to the pool they need?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

This seems like a good idea, since we're likely to also have different policies on how we treat different peers and we won't want to conflate those things.

p2p/state.py Outdated
@@ -71,7 +71,8 @@ def __init__(self,
self.root_hash = root_hash
self.scheduler = StateSync(root_hash, account_db)
self._handler = PeerRequestHandler(self.chaindb, self.logger, self.cancel_token)
self._peers_with_pending_requests: Dict[ETHPeer, float] = {}
self._requested_nodes: Dict[ETHPeer, Tuple[float, List[Hash32]]] = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe _active_requests would better capture what this is storing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

p2p/state.py Outdated
# XXX: This is a quick workaround for
# https://github.com/ethereum/py-evm/issues/1074, which will be replaced soon
# with a proper fix.
await self.wait(asyncio.sleep(0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be shortcutted as self.sleep(0) on the Service class since this is likely a pattern we'll use regularly.

p2p/state.py Outdated
# This is probably a batch that we retried after a timeout and ended up receiving
# more than once, so ignore but log as an INFO just in case.
self.logger.info(
"Got %d NodeData entries from %s that were not expected, ignoring them",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the go-ethereum client tracks duplicates. Probably something we could start tracking and keep this at the DEBUG level. Then we still have a strong-ish signal about whether we're receiving a lot of them but it won't down out the console.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Filed #1098 for it, although in my (limited) testing, we're getting a lot less timeouts with those changes

p2p/state.py Outdated
self.logger.debug(
"No idle peers have any of the trie nodes we want, sleeping a bit")
await self.wait(asyncio.sleep(0.2))
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If none of the peers we are connected to have a node we're looking for this loop will never terminate.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right... Maybe for now we could have another exception to signal that none of our peers have any of the nodes we're about to request. Then we could do one/more of the following:

  1. If PeerPool is full, disconnect from one or more peers to make room for others that might have the nodes
  2. Increase sleep delay
  3. Request another batch of nodes

What do you think? Any other ideas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine punting on the long term solution for this (but lets document it with another issue). I think option 3 is our best and simplest solution for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out it would take a big redesign of the whole StateDownloader in order to request a different batch as we don't have something like a back-burner area to put those node keys that no peers seem to have. I guess we'll have to go with option 1 and 2 when none our peers have the nodes. I'll give it some more thought, though, maybe I'll come up with something

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and implemented the above so now when no peers have the nodes in a batch, they're added to a list of missing node keys and retried after a while.

p2p/state.py Outdated
self.logger.debug(
"Timed out waiting for %d nodes from %s", len(node_keys), peer)
timed_out.extend(node_keys)
self._requested_nodes.pop(peer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we re-structure this in a non-mutative manner.

timed_out = cytoolz.valfilter(lambda v: time.time() - v[0] > self._reply_timeout, self.requested_nodes)
self._requested_nodes = cytoolz.dissoc(self._request_nodes, *timed_out.keys())
timed_out_nodes = cytoolz.concat(node_keys for _, node_keys in timed_out.values())

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather avoid doing that now, as it's tricky to exercise this code path and the non-mutative approach is not as clear as the current one. I'm sure I'd end up introducing bugs and we wouldn't uncover them easily. If only we had proper unit tests for those things... I really should take the time to write those

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this would be easier to test if we isolated this functionality into something stand-alone

class StateSyncTracker:
   ...

It could house all of the statefullness of state syncing which would allow us to test it's APIs without needing to spin up the full state syncer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(let me know if it's unclear what I mean by this and I can try to flesh out an example a bit further)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it'll definitely need some refactoring in order to be unit tested

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take a stab at it, but I'm done for the day and will be off tomorrow, so this will have to wait until next week. I'll at least try to fix the issue above with the infinite loop when no peers have the nodes we want

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done this refactoring and added tests, together with the change I mentioned above to retry missing nodes

p2p/state.py Outdated
@@ -201,7 +230,7 @@ def msg_queue_maxsize(self) -> int:
now = time.time()
sleep_duration = (oldest_request_time + self._reply_timeout) - now
try:
await self.wait_first(asyncio.sleep(sleep_duration))
await self.wait(asyncio.sleep(sleep_duration))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More places where self.sleep would be more concise

p2p/state.py Outdated
@@ -238,18 +267,19 @@ def msg_queue_maxsize(self) -> int:

async def _periodically_report_progress(self) -> None:
while self.is_running:
requested_nodes = sum(
len(node_keys) for _, node_keys in self._requested_nodes.values())
self.logger.info("====== State sync progress ========")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatcha think about going ahead and re-structuring this output to be a single line similar to geth's?

processed: %11d  tnps: %5s committed: %11d scheduled: %7d timeout: %10d duplicates: %11d

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but what's tnps and why the different formatting lengths for every item?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tnps -> tree-nodes-per-second, the numbers are the fixed width spacing so that each line is consistently easy to read (and I tried to pick numbers appropriate for the expected scale of the values we expect to see in those places, feel free to adjust to your best judgement)

@gsalgado gsalgado force-pushed the state-sync-improvements branch 5 times, most recently from bbf15a6 to 6e704e5 Compare July 30, 2018 12:41
# XXX: This is a quick workaround for
# https://github.com/ethereum/py-evm/issues/1074, which will be replaced soon
# with a proper fix.
await self.sleep(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be self.sleep(0)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, what do you mean? that's what it is

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No clue, I must have looked at this and read it as asyncio.sleep. Disregard.

@pipermerriam
Copy link
Member

:shipit:

It will be replaced shortly with a proper fix, but should allow us to
validate some assumptions and make sure the actual fix will ultimately
address the issue

Issue ethereum#1074
Keep track of which nodes we requested to every peer so that we know
which ones they're missing when they reply with just a subset of what
was requested. That way we can avoid re-requesting those missing nodes
to peers that don't have them. #1073

Also, when a reply from a peer contains only a subset of what was
requested, we immediately re-request the missing ones. Closes: ethereum#1077
Now if none of our peers have any of the nodes we want, those nodes are
added to a list of missing node keys and retried after a while.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants