state-sync: don't request trie nodes to peers known not to have them #1097

gsalgado · 2018-07-26T11:54:39Z

Keep track of which nodes we requested to every peer so that we know which
ones they're missing when they reply with just a subset of what was
requested. That way we can avoid re-requesting those missing nodes to peers
that don't have them. #1073

Also, when a reply from a peer contains only a subset of what was
requested, we immediately re-request the missing ones. Closes: #1077

And a quick workaround to avoid losing peers during state sync (#1074)

pipermerriam · 2018-07-26T12:30:47Z

p2p/state.py

@@ -71,7 +71,8 @@ def __init__(self,
        self.root_hash = root_hash
        self.scheduler = StateSync(root_hash, account_db)
        self._handler = PeerRequestHandler(self.chaindb, self.logger, self.cancel_token)
-        self._peers_with_pending_requests: Dict[ETHPeer, float] = {}
+        self._requested_nodes: Dict[ETHPeer, Tuple[float, List[Hash32]]] = {}
+        self._peer_missing_nodes: Dict[ETHPeer, List[Hash32]] = collections.defaultdict(list)


Looks like the way we use this we should use a set instead of a list to store the missing nodes.

pipermerriam · 2018-07-26T12:34:57Z

p2p/state.py

+    async def get_peer_for_request(self, node_keys: Set[Hash32]) -> ETHPeer:
+        """Return an idle peer that may have any of the trie nodes in node_keys."""
+        async for peer in self.peer_pool:
+            peer = cast(ETHPeer, peer)


This is more of a side-bar question. At some point we'll be connected to multiple types of peers, light, full, etc. Will we need to build out an API for only getting back peers which match the capabilities we need (i.e. if we are connected to both normal peers and light peers, we won't want to iterate over all of the peers, but rather, only the ETHPeer ones.).

Do we actually need that or can we instead have a separate PeerPool instance for every peer type, with separate services that subscribe only to the pool they need?

👍

This seems like a good idea, since we're likely to also have different policies on how we treat different peers and we won't want to conflate those things.

pipermerriam · 2018-07-26T12:36:08Z

p2p/state.py

@@ -71,7 +71,8 @@ def __init__(self,
        self.root_hash = root_hash
        self.scheduler = StateSync(root_hash, account_db)
        self._handler = PeerRequestHandler(self.chaindb, self.logger, self.cancel_token)
-        self._peers_with_pending_requests: Dict[ETHPeer, float] = {}
+        self._requested_nodes: Dict[ETHPeer, Tuple[float, List[Hash32]]] = {}


Maybe _active_requests would better capture what this is storing?

Sounds good to me

pipermerriam · 2018-07-26T12:38:11Z

p2p/state.py

+                # XXX: This is a quick workaround for
+                # https://github.com/ethereum/py-evm/issues/1074, which will be replaced soon
+                # with a proper fix.
+                await self.wait(asyncio.sleep(0))


Maybe this should be shortcutted as self.sleep(0) on the Service class since this is likely a pattern we'll use regularly.

pipermerriam · 2018-07-26T12:40:19Z

p2p/state.py

+                # This is probably a batch that we retried after a timeout and ended up receiving
+                # more than once, so ignore but log as an INFO just in case.
+                self.logger.info(
+                    "Got %d NodeData entries from %s that were not expected, ignoring them",


IIRC the go-ethereum client tracks duplicates. Probably something we could start tracking and keep this at the DEBUG level. Then we still have a strong-ish signal about whether we're receiving a lot of them but it won't down out the console.

Agreed. Filed #1098 for it, although in my (limited) testing, we're getting a lot less timeouts with those changes

pipermerriam · 2018-07-26T14:06:37Z

p2p/state.py

+                self.logger.debug(
+                    "No idle peers have any of the trie nodes we want, sleeping a bit")
+                await self.wait(asyncio.sleep(0.2))
+                continue


If none of the peers we are connected to have a node we're looking for this loop will never terminate.

Right... Maybe for now we could have another exception to signal that none of our peers have any of the nodes we're about to request. Then we could do one/more of the following:

If PeerPool is full, disconnect from one or more peers to make room for others that might have the nodes

Increase sleep delay

Request another batch of nodes

What do you think? Any other ideas?

I'm fine punting on the long term solution for this (but lets document it with another issue). I think option 3 is our best and simplest solution for now.

Turns out it would take a big redesign of the whole StateDownloader in order to request a different batch as we don't have something like a back-burner area to put those node keys that no peers seem to have. I guess we'll have to go with option 1 and 2 when none our peers have the nodes. I'll give it some more thought, though, maybe I'll come up with something

I went ahead and implemented the above so now when no peers have the nodes in a batch, they're added to a list of missing node keys and retried after a while.

pipermerriam · 2018-07-26T14:12:05Z

p2p/state.py

+                    self.logger.debug(
+                        "Timed out waiting for %d nodes from %s", len(node_keys), peer)
+                    timed_out.extend(node_keys)
+                    self._requested_nodes.pop(peer)


Can we re-structure this in a non-mutative manner.

timed_out = cytoolz.valfilter(lambda v: time.time() - v[0] > self._reply_timeout, self.requested_nodes) self._requested_nodes = cytoolz.dissoc(self._request_nodes, *timed_out.keys()) timed_out_nodes = cytoolz.concat(node_keys for _, node_keys in timed_out.values())

I'd rather avoid doing that now, as it's tricky to exercise this code path and the non-mutative approach is not as clear as the current one. I'm sure I'd end up introducing bugs and we wouldn't uncover them easily. If only we had proper unit tests for those things... I really should take the time to write those

Maybe this would be easier to test if we isolated this functionality into something stand-alone

class StateSyncTracker: ...

It could house all of the statefullness of state syncing which would allow us to test it's APIs without needing to spin up the full state syncer?

(let me know if it's unclear what I mean by this and I can try to flesh out an example a bit further)

Yeah, it'll definitely need some refactoring in order to be unit tested

I can take a stab at it, but I'm done for the day and will be off tomorrow, so this will have to wait until next week. I'll at least try to fix the issue above with the infinite loop when no peers have the nodes we want

I've done this refactoring and added tests, together with the change I mentioned above to retry missing nodes

pipermerriam · 2018-07-26T14:12:21Z

p2p/state.py

@@ -201,7 +230,7 @@ def msg_queue_maxsize(self) -> int:
            now = time.time()
            sleep_duration = (oldest_request_time + self._reply_timeout) - now
            try:
-                await self.wait_first(asyncio.sleep(sleep_duration))
+                await self.wait(asyncio.sleep(sleep_duration))


More places where self.sleep would be more concise

pipermerriam · 2018-07-26T14:15:36Z

p2p/state.py

@@ -238,18 +267,19 @@ def msg_queue_maxsize(self) -> int:

    async def _periodically_report_progress(self) -> None:
        while self.is_running:
+            requested_nodes = sum(
+                len(node_keys) for _, node_keys in self._requested_nodes.values())
            self.logger.info("====== State sync progress ========")


Whatcha think about going ahead and re-structuring this output to be a single line similar to geth's?

processed: %11d tnps: %5s committed: %11d scheduled: %7d timeout: %10d duplicates: %11d

Sure, but what's tnps and why the different formatting lengths for every item?

tnps -> tree-nodes-per-second, the numbers are the fixed width spacing so that each line is consistently easy to read (and I tried to pick numbers appropriate for the expected scale of the values we expect to see in those places, feel free to adjust to your best judgement)

pipermerriam · 2018-07-30T15:48:53Z

p2p/state.py

+                # XXX: This is a quick workaround for
+                # https://github.com/ethereum/py-evm/issues/1074, which will be replaced soon
+                # with a proper fix.
+                await self.sleep(0)


should this be self.sleep(0)?

hmm, what do you mean? that's what it is

No clue, I must have looked at this and read it as asyncio.sleep. Disregard.

pipermerriam · 2018-07-30T16:14:24Z

It will be replaced shortly with a proper fix, but should allow us to validate some assumptions and make sure the actual fix will ultimately address the issue Issue ethereum#1074

Keep track of which nodes we requested to every peer so that we know which ones they're missing when they reply with just a subset of what was requested. That way we can avoid re-requesting those missing nodes to peers that don't have them. #1073 Also, when a reply from a peer contains only a subset of what was requested, we immediately re-request the missing ones. Closes: ethereum#1077

Now if none of our peers have any of the nodes we want, those nodes are added to a list of missing node keys and retried after a while.

gsalgado requested a review from pipermerriam July 26, 2018 14:02

pipermerriam approved these changes Jul 26, 2018

View reviewed changes

gsalgado force-pushed the state-sync-improvements branch 5 times, most recently from bbf15a6 to 6e704e5 Compare July 30, 2018 12:41

pipermerriam approved these changes Jul 30, 2018

View reviewed changes

gsalgado added 8 commits July 30, 2018 19:44

Make it possible to run chain.py's _test() on mainnet

9b0fc52

Quick workaround to avoid losing peers during state sync

87e3cc2

It will be replaced shortly with a proper fix, but should allow us to validate some assumptions and make sure the actual fix will ultimately address the issue Issue ethereum#1074

Rename _requested_nodes to _active_requests

c08e150

Use set() instead of list() on _peer_missing_nodes

0d8aa02

Add a sleep() method to CancellableMixin

6f28af4

Log state sync progress in a single line

f9665ad

StateDownloader: avoid infinite loop when no peers have nodes we want

a94ef03

Now if none of our peers have any of the nodes we want, those nodes are added to a list of missing node keys and retried after a while.

gsalgado force-pushed the state-sync-improvements branch from 6e704e5 to a94ef03 Compare July 30, 2018 17:45

gsalgado merged commit 27a89b2 into ethereum:master Jul 30, 2018

gsalgado mentioned this pull request Dec 21, 2018

Report number of duplicate trie nodes received during state sync ethereum/trinity#44

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

state-sync: don't request trie nodes to peers known not to have them #1097

state-sync: don't request trie nodes to peers known not to have them #1097

gsalgado commented Jul 26, 2018

pipermerriam Jul 26, 2018

pipermerriam Jul 26, 2018

gsalgado Jul 26, 2018

pipermerriam Jul 26, 2018

pipermerriam Jul 26, 2018

gsalgado Jul 26, 2018

pipermerriam Jul 26, 2018

pipermerriam Jul 26, 2018

gsalgado Jul 26, 2018

pipermerriam Jul 26, 2018

gsalgado Jul 26, 2018

pipermerriam Jul 26, 2018

gsalgado Jul 26, 2018

gsalgado Jul 30, 2018

pipermerriam Jul 26, 2018

gsalgado Jul 26, 2018

pipermerriam Jul 26, 2018

pipermerriam Jul 26, 2018

gsalgado Jul 26, 2018

gsalgado Jul 26, 2018

gsalgado Jul 30, 2018

pipermerriam Jul 26, 2018

pipermerriam Jul 26, 2018

gsalgado Jul 26, 2018

pipermerriam Jul 26, 2018 •

edited

Loading

pipermerriam Jul 30, 2018

gsalgado Jul 30, 2018

pipermerriam Jul 30, 2018

pipermerriam commented Jul 30, 2018

state-sync: don't request trie nodes to peers known not to have them #1097

state-sync: don't request trie nodes to peers known not to have them #1097

Conversation

gsalgado commented Jul 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pipermerriam Jul 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pipermerriam commented Jul 30, 2018

pipermerriam Jul 26, 2018 •

edited

Loading