Download all data in queues (instead of batches) while fast & regular syncing #1226

carver · 2018-08-30T00:38:05Z

What was wrong?

Requests for block data and receipts were batched together. A slow peer can dramatically slow down this process.

How was it fixed?

Roughly, put everything into queues so they can run independently, and one peer isn't waiting on another to finish.

Numbers vary widely, but I've seen as high as 1500 blocks imported in 5 s.

TODO:

hunt down an occassional RuntimeError: Cannot restart a service that has already been cancelled
manually test
overnight manual test
make code more readable
linting
clean commit history

Cute Animal Picture

pipermerriam · 2018-08-30T04:02:05Z

tests/trinity/core/p2p-proto/test_stats.py

+        header_stats = stats['BlockHeaders']
+        assert header_stats['count'] == idx
+        assert header_stats['items'] == idx
+        assert header_stats['timeouts'] == 0


You might rebase this on top of #1225 so that we don't conflict here and also the stats available in that branch are easier to get at and probably better.

You might rebase this on top of #1225

Yeah, I just noticed that conflict.

also the stats available in that branch are easier to get at and probably better.

At first glance, I agree. I'll review it now.

carver · 2018-09-01T00:35:56Z

Ok, getting closer. I think most what's left (before removing the WIP tag) is linting and maybe some minor refactors for readibility. Plus whatever the full CI run digs up.

carver · 2018-09-04T22:58:22Z

I'm in the process of pulling most of these pieces out into other PRs. I'll rebase this after the others are merged:

carver · 2018-09-07T23:47:47Z

I rewrote most of trinity/sync/full/chain.py to fit the new model. So much has changed that it's not really worth looking at the diff on that file, just read the whole new file at: https://github.com/carver/py-evm/blob/block-data-queues/trinity/sync/full/chain.py

In my tests, we seem to be able to download bodies and receipts so fast that the new bottleneck is downloading the headers. You may see this yourself if you find something like this in the debug logs:

(in progress, queued, max size) of headers, bodies, receipts: [(0, 0, 1536), (320, 320, 512), (576, 576, 1024)]

That means that there are no new headers from which to start downloading bodies and receipts. All of the current download tasks (320 and 576, for bodies and receipts respectively) have current outstanding requests to peers.

This problem is similar to #1218 -- we probably have a peer with low header throughput, and we are stuck syncing on them until we give-up/time-out/catch-up. Some changes to make header sync look even more like body/receipt downloads might help (always getting the peer with the best throughput, on every new request).

Edit~ or do what gsalgado said is happening in geth: fetch every 128th header from your "lead" peer, and fill in the rest from other peers.

pipermerriam

Giving this approval as I think the approach is generally sound and rather than futz with it in theory I'd rather get it merged so we can run it and shake out whatever problems may be hiding that way.

trinity/sync/full/chain.py

pipermerriam · 2018-09-10T16:16:57Z

trinity/sync/full/chain.py

+        self._block_body_tasks = TaskQueue(MAX_BODIES_FETCH * 4, attrgetter('block_number'))
+
+    # TODO move this to BaseService if it gets broader usage and/or stays stable for long enough
+    def delayed_run_func(self, func: Callable[[], None], delay: float) -> None:


This seems like it should be using loop.call_later() or call_at

pipermerriam · 2018-09-10T16:25:31Z

trinity/sync/full/chain.py

-            headers = tuple(concat(all_missing_headers))
+            self._mark_body_download_complete(batch_id, completed_headers + trivial_headers)
+        except BaseP2PError:
+            self._block_body_tasks.complete(batch_id, trivial_headers)


Could this go in a finally statement so that it doesn't have to be present in all except blocks?

Maybe, let's see if I can work it out (right now, it's being called separately in _mark_body_download_complete())

pipermerriam · 2018-09-10T16:25:43Z

trinity/sync/full/chain.py

+            self._mark_body_download_complete(batch_id, completed_headers + trivial_headers)
+        except BaseP2PError:
+            self._block_body_tasks.complete(batch_id, trivial_headers)
+            self.logger.debug("Problem downloading body from peer, dropping...", exc_info=True)


This blanket catch of any p2p exception stands out to me as potentially problematic. What do you think about adding an INFO level statement so that we can zero in on being more explicit about what we catch?

pipermerriam · 2018-09-10T16:27:08Z

trinity/sync/full/chain.py


-        self.logger.debug("Got block bodies batch for %d headers", len(all_headers))
-        return block_bodies_by_key
+    def _mark_body_download_complete(


Is this passthrough function just for readability?

It's an unfortunate side-effect of the two classes being built with inheritance. They have to register that the download is complete in a different OrderedTaskPreparation object, with a different prerequisite.

pipermerriam · 2018-09-10T16:27:37Z

trinity/sync/full/chain.py

+            self._block_body_tasks.complete(batch_id, trivial_headers)
+            self.logger.debug("Problem downloading body from peer, dropping...", exc_info=True)
+        except Exception:
+            self._block_body_tasks.complete(batch_id, trivial_headers)


This could/should use self._mark_body_download_complete?

trinity/sync/full/chain.py

pipermerriam · 2018-09-10T16:35:07Z

trinity/sync/full/chain.py

+        self.run_task(self._launch_prerequisite_tasks())
+        self.run_task(self._assign_receipt_download_to_peers())
+        self.run_task(self._assign_body_download_to_peers())
+        self.run_task(self._persist_ready_blocks())


So in theory, most of these background tasks are required for the service to be operational/functional. Using run_task allows for them to exit at which point the client will keep running but sync will stop working. Thoughts on how we can address this? Something like BaseService.run_daemon_task()?

pipermerriam · 2018-09-10T16:39:37Z

trinity/sync/full/chain.py

+            vm_class = self.chain.get_vm_class(header)
+            block_class = vm_class.get_block_class()
+            # We don't need to use our block transactions here because persist_block() doesn't do
+            # anything with them as it expects them to have been persisted already.


I don't think this comment is accurate:

py-evm/eth/db/chain.py

Lines 233 to 243 in de4809f

for header in new_canonical_headers:

if header.hash == block.hash:

# Most of the time this is called to persist a block whose parent is the current

# head, so we optimize for that and read the tx hashes from the block itself. This

# is specially important during a fast sync.

tx_hashes = [tx.hash for tx in block.transactions]

else:

tx_hashes = self.get_block_transaction_hashes(header)

for index, transaction_hash in enumerate(tx_hashes):

self._add_transaction_to_canonical_chain(transaction_hash, header, index)

By not including the transactions, we are failing to populate the canonical transaction hash to block lookups right?

Hm, I haven't looked deeply (it was already there), but at first glance I think you're right.

pipermerriam · 2018-09-11T02:33:25Z

p2p/service.py

@@ -142,6 +142,25 @@ def run_task(self, awaitable: Awaitable[Any]) -> None:
                self.logger.trace("Task %s finished with no errors", awaitable)
        self._tasks.add(asyncio.ensure_future(_run_task_wrapper()))

+    def run_daemon_task(self, awaitable: Awaitable[Any]) -> None:


I was looking at this and the run_task api and I think we should update them as follows.

def run_task(self, fn, *args): @functools.wraps(fn) async def _run_task_wrapper(): ... ...

By doing it this way we'll get intelligible stacktraces when tasks are not properly cleaned up which should make finding the offending code a lot easier.

pipermerriam · 2018-09-11T02:35:52Z

trinity/sync/full/chain.py

@@ -238,7 +235,10 @@ def delayed_run_func(self, func: Callable[[], None], delay: float) -> None:
                # peer returned no results, wait a while before trying again
                delay = self.EMPTY_PEER_RESPONSE_PENALTY
                self.logger.debug("Pausing %s for %.1fs, for sending 0 block bodies", peer, delay)
-                self.delayed_run_func(lambda: self._body_peers.put_nowait(peer), delay)
+                loop = self.get_event_loop()
+                loop.call_later(delay, partial(self._body_peers.put_nowait, peer))


I think that we need to wrap call_later in a BaseService.call_later which retains the return value which I believe is the only way we can issue a cancellation when triggered.

carver added the PR state: WIP label Aug 30, 2018

pipermerriam reviewed Aug 30, 2018

View reviewed changes

carver mentioned this pull request Aug 30, 2018

Expand peer stats tracking #1225

Merged

carver force-pushed the block-data-queues branch from 18ca676 to e44c354 Compare September 1, 2018 00:34

carver mentioned this pull request Sep 4, 2018

Add TaskIntegrator tracking class #1254

Merged

carver added the PR state: blocked label Sep 4, 2018

carver added 3 commits September 7, 2018 16:17

Add Timer.pop_elapsed()

00c7c0b

Sync bodies and receipts in parallel in fast sync

b63a8cd

bugfix display of average items/s stats

842aa36

carver force-pushed the block-data-queues branch from e44c354 to 842aa36 Compare September 7, 2018 23:36

carver removed PR state: WIP PR state: blocked labels Sep 7, 2018

carver changed the title ~~[WIP] Block data queues~~ Download all data in parallel while fast & regular syncing Sep 7, 2018

carver changed the title ~~Download all data in parallel while fast & regular syncing~~ Download all data in queues (instead of batches) while fast & regular syncing Sep 7, 2018

carver requested a review from gsalgado September 7, 2018 23:51

pipermerriam approved these changes Sep 10, 2018

View reviewed changes

carver added 2 commits September 10, 2018 15:01

New BaseService.run_daemon_task()

d81f7c0

PR feedback

33724de

carver merged commit 6432b67 into ethereum:master Sep 10, 2018

carver deleted the block-data-queues branch September 10, 2018 22:41

pipermerriam reviewed Sep 11, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download all data in queues (instead of batches) while fast & regular syncing #1226

Download all data in queues (instead of batches) while fast & regular syncing #1226

carver commented Aug 30, 2018 •

edited

pipermerriam Aug 30, 2018

carver Aug 30, 2018 •

edited

carver commented Sep 1, 2018 •

edited

carver commented Sep 4, 2018 •

edited

carver commented Sep 7, 2018 •

edited

pipermerriam left a comment

pipermerriam Sep 10, 2018

pipermerriam Sep 10, 2018

carver Sep 10, 2018

pipermerriam Sep 10, 2018

pipermerriam Sep 10, 2018

carver Sep 10, 2018

pipermerriam Sep 10, 2018

pipermerriam Sep 10, 2018

pipermerriam Sep 10, 2018

carver Sep 10, 2018

pipermerriam Sep 11, 2018

pipermerriam Sep 11, 2018

	for header in new_canonical_headers:
	if header.hash == block.hash:
	# Most of the time this is called to persist a block whose parent is the current
	# head, so we optimize for that and read the tx hashes from the block itself. This
	# is specially important during a fast sync.
	tx_hashes = [tx.hash for tx in block.transactions]
	else:
	tx_hashes = self.get_block_transaction_hashes(header)

	for index, transaction_hash in enumerate(tx_hashes):
	self._add_transaction_to_canonical_chain(transaction_hash, header, index)

Download all data in queues (instead of batches) while fast & regular syncing #1226

Download all data in queues (instead of batches) while fast & regular syncing #1226

Conversation

carver commented Aug 30, 2018 • edited

What was wrong?

How was it fixed?

Cute Animal Picture

Choose a reason for hiding this comment

carver Aug 30, 2018 • edited

Choose a reason for hiding this comment

carver commented Sep 1, 2018 • edited

carver commented Sep 4, 2018 • edited

carver commented Sep 7, 2018 • edited

pipermerriam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carver commented Aug 30, 2018 •

edited

carver Aug 30, 2018 •

edited

carver commented Sep 1, 2018 •

edited

carver commented Sep 4, 2018 •

edited

carver commented Sep 7, 2018 •

edited