New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move HexaryTrieSync from py-trie here, and make it async #1124
Conversation
01db307
to
a8c96a3
Compare
@pipermerriam I think this may have slipped under your radar? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename trinity/sync/full/trie.py
to be hexary_trie
since we kind of know there is going to be a BinaryTrie
making it's way into here at some point.
for key, value in contents.items(): | ||
assert dest_trie[key] == value | ||
|
||
asyncio.get_event_loop().run_until_complete(_test_trie_sync()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we instead use the event_loop
fixture provided by pytest? I know these are supposed to be equivalent during runtime, but it seems slightly more correct to use the fixture version of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to hear your thoughts on the proposed SyncResponse
pattern, or any alternate ideas you might have.
trinity/sync/full/trie.py
Outdated
is_raw: bool = False) -> None: | ||
"""Schedule a request for the node with the given key.""" | ||
if node_key in self._existing_nodes: | ||
self.logger.debug("Node %s already exists in db" % encode_hex(node_key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logging statements should not do string formatting. Can you change to self.logger.debug("the %s message", encode_hex(nodekey))
trinity/sync/full/trie.py
Outdated
return | ||
if await self.db.coro_exists(node_key): | ||
self._existing_nodes.add(node_key) | ||
self.logger.debug("Node %s already exists in db" % encode_hex(node_key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, drop the string formatting and just provide the hex encoded node_key
as a positional argument.
trinity/sync/full/trie.py
Outdated
:rtype: A two-tuple with one list containing the children that reference other nodes and | ||
another containing the leaf children. | ||
""" | ||
node = decode_node(request.data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stands out as a potential source of trouble. request.data
can be None
in cases where the SyncRequest
has not been processed. I trust that this code path is only called after the data
attribute has been populated, but the implicitness of this approach makes me think we should look for an alternate pattern, maybe something like a SyncResponse
object which takes the request and the data as an argument and exposes response.data
which is always guaranteed to be populated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_children() is called precisely to find the children of the node we're processing, and process()
does set request.data
immediately before calling it. I guess if I inlined this method into process()
it wouldn't raise a red flag, and we wouldn't make process()
any more complex or less readable as this is a two-line method that just decodes the request data and passes it to _get_children()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 if it gets rid of the mutation of the request
object and thus the implicit availability of the request.data
attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to completely get rid of the mutation of request.data
I'll have to get rid of the check in process()
that raises SyncRequestAlreadyProcessed
when request.data is not None, but that is useless anyway as it is caught and ignored in StateDownloader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, spoke too soon. I thought I'd be able to have process()
pass the data
to commit()
but that doesn't work because we don't always commit at the end of process()
-- we only commit once all the child nodes have been retrieved and committed. IOW, we need to keep (somewhere) the data received for every request, until that request is eventually committed. I could get rid of request.data
, but then I'd have to store the data for every request in an instance variable of HexaryTrieSync
. The SyncResponse
you suggest would also have to be stored somewhere, and since it's disjoint from the actual request I'm not sure it gives us any guarantees really -- we could still try to commit a request for which there's no SyncResponse
in the same way we can try to commit one which has .data
set to None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can't find a way to do it that seems good, my fallback request would be to add some comments to document this deficiency. Given that if request.data
is None
the two methods that access it will very likely fail loudly I'm ok with it staying in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've added a docstring to commit()
, and the other method is gone as there was no point in having it anyway
trinity/sync/full/trie.py
Outdated
await self.commit(request) | ||
continue | ||
|
||
references, leaves = self.get_children(request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to my other comment about request.data
being potentially problematic. At this call-site we'd instead pass in a response
object so that we have a guarantee that the data
attribute is present.
trinity/sync/full/trie.py
Outdated
if request.dependencies == 0: | ||
await self.commit(request) | ||
|
||
async def commit(self, request: SyncRequest) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function would also change to take a SyncResponse
object.
f0af0bd
to
4014ff9
Compare
@pipermerriam I think I've addressed all your comments. wanna have another look? |
trinity/sync/full/hexary_trie.py
Outdated
# Requests get added to both self.queue and self.requests; the former is used to keep | ||
# track which requests should be sent next, and the latter is used to avoid scheduling a | ||
# request for a given node multiple times. | ||
self.logger.debug("Scheduling retrieval of %s", encode_hex(request.node_key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be TRACE
level logging?
trinity/sync/full/hexary_trie.py
Outdated
is_raw: bool = False) -> None: | ||
"""Schedule a request for the node with the given key.""" | ||
if node_key in self._existing_nodes: | ||
self.logger.debug("Node %s already exists in db", encode_hex(node_key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this code path is likely to be encountered a lot we should move this to TRACE
level logging. Same with the one in the next if
block.
trinity/sync/full/hexary_trie.py
Outdated
if request is None: | ||
# This may happen if we resend a request for a node after waiting too long, | ||
# and then eventually get two responses with it. | ||
self.logger.debug( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trace
encode_hex(node_key)) | ||
return | ||
|
||
if request.data is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've got a solution that'll work. Lets remove the data
property from the SyncRequest
object and just use a boolean property like is_done
. We can set is_done
here in this method, and just pass the raw data
value into self.commit
. It looks like there are no other places where request.data
is accessed. This has the added benefit of reducing the memory overhead of this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't always commit at the end of process()
, so we need to keep the data for a given request until we have committed all of its children; see my previous comment: #1124 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, missed that. Ok, unless you see a clean way around this.
# A cache of node hashes we know to exist in our DB, used to avoid querying the DB | ||
# unnecessarily as that's the main bottleneck when dealing with a large DB like for | ||
# ethereum's mainnet/ropsten. | ||
self._existing_nodes: Set[Hash32] = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This worries me a bit as it has O(n)
storage cost. For a full state trie sync I believe that means storing roughly 180 million hashes which is 5gb of memory. Do you think we can get by with an LRU cache or something on the existence check and not actually store all of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a deja-vu feeling here so did a bit of digging and found ethereum/py-trie#43 (review)
I've opened #1154 for it
Closes: #1074