Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move HexaryTrieSync from py-trie here, and make it async #1124

Merged
merged 2 commits into from Aug 7, 2018

Conversation

gsalgado
Copy link
Collaborator

@gsalgado gsalgado commented Jul 31, 2018

Closes: #1074

@gsalgado gsalgado force-pushed the issue-1074 branch 3 times, most recently from 01db307 to a8c96a3 Compare August 2, 2018 08:02
@gsalgado gsalgado changed the title wip: async trie-sync Move HexaryTrieSync from py-trie here, and make it async Aug 2, 2018
@gsalgado
Copy link
Collaborator Author

gsalgado commented Aug 6, 2018

@pipermerriam I think this may have slipped under your radar?

Copy link
Member

@pipermerriam pipermerriam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename trinity/sync/full/trie.py to be hexary_trie since we kind of know there is going to be a BinaryTrie making it's way into here at some point.

for key, value in contents.items():
assert dest_trie[key] == value

asyncio.get_event_loop().run_until_complete(_test_trie_sync())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we instead use the event_loop fixture provided by pytest? I know these are supposed to be equivalent during runtime, but it seems slightly more correct to use the fixture version of this.

Copy link
Member

@pipermerriam pipermerriam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to hear your thoughts on the proposed SyncResponse pattern, or any alternate ideas you might have.

is_raw: bool = False) -> None:
"""Schedule a request for the node with the given key."""
if node_key in self._existing_nodes:
self.logger.debug("Node %s already exists in db" % encode_hex(node_key))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging statements should not do string formatting. Can you change to self.logger.debug("the %s message", encode_hex(nodekey))

return
if await self.db.coro_exists(node_key):
self._existing_nodes.add(node_key)
self.logger.debug("Node %s already exists in db" % encode_hex(node_key))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, drop the string formatting and just provide the hex encoded node_key as a positional argument.

:rtype: A two-tuple with one list containing the children that reference other nodes and
another containing the leaf children.
"""
node = decode_node(request.data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stands out as a potential source of trouble. request.data can be None in cases where the SyncRequest has not been processed. I trust that this code path is only called after the data attribute has been populated, but the implicitness of this approach makes me think we should look for an alternate pattern, maybe something like a SyncResponse object which takes the request and the data as an argument and exposes response.data which is always guaranteed to be populated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_children() is called precisely to find the children of the node we're processing, and process() does set request.data immediately before calling it. I guess if I inlined this method into process() it wouldn't raise a red flag, and we wouldn't make process() any more complex or less readable as this is a two-line method that just decodes the request data and passes it to _get_children()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 if it gets rid of the mutation of the request object and thus the implicit availability of the request.data attribute.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to completely get rid of the mutation of request.data I'll have to get rid of the check in process() that raises SyncRequestAlreadyProcessed when request.data is not None, but that is useless anyway as it is caught and ignored in StateDownloader

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, spoke too soon. I thought I'd be able to have process() pass the data to commit() but that doesn't work because we don't always commit at the end of process() -- we only commit once all the child nodes have been retrieved and committed. IOW, we need to keep (somewhere) the data received for every request, until that request is eventually committed. I could get rid of request.data, but then I'd have to store the data for every request in an instance variable of HexaryTrieSync. The SyncResponse you suggest would also have to be stored somewhere, and since it's disjoint from the actual request I'm not sure it gives us any guarantees really -- we could still try to commit a request for which there's no SyncResponse in the same way we can try to commit one which has .data set to None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can't find a way to do it that seems good, my fallback request would be to add some comments to document this deficiency. Given that if request.data is None the two methods that access it will very likely fail loudly I'm ok with it staying in.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've added a docstring to commit(), and the other method is gone as there was no point in having it anyway

await self.commit(request)
continue

references, leaves = self.get_children(request)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to my other comment about request.data being potentially problematic. At this call-site we'd instead pass in a response object so that we have a guarantee that the data attribute is present.

if request.dependencies == 0:
await self.commit(request)

async def commit(self, request: SyncRequest) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function would also change to take a SyncResponse object.

@gsalgado gsalgado force-pushed the issue-1074 branch 3 times, most recently from f0af0bd to 4014ff9 Compare August 6, 2018 18:09
@gsalgado
Copy link
Collaborator Author

gsalgado commented Aug 7, 2018

@pipermerriam I think I've addressed all your comments. wanna have another look?

# Requests get added to both self.queue and self.requests; the former is used to keep
# track which requests should be sent next, and the latter is used to avoid scheduling a
# request for a given node multiple times.
self.logger.debug("Scheduling retrieval of %s", encode_hex(request.node_key))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be TRACE level logging?

is_raw: bool = False) -> None:
"""Schedule a request for the node with the given key."""
if node_key in self._existing_nodes:
self.logger.debug("Node %s already exists in db", encode_hex(node_key))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this code path is likely to be encountered a lot we should move this to TRACE level logging. Same with the one in the next if block.

if request is None:
# This may happen if we resend a request for a node after waiting too long,
# and then eventually get two responses with it.
self.logger.debug(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trace

encode_hex(node_key))
return

if request.data is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've got a solution that'll work. Lets remove the data property from the SyncRequest object and just use a boolean property like is_done. We can set is_done here in this method, and just pass the raw data value into self.commit. It looks like there are no other places where request.data is accessed. This has the added benefit of reducing the memory overhead of this class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't always commit at the end of process(), so we need to keep the data for a given request until we have committed all of its children; see my previous comment: #1124 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, missed that. Ok, :shipit: unless you see a clean way around this.

# A cache of node hashes we know to exist in our DB, used to avoid querying the DB
# unnecessarily as that's the main bottleneck when dealing with a large DB like for
# ethereum's mainnet/ropsten.
self._existing_nodes: Set[Hash32] = set()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worries me a bit as it has O(n) storage cost. For a full state trie sync I believe that means storing roughly 180 million hashes which is 5gb of memory. Do you think we can get by with an LRU cache or something on the existence check and not actually store all of them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a deja-vu feeling here so did a bit of digging and found ethereum/py-trie#43 (review)

I've opened #1154 for it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants