Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
core, eth, trie: bloom filter for trie node dedup during fast sync #19489
State sync (part of fast sync) works by taking the root of a state trie, downloading it from remote peers, deduplicating all children based on the local database and recursing any missing subtries. The deduplication simply checks the local database if the node is already known and omits downloading it if so. Unfortunately this check is exceedingly expensive as the database grows.
This PR adds a bloom filter to the state sync, so that instead of checking each and every node whether we already have it locally on disk or not, we first consult a gigantic bloom filter. If the node being deduplicated is not in the bloom, it surely is not on disk either, so we can save an expensive database lookup. If the bloom says it "might" be present, we must verify.
The bloom however can only return correct answers (i.e. a trie node surely was not yet downloaded) if we ensure that all trie nodes on disk have been injected into the filter too. This is the case for a fresh node with no database (i.e. no nodes whatsoever), but it does not hold true if a node is restarted mid-sync. For those instances, the PR starts by scanning the local database and injecting any previously downloaded trie node hashes into the bloom. As long as the initialization is running, the bloom always returns "maybe", delegating fast sync to reach down into the database instead of relying on the bloom.
The code is not ideal. We need to clean it up a bit, possibly thread safety wise too. We might also consider reimplementing the bloom instead of pulling in the library, not a fan. The PR is mostly up for benchmarking, discussions, reviews (note, it's not bad, just not final).
On my machine ( a VM with NAT)
Basically, it currently starts iterating the state lazily once we get some state from a peer. So on my machine it's