Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
core/bloombits, eth/filter: transformed bloom bitmap based log search #14631
Further parts of the code may be moved into separate PRs to make the review process easier (suggestions are welcome).
This PR optimizes log searching by creating a data structure (BloomBits) that makes it cheaper to retrieve bloom filter data relevant to a specific filter. When searching in a long section of the block history, we are checking three specific bits of each bloom filter per address/topic. In order to do that, currently we read/retrieve a cca. 500 byte block header for each block. The implemented structure optimizes this by a "bitwise 90 degree rotation" of the bloom filters. Blocks are grouped into sections (SectionSize is 4096 blocks at the moment), BloomBits[bitIdx][sectionIdx] is a 4096 bit (512 byte) long bit vector that contains a single bit of each bloom filter from the block range [sectionIdx*SectionSize ... (sectionIdx+1)*SectionSize-1]. (Since bloom filters are usually sparse, a simple data compression makes this structure even more efficient, especially for ODR retrieval.) By reading and binary AND-ing three BloomBits sections, we can filter for an address/topic in 4096 blocks at once ("1" bits in the binary AND result mean bloom matches).
Implementation and design rationale of the matcher logic
The matcher was designed with the needs of both full and light nodes in mind. A simpler architecture would probably be satisfactory for full nodes (where the bit vectors are available in the local database) but the network retrieval bottleneck of light clients justifies a more sophisticated algorithm that tries to minimize the amount of retrieved data and return results as soon as possible. The current implementation is a pipelined structure based on input and output channels (receiving section indexes and sending potential matches). The matcher is built from sub-matchers, one for the addresses and one for each topic group. Since we are interested in matches that each sub-matcher signals as positive, they are daisy-chained in a way that subsequent sub-matchers are only retrieving and matching the bit vectors of sections where the previous matchers have found a potential match. The "1" bits of the output of the last sub-matcher are returned as bloom filter matches.
Light clients retrieve the bit vectors with merkle proofs, which makes it much more efficient to retrieve batches of vectors (whose merkle proofs share most of their trie nodes) in a single request. Also, it is preferable to prioritize requests based on their section index (regardless of bit index) in order to ensure that matches are found and returned as soon as possible (and in a sequential order). Prioritizing and batching are realized by a common request distributor that receives individual bit index/section index requests from fetchers and keeps an ordered list of section indexes to be requested, grouped by bit index. It does not call any retrieval backend function but it is called by a "server" process (see serveMatcher in filter.go). NextRequest returns the next batch to be requested, retrieved vectors are returned through the Deliver function. This method ensures that the bloombits package should only care about implementing the matching logic. The caller can retain full control over the resources (CPU/disk/network bandwidth) assigned to this task.
I don't think I fully understand the operation of the various goroutines here. Could we schedule a call to discuss and review interactively?
Generally looks good, but I'm concerned the code to retrieve data from disk is massively overengineered.