-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for recursive graph bisection. #12489
Conversation
Recursive graph bisection is an extremely effective algorithm to reorder doc IDs in a way that improves both storage and query efficiency by clustering similar documents together. It usually performs better than other techniques that try to achieve a similar goal such as sorting the index in natural order (e.g. by URL) or by a min-hash, though it comes at a higher index-time cost. The [original paper](https://arxiv.org/pdf/1602.08820.pdf) is good but I found this [follow-up paper](http://engineering.nyu.edu/~suel/papers/bp-ecir19.pdf) to describe the algorithm in more practical ways.
I'm opening this draft in case someone wants to take a look. I only checked the output on very small indices for now. I also ran it on larger indexes, such as a 1.8M-docs wikimedium10m segment to see how long it takes (4 minutes on my 24-cores machine) but I haven't checked if the result made sense yet. It's probably full of bugs! |
I think it's starting to look better now. I worked on some inefficiencies and applied some of the optimizations suggested by Mackenzie et al. in "Tradeoff Options for Bipartite Graph Partitioning":
With the suggested defaults of minDocFreq=4,096 and minPartitionSize=32, I'm getting the following performance numbers on wikimedium10m (10M docs):
Then comparing query performance, I'm getting interesting results. I had to disable verification of scores and counts because of the reordering. A quick manual check suggests that results are valid. I can guess why some queries like conjunctions are faster, but I'm not sure for
|
I ran the benchmark multiple times to see if the slowdown on
First, the reordering works pretty well, as there are 11 contiguous ranges of 100k doc IDs that don't have a single occurrence of But I suspect that it is also the source of the slowdown with the disjunction: |
I opened #12526 for a potential solution to this problem. |
@jpountz did you measure any change to index size with the reordered docids? |
I did. My wikimedium file is sorted by title, which already gives some compression compared to random ordering. Disappointedly, recursive graph bisection only improved compression of postings (doc) by 1.5%. It significantly hurts stored fields though, I suspect it's because the
It gave me doubts whether the algorithm was correctly implemented in the beginning, but the query speedups and postings distributions suggest it is not completely wrong. I should run on wikibigall too. |
Wikibigall. Less space spent on doc valuse this time since I did not enable indexing of facets. There is a more significant size reduction of postings this time (-10.5%). This is not misaligned with the reproducibility paper which observered size reductions of 18% with partitioned Elias-Fano and 5% with SVByte on the Wikipedia dataset. I would expect PFor to be somewhere in between as it's better able to take advantage of small gaps between docs than SVByte, but less than partioned Elias-Fano.
Benchmarks still show slowdowns on phrase queries and speedups on conjunctions, though it's less spectacular than on wikimedium10m.
|
Thanks @jpountz -- these are fascinating results! I wonder why stored fields index size wasn't really hurt nearly as much for It's interesting that |
This is because wikimedium uses chunks of articles as documents, and every chunk has the title of the Wikipedia article, so there are often ten or more adjacent docs that have the same title. This is a best case for stored fields compression as only the fist title is actually stored and other occurrences of the same title are replaced with a reference to the first occurrence. With reordering, these duplicate titles are no longer in the same block, so it goes back to just deduplicating bits of title strings, instead of entire titles. wikibig doesn't have this best case scenario for stored fields compression. Ordering only helps a bit because articles are in title order, so there are more duplicate strings in a block of stored fields (shared prefixes) compared to the reordered index. |
Regarding positions, the reproducibility paper noted that the algorithm helped term frequencies a bit, though not as much as docs. It doesn't say anythink about positions, though I suspect that if it tends to group together docs that have the same freq for the same term, then gaps in positions also tend to be more regular. |
I just found a bug that in practice only made BP run one iteration per level, fixing it makes performance better (wikibigall):
Space savings are also bigger on postings:
|
Since it's fairly unintrusive to other functionality, I felt free to merge. |
Recursive graph bisection is an extremely effective algorithm to reorder doc IDs in a way that improves both storage and query efficiency by clustering similar documents together. It usually performs better than other techniques that try to achieve a similar goal such as sorting the index in natural order (e.g. by URL) or by a min-hash, though it comes at a higher index-time cost. The [original paper](https://arxiv.org/pdf/1602.08820.pdf) is good but I found this [reproducibility study](http://engineering.nyu.edu/~suel/papers/bp-ecir19.pdf) to describe the algorithm in more practical ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using default ThreadFactoty for fork join pools runs test without permissions in case of enabled security manager.
See test failures here: https://jenkins.thetaphi.de/job/Lucene-MMAPv2-Windows/794/
|
||
public void testSingleTermWithForkJoinPool() throws IOException { | ||
int concurrency = TestUtil.nextInt(random(), 1, 8); | ||
ForkJoinPool pool = new ForkJoinPool(concurrency); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default implementation of ForkJoinPool executes tasks without any permissions. This causes the test to fail if a FS based directory implemntation is used:
To fix use a thread factory that does not remove all permissions.
Recursive graph bisection is an extremely effective algorithm to reorder doc IDs in a way that improves both storage and query efficiency by clustering similar documents together. It usually performs better than other techniques that try to achieve a similar goal such as sorting the index in natural order (e.g. by URL) or by a min-hash, though it comes at a higher index-time cost.
The original paper is good but I found this follow-up reproducibility study to describe the algorithm in more practical ways.