New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce new scorch index builder #1282
Conversation
The purpose of the index builder is to better support use cases where one would like to index a set of data, not search it during the build, ensure the resulting file is in an efficient optimized state, and then support opening this index read-only to serve queries. Useful implementation details: Only an Index() method is offered, meaning you do not control the batch size. The builder will group data into batches using a configurable 'batchSize' parameter (default 1000) All of these batches are persisted as segments into a temp location. You can control this using 'buildPathPrefix', or it will default to a system temp location. You must call Close() after indexing all data. This will flush the last batch, and begin merging segments. The builder will attempt to merge several segments at once. This is configurable with the 'mergeMax' setting, which defaults to 10. Merging continues until there is only 1 segment remaining. At this point, the final segment is moved to the final location (as specified by the original builder constructor). The root.bolt is created, pointing to the segment, and the remaining bleve index wrapping is completed.
previously we used incorrect logic to see if there was a final batch that needed to be flushed, this adds a new test to catch this case, and adjusts the logic
We've had great experience with this offline builder with large scale, real-world data (multi billions of documents across thousands of shards). Very fast, predictable, reliable output size. |
Does this co-exist with the existing function that allows both search and index operations in one opened index? |
Also required some additional refactoring.
@fi0 yes it coexists in the sense that we are not removing or changing the existing API. However, at the time you decide to create the index, you have to choose which method you want to use. You can use the IndexBuilder to build an initial index, then close it and re-open with the traditional API (read-only or read-write). |
I just merged in the changes from master that have taken place since this was initially developed. There were a few places I had to make minor changes to use the newer segment plugin interface, but it all should be straightforward to follow. Requesting re-review from the team. |
Ping @sreekanth-cb @abhinavdangeti @steveyen since initial review was never completed it won't let me request re-review. |
several of the places this is called are doing a best effort cleanup, and so still ignore this error as they're already returning a different error
I don't know @sreekanth-cb I'm stuck. Logs show I close the segment, it has ref count of 1, and close returns nil. Windows claims the file is still used by another process when we go to cleanup. From debugging windows bullshit like this in the past, I suspect it might be related to the rename operation we do earlier, but I don't see the problem. Any ideas? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mschoch , perhaps the problem is with the checkIndex method in UT..
the indexReader isn't closed?
move both closes to defer to improve reliability
Looking forward to this being merged for us to finally be aligned with your upstream. |
I just gave it a try on an index with 4 million documents and a batch size of 10k. Compared to my old approach of indexing batches, the new builder:
Is that the expected result? |
@tylerkovacs so the most likely explanation is that this is because zap version 11 is still the default (for traditional new indexes as well as when using the index builder). Zap version 11 has an issue where considerable space is wasted for low frequency terms. By forcing everything into a single segment, the degree to which a term is low-frequency tends to be exaggerated, making the wasted space more pronounced. Zap version 12 will become the default in the next major release of bleve. Until then, you can force use of version 12 by passing the following additional configuration when creating a new index, or creating a new index builder:
Can you try again with these settings? I hope that this addresses the space issue. In previous testing for a client, zap v12 with a single segment resulted in considerable space savings. That was actually how this PR came to be in the first place. As for the time, there is another config option for the builder Let me know how these settings work, and we'll try to work through it. |
Thanks for the quick reply @mschoch. I didn't mention it but I am using Zap version 12 on both branches. Here's how I'm calling
I even added some logging statements to I noticed that disk utilization increases dramatically during the call to While Could it have something to do with my |
@tylerkovacs I wouldn't expect your mapping to matter. With zap v12 + IndexBuilder no indexes should have gotten larger, so it seems like a bug. I've opened #1397 |
The purpose of the index builder is to better support
use cases where one would like to index a set of data,
not search it during the build, ensure the resulting file
is in an efficient optimized state, and then support
opening this index read-only to serve queries.
Useful implementation details:
Only an Index() method is offered, meaning you do not control
the batch size. The builder will group data into batches
using a configurable 'batchSize' parameter (default 1000)
All of these batches are persisted as segments into a temp
location. You can control this using 'buildPathPrefix', or
it will default to a system temp location.
You must call Close() after indexing all data. This will
flush the last batch, and begin merging segments. The
builder will attempt to merge several segments at once. This
is configurable with the 'mergeMax' setting, which defaults
to 10. Merging continues until there is only 1 segment
remaining.
At this point, the final segment is moved to the final
location (as specified by the original builder constructor).
The root.bolt is created, pointing to the segment, and
the remaining bleve index wrapping is completed.