Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce new scorch index builder #1282

Merged
merged 17 commits into from May 12, 2020

Conversation

mschoch
Copy link
Member

@mschoch mschoch commented Aug 20, 2019

The purpose of the index builder is to better support
use cases where one would like to index a set of data,
not search it during the build, ensure the resulting file
is in an efficient optimized state, and then support
opening this index read-only to serve queries.

Useful implementation details:

Only an Index() method is offered, meaning you do not control
the batch size. The builder will group data into batches
using a configurable 'batchSize' parameter (default 1000)

All of these batches are persisted as segments into a temp
location. You can control this using 'buildPathPrefix', or
it will default to a system temp location.

You must call Close() after indexing all data. This will
flush the last batch, and begin merging segments. The
builder will attempt to merge several segments at once. This
is configurable with the 'mergeMax' setting, which defaults
to 10. Merging continues until there is only 1 segment
remaining.

At this point, the final segment is moved to the final
location (as specified by the original builder constructor).
The root.bolt is created, pointing to the segment, and
the remaining bleve index wrapping is completed.

The purpose of the index builder is to better support
use cases where one would like to index a set of data,
not search it during the build, ensure the resulting file
is in an efficient optimized state, and then support
opening this index read-only to serve queries.

Useful implementation details:

Only an Index() method is offered, meaning you do not control
the batch size.  The builder will group data into batches
using a configurable 'batchSize' parameter (default 1000)

All of these batches are persisted as segments into a temp
location.  You can control this using 'buildPathPrefix', or
it will default to a system temp location.

You must call Close() after indexing all data.  This will
flush the last batch, and begin merging segments.  The
builder will attempt to merge several segments at once. This
is configurable with the 'mergeMax' setting, which defaults
to 10.  Merging continues until there is only 1 segment
remaining.

At this point, the final segment is moved to the final
location (as specified by the original builder constructor).
The root.bolt is created, pointing to the segment, and
the remaining bleve index wrapping is completed.
previously we used incorrect logic to see if there was
a final batch that needed to be flushed, this adds a new
test to catch this case, and adjusts the logic
@fproulx-dfuse
Copy link

fproulx-dfuse commented Aug 28, 2019

We've had great experience with this offline builder with large scale, real-world data (multi billions of documents across thousands of shards). Very fast, predictable, reliable output size.

@fi0
Copy link

fi0 commented Aug 31, 2019

Does this co-exist with the existing function that allows both search and index operations in one opened index?

Also required some additional refactoring.
@mschoch
Copy link
Member Author

mschoch commented Apr 14, 2020

@fi0 yes it coexists in the sense that we are not removing or changing the existing API. However, at the time you decide to create the index, you have to choose which method you want to use. You can use the IndexBuilder to build an initial index, then close it and re-open with the traditional API (read-only or read-write).

@mschoch
Copy link
Member Author

mschoch commented Apr 14, 2020

I just merged in the changes from master that have taken place since this was initially developed. There were a few places I had to make minor changes to use the newer segment plugin interface, but it all should be straightforward to follow. Requesting re-review from the team.

@mschoch
Copy link
Member Author

mschoch commented Apr 14, 2020

Ping @sreekanth-cb @abhinavdangeti @steveyen since initial review was never completed it won't let me request re-review.

index/scorch/builder.go Outdated Show resolved Hide resolved
index/scorch/builder.go Outdated Show resolved Hide resolved
index/scorch/builder.go Outdated Show resolved Hide resolved
@mschoch
Copy link
Member Author

mschoch commented Apr 29, 2020

I don't know @sreekanth-cb I'm stuck. Logs show I close the segment, it has ref count of 1, and close returns nil. Windows claims the file is still used by another process when we go to cleanup.

From debugging windows bullshit like this in the past, I suspect it might be related to the rename operation we do earlier, but I don't see the problem. Any ideas?

Copy link
Contributor

@sreekanth-cb sreekanth-cb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mschoch , perhaps the problem is with the checkIndex method in UT..
the indexReader isn't closed?

index/scorch/builder.go Outdated Show resolved Hide resolved
move both closes to defer to improve reliability
@fproulx-dfuse
Copy link

Looking forward to this being merged for us to finally be aligned with your upstream.
We've been on this experimental branch for months.

@mschoch mschoch merged commit c5a1089 into blevesearch:master May 12, 2020
@mschoch mschoch deleted the scorch-index-builder branch May 12, 2020 23:59
@tylerkovacs
Copy link
Contributor

I just gave it a try on an index with 4 million documents and a batch size of 10k. Compared to my old approach of indexing batches, the new builder:

  • Took 60% longer (which seems to be due the additional latency of merging segments during the call to Close()).
  • Creates an index that takes 4x more disk space. A compressed tarball of the index is about 20% bigger.

Is that the expected result?

@mschoch
Copy link
Member Author

mschoch commented May 13, 2020

@tylerkovacs so the most likely explanation is that this is because zap version 11 is still the default (for traditional new indexes as well as when using the index builder). Zap version 11 has an issue where considerable space is wasted for low frequency terms. By forcing everything into a single segment, the degree to which a term is low-frequency tends to be exaggerated, making the wasted space more pronounced.

Zap version 12 will become the default in the next major release of bleve. Until then, you can force use of version 12 by passing the following additional configuration when creating a new index, or creating a new index builder:

"forceSegmentType": "zap",
"forceSegmentVersion": 12

Can you try again with these settings? I hope that this addresses the space issue. In previous testing for a client, zap v12 with a single segment resulted in considerable space savings. That was actually how this PR came to be in the first place.

As for the time, there is another config option for the builder batchSize which you coul set to 10000 to match your previous batch size, the default is 1000. As you noted, the build time may always be somewhat longer because of the forced merging, but the benefit should be reduced disk usage and faster searches.

Let me know how these settings work, and we'll try to work through it.

@tylerkovacs
Copy link
Contributor

Thanks for the quick reply @mschoch.

I didn't mention it but I am using Zap version 12 on both branches. Here's how I'm calling NewBuilder:

bleve.NewBuilder(a.indexShardPath(i), a.domainIndexMapping, map[string]interface{}{
  "forceSegmentType":    "zap",
  "forceSegmentVersion": 12,
  "batchSize":           10_000,
  "buildPathPrefix":     a.tempIndexShardPath(i),
})

I even added some logging statements to configForceSegmentTypeVersion to verify that zap segment type/version config was recognized.

I noticed that disk utilization increases dramatically during the call to Close(). Immediately before calling Close(), my index is 27GB. As an aside, that's almost twice as big as my final index size when not using Builder.

While Close() is running, it balloons to 70GB before shrinking to 58GB when it completes. So index size is more than doubling as a result of the call to Close().

Could it have something to do with my IndexMapping or the data that I'm adding to the index? My IndexMapping has a single DocumentMapping with 126 FieldMappings of various types.

@mschoch
Copy link
Member Author

mschoch commented May 14, 2020

@tylerkovacs I wouldn't expect your mapping to matter. With zap v12 + IndexBuilder no indexes should have gotten larger, so it seems like a bug. I've opened #1397

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants