introduce new scorch index builder #1282

mschoch · 2019-08-20T20:46:16Z

The purpose of the index builder is to better support
use cases where one would like to index a set of data,
not search it during the build, ensure the resulting file
is in an efficient optimized state, and then support
opening this index read-only to serve queries.

Useful implementation details:

Only an Index() method is offered, meaning you do not control
the batch size. The builder will group data into batches
using a configurable 'batchSize' parameter (default 1000)

All of these batches are persisted as segments into a temp
location. You can control this using 'buildPathPrefix', or
it will default to a system temp location.

You must call Close() after indexing all data. This will
flush the last batch, and begin merging segments. The
builder will attempt to merge several segments at once. This
is configurable with the 'mergeMax' setting, which defaults
to 10. Merging continues until there is only 1 segment
remaining.

At this point, the final segment is moved to the final
location (as specified by the original builder constructor).
The root.bolt is created, pointing to the segment, and
the remaining bleve index wrapping is completed.

The purpose of the index builder is to better support use cases where one would like to index a set of data, not search it during the build, ensure the resulting file is in an efficient optimized state, and then support opening this index read-only to serve queries. Useful implementation details: Only an Index() method is offered, meaning you do not control the batch size. The builder will group data into batches using a configurable 'batchSize' parameter (default 1000) All of these batches are persisted as segments into a temp location. You can control this using 'buildPathPrefix', or it will default to a system temp location. You must call Close() after indexing all data. This will flush the last batch, and begin merging segments. The builder will attempt to merge several segments at once. This is configurable with the 'mergeMax' setting, which defaults to 10. Merging continues until there is only 1 segment remaining. At this point, the final segment is moved to the final location (as specified by the original builder constructor). The root.bolt is created, pointing to the segment, and the remaining bleve index wrapping is completed.

index/scorch/builder.go

previously we used incorrect logic to see if there was a final batch that needed to be flushed, this adds a new test to catch this case, and adjusts the logic

fproulx-dfuse · 2019-08-28T04:08:55Z

We've had great experience with this offline builder with large scale, real-world data (multi billions of documents across thousands of shards). Very fast, predictable, reliable output size.

fi0 · 2019-08-31T02:58:30Z

Does this co-exist with the existing function that allows both search and index operations in one opened index?

Also required some additional refactoring.

mschoch · 2020-04-14T18:20:28Z

@fi0 yes it coexists in the sense that we are not removing or changing the existing API. However, at the time you decide to create the index, you have to choose which method you want to use. You can use the IndexBuilder to build an initial index, then close it and re-open with the traditional API (read-only or read-write).

mschoch · 2020-04-14T18:21:20Z

I just merged in the changes from master that have taken place since this was initially developed. There were a few places I had to make minor changes to use the newer segment plugin interface, but it all should be straightforward to follow. Requesting re-review from the team.

mschoch · 2020-04-14T18:22:15Z

Ping @sreekanth-cb @abhinavdangeti @steveyen since initial review was never completed it won't let me request re-review.

index/scorch/builder.go

several of the places this is called are doing a best effort cleanup, and so still ignore this error as they're already returning a different error

mschoch · 2020-04-29T22:23:23Z

I don't know @sreekanth-cb I'm stuck. Logs show I close the segment, it has ref count of 1, and close returns nil. Windows claims the file is still used by another process when we go to cleanup.

From debugging windows bullshit like this in the past, I suspect it might be related to the rename operation we do earlier, but I don't see the problem. Any ideas?

sreekanth-cb

@mschoch , perhaps the problem is with the checkIndex method in UT..
the indexReader isn't closed?

index/scorch/builder.go

move both closes to defer to improve reliability

fproulx-dfuse · 2020-05-04T16:37:18Z

Looking forward to this being merged for us to finally be aligned with your upstream.
We've been on this experimental branch for months.

tylerkovacs · 2020-05-13T22:08:47Z

I just gave it a try on an index with 4 million documents and a batch size of 10k. Compared to my old approach of indexing batches, the new builder:

Took 60% longer (which seems to be due the additional latency of merging segments during the call to Close()).
Creates an index that takes 4x more disk space. A compressed tarball of the index is about 20% bigger.

Is that the expected result?

mschoch · 2020-05-13T22:59:46Z

@tylerkovacs so the most likely explanation is that this is because zap version 11 is still the default (for traditional new indexes as well as when using the index builder). Zap version 11 has an issue where considerable space is wasted for low frequency terms. By forcing everything into a single segment, the degree to which a term is low-frequency tends to be exaggerated, making the wasted space more pronounced.

Zap version 12 will become the default in the next major release of bleve. Until then, you can force use of version 12 by passing the following additional configuration when creating a new index, or creating a new index builder:

"forceSegmentType": "zap",
"forceSegmentVersion": 12

Can you try again with these settings? I hope that this addresses the space issue. In previous testing for a client, zap v12 with a single segment resulted in considerable space savings. That was actually how this PR came to be in the first place.

As for the time, there is another config option for the builder batchSize which you coul set to 10000 to match your previous batch size, the default is 1000. As you noted, the build time may always be somewhat longer because of the forced merging, but the benefit should be reduced disk usage and faster searches.

Let me know how these settings work, and we'll try to work through it.

tylerkovacs · 2020-05-14T04:38:59Z

Thanks for the quick reply @mschoch.

I didn't mention it but I am using Zap version 12 on both branches. Here's how I'm calling NewBuilder:

bleve.NewBuilder(a.indexShardPath(i), a.domainIndexMapping, map[string]interface{}{
  "forceSegmentType":    "zap",
  "forceSegmentVersion": 12,
  "batchSize":           10_000,
  "buildPathPrefix":     a.tempIndexShardPath(i),
})

I even added some logging statements to configForceSegmentTypeVersion to verify that zap segment type/version config was recognized.

I noticed that disk utilization increases dramatically during the call to Close(). Immediately before calling Close(), my index is 27GB. As an aside, that's almost twice as big as my final index size when not using Builder.

While Close() is running, it balloons to 70GB before shrinking to 58GB when it completes. So index size is more than doubling as a result of the call to Close().

Could it have something to do with my IndexMapping or the data that I'm adding to the index? My IndexMapping has a single DocumentMapping with 126 FieldMappings of various types.

mschoch · 2020-05-14T11:31:24Z

@tylerkovacs I wouldn't expect your mapping to matter. With zap v12 + IndexBuilder no indexes should have gotten larger, so it seems like a bug. I've opened #1397

mschoch requested review from steveyen, abhinavdangeti and sreekanth-cb August 20, 2019 20:46

fproulx-dfuse reviewed Aug 20, 2019

View reviewed changes

index/scorch/builder.go Outdated Show resolved Hide resolved

mschoch added 2 commits August 20, 2019 18:03

address issue flushing partial batch

6425094

previously we used incorrect logic to see if there was a final batch that needed to be flushed, this adds a new test to catch this case, and adjusts the logic

fix type, correct key is mergeMax, not maxMerge

0b3fe5a

Merge branch 'master' into scorch-index-builder

a7a4718

Also required some additional refactoring.

sreekanth-cb approved these changes Apr 28, 2020

View reviewed changes

index/scorch/builder.go Outdated Show resolved Hide resolved

index/scorch/builder.go Outdated Show resolved Hide resolved

index/scorch/builder.go Outdated Show resolved Hide resolved

mschoch added 7 commits April 29, 2020 15:41

better error handling when closing segments

8285b4f

several of the places this is called are doing a best effort cleanup, and so still ignore this error as they're already returning a different error

remove extra space in comment

d6ebd9c

rename constants to better describe their use

9eb36ef

log error message

873c843

Merge branch 'master' into scorch-index-builder

8da76a5

debug logging

3ea4e83

move debug logging

b8986b1

mschoch added 4 commits April 29, 2020 18:53

try not removing buildpath

05c925e

disable rename to see if original remove succeeds

0d93afb

fuck me

d7db36a

fysmd

7354474

sreekanth-cb suggested changes Apr 30, 2020

View reviewed changes

index/scorch/builder.go Outdated Show resolved Hide resolved

mschoch added 2 commits April 30, 2020 10:40

close reader in checkIndex

89d43a5

move both closes to defer to improve reliability

fix typo

e7063f1

mschoch requested a review from sreekanth-cb April 30, 2020 14:53

sreekanth-cb approved these changes Apr 30, 2020

View reviewed changes

mschoch merged commit c5a1089 into blevesearch:master May 12, 2020

mschoch deleted the scorch-index-builder branch May 12, 2020 23:59

mschoch mentioned this pull request May 14, 2020

IndexBuilder producing larger indexes #1397

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

introduce new scorch index builder #1282

introduce new scorch index builder #1282

mschoch commented Aug 20, 2019

fproulx-dfuse commented Aug 28, 2019 •

edited

fi0 commented Aug 31, 2019

mschoch commented Apr 14, 2020

mschoch commented Apr 14, 2020

mschoch commented Apr 14, 2020

mschoch commented Apr 29, 2020 •

edited

sreekanth-cb left a comment

fproulx-dfuse commented May 4, 2020

tylerkovacs commented May 13, 2020

mschoch commented May 13, 2020

tylerkovacs commented May 14, 2020

mschoch commented May 14, 2020

introduce new scorch index builder #1282

introduce new scorch index builder #1282

Conversation

mschoch commented Aug 20, 2019

fproulx-dfuse commented Aug 28, 2019 • edited

fi0 commented Aug 31, 2019

mschoch commented Apr 14, 2020

mschoch commented Apr 14, 2020

mschoch commented Apr 14, 2020

mschoch commented Apr 29, 2020 • edited

sreekanth-cb left a comment

Choose a reason for hiding this comment

fproulx-dfuse commented May 4, 2020

tylerkovacs commented May 13, 2020

mschoch commented May 13, 2020

tylerkovacs commented May 14, 2020

mschoch commented May 14, 2020

fproulx-dfuse commented Aug 28, 2019 •

edited

mschoch commented Apr 29, 2020 •

edited