Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse Lucene's TermsEnum for faster _uid/version lookup during indexing #6298

Closed
wants to merge 3 commits into from

Conversation

Projects
None yet
4 participants
@mikemccand
Copy link
Contributor

mikemccand commented May 23, 2014

The TermsEnums used for lookup have highish cost to init, so if we
reuse them we may be able to stop using bloom filters. I ran some bulk
update performance tests, showing that turning off blooms and reusing
the enums gets close to the same performance as master (using blooms
and not reusing the enums).

Closes #6212

Core: reuse Lucene's TermsEnum for faster _uid/version lookup during …
…indexing

The TermsEnums used for lookup have highish cost to init, so if we
reuse them we may be able to stop using bloom filters.  I ran some bulk
update performance tests, showing that turning off blooms and reusing
the enums gets close to the same performance as master (using blooms
and not reusing the enums).

Closes #6212
List<AtomicReaderContext> leaves = new ArrayList<>(r.leaves());

// nocommit but ES goes backwards today... is that really best? backwards is not necessarily reverse time order (TMP merges out of
// order)

This comment has been minimized.

Copy link
@nik9000

nik9000 May 23, 2014

Contributor

It looks like backwards in this context means smallest to largest?

This comment has been minimized.

Copy link
@mikemccand

mikemccand May 23, 2014

Author Contributor

Roughly, yes, it goes smallest to largest, but with the default TieredMergePolicy, the segment sizes can vary a lot over time (it doesn't care about / preserve segment order in the index).

This comment has been minimized.

Copy link
@jpountz

jpountz May 30, 2014

Contributor

I think the idea was also that it is frequent to have a lots of documents that are pretty static and a few ones that are frequently updated, and these frequently updated documents would more likely be in the last segments?

This comment has been minimized.

Copy link
@mikemccand

mikemccand May 30, 2014

Author Contributor

OK, I'll switch it to go backwards again.

@mikemccand

This comment has been minimized.

Copy link
Contributor Author

mikemccand commented May 28, 2014

bulkindexing

I ran a bulk indexing perf test comparing master vs this patch, updating 10M small log-entry type docs ("index" command, passing _id), with random UUIDs (%10d, worst case for terms dict), using MMapDir, and the results look promising: reusing the TermsEnum gets back much of the performance that bloom filters buy us today.

However, this is a small test (10M docs), the index was fully hot...

@@ -67,7 +67,9 @@
for (String luceneName : PostingsFormat.availablePostingsFormats()) {
buildInPostingFormatsX.put(luceneName, new PreBuiltPostingsFormatProvider.Factory(PostingsFormat.forName(luceneName)));
}
final Elasticsearch090PostingsFormat defaultFormat = new Elasticsearch090PostingsFormat();
// nocommit can we disable bloom by default

This comment has been minimized.

Copy link
@jpountz

jpountz May 30, 2014

Contributor

It makes sense to me given the minor speedup they bring. But maybe as a separate change?

This comment has been minimized.

Copy link
@mikemccand

mikemccand May 30, 2014

Author Contributor

+1 for separate issue, I'll open it.

I want to test a larger index, and a cold index first. Separately I'm going to test switching to the new IDVPostingsFormat.

This comment has been minimized.

Copy link
@mikemccand

mikemccand May 30, 2014

Author Contributor

I opened #6349

@mikemccand

This comment has been minimized.

Copy link
Contributor Author

mikemccand commented May 30, 2014

OK I went back to searching segments in reverse order, and downgraded the other nocommits to TODOs. I think this is ready; I'll commit soon.

@@ -67,7 +67,8 @@
for (String luceneName : PostingsFormat.availablePostingsFormats()) {
buildInPostingFormatsX.put(luceneName, new PreBuiltPostingsFormatProvider.Factory(PostingsFormat.forName(luceneName)));
}
final Elasticsearch090PostingsFormat defaultFormat = new Elasticsearch090PostingsFormat();
final PostingsFormat defaultFormat = new Elasticsearch090PostingsFormat();
//final PostingsFormat defaultFormat = PostingsFormat.forName("Lucene41");

This comment has been minimized.

Copy link
@jpountz

jpountz May 30, 2014

Contributor

Can you just remove this commented line before pushing?

This comment has been minimized.

Copy link
@mikemccand

mikemccand May 30, 2014

Author Contributor

OK will do!

@mikemccand

This comment has been minimized.

Copy link
Contributor Author

mikemccand commented May 31, 2014

Committed with #6212

@mikemccand mikemccand closed this May 31, 2014

@clintongormley clintongormley changed the title Core: reuse Lucene's TermsEnum for faster _uid/version lookup during indexing Indexing: Reuse Lucene's TermsEnum for faster _uid/version lookup during indexing Jul 16, 2014

mikemccand added a commit to mikemccand/elasticsearch that referenced this pull request Jul 22, 2014

Core: disable loading of bloom filters by default
This commit changes the default for index.codec.bloom.load to false,
because bloom filters can use a sizable amount of RAM on indices with
many tiny documents, and now only gives smallish index-time
performance gains for apps that update (not just append) documents,
since we've separately improved performance for ID lookups with
elastic#6298.

Closes elastic#6349

mikemccand added a commit that referenced this pull request Jul 23, 2014

Core: don't load bloom filters by default
This change just changes the default for index.codec.bloom.load to
false: with recent performance improvements to ID lookup, such as
#6298, bloom filters don't give much of a performance gain anymore,
and they can consume non-trivial RAM when there are many tiny
documents.

For now, we still index the bloom filters, so if a given app wants
them back, it can just update the index.codec.bloom.load to true.

Closes #6959

mikemccand added a commit that referenced this pull request Jul 23, 2014

Core: don't load bloom filters by default
This change just changes the default for index.codec.bloom.load to
false: with recent performance improvements to ID lookup, such as
#6298, bloom filters don't give much of a performance gain anymore,
and they can consume non-trivial RAM when there are many tiny
documents.

For now, we still index the bloom filters, so if a given app wants
them back, it can just update the index.codec.bloom.load to true.

Closes #6959

@clintongormley clintongormley changed the title Indexing: Reuse Lucene's TermsEnum for faster _uid/version lookup during indexing Reuse Lucene's TermsEnum for faster _uid/version lookup during indexing Jun 7, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.