Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turn off bloom filters on _uid by default #6349

Closed
mikemccand opened this Issue May 30, 2014 · 4 comments

Comments

Projects
None yet
2 participants
@mikemccand
Copy link
Contributor

mikemccand commented May 30, 2014

In #6298, reusing Lucene's TermsEnums gets back much of the performance that bloom filters buy us, at least in an initial test; I think we can turn them off by default and save some RAM.

We can simply change the default index.codec.bloom.load to false, so we still index bloom filters, and then if a user has trouble, they can turn them back on.

I plan to run some more tests to see if the results hold up; I'll post here.

@jpountz

This comment has been minimized.

Copy link
Contributor

jpountz commented May 30, 2014

Would it be possible to count the average number of disk seeks per version lookup in both cases. I think this would give a pretty good idea of how performance would compare in case of an index that is cold or much larger than RAM?

@mikemccand

This comment has been minimized.

Copy link
Contributor Author

mikemccand commented May 30, 2014

I did exactly that when working on the UUID blog post. Random IDs (which ES is using for auto-id) are the worst case: they cause a seek per segment, once the index is large enough. Predictable IDs give much less seeking.

I didn't test seek count with the bloom filters; I can do that. It should be much less, though I suspect even in a cold case (overall index bigger than free RAM), the OS would keep the _uid terms dict blocks warmish in the update case, as long as ongoing indexing is fast enough, because ES is doing a lookup per indexed doc. Especially if the lookups are biased towards recently indexed docs.

In the append-only case I think none of this matters much, because we are never doing a lookup by ID.

@mikemccand

This comment has been minimized.

Copy link
Contributor Author

mikemccand commented Jul 21, 2014

I ran another test here, indexing 50M small (lines from web access logs) docs. I pass my own ID (so ES must do the ID lookup), and 25% of the time the ID does exist and so the doc is replaced.

It was a worst case test: I used fully random UUIDs, docs are tiny, I left terms index at its default settings (i.e., did not let it use the ~10 bits of RAM per UUID that blooms got to use), and indexing performance was ~10% slower. This used to be much, much worse before #6298 ...

I suspect the apps that do pass their own ID and update docs are "typically" indexing larger docs than the common "append only tiny docs" case, and so that 10% would be lower because more time is spent actually indexing.

Net/net I think we should disable blooms today: I think the added RAM usage at search time is dangerous and not worth the minor indexing gains. We could do this in a low-risk way, just by changing the default index.codec.bloom.load to false. This way the bloom filters are still computed at indexing time, but not loaded at search time. Apps that "need them" can just flip that boolean to true.

Or we can stop computing them at indexing time too; this means apps that want them back would have to re-index.

mikemccand added a commit to mikemccand/elasticsearch that referenced this issue Jul 22, 2014

Core: disable loading of bloom filters by default
This commit changes the default for index.codec.bloom.load to false,
because bloom filters can use a sizable amount of RAM on indices with
many tiny documents, and now only gives smallish index-time
performance gains for apps that update (not just append) documents,
since we've separately improved performance for ID lookups with
elastic#6298.

Closes elastic#6349
@mikemccand

This comment has been minimized.

Copy link
Contributor Author

mikemccand commented Jul 23, 2014

Closed via #6959

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.