Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Turn off bloom filters on _uid by default #6349
In #6298, reusing Lucene's TermsEnums gets back much of the performance that bloom filters buy us, at least in an initial test; I think we can turn them off by default and save some RAM.
We can simply change the default index.codec.bloom.load to false, so we still index bloom filters, and then if a user has trouble, they can turn them back on.
I plan to run some more tests to see if the results hold up; I'll post here.
referenced this issue
May 30, 2014
I did exactly that when working on the UUID blog post. Random IDs (which ES is using for auto-id) are the worst case: they cause a seek per segment, once the index is large enough. Predictable IDs give much less seeking.
I didn't test seek count with the bloom filters; I can do that. It should be much less, though I suspect even in a cold case (overall index bigger than free RAM), the OS would keep the _uid terms dict blocks warmish in the update case, as long as ongoing indexing is fast enough, because ES is doing a lookup per indexed doc. Especially if the lookups are biased towards recently indexed docs.
In the append-only case I think none of this matters much, because we are never doing a lookup by ID.
I ran another test here, indexing 50M small (lines from web access logs) docs. I pass my own ID (so ES must do the ID lookup), and 25% of the time the ID does exist and so the doc is replaced.
It was a worst case test: I used fully random UUIDs, docs are tiny, I left terms index at its default settings (i.e., did not let it use the ~10 bits of RAM per UUID that blooms got to use), and indexing performance was ~10% slower. This used to be much, much worse before #6298 ...
I suspect the apps that do pass their own ID and update docs are "typically" indexing larger docs than the common "append only tiny docs" case, and so that 10% would be lower because more time is spent actually indexing.
Net/net I think we should disable blooms today: I think the added RAM usage at search time is dangerous and not worth the minor indexing gains. We could do this in a low-risk way, just by changing the default index.codec.bloom.load to false. This way the bloom filters are still computed at indexing time, but not loaded at search time. Apps that "need them" can just flip that boolean to true.
Or we can stop computing them at indexing time too; this means apps that want them back would have to re-index.