-
Notifications
You must be signed in to change notification settings - Fork 24.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save one utf8 conversion in KeywordFieldMapper. #19867
Save one utf8 conversion in KeywordFieldMapper. #19867
Conversation
LGTM |
If a `keyword` field is both indexed and doc-valued, then we will convert the input string to utf8 bytes twice: once for indexing/storing, and once for doc values. This commit changes `keyword` fields to compute the utf8 representation up-front and then feed both the inverted index and doc values with it. Rather than adding version-based bw compat logic, I broke the `keyword` field (they are now indexed/stored as a binary field rather than string), which is fine since we are still on alpha releases for 5.0.
d020966
to
c44679d
Compare
Unfortunately, this change caused a large drop in indexing performance for the geonames documents: https://benchmarks-old.elastic.co/index.html#indexing What's very strange is, I can easily reproduce the big indexing throughput loss on my 72 core box, before and after this change, but on 8 core Haswell and Skylake boxes, there is no measurable change in indexing performance. I can't explain that. The change looks quite innocuous, but when I stare at Lucene's sources for how it handles the Maybe we are confusing hotspot by having two hot implementations for Maybe we should back this out until we understand what's happening? I think the motivation for the change is very good, but I can't explain why it's giving such a big drop on a high core count box... |
Thanks for catching the slow down. +1 to back it out. I'll do it tomorrow unless someone is able to do it before. |
I'm sorry to be the bearer of bad news here: I think it's ridiculous we cannot make a clear improvement in ES for this reason. I hope we can get to the bottom of it. It ought to be safe to index binary terms directly in Lucene. Anyway, @rmuir has been helping me try to isolate the cause for the slowdown here, and at least we have made some progress: if I run the benchmark with
So, happily, the change ("after") is a bit faster, indexing all geonames docs, if we force only C1 (essentially |
This reverts commit c44679d. Conflicts: core/src/main/java/org/elasticsearch/index/mapper/BaseGeoPointFieldMapper.java core/src/main/java/org/elasticsearch/index/mapper/GeoPointFieldMapperLegacy.java core/src/test/java/org/elasticsearch/index/mapper/GeoPointFieldMapperTests.java
Honestly, i don't get it, maybe its a compiler bug, or perhaps related to lock elision or something else on your machine (though, we tried disabling EA). Because i think even if this part of indexwriter is compiled inefficiently, it shouldn't create such a massive performance regression for the app as a whole, so something truly crazy is going on. |
I also tested JDK-9 EA: still a big slowdown. And I upgraded my kernel from 4.6.x to 4.7.0: still a big slowdown. It's so weird that single socket boxes, at least the two 8 core boxes I've tested on, don't show any slowdown. |
OK @rmuir and I finally found the smoking gun (using JFR, but running with
But this is not committable to Lucene ... so we need to figure out how to properly fix it... |
I just pushed this change back in. |
Thank you @mikemccand and @rmuir for looking into it! |
If a
keyword
field is both indexed and doc-valued, then we will convert theinput string to utf8 bytes twice: once for indexing/storing, and once for doc
values. This commit changes
keyword
fields to compute the utf8 representationup-front and then feed both the inverted index and doc values with it.
Rather than adding version-based bw compat logic, I broke the
keyword
field(they are now indexed/stored as a binary field rather than string), which is
fine since we are still on alpha releases for 5.0.