Save one utf8 conversion in KeywordFieldMapper. #19867

jpountz · 2016-08-08T14:39:35Z

If a keyword field is both indexed and doc-valued, then we will convert the
input string to utf8 bytes twice: once for indexing/storing, and once for doc
values. This commit changes keyword fields to compute the utf8 representation
up-front and then feed both the inverted index and doc values with it.

Rather than adding version-based bw compat logic, I broke the keyword field
(they are now indexed/stored as a binary field rather than string), which is
fine since we are still on alpha releases for 5.0.

rjernst · 2016-08-08T15:46:02Z

LGTM

If a `keyword` field is both indexed and doc-valued, then we will convert the input string to utf8 bytes twice: once for indexing/storing, and once for doc values. This commit changes `keyword` fields to compute the utf8 representation up-front and then feed both the inverted index and doc values with it. Rather than adding version-based bw compat logic, I broke the `keyword` field (they are now indexed/stored as a binary field rather than string), which is fine since we are still on alpha releases for 5.0.

mikemccand · 2016-08-17T14:08:45Z

Unfortunately, this change caused a large drop in indexing performance for the geonames documents: https://benchmarks-old.elastic.co/index.html#indexing

What's very strange is, I can easily reproduce the big indexing throughput loss on my 72 core box, before and after this change, but on 8 core Haswell and Skylake boxes, there is no measurable change in indexing performance. I can't explain that.

The change looks quite innocuous, but when I stare at Lucene's sources for how it handles the String vs BytesRef cases, there are some interesting differences, e.g. the BinaryTokenStream does not have an OffsetAttribute, the BytesTermAttribute.clear does not reuse the byte[] buffer while CharTermAttributeImpl.clear does reuse its char[] termBuffer.

Maybe we are confusing hotspot by having two hot implementations for termAtt.getBytesRef called from TermsHashPerField.add?

Maybe we should back this out until we understand what's happening? I think the motivation for the change is very good, but I can't explain why it's giving such a big drop on a high core count box...

jpountz · 2016-08-17T15:13:16Z

Thanks for catching the slow down. +1 to back it out. I'll do it tomorrow unless someone is able to do it before.

mikemccand · 2016-08-17T19:43:25Z

I'm sorry to be the bearer of bad news here: I think it's ridiculous we cannot make a clear improvement in ES for this reason. I hope we can get to the bottom of it. It ought to be safe to index binary terms directly in Lucene.

Anyway, @rmuir has been helping me try to isolate the cause for the slowdown here, and at least we have made some progress: if I run the benchmark with -XX:TieredStopAtLevel=1 (disables Hotspot's C2 compilation "efforts") then:

after: Indexer: 8647880 docs: 320.33 sec [26997.1 dps, 8.7 MB/sec]
before: Indexer: 8647880 docs: 343.35 sec [25186.9 dps, 8.1 MB/sec]

So, happily, the change ("after") is a bit faster, indexing all geonames docs, if we force only C1 (essentially -client as @rmuir summarizes it)...

This reverts commit c44679d. Conflicts: core/src/main/java/org/elasticsearch/index/mapper/BaseGeoPointFieldMapper.java core/src/main/java/org/elasticsearch/index/mapper/GeoPointFieldMapperLegacy.java core/src/test/java/org/elasticsearch/index/mapper/GeoPointFieldMapperTests.java

rmuir · 2016-08-18T12:52:47Z

Honestly, i don't get it, maybe its a compiler bug, or perhaps related to lock elision or something else on your machine (though, we tried disabling EA).

Because i think even if this part of indexwriter is compiled inefficiently, it shouldn't create such a massive performance regression for the app as a whole, so something truly crazy is going on.

mikemccand · 2016-08-18T17:56:56Z

I also tested JDK-9 EA: still a big slowdown.

And I upgraded my kernel from 4.6.x to 4.7.0: still a big slowdown.

It's so weird that single socket boxes, at least the two 8 core boxes I've tested on, don't show any slowdown.

mikemccand · 2016-08-18T20:11:46Z

OK @rmuir and I finally found the smoking gun (using JFR, but running with -XX:-Inline so you see all methods). With this tiny change to Lucene 6.1.0, this change has the same performance as before on my 2 socket (72 core) box:

mike@beast2:/l/61/lucene/core$ git diff
diff --git a/lucene/core/src/java/org/apache/lucene/analysis/TokenStream.java b/lucene/core/src/java/org/apache/lucene/analysis/TokenStream.java
index 6a78e1c..1f89f06 100644
--- a/lucene/core/src/java/org/apache/lucene/analysis/TokenStream.java
+++ b/lucene/core/src/java/org/apache/lucene/analysis/TokenStream.java
@@ -177,10 +177,10 @@ public abstract class TokenStream extends AttributeSource implements Closeable {
    */
   public void end() throws IOException {
     clearAttributes(); // LUCENE-3849: don't consume dirty atts
-    PositionIncrementAttribute posIncAtt = getAttribute(PositionIncrementAttribute.class);
-    if (posIncAtt != null) {
-      posIncAtt.setPositionIncrement(0);
-    }
+    //PositionIncrementAttribute posIncAtt = getAttribute(PositionIncrementAttribute.class);
+    //if (posIncAtt != null) {
+    //posIncAtt.setPositionIncrement(0);
+    //}
   }

   /**

But this is not committable to Lucene ... so we need to figure out how to properly fix it...

rmuir · 2016-08-18T21:32:14Z

https://issues.apache.org/jira/browse/LUCENE-7419

This reverts commit d805266.

jpountz · 2016-08-25T11:50:41Z

I just pushed this change back in.

jpountz · 2016-08-25T11:51:04Z

Thank you @mikemccand and @rmuir for looking into it!

jpountz added >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types v5.0.0-alpha5 labels Aug 8, 2016

jpountz force-pushed the enhancement/save_utf8_conversion_keyword branch from d020966 to c44679d Compare August 9, 2016 08:06

jpountz merged commit c44679d into elastic:master Aug 9, 2016

jpountz deleted the enhancement/save_utf8_conversion_keyword branch August 9, 2016 08:07

rjernst mentioned this pull request Aug 9, 2016

Build Failure: org.elasticsearch.index.mapper.geo.GeoPointFieldMapperTests.testGeoHashSearchWithPrefix #19895

Closed

clintongormley mentioned this pull request Aug 11, 2016

Indexing keywords should not utf8-encode twice #19782

Closed

jpountz added a commit that referenced this pull request Aug 25, 2016

Revert "Revert "Save one utf8 conversion in KeywordFieldMapper. #19867""

b521638

This reverts commit d805266.

clintongormley added v5.0.0-beta1 and removed v5.0.0-alpha5 labels Aug 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save one utf8 conversion in KeywordFieldMapper. #19867

Save one utf8 conversion in KeywordFieldMapper. #19867

jpountz commented Aug 8, 2016

rjernst commented Aug 8, 2016

mikemccand commented Aug 17, 2016

jpountz commented Aug 17, 2016

mikemccand commented Aug 17, 2016

rmuir commented Aug 18, 2016

mikemccand commented Aug 18, 2016

mikemccand commented Aug 18, 2016

rmuir commented Aug 18, 2016

jpountz commented Aug 25, 2016

jpountz commented Aug 25, 2016

Save one utf8 conversion in KeywordFieldMapper. #19867

Save one utf8 conversion in KeywordFieldMapper. #19867

Conversation

jpountz commented Aug 8, 2016

rjernst commented Aug 8, 2016

mikemccand commented Aug 17, 2016

jpountz commented Aug 17, 2016

mikemccand commented Aug 17, 2016

rmuir commented Aug 18, 2016

mikemccand commented Aug 18, 2016

mikemccand commented Aug 18, 2016

rmuir commented Aug 18, 2016

jpountz commented Aug 25, 2016

jpountz commented Aug 25, 2016