Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save one utf8 conversion in KeywordFieldMapper. #19867

Merged

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Aug 8, 2016

If a keyword field is both indexed and doc-valued, then we will convert the
input string to utf8 bytes twice: once for indexing/storing, and once for doc
values. This commit changes keyword fields to compute the utf8 representation
up-front and then feed both the inverted index and doc values with it.

Rather than adding version-based bw compat logic, I broke the keyword field
(they are now indexed/stored as a binary field rather than string), which is
fine since we are still on alpha releases for 5.0.

@jpountz jpountz added >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types v5.0.0-alpha5 labels Aug 8, 2016
@rjernst
Copy link
Member

rjernst commented Aug 8, 2016

LGTM

If a `keyword` field is both indexed and doc-valued, then we will convert the
input string to utf8 bytes twice: once for indexing/storing, and once for doc
values. This commit changes `keyword` fields to compute the utf8 representation
up-front and then feed both the inverted index and doc values with it.

Rather than adding version-based bw compat logic, I broke the `keyword` field
(they are now indexed/stored as a binary field rather than string), which is
fine since we are still on alpha releases for 5.0.
@jpountz jpountz force-pushed the enhancement/save_utf8_conversion_keyword branch from d020966 to c44679d Compare August 9, 2016 08:06
@jpountz jpountz merged commit c44679d into elastic:master Aug 9, 2016
@jpountz jpountz deleted the enhancement/save_utf8_conversion_keyword branch August 9, 2016 08:07
@mikemccand
Copy link
Contributor

Unfortunately, this change caused a large drop in indexing performance for the geonames documents: https://benchmarks-old.elastic.co/index.html#indexing

What's very strange is, I can easily reproduce the big indexing throughput loss on my 72 core box, before and after this change, but on 8 core Haswell and Skylake boxes, there is no measurable change in indexing performance. I can't explain that.

The change looks quite innocuous, but when I stare at Lucene's sources for how it handles the String vs BytesRef cases, there are some interesting differences, e.g. the BinaryTokenStream does not have an OffsetAttribute, the BytesTermAttribute.clear does not reuse the byte[] buffer while CharTermAttributeImpl.clear does reuse its char[] termBuffer.

Maybe we are confusing hotspot by having two hot implementations for termAtt.getBytesRef called from TermsHashPerField.add?

Maybe we should back this out until we understand what's happening? I think the motivation for the change is very good, but I can't explain why it's giving such a big drop on a high core count box...

@jpountz
Copy link
Contributor Author

jpountz commented Aug 17, 2016

Thanks for catching the slow down. +1 to back it out. I'll do it tomorrow unless someone is able to do it before.

@mikemccand
Copy link
Contributor

I'm sorry to be the bearer of bad news here: I think it's ridiculous we cannot make a clear improvement in ES for this reason. I hope we can get to the bottom of it. It ought to be safe to index binary terms directly in Lucene.

Anyway, @rmuir has been helping me try to isolate the cause for the slowdown here, and at least we have made some progress: if I run the benchmark with -XX:TieredStopAtLevel=1 (disables Hotspot's C2 compilation "efforts") then:

after: Indexer: 8647880 docs: 320.33 sec [26997.1 dps, 8.7 MB/sec]
before: Indexer: 8647880 docs: 343.35 sec [25186.9 dps, 8.1 MB/sec]

So, happily, the change ("after") is a bit faster, indexing all geonames docs, if we force only C1 (essentially -client as @rmuir summarizes it)...

jpountz added a commit that referenced this pull request Aug 18, 2016
This reverts commit c44679d.

Conflicts:
	core/src/main/java/org/elasticsearch/index/mapper/BaseGeoPointFieldMapper.java
	core/src/main/java/org/elasticsearch/index/mapper/GeoPointFieldMapperLegacy.java
	core/src/test/java/org/elasticsearch/index/mapper/GeoPointFieldMapperTests.java
@rmuir
Copy link
Contributor

rmuir commented Aug 18, 2016

Honestly, i don't get it, maybe its a compiler bug, or perhaps related to lock elision or something else on your machine (though, we tried disabling EA).

Because i think even if this part of indexwriter is compiled inefficiently, it shouldn't create such a massive performance regression for the app as a whole, so something truly crazy is going on.

@mikemccand
Copy link
Contributor

I also tested JDK-9 EA: still a big slowdown.

And I upgraded my kernel from 4.6.x to 4.7.0: still a big slowdown.

It's so weird that single socket boxes, at least the two 8 core boxes I've tested on, don't show any slowdown.

@mikemccand
Copy link
Contributor

OK @rmuir and I finally found the smoking gun (using JFR, but running with -XX:-Inline so you see all methods). With this tiny change to Lucene 6.1.0, this change has the same performance as before on my 2 socket (72 core) box:

mike@beast2:/l/61/lucene/core$ git diff
diff --git a/lucene/core/src/java/org/apache/lucene/analysis/TokenStream.java b/lucene/core/src/java/org/apache/lucene/analysis/TokenStream.java
index 6a78e1c..1f89f06 100644
--- a/lucene/core/src/java/org/apache/lucene/analysis/TokenStream.java
+++ b/lucene/core/src/java/org/apache/lucene/analysis/TokenStream.java
@@ -177,10 +177,10 @@ public abstract class TokenStream extends AttributeSource implements Closeable {
    */
   public void end() throws IOException {
     clearAttributes(); // LUCENE-3849: don't consume dirty atts
-    PositionIncrementAttribute posIncAtt = getAttribute(PositionIncrementAttribute.class);
-    if (posIncAtt != null) {
-      posIncAtt.setPositionIncrement(0);
-    }
+    //PositionIncrementAttribute posIncAtt = getAttribute(PositionIncrementAttribute.class);
+    //if (posIncAtt != null) {
+    //posIncAtt.setPositionIncrement(0);
+    //}
   }

   /**

But this is not committable to Lucene ... so we need to figure out how to properly fix it...

@rmuir
Copy link
Contributor

rmuir commented Aug 18, 2016

jpountz added a commit that referenced this pull request Aug 25, 2016
@jpountz
Copy link
Contributor Author

jpountz commented Aug 25, 2016

I just pushed this change back in.

@jpountz
Copy link
Contributor Author

jpountz commented Aug 25, 2016

Thank you @mikemccand and @rmuir for looking into it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types v5.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants