Switch fielddata to use Lucene doc values APIs. #6908

jpountz · 2014-07-17T15:36:23Z

This commits removes BytesValues/LongValues/DoubleValues/... and tries to use
Lucene's APIs such as NumericDocValues or RandomAccessOrds instead whenever
possible.

The next step would be to take advantage of the fact that APIs are the same in
Lucene and Elasticsearch in order to remove our custom comparators and use
Lucene's.

There are a few side-effects to this change:

GeoDistanceComparator has been removed, DoubleValuesComparator is used instead
on top of dynamically computed values (was easier than migrating
GeoDistanceComparator).
SortedNumericDocValues doesn't guarantee uniqueness so long/double terms
aggregators have been updated to make sure a document cannot fall twice in
the same bucket.
Sorting by maximum value of a field or running a max aggregation is
potentially significantly faster thanks to the random-access API.

Our aggs and p/c aggregations benchmarks don't report differences with this
change on uninverted field data. However the fact that doc values don't need
to be wrapped anymore seems to help a lot. For example
TermsAggregationSearchBenchmark reports ~30% faster terms aggregations on doc
values on string fields with this change, which are now only ~18% slower than
uninverted field data although stored on disk.

This commits removes BytesValues/LongValues/DoubleValues/... and tries to use Lucene's APIs such as NumericDocValues or RandomAccessOrds instead whenever possible. The next step would be to take advantage of the fact that APIs are the same in Lucene and Elasticsearch in order to remove our custom comparators and use Lucene's. There are a few side-effects to this change: - GeoDistanceComparator has been removed, DoubleValuesComparator is used instead on top of dynamically computed values (was easier than migrating GeoDistanceComparator). - SortedNumericDocValues doesn't guarantee uniqueness so long/double terms aggregators have been updated to make sure a document cannot fall twice in the same bucket. - Sorting by maximum value of a field or running a `max` aggregation is potentially significantly faster thanks to the random-access API. Our aggs and p/c aggregations benchmarks don't report differences with this change on uninverted field data. However the fact that doc values don't need to be wrapped anymore seems to help a lot. For example TermsAggregationSearchBenchmark reports ~30% faster terms aggregations on doc values on string fields with this change, which are now only ~18% slower than uninverted field data although stored on disk.

rmuir · 2014-07-18T15:21:57Z

...in/java/org/elasticsearch/search/aggregations/metrics/cardinality/CardinalityAggregator.java

+            @Override
+            public long valueAt(int index) {
+                return MurmurHash3.hash(java.lang.Double.doubleToLongBits(values.valueAt(index)));
+            }


I don't know how this is used, but i think it can be a little misleading that this class extends SortedNumericDocValues, yet the values are not actually sorted?

rmuir · 2014-07-18T15:23:32Z

overall looks good, i took a pass thru and added comments (it seems like you already responded/addressed most of them). Thanks for doing this!

…ericDocValues since order is not guaranteed.

jpountz · 2014-07-21T10:28:00Z

@rmuir I pushed new commits to address your comments.

rmuir · 2014-07-21T13:22:58Z

src/main/java/org/elasticsearch/search/aggregations/bucket/terms/StringTermsAggregator.java

        }
    }

    // TODO: use terms enum
    /** Returns an iterator over the field data terms. */
-    private static Iterator<BytesRef> terms(final BytesValues.WithOrdinals bytesValues, boolean reverse) {
+    private static Iterator<BytesRef> terms(final RandomAccessOrds bytesValues, boolean reverse) {


can we open a followup issue for this? it should use the termsenum (at least when moving forwards), and also i dont understand why it makes deep copies on next()

Good catch! I think I can remove this code actually!

rmuir · 2014-07-21T13:29:07Z

+1 Looks good, i added a comment to StringTermsAggregator.java to ask for a followup as it can have performance implications.

s1monw · 2014-07-21T13:51:37Z

src/main/java/org/elasticsearch/common/geo/GeoDistance.java

+    public static SortedNumericDoubleValues distanceValues(final FixedSourceDistance distance, final MultiGeoPointValues geoPointValues) {
+        final GeoPointValues singleValues = FieldData.unwrapSingleton(geoPointValues);
+        if (singleValues != null) {
+            final Bits docsWithField = FieldData.unwrapSingletonBits(geoPointValues);


if docsWithField == null should we maybe wrap set the variable to MatchAllBits?

For instance, Lucene comparators do it on purpose since the JVM needs to do the null check anyway. So I think it's fine to do the same here?

s1monw · 2014-07-21T15:31:01Z

I left a bunch of comments mostly cosmetic looks good though

jpountz · 2014-07-21T16:27:00Z

@rmuir @s1monw Pushed a new commit addressing your comments.

s1monw · 2014-07-22T12:23:07Z

LGTM

This commits removes BytesValues/LongValues/DoubleValues/... and tries to use Lucene's APIs such as NumericDocValues or RandomAccessOrds instead whenever possible. The next step would be to take advantage of the fact that APIs are the same in Lucene and Elasticsearch in order to remove our custom comparators and use Lucene's. There are a few side-effects to this change: - GeoDistanceComparator has been removed, DoubleValuesComparator is used instead on top of dynamically computed values (was easier than migrating GeoDistanceComparator). - SortedNumericDocValues doesn't guarantee uniqueness so long/double terms aggregators have been updated to make sure a document cannot fall twice in the same bucket. - Sorting by maximum value of a field or running a `max` aggregation is potentially significantly faster thanks to the random-access API. Our aggs and p/c aggregations benchmarks don't report differences with this change on uninverted field data. However the fact that doc values don't need to be wrapped anymore seems to help a lot. For example TermsAggregationSearchBenchmark reports ~30% faster terms aggregations on doc values on string fields with this change, which are now only ~18% slower than uninverted field data although stored on disk. Close #6908

Caused by #6908

…lues for a string field. This regression was introduced in elastic#6908: the conversion from RandomAccessOrds to SortedBinaryDocValues goes through Strings while both impls actually work on BytesRef, so the SortedBinaryDocValues instance could directly return the BytesRefs returned by the RandomAccessOrds. Close elastic#9306

…lues for a string field. This regression was introduced in #6908: the conversion from RandomAccessOrds to SortedBinaryDocValues goes through Strings while both impls actually work on BytesRef, so the SortedBinaryDocValues instance could directly return the BytesRefs returned by the RandomAccessOrds. Close #9306

…lues for a string field. This regression was introduced in elastic#6908: the conversion from RandomAccessOrds to SortedBinaryDocValues goes through Strings while both impls actually work on BytesRef, so the SortedBinaryDocValues instance could directly return the BytesRefs returned by the RandomAccessOrds. Close elastic#9306

jpountz added v1.4.0 labels Jul 17, 2014

jpountz added 4 commits July 18, 2014 11:28

Add warning about the slowness of the FieldData.toString methods.

47ad575

Use the optimized term lookup in the string comparator.

03a3e4a

Minor changes.

3528cc2

rmuir reviewed Jul 18, 2014
View reviewed changes

jpountz added 2 commits July 21, 2014 12:14

Have an internal MurmurHash3Values class instead of reusing SortedNum…

2094bca

…ericDocValues since order is not guaranteed.

Remove TODO.

14f82a1

Add assert.

4809047

rmuir reviewed Jul 21, 2014
View reviewed changes

s1monw reviewed Jul 21, 2014
View reviewed changes

jpountz added 2 commits July 21, 2014 18:15

Address comments.

7084de0

Remove BytesRefSorter.

7c2c14c

jpountz closed this in 3c142e5 Jul 22, 2014

jpountz removed the review label Jul 22, 2014

jpountz deleted the enhancement/lucene_doc_values_api branch July 22, 2014 13:17

jpountz changed the title ~~Fielddata: Switch to Lucene DV APIs.~~ Fielddata: Switch to Lucene doc values APIs. Jul 22, 2014

jpountz added a commit that referenced this pull request Jul 23, 2014

Fielddata: Fix the ordinals impl for sparse fields.

7651115

Caused by #6908

jpountz added a commit that referenced this pull request Jul 23, 2014

Fielddata: Fix the ordinals impl for sparse fields.

56d12b4

Caused by #6908

clintongormley changed the title ~~Fielddata: Switch to Lucene doc values APIs.~~ Internal: Switch fielddata to use Lucene doc values APIs. Sep 8, 2014

jpountz mentioned this pull request Feb 4, 2015

performance degredation between 0.90.5 and 1.4.1 #9306

Closed

jpountz mentioned this pull request Feb 4, 2015

Avoid unnecessary utf8 conversion when creating ScriptDocValues for a string field. #9557

Merged

clintongormley added the :Core/Infra/Core Core issues without another label label Jun 7, 2015

clintongormley changed the title ~~Internal: Switch fielddata to use Lucene doc values APIs.~~ Switch fielddata to use Lucene doc values APIs. Jun 7, 2015

clintongormley added :Fielddata and removed :Cache :Core/Infra/Core Core issues without another label :Search/Mapping Index mappings, including merging and defining field types :Search/Search Search-related issues that do not fall into other categories labels Jun 7, 2015

clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Fielddata labels Feb 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch fielddata to use Lucene doc values APIs. #6908

Switch fielddata to use Lucene doc values APIs. #6908

jpountz commented Jul 17, 2014

rmuir Jul 18, 2014

rmuir commented Jul 18, 2014

jpountz commented Jul 21, 2014

rmuir Jul 21, 2014

jpountz Jul 21, 2014

rmuir commented Jul 21, 2014

s1monw Jul 21, 2014

jpountz Jul 21, 2014

s1monw commented Jul 21, 2014

jpountz commented Jul 21, 2014

s1monw commented Jul 22, 2014

Switch fielddata to use Lucene doc values APIs. #6908

Switch fielddata to use Lucene doc values APIs. #6908

Conversation

jpountz commented Jul 17, 2014

rmuir Jul 18, 2014

Choose a reason for hiding this comment

rmuir commented Jul 18, 2014

jpountz commented Jul 21, 2014

rmuir Jul 21, 2014

Choose a reason for hiding this comment

jpountz Jul 21, 2014

Choose a reason for hiding this comment

rmuir commented Jul 21, 2014

s1monw Jul 21, 2014

Choose a reason for hiding this comment

jpountz Jul 21, 2014

Choose a reason for hiding this comment

s1monw commented Jul 21, 2014

jpountz commented Jul 21, 2014

s1monw commented Jul 22, 2014