-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch fielddata to use Lucene doc values APIs. #6908
Conversation
This commits removes BytesValues/LongValues/DoubleValues/... and tries to use Lucene's APIs such as NumericDocValues or RandomAccessOrds instead whenever possible. The next step would be to take advantage of the fact that APIs are the same in Lucene and Elasticsearch in order to remove our custom comparators and use Lucene's. There are a few side-effects to this change: - GeoDistanceComparator has been removed, DoubleValuesComparator is used instead on top of dynamically computed values (was easier than migrating GeoDistanceComparator). - SortedNumericDocValues doesn't guarantee uniqueness so long/double terms aggregators have been updated to make sure a document cannot fall twice in the same bucket. - Sorting by maximum value of a field or running a `max` aggregation is potentially significantly faster thanks to the random-access API. Our aggs and p/c aggregations benchmarks don't report differences with this change on uninverted field data. However the fact that doc values don't need to be wrapped anymore seems to help a lot. For example TermsAggregationSearchBenchmark reports ~30% faster terms aggregations on doc values on string fields with this change, which are now only ~18% slower than uninverted field data although stored on disk.
@Override | ||
public long valueAt(int index) { | ||
return MurmurHash3.hash(java.lang.Double.doubleToLongBits(values.valueAt(index))); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how this is used, but i think it can be a little misleading that this class extends SortedNumericDocValues, yet the values are not actually sorted?
overall looks good, i took a pass thru and added comments (it seems like you already responded/addressed most of them). Thanks for doing this! |
…ericDocValues since order is not guaranteed.
@rmuir I pushed new commits to address your comments. |
} | ||
} | ||
|
||
// TODO: use terms enum | ||
/** Returns an iterator over the field data terms. */ | ||
private static Iterator<BytesRef> terms(final BytesValues.WithOrdinals bytesValues, boolean reverse) { | ||
private static Iterator<BytesRef> terms(final RandomAccessOrds bytesValues, boolean reverse) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we open a followup issue for this? it should use the termsenum (at least when moving forwards), and also i dont understand why it makes deep copies on next()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I think I can remove this code actually!
+1 Looks good, i added a comment to StringTermsAggregator.java to ask for a followup as it can have performance implications. |
public static SortedNumericDoubleValues distanceValues(final FixedSourceDistance distance, final MultiGeoPointValues geoPointValues) { | ||
final GeoPointValues singleValues = FieldData.unwrapSingleton(geoPointValues); | ||
if (singleValues != null) { | ||
final Bits docsWithField = FieldData.unwrapSingletonBits(geoPointValues); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if docsWithField == null
should we maybe wrap set the variable to MatchAllBits
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For instance, Lucene comparators do it on purpose since the JVM needs to do the null check anyway. So I think it's fine to do the same here?
I left a bunch of comments mostly cosmetic looks good though |
LGTM |
This commits removes BytesValues/LongValues/DoubleValues/... and tries to use Lucene's APIs such as NumericDocValues or RandomAccessOrds instead whenever possible. The next step would be to take advantage of the fact that APIs are the same in Lucene and Elasticsearch in order to remove our custom comparators and use Lucene's. There are a few side-effects to this change: - GeoDistanceComparator has been removed, DoubleValuesComparator is used instead on top of dynamically computed values (was easier than migrating GeoDistanceComparator). - SortedNumericDocValues doesn't guarantee uniqueness so long/double terms aggregators have been updated to make sure a document cannot fall twice in the same bucket. - Sorting by maximum value of a field or running a `max` aggregation is potentially significantly faster thanks to the random-access API. Our aggs and p/c aggregations benchmarks don't report differences with this change on uninverted field data. However the fact that doc values don't need to be wrapped anymore seems to help a lot. For example TermsAggregationSearchBenchmark reports ~30% faster terms aggregations on doc values on string fields with this change, which are now only ~18% slower than uninverted field data although stored on disk. Close #6908
…lues for a string field. This regression was introduced in elastic#6908: the conversion from RandomAccessOrds to SortedBinaryDocValues goes through Strings while both impls actually work on BytesRef, so the SortedBinaryDocValues instance could directly return the BytesRefs returned by the RandomAccessOrds. Close elastic#9306
…lues for a string field. This regression was introduced in #6908: the conversion from RandomAccessOrds to SortedBinaryDocValues goes through Strings while both impls actually work on BytesRef, so the SortedBinaryDocValues instance could directly return the BytesRefs returned by the RandomAccessOrds. Close #9306
…lues for a string field. This regression was introduced in #6908: the conversion from RandomAccessOrds to SortedBinaryDocValues goes through Strings while both impls actually work on BytesRef, so the SortedBinaryDocValues instance could directly return the BytesRefs returned by the RandomAccessOrds. Close #9306
…lues for a string field. This regression was introduced in elastic#6908: the conversion from RandomAccessOrds to SortedBinaryDocValues goes through Strings while both impls actually work on BytesRef, so the SortedBinaryDocValues instance could directly return the BytesRefs returned by the RandomAccessOrds. Close elastic#9306
This commits removes BytesValues/LongValues/DoubleValues/... and tries to use
Lucene's APIs such as NumericDocValues or RandomAccessOrds instead whenever
possible.
The next step would be to take advantage of the fact that APIs are the same in
Lucene and Elasticsearch in order to remove our custom comparators and use
Lucene's.
There are a few side-effects to this change:
on top of dynamically computed values (was easier than migrating
GeoDistanceComparator).
aggregators have been updated to make sure a document cannot fall twice in
the same bucket.
max
aggregation ispotentially significantly faster thanks to the random-access API.
Our aggs and p/c aggregations benchmarks don't report differences with this
change on uninverted field data. However the fact that doc values don't need
to be wrapped anymore seems to help a lot. For example
TermsAggregationSearchBenchmark reports ~30% faster terms aggregations on doc
values on string fields with this change, which are now only ~18% slower than
uninverted field data although stored on disk.