Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch fielddata to use Lucene doc values APIs. #6908

Closed
wants to merge 9 commits into from

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Jul 17, 2014

This commits removes BytesValues/LongValues/DoubleValues/... and tries to use
Lucene's APIs such as NumericDocValues or RandomAccessOrds instead whenever
possible.

The next step would be to take advantage of the fact that APIs are the same in
Lucene and Elasticsearch in order to remove our custom comparators and use
Lucene's.

There are a few side-effects to this change:

  • GeoDistanceComparator has been removed, DoubleValuesComparator is used instead
    on top of dynamically computed values (was easier than migrating
    GeoDistanceComparator).
  • SortedNumericDocValues doesn't guarantee uniqueness so long/double terms
    aggregators have been updated to make sure a document cannot fall twice in
    the same bucket.
  • Sorting by maximum value of a field or running a max aggregation is
    potentially significantly faster thanks to the random-access API.

Our aggs and p/c aggregations benchmarks don't report differences with this
change on uninverted field data. However the fact that doc values don't need
to be wrapped anymore seems to help a lot. For example
TermsAggregationSearchBenchmark reports ~30% faster terms aggregations on doc
values on string fields with this change, which are now only ~18% slower than
uninverted field data although stored on disk.

This commits removes BytesValues/LongValues/DoubleValues/... and tries to use
Lucene's APIs such as NumericDocValues or RandomAccessOrds instead whenever
possible.

The next step would be to take advantage of the fact that APIs are the same in
Lucene and Elasticsearch in order to remove our custom comparators and use
Lucene's.

There are a few side-effects to this change:
 - GeoDistanceComparator has been removed, DoubleValuesComparator is used instead
   on top of dynamically computed values (was easier than migrating
   GeoDistanceComparator).
 - SortedNumericDocValues doesn't guarantee uniqueness so long/double terms
   aggregators have been updated to make sure a document cannot fall twice in
   the same bucket.
 - Sorting by maximum value of a field or running a `max` aggregation is
   potentially significantly faster thanks to the random-access API.

Our aggs and p/c aggregations benchmarks don't report differences with this
change on uninverted field data. However the fact that doc values don't need
to be wrapped anymore seems to help a lot. For example
TermsAggregationSearchBenchmark reports ~30% faster terms aggregations on doc
values on string fields with this change, which are now only ~18% slower than
uninverted field data although stored on disk.
@Override
public long valueAt(int index) {
return MurmurHash3.hash(java.lang.Double.doubleToLongBits(values.valueAt(index)));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how this is used, but i think it can be a little misleading that this class extends SortedNumericDocValues, yet the values are not actually sorted?

@rmuir
Copy link
Contributor

rmuir commented Jul 18, 2014

overall looks good, i took a pass thru and added comments (it seems like you already responded/addressed most of them). Thanks for doing this!

@jpountz
Copy link
Contributor Author

jpountz commented Jul 21, 2014

@rmuir I pushed new commits to address your comments.

}
}

// TODO: use terms enum
/** Returns an iterator over the field data terms. */
private static Iterator<BytesRef> terms(final BytesValues.WithOrdinals bytesValues, boolean reverse) {
private static Iterator<BytesRef> terms(final RandomAccessOrds bytesValues, boolean reverse) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we open a followup issue for this? it should use the termsenum (at least when moving forwards), and also i dont understand why it makes deep copies on next()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I think I can remove this code actually!

@rmuir
Copy link
Contributor

rmuir commented Jul 21, 2014

+1 Looks good, i added a comment to StringTermsAggregator.java to ask for a followup as it can have performance implications.

public static SortedNumericDoubleValues distanceValues(final FixedSourceDistance distance, final MultiGeoPointValues geoPointValues) {
final GeoPointValues singleValues = FieldData.unwrapSingleton(geoPointValues);
if (singleValues != null) {
final Bits docsWithField = FieldData.unwrapSingletonBits(geoPointValues);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if docsWithField == null should we maybe wrap set the variable to MatchAllBits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, Lucene comparators do it on purpose since the JVM needs to do the null check anyway. So I think it's fine to do the same here?

@s1monw
Copy link
Contributor

s1monw commented Jul 21, 2014

I left a bunch of comments mostly cosmetic looks good though

@jpountz
Copy link
Contributor Author

jpountz commented Jul 21, 2014

@rmuir @s1monw Pushed a new commit addressing your comments.

@s1monw
Copy link
Contributor

s1monw commented Jul 22, 2014

LGTM

jpountz added a commit that referenced this pull request Jul 22, 2014
This commits removes BytesValues/LongValues/DoubleValues/... and tries to use
Lucene's APIs such as NumericDocValues or RandomAccessOrds instead whenever
possible.

The next step would be to take advantage of the fact that APIs are the same in
Lucene and Elasticsearch in order to remove our custom comparators and use
Lucene's.

There are a few side-effects to this change:
 - GeoDistanceComparator has been removed, DoubleValuesComparator is used instead
   on top of dynamically computed values (was easier than migrating
   GeoDistanceComparator).
 - SortedNumericDocValues doesn't guarantee uniqueness so long/double terms
   aggregators have been updated to make sure a document cannot fall twice in
   the same bucket.
 - Sorting by maximum value of a field or running a `max` aggregation is
   potentially significantly faster thanks to the random-access API.

Our aggs and p/c aggregations benchmarks don't report differences with this
change on uninverted field data. However the fact that doc values don't need
to be wrapped anymore seems to help a lot. For example
TermsAggregationSearchBenchmark reports ~30% faster terms aggregations on doc
values on string fields with this change, which are now only ~18% slower than
uninverted field data although stored on disk.

Close #6908
@jpountz jpountz closed this in 3c142e5 Jul 22, 2014
@jpountz jpountz removed the review label Jul 22, 2014
@jpountz jpountz deleted the enhancement/lucene_doc_values_api branch July 22, 2014 13:17
@jpountz jpountz changed the title Fielddata: Switch to Lucene DV APIs. Fielddata: Switch to Lucene doc values APIs. Jul 22, 2014
jpountz added a commit that referenced this pull request Jul 23, 2014
jpountz added a commit that referenced this pull request Jul 23, 2014
@clintongormley clintongormley changed the title Fielddata: Switch to Lucene doc values APIs. Internal: Switch fielddata to use Lucene doc values APIs. Sep 8, 2014
jpountz added a commit to jpountz/elasticsearch that referenced this pull request Feb 4, 2015
…lues for a string field.

This regression was introduced in elastic#6908: the conversion from RandomAccessOrds
to SortedBinaryDocValues goes through Strings while both impls actually work
on BytesRef, so the SortedBinaryDocValues instance could directly return the
BytesRefs returned by the RandomAccessOrds.

Close elastic#9306
jpountz added a commit that referenced this pull request Feb 4, 2015
…lues for a string field.

This regression was introduced in #6908: the conversion from RandomAccessOrds
to SortedBinaryDocValues goes through Strings while both impls actually work
on BytesRef, so the SortedBinaryDocValues instance could directly return the
BytesRefs returned by the RandomAccessOrds.

Close #9306
jpountz added a commit that referenced this pull request Feb 4, 2015
…lues for a string field.

This regression was introduced in #6908: the conversion from RandomAccessOrds
to SortedBinaryDocValues goes through Strings while both impls actually work
on BytesRef, so the SortedBinaryDocValues instance could directly return the
BytesRefs returned by the RandomAccessOrds.

Close #9306
@clintongormley clintongormley added the :Core/Infra/Core Core issues without another label label Jun 7, 2015
@clintongormley clintongormley changed the title Internal: Switch fielddata to use Lucene doc values APIs. Switch fielddata to use Lucene doc values APIs. Jun 7, 2015
@clintongormley clintongormley added :Fielddata and removed :Cache :Core/Infra/Core Core issues without another label :Search/Mapping Index mappings, including merging and defining field types :Search/Search Search-related issues that do not fall into other categories labels Jun 7, 2015
mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015
…lues for a string field.

This regression was introduced in elastic#6908: the conversion from RandomAccessOrds
to SortedBinaryDocValues goes through Strings while both impls actually work
on BytesRef, so the SortedBinaryDocValues instance could directly return the
BytesRefs returned by the RandomAccessOrds.

Close elastic#9306
@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Fielddata labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories v1.4.0.Beta1 v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants