Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use BINARY doc values instead of SORTED_SET doc values to store numeric data #4518

Closed

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Dec 19, 2013

Although SORTED_SET doc values make things like terms aggregations very fast
thanks to the use of ordinals, ordinals are usually not that useful on numeric
data. We are more interested in the values themselves in order to be able to
compute sums, averages, etc. on these values. However, SORTED_SET is quite slow
at accessing values, so BINARY doc values are better suited at storing numeric
data.

It is only allowed to have a single BINARY doc values field instance per field
name per document, which makes it quite challenging to use for multi-valued
fields since all values need to be buffered in memory and converted to a single
field instance in the end. In order to do so easily, all mappers (not only the
root mappers) now have a preParse and a postParse phase, which are called before
and after all fields have been visited for a single document. In the case of the
numeric field mappers, parse now takes care to buffer values and postParse takes
these values, sorts them and deduplicates them before encoding them into a
BINARY doc values field.

floats and doubles are encoded without compression with little-endian byte order
(so that it may be optimizable through sun.misc.Unsafe in the future given that
most computers nowadays use the little-endian byte order) and byte, short, int,
and long are encoded using vLong encoding: they first encode the minimum value
using zig-zag encoding (so that negative values become positive) and then deltas
between successive values.

I ran TermsAggregationSearchBenchmark to get an idea of the impact of this
change and results are promising:

Task Before this change After this change Difference
terms_agg_l_dv 235 145 38% faster
terms_agg_lm_dv 1705 714 58% faster

For reference, terms_agg_l_dv is an aggregation on a single-valued long field
stored in doc values while terms_agg_lm_dv is an aggregation on a multi-valued
long field (10 values per document) stored in doc values.

Close #3993



/** Utility methods to do byte-level encoding. These methods are biased towards little-endian byte order because it is the most
* common byte order and reading several bytes at once may be optimizable in the future with the help of sun.mist.Unsafe. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w00t

@jpountz
Copy link
Contributor Author

jpountz commented Dec 20, 2013

Simon and I discussed about a better way to do the buffering and I'll explore a different approach that would use the Document object instead of the field mapper to do the buffering.

…ic data.

Although SORTED_SET doc values make things like terms aggregations very fast
thanks to the use of ordinals, ordinals are usually not that useful on numeric
data. We are more interested in the values themselves in order to be able to
compute sums, averages, etc. on these values. However, SORTED_SET is quite slow
at accessing values, so BINARY doc values are better suited at storing numeric
data.

floats and doubles are encoded without compression with little-endian byte order
(so that it may be optimizable through sun.misc.Unsafe in the future given that
most computers nowadays use the little-endian byte order) and byte, short, int,
and long are encoded using vLong encoding: they first encode the minimum value
using zig-zag encoding (so that negative values become positive) and then deltas
between successive values.

Close elastic#3993
@jpountz
Copy link
Contributor Author

jpountz commented Dec 23, 2013

Here is a new approach that does the buffering at the document level (see ParseContext.Document). Since many field mappers were just reverted to go back to what they are in master, I squashed the commits to make it easier to review.


/**
*
*/
public class ParseContext {

/** Fork of {@link org.apache.lucene.document.Document} with additional functionality. */
public static class Document implements Iterable<IndexableField> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool stuff - I wonder if we should only add fields to the multimap that are explicitly added like via add(IndexableField field, String key) for the most of the field this is not needed at all though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it as well but on the other hand I wasn't unhappy that Document.get/getField/getBinaryField/... performs in constant time instead of linear time as in Lucene?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure though it adds a lot of additional objects per document possibly completely useless. Where do we call these methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly in tests, but also in some field mappers. However when it happens in mappers, it is mostly for meta fields (_uid, ...) which are among the first fields so it should be OK.

@s1monw
Copy link
Contributor

s1monw commented Dec 23, 2013

I left some minor comments but this looks awesome. +1 in general I think the next iter goes in though!

@jpountz
Copy link
Contributor Author

jpountz commented Dec 23, 2013

New commit pushed:

  • Behavior of NaN is now consistent with uninverted field data, appearing at most once and always at the last position (compares greater than POSITIVE_INFINITY).
  • Sorting and deduplication utility methods moved to a utility class.


@Override
public Iterator<IndexableField> iterator() {
return fieldList.iterator();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't make sense to return unmodifiableList in getFields but return a mutable iterator here, I guess it's ok to get a mutable list or we should maybe make both unmodifiable but I don't think we should add yet another object wrapper here WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I had forgotten that iterators also have a remove method. The unmodifiable wrapper was mainly useful for me to understand what consumers do with it. I'll remove the wrappers.

@@ -1 +1 @@
Subproject commit 2f5f78f24d8fbacf69c83ab7545654c83965e846
Subproject commit b3ab72486fae1b5c5a5397356a3e113bf72eb6d5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I guess this one will be updated once you rebase :)

@s1monw
Copy link
Contributor

s1monw commented Dec 25, 2013

did another round and left some minor comments that can be handled once its pushed. +1 LGTM

@jpountz
Copy link
Contributor Author

jpountz commented Dec 26, 2013

Thanks for the reviews, Simon. Much appreciated!

@jpountz jpountz closed this Dec 26, 2013
@jpountz jpountz deleted the enhancement/numeric_doc_values branch December 26, 2013 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Numeric field mappers should encode doc values in binary doc values fields (?)
2 participants