-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use BINARY doc values instead of SORTED_SET doc values to store numeric data #4518
Conversation
|
||
|
||
/** Utility methods to do byte-level encoding. These methods are biased towards little-endian byte order because it is the most | ||
* common byte order and reading several bytes at once may be optimizable in the future with the help of sun.mist.Unsafe. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
w00t
Simon and I discussed about a better way to do the buffering and I'll explore a different approach that would use the Document object instead of the field mapper to do the buffering. |
…ic data. Although SORTED_SET doc values make things like terms aggregations very fast thanks to the use of ordinals, ordinals are usually not that useful on numeric data. We are more interested in the values themselves in order to be able to compute sums, averages, etc. on these values. However, SORTED_SET is quite slow at accessing values, so BINARY doc values are better suited at storing numeric data. floats and doubles are encoded without compression with little-endian byte order (so that it may be optimizable through sun.misc.Unsafe in the future given that most computers nowadays use the little-endian byte order) and byte, short, int, and long are encoded using vLong encoding: they first encode the minimum value using zig-zag encoding (so that negative values become positive) and then deltas between successive values. Close elastic#3993
Here is a new approach that does the buffering at the document level (see ParseContext.Document). Since many field mappers were just reverted to go back to what they are in master, I squashed the commits to make it easier to review. |
|
||
/** | ||
* | ||
*/ | ||
public class ParseContext { | ||
|
||
/** Fork of {@link org.apache.lucene.document.Document} with additional functionality. */ | ||
public static class Document implements Iterable<IndexableField> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool stuff - I wonder if we should only add fields to the multimap that are explicitly added like via add(IndexableField field, String key)
for the most of the field this is not needed at all though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it as well but on the other hand I wasn't unhappy that Document.get/getField/getBinaryField/... performs in constant time instead of linear time as in Lucene?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure though it adds a lot of additional objects per document possibly completely useless. Where do we call these methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly in tests, but also in some field mappers. However when it happens in mappers, it is mostly for meta fields (_uid, ...) which are among the first fields so it should be OK.
I left some minor comments but this looks awesome. +1 in general I think the next iter goes in though! |
New commit pushed:
|
|
||
@Override | ||
public Iterator<IndexableField> iterator() { | ||
return fieldList.iterator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't make sense to return unmodifiableList
in getFields
but return a mutable iterator here, I guess it's ok to get a mutable list or we should maybe make both unmodifiable but I don't think we should add yet another object wrapper here WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I had forgotten that iterators also have a remove method. The unmodifiable wrapper was mainly useful for me to understand what consumers do with it. I'll remove the wrappers.
@@ -1 +1 @@ | |||
Subproject commit 2f5f78f24d8fbacf69c83ab7545654c83965e846 | |||
Subproject commit b3ab72486fae1b5c5a5397356a3e113bf72eb6d5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I guess this one will be updated once you rebase :)
did another round and left some minor comments that can be handled once its pushed. +1 LGTM |
Thanks for the reviews, Simon. Much appreciated! |
Although SORTED_SET doc values make things like terms aggregations very fast
thanks to the use of ordinals, ordinals are usually not that useful on numeric
data. We are more interested in the values themselves in order to be able to
compute sums, averages, etc. on these values. However, SORTED_SET is quite slow
at accessing values, so BINARY doc values are better suited at storing numeric
data.
It is only allowed to have a single BINARY doc values field instance per field
name per document, which makes it quite challenging to use for multi-valued
fields since all values need to be buffered in memory and converted to a single
field instance in the end. In order to do so easily, all mappers (not only the
root mappers) now have a preParse and a postParse phase, which are called before
and after all fields have been visited for a single document. In the case of the
numeric field mappers, parse now takes care to buffer values and postParse takes
these values, sorts them and deduplicates them before encoding them into a
BINARY doc values field.
floats and doubles are encoded without compression with little-endian byte order
(so that it may be optimizable through sun.misc.Unsafe in the future given that
most computers nowadays use the little-endian byte order) and byte, short, int,
and long are encoded using vLong encoding: they first encode the minimum value
using zig-zag encoding (so that negative values become positive) and then deltas
between successive values.
I ran
TermsAggregationSearchBenchmark
to get an idea of the impact of thischange and results are promising:
For reference,
terms_agg_l_dv
is an aggregation on a single-valued long fieldstored in doc values while
terms_agg_lm_dv
is an aggregation on a multi-valuedlong field (10 values per document) stored in doc values.
Close #3993