New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Field data should support more than 2B ordinals per segment #3189

Closed
jpountz opened this Issue Jun 15, 2013 · 1 comment

Comments

Projects
None yet
1 participant
@jpountz
Contributor

jpountz commented Jun 15, 2013

Field data currently uses integers to store ordinals although a Lucene index can have more than 2B unique values per index. We should use longs instead in the APIs and fix implementations to actually support more than 2B unique values.

@ghost ghost assigned jpountz Jun 15, 2013

@jpountz

This comment has been minimized.

Show comment
Hide comment
@jpountz

jpountz Jul 9, 2013

Contributor

I opened a pull request (#3306) which tries to fix this issue. In addition to that, it has some nice memory improvements. Here is the memory usage reported by LongFieldDataBenchmark for the whole field data instance (ordinals + data) without and with this commit (you can have a look at #3220 too to see how much memory usage was before we started using the PackedInts API to store values):

                               Before      After
SINGLE_VALUED_DENSE_ENUM     488.3 KB   488.3 KB
SINGLE_VALUED_DENSE_DATE       4.3 MB     4.3 MB
MULTI_VALUED_DATE             10.5 MB     5.9 MB
MULTI_VALUED_ENUM              7.8 MB     1.2 MB 
SINGLE_VALUED_SPARSE_RANDOM    3.5 MB     1.5 MB
MULTI_VALUED_SPARSE_RANDOM     7.7 MB     3.4 MB
MULTI_VALUED_DENSE_RANDOM     23.7 MB    17.8 MB

Nothing changes for the single-valued case (as expected) but there are some nice savings for the multi-valued case, especially when the values don't require much space.

I also ran TermsFacetSearchBenchmark to see how this impacts faceting, here are the results:

Before:
                     name      took    millis
                  terms_s      6.1s        30
              terms_map_s     20.7s       103
                  terms_l     13.8s        69
              terms_map_l       14s        70
                 terms_sm       22s       110
             terms_map_sm      3.3m      1009
                 terms_lm      1.3m       391
             terms_map_lm      1.3m       390
          terms_stats_s_l     31.9s       159
         terms_stats_s_lm        1m       322
         terms_stats_sm_l      4.3m      1319
After:
                  terms_s      5.4s        27
              terms_map_s     20.7s       103
                  terms_l     12.7s        63
              terms_map_l     12.7s        63
                 terms_sm     40.1s       200
             terms_map_sm      3.3m      1015
                 terms_lm      1.6m       486
             terms_map_lm      1.6m       486
          terms_stats_s_l     28.8s       144
         terms_stats_s_lm      1.3m       415
         terms_stats_sm_l      4.3m      1300

In some cases, faceting is slower. I ran the benchmark under a profiler and MonotonicAppendingLongBuffer and AppendingLongBuffer, which are used to store the ordinals, were among the most hot spots. Since they are also the reason why we have these memory savings, maybe it is not that bad?

Contributor

jpountz commented Jul 9, 2013

I opened a pull request (#3306) which tries to fix this issue. In addition to that, it has some nice memory improvements. Here is the memory usage reported by LongFieldDataBenchmark for the whole field data instance (ordinals + data) without and with this commit (you can have a look at #3220 too to see how much memory usage was before we started using the PackedInts API to store values):

                               Before      After
SINGLE_VALUED_DENSE_ENUM     488.3 KB   488.3 KB
SINGLE_VALUED_DENSE_DATE       4.3 MB     4.3 MB
MULTI_VALUED_DATE             10.5 MB     5.9 MB
MULTI_VALUED_ENUM              7.8 MB     1.2 MB 
SINGLE_VALUED_SPARSE_RANDOM    3.5 MB     1.5 MB
MULTI_VALUED_SPARSE_RANDOM     7.7 MB     3.4 MB
MULTI_VALUED_DENSE_RANDOM     23.7 MB    17.8 MB

Nothing changes for the single-valued case (as expected) but there are some nice savings for the multi-valued case, especially when the values don't require much space.

I also ran TermsFacetSearchBenchmark to see how this impacts faceting, here are the results:

Before:
                     name      took    millis
                  terms_s      6.1s        30
              terms_map_s     20.7s       103
                  terms_l     13.8s        69
              terms_map_l       14s        70
                 terms_sm       22s       110
             terms_map_sm      3.3m      1009
                 terms_lm      1.3m       391
             terms_map_lm      1.3m       390
          terms_stats_s_l     31.9s       159
         terms_stats_s_lm        1m       322
         terms_stats_sm_l      4.3m      1319
After:
                  terms_s      5.4s        27
              terms_map_s     20.7s       103
                  terms_l     12.7s        63
              terms_map_l     12.7s        63
                 terms_sm     40.1s       200
             terms_map_sm      3.3m      1015
                 terms_lm      1.6m       486
             terms_map_lm      1.6m       486
          terms_stats_s_l     28.8s       144
         terms_stats_s_lm      1.3m       415
         terms_stats_sm_l      4.3m      1300

In some cases, faceting is slower. I ran the benchmark under a profiler and MonotonicAppendingLongBuffer and AppendingLongBuffer, which are used to store the ordinals, were among the most hot spots. Since they are also the reason why we have these memory savings, maybe it is not that bad?

@jpountz jpountz closed this in 12d9268 Jul 19, 2013

jpountz added a commit that referenced this issue Jul 19, 2013

Make field data able to support more than 2B ordinals per segment.
Although segments are limited to 2B documents, there is not limit on the number
of unique values that a segment may store. This commit replaces 'int' with
'long' every time a number is used to represent an ordinal and modifies the
data-structures used to store ordinals so that they can actually support more
than 2B ordinals per segment.

This commit also improves memory usage of the multi-ordinals data-structures
and the transient memory usage which is required to build them (OrdinalsBuilder)
by using Lucene's PackedInts data-structures. In the end, loading the ordinals
mapping from disk may be a little slower, field-data-based features such as
faceting may be slightly slower or faster depending on whether being nicer to
the CPU caches balances the overhead of the additional abstraction or not, and
memory usage should be better in all cases, especially when the size of the
ordinals mapping is not negligible compared to the size of the values (numeric
data for example).

Close #3189

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015

Make field data able to support more than 2B ordinals per segment.
Although segments are limited to 2B documents, there is not limit on the number
of unique values that a segment may store. This commit replaces 'int' with
'long' every time a number is used to represent an ordinal and modifies the
data-structures used to store ordinals so that they can actually support more
than 2B ordinals per segment.

This commit also improves memory usage of the multi-ordinals data-structures
and the transient memory usage which is required to build them (OrdinalsBuilder)
by using Lucene's PackedInts data-structures. In the end, loading the ordinals
mapping from disk may be a little slower, field-data-based features such as
faceting may be slightly slower or faster depending on whether being nicer to
the CPU caches balances the overhead of the additional abstraction or not, and
memory usage should be better in all cases, especially when the size of the
ordinals mapping is not negligible compared to the size of the values (numeric
data for example).

Close elastic#3189
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment