Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a AppendingDeltaPackedLongBuffer-based storage format to single value field data #5706

Closed

Conversation

Projects
None yet
3 participants
@bleskes
Copy link
Member

bleskes commented Apr 7, 2014

The AppendingDeltaPackedLongBuffer uses delta compression in paged fashion. For data which is roughly monotonic this results in a reduced memory signature.

By default we use the storage format expected to use the least memory. You can force a choice using a new field data setting memory_storage_hint which can be set to ORDINALS, PACKED or PAGED

Running some benchmarks on simulated time based data shows 25-30% reduction in memory usage with a very small performance overhead (current implementations uses PACKED as a memory format with 0.5 acceptable overhead ratio):


------------------ SUMMARY -------------------------------
docs: 5000000
match percentage: 0.1
memory format hint: PACKED
acceptable_overhead_ratio: 0.5
field data: 19mb
                     name      took    millis
                   hist_l     16.9s        33
------------------ SUMMARY -------------------------------

------------------ SUMMARY -------------------------------
docs: 5000000
match percentage: 0.1
memory format hint: PAGED
acceptable_overhead_ratio: 0.5
field data: 14.6mb
                     name      took    millis
                   hist_l     18.2s        36
------------------ SUMMARY -------------------------------

------------------ SUMMARY -------------------------------
docs: 5000000
match percentage: 0.1
memory format hint: PACKED
acceptable_overhead_ratio: 0.0
field data: 16mb
                     name      took    millis
                   hist_l     17.4s        34
------------------ SUMMARY -------------------------------

------------------ SUMMARY -------------------------------
docs: 5000000
match percentage: 0.1
memory format hint: PAGED
acceptable_overhead_ratio: 0.0
field data: 10.8mb
                     name      took    millis
                   hist_l       21s        42
------------------ SUMMARY -------------------------------

Added a AppendingDeltaPackedLongBuffer-based storage format to single…
… value field data

The AppendingDeltaPackedLongBuffer uses delta compression in paged fashion. For data which is roughly monotonic this results in reduced memory signature.

By default we use the storage format expected to use the least memory. You can force a choice using a new field data setting `memory_storage_hint` which can be set to `ORDINALS`, `PACKED` or `PAGED`
if (s != null) {
return "always".equals(s) ? MemoryStorageFormat.ORDINALS : null;
}
return MemoryStorageFormat.fromString(fieldDataType.getSettings().get(SETTING_MEMORY_STORAGE_HINT));

This comment has been minimized.

Copy link
@jpountz

jpountz Apr 8, 2014

Contributor

It would be nice to be able to get the default value from the settings in order to be able to randomize it in our integration tests.

This comment has been minimized.

Copy link
@bleskes

bleskes Apr 8, 2014

Author Member

Note sure I follow? The default value is null , which means the code is allowed to decide based on memory size

This comment has been minimized.

Copy link
@jpountz

jpountz Apr 9, 2014

Contributor

For example, you can look at how the cache.recycler.page.type setting is set in TestCluster.getRandomNodeSettings: it is either not set in order to make sure things work fine with default settings, or set to a random value to make sure all recycler types get tested by our integration tests. I was thinking doing something similar here would help make sure that our integration tests would pass with any of these formats.

This comment has been minimized.

Copy link
@bleskes

bleskes Apr 11, 2014

Author Member

I see - the problem is that I don't have easy access to node level settings from that part of the code and I can't see how I to easily add it.

for (int i = 0; i < reader.maxDoc(); i++) {
final long ord = ordinals.getOrd(i);
if (ord == Ordinals.MISSING_ORDINAL) {
dpValues.add(missingValue);

This comment has been minimized.

Copy link
@jpountz

jpountz Apr 8, 2014

Contributor

I think this might kill compression. Maybe it should rather add the previous value (to make sure it doesn't increase the number of bits required) and then the produced Long and Double values would need to check if the document has a value based on a bit set in order to know whether to return the value from this array or the default value?

This comment has been minimized.

Copy link
@bleskes

bleskes Apr 8, 2014

Author Member

I originally decided not to worry about it and keep things simple (see comment at the memory calculation, where it is taken into account) as my main use case was time stamps. Nevertheless, reusing the docs with values bit set is a smart simple solution. Will do.

Improved support for missing values by relying on the docsWithValues …
…bitset which allows choosing a better place holder

also made page size configurable
@bleskes

This comment has been minimized.

Copy link
Member Author

bleskes commented Apr 11, 2014

@jpountz I pushed another commit. Thx for the feedback

@jpountz

This comment has been minimized.

Copy link
Contributor

jpountz commented Apr 11, 2014

Thanks Boaz, this looks great. +1 to merge

@bleskes bleskes added v2.0.0 labels Apr 11, 2014

@bleskes bleskes closed this in 1d1ca3b Apr 11, 2014

bleskes added a commit that referenced this pull request Apr 11, 2014

Added a AppendingDeltaPackedLongBuffer-based storage format to single…
… value field data

The AppendingDeltaPackedLongBuffer uses delta compression in paged fashion. For data which is roughly monotonic this results in reduced memory signature.

By default we use the storage format expected to use the least memory. You can force a choice using a new field data setting `memory_storage_hint` which can be set to `ORDINALS`, `PACKED` or `PAGED`

Closes #5706

bleskes added a commit that referenced this pull request Apr 11, 2014

PR #5706 introduced a bug in the sparse array-backed field data
When we load sparse single valued data, we automatically assign a missing value to represent a document who has none. We try to find a value that will increase the number of bits needed to represent the data. If that missing value happen to be 0, we do no properly intialize the value array.

This commit solved this problem but also cleans up the code even more to make spotting such issues easier in the future.

bleskes added a commit that referenced this pull request Apr 11, 2014

PR #5706 introduced a bug in the sparse array-backed field data
When we load sparse single valued data, we automatically assign a missing value to represent a document who has none. We try to find a value that will increase the number of bits needed to represent the data. If that missing value happen to be 0, we do no properly intialize the value array.

This commit solved this problem but also cleans up the code even more to make spotting such issues easier in the future.

@bleskes bleskes deleted the bleskes:exp/single_paged_compressed_fielddata branch May 19, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.