[SYSTEMDS-2883] CLA Specialized pointer using Offsetlist #1202
Closed
Baunsgaard wants to merge 4 commits intoapache:masterfrom
Closed
[SYSTEMDS-2883] CLA Specialized pointer using Offsetlist #1202Baunsgaard wants to merge 4 commits intoapache:masterfrom
Baunsgaard wants to merge 4 commits intoapache:masterfrom
Conversation
4618a4b to
d6aa553
Compare
bda21ee to
1f3519c
Compare
d84f629 to
a0db894
Compare
d58a1e4 to
b5e2b52
Compare
This commit also include more efficient cocoding part of execution by not analyzing groups that are known to contain fewer than x number of distinct tuples. restructure the Iterators to Abstracts not interfaces
[SYSTEMDS-2938] CLA Partitioning Bin-Packing [SYSTEMDS-2939] CLA BruteForce Cocode This commit reintroduce the bin packing and brute force Cocode, this combination was the default setting previously, and still produce the best compression ratios for the available methods. While this method is good if compression ratios are important, the COST based model still produce compressions with more CoCoding, resulting in better runtime performance. Future additions will try to bridge this gab. Compressing Covtype again achieve the 32x compression ratio as stated in the 2018 CLA Paper. additionally: - Progress and stabilization of estimates of sizes of colGroups. - Ignore number of runs calculation for sample if RLE is disabled. - Sample based tests of estimated sizes - Added JDoc and removed some minor allocation steps in the single threaded paths of compression.
This commit isolate the Dictionary to SubPackage and have split the ColGroupValue into a new superClass called ColGroupCompressed. This gives the ability to more easily add new types of compression techniques that does not need either the dictionary abstract nor the counts of items contained. One such instance is Delta encoding.
b5e2b52 to
2c1678e
Compare
Contributor
Author
|
merged |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit contain a offset list encoding interface and byte and char implementations that allow for effecient offset lists for SDC column groups.
future work include encoding in odd number bits like 2 4 and 6.