Skip to content

[SYSTEMDS-2883] CLA Specialized pointer using Offsetlist #1202

Closed
Baunsgaard wants to merge 4 commits intoapache:masterfrom
Baunsgaard:CompSampleOpt
Closed

[SYSTEMDS-2883] CLA Specialized pointer using Offsetlist #1202
Baunsgaard wants to merge 4 commits intoapache:masterfrom
Baunsgaard:CompSampleOpt

Conversation

@Baunsgaard
Copy link
Contributor

This commit contain a offset list encoding interface and byte and char implementations that allow for effecient offset lists for SDC column groups.

future work include encoding in odd number bits like 2 4 and 6.

@Baunsgaard Baunsgaard force-pushed the CompSampleOpt branch 3 times, most recently from 4618a4b to d6aa553 Compare March 23, 2021 19:00
@Baunsgaard Baunsgaard changed the title SYSTEMDS-???] CLA Offsetlist encoding interface [SYSTEMDS-2883] CLA Specialized pointer using Offsetlist Mar 24, 2021
@Baunsgaard Baunsgaard force-pushed the CompSampleOpt branch 2 times, most recently from d84f629 to a0db894 Compare April 20, 2021 08:23
@Baunsgaard Baunsgaard force-pushed the CompSampleOpt branch 4 times, most recently from d58a1e4 to b5e2b52 Compare April 22, 2021 16:23
This commit also include more efficient cocoding part of execution by
not analyzing groups that are known to contain fewer than x number of
distinct tuples.

restructure the Iterators to Abstracts not interfaces
[SYSTEMDS-2938] CLA Partitioning Bin-Packing
[SYSTEMDS-2939] CLA BruteForce Cocode

This commit reintroduce the bin packing and brute force Cocode,
this combination was the default setting previously, and still produce
the best compression ratios for the available methods.
While this method is good if compression ratios are important,
the COST based model still produce compressions with more CoCoding,
resulting in better runtime performance. Future additions will try to
bridge this gab.

Compressing Covtype again achieve the 32x compression ratio as stated in
the 2018 CLA Paper.

additionally:

- Progress and stabilization of estimates of sizes of colGroups.
- Ignore number of runs calculation for sample if RLE is disabled.
- Sample based tests of estimated sizes
- Added JDoc and removed some minor allocation steps in the single
  threaded paths of compression.
This commit isolate the Dictionary to SubPackage and
have split the ColGroupValue into a new superClass called
ColGroupCompressed. This gives the ability to more easily add new
types of compression techniques that does not need either the
dictionary abstract nor the counts of items contained.
One such instance is Delta encoding.
@Baunsgaard
Copy link
Contributor Author

merged

@Baunsgaard Baunsgaard closed this Apr 23, 2021
@Baunsgaard Baunsgaard deleted the CompSampleOpt branch June 8, 2021 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant