Diff'ing before shuffling: even better "compression" for non-normal data? #55

arnehilmann · 2014-07-26T19:15:19Z

Just an idea:
shuffling is efficient when it yields long blocks with same values.
Assuming non-normal distributed data (e.g. text, images, ...), calculating the difference before shuffling might lead to smaller values, thus increasing the chance of long blocks of zeros afterwards.

FrancescAlted · 2014-07-28T12:16:27Z

Yes, that is a nice idea. Probably will only work with integers, as this can change the precision in floating point, but worth exploring. Will still have room for at least four different pre-conditioners in Blosc, and what you are suggesting may be good candidate. Would you like to create some PR?

aparamon · 2018-05-06T17:30:02Z

In fact, in many cases diff'ing series of IEEE 754 floats as integers is nearly as efficient as calculating the precise floating-point differences. This works if typical delta is less than the typical magnitude (exponent changes rarely), e.g. for measurements of temperature in K, masses, and similar non-negative quantities.

In my experiments, this + bitshuffle did provide better compression compared to plain bitshuffle, and was quite fast.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diff'ing before shuffling: even better "compression" for non-normal data? #55

Diff'ing before shuffling: even better "compression" for non-normal data? #55

arnehilmann commented Jul 26, 2014

FrancescAlted commented Jul 28, 2014

aparamon commented May 6, 2018

Diff'ing before shuffling: even better "compression" for non-normal data? #55

Diff'ing before shuffling: even better "compression" for non-normal data? #55

Comments

arnehilmann commented Jul 26, 2014

FrancescAlted commented Jul 28, 2014

aparamon commented May 6, 2018