Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diff'ing before shuffling: even better "compression" for non-normal data? #55

Open
arnehilmann opened this issue Jul 26, 2014 · 2 comments

Comments

@arnehilmann
Copy link

Just an idea:
shuffling is efficient when it yields long blocks with same values.
Assuming non-normal distributed data (e.g. text, images, ...), calculating the difference before shuffling might lead to smaller values, thus increasing the chance of long blocks of zeros afterwards.

@FrancescAlted
Copy link
Member

Yes, that is a nice idea. Probably will only work with integers, as this can change the precision in floating point, but worth exploring. Will still have room for at least four different pre-conditioners in Blosc, and what you are suggesting may be good candidate. Would you like to create some PR?

@aparamon
Copy link

aparamon commented May 6, 2018

In fact, in many cases diff'ing series of IEEE 754 floats as integers is nearly as efficient as calculating the precise floating-point differences. This works if typical delta is less than the typical magnitude (exponent changes rarely), e.g. for measurements of temperature in K, masses, and similar non-negative quantities.

In my experiments, this + bitshuffle did provide better compression compared to plain bitshuffle, and was quite fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants