Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very bad compression on short inputs 1-127 bytes long #267

Closed
dumblob opened this issue Feb 9, 2022 · 5 comments
Closed

Very bad compression on short inputs 1-127 bytes long #267

dumblob opened this issue Feb 9, 2022 · 5 comments

Comments

@dumblob
Copy link

dumblob commented Feb 9, 2022

I wonder whether there could be some heuristics employed to drastically decrease the minimum size blosc produces for short-length inputs.

If I wanted to use blosc for things like 64bit numbers or strings like a or my string, then I always end up with much bigger sizes than the input.

I need to store data into DB as separate items and there might be lots of such small data at some point. This drastically affects the DB performance according to my measurements (order(s) of magnitude).

>>> b.compress( b'a' ).__len__()
17
>>> pickle.dumps( b'a' ).__len__()
16
>>> b.compress( bytes( str( 1 ), 'ascii' ) ).__len__()
17
>>> pickle.dumps( 1 ).__len__()
5
>>> len( b.compress( b'a' * 127 ) )
143
>>> len( b.compress( b'a' * 128 ) )
35

Any non-boxing schemes I came up with as a workaround are flawed, so it seems I'll need to get my hands dirty with memoryview() and do some custom boxing if you don't have any better ideas how to approach this issue.

@dumblob
Copy link
Author

dumblob commented Feb 11, 2022

Any ideas? Does blosc2 do much better on this front (I'm assuming Python bindings which I didn't try yet)?

@FrancescAlted
Copy link
Member

Yeah, this kind of wild variations in compression ratio for small buffers is expected. Blosc is actually meant towards compressing large datasets, so priority in optimizing such small buffers is very low. Blosc2 is even more slanted towards large data, so the same should apply.

@dumblob
Copy link
Author

dumblob commented Feb 11, 2022

Ah, thanks. That clarifies it a lot.

Any concrete suggestions how to approach compression of short inputs?

@FrancescAlted
Copy link
Member

Sorry, but no ideas. You will have to do your own research.

@dumblob
Copy link
Author

dumblob commented Dec 7, 2022

Ok, thanks anyway. Most important for me is that Blosc2 does not plan to tackle this type of issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants