Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pcodec support #49

Open
mwlon opened this issue May 8, 2024 · 3 comments
Open

Pcodec support #49

mwlon opened this issue May 8, 2024 · 3 comments

Comments

@mwlon
Copy link

mwlon commented May 8, 2024

I'm excited that Nimble has such flexible encodings/compressions! It shouldn't be too hard to add Pcodec, which generally gets much better compression ratio on numerical data than the traditional dictionary/rle/.../LZ approach. Compression and decompression speeds could benefit too. This seems important, especially for an ML-focused columnar format.

@pedroerp
Copy link
Contributor

Hi @mwlon, I'm just reading about pcodec and it does seem like something that would be interesting to try out. Is this something you would like to do? We can help in ensure that Nimble has the right extensibility APIs for you to add it, and would be interested in experimental results.

Cc: @helfman @Yuhta

@mwlon
Copy link
Author

mwlon commented May 11, 2024

I'm looking at the repo more now, but I don't see a spec doc. Does Nimble have a concept equivalent to Parquet's fine-grained "data pages"? If not, does it plan to have finer-grained pages in the future? This might affect a Pcodec implementation.

@Yuhta
Copy link
Contributor

Yuhta commented May 13, 2024

@mwlon Nimble has the same concept as Parquet page, which is called "chunk" inside one stream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants