Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can kerchunk return references where the final chunk is smaller? #436

Open
TomNicholas opened this issue Mar 18, 2024 · 3 comments
Open

Comments

@TomNicholas
Copy link

TomNicholas commented Mar 18, 2024

Basically I want to check some assumptions I'm making about concatenation of kerchunk references in VirtualiZarr (for this issue).

I would like to know if when I scan e.g. a single netCDF file with kerchunk, for the returned reference dict:

a) are there files where I will get back more than one chunk entry per variable for one file,
b) if so are there files where instead of all those chunks being the same size, one of them could be smaller (like the final chunk can be in the zarr model)?

In the zarr model (b) is allowed, but that doesn't mean that kerchunk ever actually does it.

If (b) never happens, that's nice because then I can basically just take the shape key in each kerchunk reference dict as referring to every chunk from that variable from that file with no exceptions, and I won't have issues of unknowingly concatenating arrays to become variable-length chunks.

@martindurant
Copy link
Member

a) are there files where I will get back more than one chunk entry per variable for one file

Yes, HDF5 variables are often chunked similar to zarr variables but with often smaller chunks. In netCDF3 (I don't think you meant this), there is only limited chunking along the append axis.

b) if so are there files where instead of all those chunks being the same size, one of them could be smaller (like the final chunk can be in the zarr model)?

Yes, you can have a shape that is not an exact multiple of the chunk size, so the last chunk is incomplete. I don't remember right now if hdf stores a full chunk (like zarr does), I suppose yes. It should be easy to test!

Not being able to virtually concatenate variables from files because of the last-chunk issue is one of the drivers for ZEP003.

@TomNicholas
Copy link
Author

Thanks @martindurant ! Does the kerchunk test suite have any examples of reading references from such files?

@martindurant
Copy link
Member

I am not sure... :|

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants