-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reproducible Zarr reading error #2484
Comments
Can you tar/zip simple.zarr and post it here? |
Yes; see attached. Thanks, Jamie |
I have just confirmed that I get the same bad result when creating the zarr file with xarray. ncdump returns bad data for 4.8.1, and hangs for 4.9.0. Using xarray 0.20.1, zarr 2.8.1, and the following code which fills the array with 42.,
running
The zarr file is attached. It is correct when opened within python with xarray. For reference, if I create at netCDF file using xarray.to_netcdf(), I get also get the correct result:
|
That file is compressed with Blosc. So where did you install |
These are both stock Conda installs; the Blosc filter is what ships with them. There is not HDF5_PLUGIN_PATH specified. I think the Blosc filter comes from the numcodecs python package, which is version 0.10.2. If you use Conda to create an environment with numpy, xarray, and zarr, you should be able to replicate my test setup. If you have Conda installed, you can do I am happy to poke further if you give me some information how. |
In order to read that file, you will need to find out where the C version |
@DennisHeimbigner, I am afraid I don't understand. I can find where numcodecs installs blosc. The version of c-blosc it uses is 1.21.0, from https://github.com/Blosc/c-blosc/tree/98aad98d0fee2e4acccec78fbb19fb51e1107bf9 ; I can see where it has been linked to cpython shared library ./blosc.cpython-310-x86_64-linux-gnu.so I have no problem reading these zarr files, as I created them. However, I want to make zarr files available to a larger audience -- in particular biologists who use R, and can use the R netCDF package. I cannot, if I want my work to be widely used, ask them to roll a custom netCDF4 library or to switch to python. The Zarr files I am producing are the most default kind of zarr file that can be made, from fresh installs of all the relevant libraries. If netCDF is going to be compatible with zarr, it might want to support them. For my parochial interests, are there compression options within zarr that would play better with netCDF? Do I have to convert my zarr files to netCDF/HDF? There are a number of reasons I prefer zarr, but if I must I must. Is it to be expected that netCDF/HDF will support the most generic zarr as produced by the primary framework that makes zarr, the python zarr library? I don't mean this rhetorically; I need to decide the best way to serve my data. Jamie |
I should note that NCZarr is intended to be compatible with the Zarr Version 2 Specification. It is possible that the python zarr implementation differs There is a description of NCZarr compressor (aka filter) support here: https://raw.githubusercontent.com/Unidata/netcdf-c/main/docs/filters.md Look at the section entitled "# NCZarr Filter Support {#filters_nczarr}" This attempts to describe the differences between how python handles its compressors and how NCZarr handles its compressors. |
Is there a contact in the R community with whom I can discuss |
The standard in R is ncdf4 package, the contact information in the R community is given in: https://cran.r-project.org/web/packages/ncdf4/index.html The zarr that I am producing is the standard Zarr 2 specification as produced by the python zarr package -- see https://zarr.readthedocs.io/en/stable/release.html I just want to re-iterate that the programs I have given you, and the files I have uploaded, are the most generic Zarr output from the system that is used to produce the vast, vast majority of the zarr seen in the wild. Your project has the great potential to be a bridge between the python zarr community and communities who do not use python. It would be a real shame if it did not work with generic Zarr output. Note this is the same issue, I think, as in #2449 and #2234 . I am not sure that netCDF "has Zarr" until this is sorted out. I am happy to help in any way in my capacity with this. I will experiment with uncompressed Zarr, but this is not a practical solution for most data sets. Jamie |
The intention is for netCDF to support generic Zarr output, so this should readily be fixable. @DennisHeimbigner It's already been shown here that the Zarr data in question works fine through Python but fails with |
We seem to be talking at cross purposes. I can read that file perfectly well with ncdump and with the C-language plugins properly installed. I must remind JamiePringle that the compressor implementations are language dependent. So NCZarr (C language) cannot use the NumCodec compressors (Python language). The missing information is how netcdf-c is being installed (conda, pypy, etc?). |
I just sent a message to the maintainer of ncdf4 to see if he can provide insight into how ncdf4 handles plugins. I will pass along any useful response. |
@JamiePringle Can you share the commands you used to build and install netcdf-c 4.9.0? Also, to be clear, you're not setting any |
@dopplershift I am using the ncdump commands provided by Macports (4.9.0 on the Mac) and the Conda install on both Mac and linux machines. I am not setting any HDF5_PLUGIN_PATH . I have compiled netCDF before, but I want to test the system that my data consumers are most likely to use. I am happy to compile netCDF myself -- but it does not really solve my larger problem. @DennisHeimbigner: you say "I must remind JamiePringle that the compressor implementations are language dependent. So NCZarr (C language) cannot use the NumCodec compressors (Python language)." I am not sure what to do with that. Does that mean that I cannot expect the stock netCDF from unidata to consume the zarr produced by the stock python package that produces zarr? In what sense is the data format interoperable? Does it mean I can't expect python produced zarr to be readable by netCDF? Also, if you look above, I link to the c-library that numcodecs uses to do the blosc compression. My goal is to produce data that can be easily consumed and distributed by scientist who use R and python and whatever. |
@DennisHeimbigner Did you install If I'm understanding the main thrust of the issue here (forgive me for jumping in late, I spent much of the last few days chasing down the Julia issue referenced over in #2486), it boils down to making sure folk who are using the netCDF-C library (and software dependent upon it) know they need to use compatible compressors, and how to configure them. We've documented @JamiePringle this is going to be the case where arriving at the solution is harder than actually adopting it (as is so often the case). This is complicated by the fact that this is newer functionality, and we do not maintain the installs provided by package managers like I'm going to replicate Dennis' success on my own M1 MacOS machine, once I know where the Blosc install he's using successfully is coming from. I'll document that here, and take a look at our documentation to see if how we can highlight this in a more visible manner. |
@WardF I am not sure if you were asking me, but the netCDF installs I was using are those provided either by Conda, or by macports (Conda installs a full set of netCDF libraries and binaries when it installs a package that requires netcdf). In all cases I used stock installs, and was careful to look in the paths so that I was not mixing and matching libraries and executables. It is worth noting that I get the same results in my linux and macOS work. If it would be useful, I can use ldd to figure out what libraries are being linked to. Thanks for looking into this! |
Hi @JamiePringle I was primarily asking Dennis, but it would be great if you were able to report the dependencies on your MacOS machine. I believe you'll need to use That said, the compressor libraries are pulled in at runtime and won't necessarily show up as links to the libraries. That's what the |
@WardF Will do this evening. Note that I had the exact same issues on my ubuntu 20.0.4, even including the exact values of the output from ncdump... |
|
@DennisHeimbigner @WardF @dopplershift I would be happy to post to either zarr developers or track down the maintainer of the zarr conda maintainer to nudge them as well towards doing what needs to be done. The existing packages do read their own blosc compressed data (it is their default). These installs do not include a plugins directory for netCDF/HDF, as far as I can tell. Or is their a helpful way I could be a matchmaker between your two groups? My ultimate goal is that the netCDF that is obtained from the common package managers read the zarr produced by xarray and zarr as produced by the most common package managers. Both the zarr and xarray groups suggest Conda the best way for regular users to install their packages. Jamie |
Thanks. If you can figure out to whom I should talk, please tell me. |
@DennisHeimbigner Well in regards to fixing up conda, we can start by fixing the build of 4.9.0. See this open PR that you're tagged on: conda-forge/libnetcdf-feedstock#140 |
This should have been fixed with PR #2448 |
@DennisHeimbigner I'll be happy to but can you let me know if you installed blosc from a package manager of some sort, or if you compiled from source? Thanks :) |
@DennisHeimbigner I'm working to verify, but at some point someone else needs to learn the conda-forge stuff so I'm not in the critical path. It's already critical to so much of our community, and it may be even more important to reliably deploy these complicated plugin-based setups. |
As I recall, I installed c-blosc on Ubuntu 21 using apt to get package libblosc-dev |
Just an update on this; using the
Using a version of |
(note, the results in my previous message were generated w/ the current developer snapshot of netcdf-c. I'm able to replicate @JamiePringle's results with
(quoting for my own reference) |
@DennisHeimbigner Can you talk more about how you were able to read the file on Ubuntu? It seems pretty straightforward, set |
What is the "appropriate location" you used for HDF5_PLUGIN_PATH? |
@DennisHeimbigner On my system? |
Remember you need to point to the directory that contains the file |
That file was not installed as part of |
As a reminder, note that HDF5 and NCZarr do not invoke the |
Thanks, Dennis, I've gotten everything working. Let me think about this. |
So @JamiePringle, I'll be updating our documentation to clarify a few things, but essentially, what's going on here is this: The From scratch, the steps to get this to work (on my system) were as follows, and assumes
Once built and installed, I set the environmental variable The reason this works is because:
|
I'll also be updating |
Thank you for this; it is very useful. I will test and make sure
uncompressed zarr is readable by the stock CRAN and conda builds. But
uncompressed files would be a significant issue for distributing data.
Do you have any plans to reach out to the folks who package the netCDF used
in conda, CRAN & the like? I wonder if there is something you can do in
the building instructions to highlight the utility of including blosc and
the like in the build? As is done with HDF?
I could always show up in the conda maintainers github pages and raise the
issues -- do you have a website that describes the best practices in
building netCDF with blosc and other filters?
Thanks to everyone who worked this issue!
Jamie
…On Wed, Sep 7, 2022 at 4:00 PM Ward Fisher ***@***.***> wrote:
CAUTION: This email originated from outside of the University System. Do
not click links or open attachments unless you recognize the sender and
know the content is safe.
CAUTION: This email originated from outside of the University System. Do
not click links or open attachments unless you recognize the sender and
know the content is safe.
I'll also be updating nc-config to have the option to report plugins
supported.
—
Reply to this email directly, view it on GitHub
<#2484 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADBZR234FK6327Z54CLCIALV5DX65ANCNFSM57E5LIOA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thank you |
@JamiePringle No problem; to answer your question posed above, we are involved in the conda feedstock pipeline, but I am not certain who I'd have to speak with for CRAN & the like. Clarifying the documentation is a great idea and a great start, however. Thanks again! |
This is my first attempt reading zarr from netCDF, and I have found this simple, reproducible error on both my Apple Silicon and Ubuntu 20.04 machines. It exists in both 4.9.0 and 4.8.1 in different ways. I create a simple little pure zarr data set as follows with zarr 2.11.3 or zarr 2.8.1:
When I read this back in with zarr in python, as expected I get an array filled with 42.0. When I use the command
ncdump 'file://simple.zarr#mode=nczarr,zarr'
with 4.8.1 on ubuntu 20.04 or on the Apple silicon, I getWhen I try to read it with 4.9.0 on the apple silicon with the same command, it hangs and never returns...
ultimately, my goal is to get my zarr files available to R users via netCDF, but figuring this out seems a logical first step. I hope this bug report is helpful, and that I am not doing something stupid.
Jamie
p.s. for 4.9.0,
nc-config --all
givesand on ubuntu netcdf 4.8.1 gives
The text was updated successfully, but these errors were encountered: