Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reproducible Zarr reading error #2484

Closed
JamiePringle opened this issue Aug 21, 2022 · 41 comments · Fixed by #2524
Closed

reproducible Zarr reading error #2484

JamiePringle opened this issue Aug 21, 2022 · 41 comments · Fixed by #2524
Assignees
Labels
Milestone

Comments

@JamiePringle
Copy link

This is my first attempt reading zarr from netCDF, and I have found this simple, reproducible error on both my Apple Silicon and Ubuntu 20.04 machines. It exists in both 4.9.0 and 4.8.1 in different ways. I create a simple little pure zarr data set as follows with zarr 2.11.3 or zarr 2.8.1:

from numpy import *
import zarr

store = zarr.DirectoryStore('simple.zarr')
root=zarr.group(store=store)
z = root.create('z',shape=(10, 10), dtype='f',overwrite=True)
z[:] = 42

When I read this back in with zarr in python, as expected I get an array filled with 42.0. When I use the command ncdump 'file://simple.zarr#mode=nczarr,zarr' with 4.8.1 on ubuntu 20.04 or on the Apple silicon, I get

netcdf simple {
dimensions:
        .zdim_10 = 10 ;
variables:
        float z(.zdim_10, .zdim_10) ;
data:

 z =
  2.080671e-36, 5.605194e-43, 5.605194e-43, 6.305843e-44, 2.802597e-44, 
    2.942727e-44, 9.187894e-41, 3.087947e-38, 39.82812, 1.36231e+10,
  48.5647, 9.24857e-44, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ;
}

When I try to read it with 4.9.0 on the apple silicon with the same command, it hangs and never returns...

ultimately, my goal is to get my zarr files available to R users via netCDF, but figuring this out seems a logical first step. I hope this bug report is helpful, and that I am not doing something stupid.

Jamie

p.s. for 4.9.0, nc-config --all gives

This netCDF 4.9.0 has been built with the following features: 

  --cc            -> /usr/bin/clang
  --cflags        -> -I/opt/local/include
  --libs          -> -L/opt/local/lib -lnetcdf
  --static        -> -lhdf5_hl -lhdf5 -lz -ldl -lm -lzstd -lbz2 -lcurl -lxml2

  --has-c++       -> no
  --cxx           -> 

  --has-c++4      -> no
  --cxx4          -> 

  --has-fortran   -> no
  --has-dap       -> yes
  --has-dap2      -> yes
  --has-dap4      -> yes
  --has-nc2       -> yes
  --has-nc4       -> yes
  --has-hdf5      -> yes
  --has-hdf4      -> no
  --has-logging   -> no
  --has-pnetcdf   -> no
  --has-szlib     -> no
  --has-cdf5      -> no
  --has-parallel4 -> no
  --has-parallel  -> no
  --has-nczarr    -> yes

  --prefix        -> /opt/local
  --includedir    -> /opt/local/include
  --libdir        -> /opt/local/lib
  --version       -> netCDF 4.9.0

and on ubuntu netcdf 4.8.1 gives

This netCDF 4.8.1 has been built with the following features: 

  --cc            -> x86_64-conda-linux-gnu-cc
  --cflags        -> -I/home/pringle/anaconda3/envs/py3_parcels_mpi_bleedingApr2022/include
  --libs          -> -L/home/pringle/anaconda3/envs/py3_parcels_mpi_bleedingApr2022/lib -lnetcdf
  --static        -> -lmfhdf -ldf -lhdf5_hl -lhdf5 -lm -lcurl -lzip

  --has-c++       -> no
  --cxx           -> 

  --has-c++4      -> no
  --cxx4          -> 

  --has-fortran   -> yes
  --fc            -> /home/conda/feedstock_root/build_artifacts/netcdf-fortran_1642696590650/_build_env/bin/x86_64-conda-linux-gnu-gfortran
  --fflags        -> -I/home/pringle/anaconda3/envs/py3_parcels_mpi_bleedingApr2022/include -I/home/pringle/anaconda3/envs/py3_parcels_mpi_bleedingApr2022/include
  --flibs         -> -L/home/pringle/anaconda3/envs/py3_parcels_mpi_bleedingApr2022/lib -lnetcdff -lnetcdf -lnetcdf -lnetcdff_c
  --has-f90       -> TRUE
  --has-f03       -> yes

  --has-dap       -> yes
  --has-dap2      -> yes
  --has-dap4      -> yes
  --has-nc2       -> yes
  --has-nc4       -> yes
  --has-hdf5      -> yes
  --has-hdf4      -> yes
  --has-logging   -> no
  --has-pnetcdf   -> no
  --has-szlib     -> no
  --has-cdf5      -> yes
  --has-parallel4 -> no
  --has-parallel  -> no
  --has-nczarr    -> yes

  --prefix        -> /home/pringle/anaconda3/envs/py3_parcels_mpi_bleedingApr2022
  --includedir    -> /home/pringle/anaconda3/envs/py3_parcels_mpi_bleedingApr2022/include
  --libdir        -> /home/pringle/anaconda3/envs/py3_parcels_mpi_bleedingApr2022/lib
  --version       -> netCDF 4.8.1
@DennisHeimbigner
Copy link
Collaborator

Can you tar/zip simple.zarr and post it here?

@JamiePringle
Copy link
Author

JamiePringle commented Aug 22, 2022

Yes; see attached. Thanks, Jamie

simple.zarr.zip

@JamiePringle
Copy link
Author

I have just confirmed that I get the same bad result when creating the zarr file with xarray. ncdump returns bad data for 4.8.1, and hangs for 4.9.0. Using xarray 0.20.1, zarr 2.8.1, and the following code which fills the array with 42.,

import numpy as np
import zarr
import xarray as xr


#=========================
#pure zarr

store = zarr.DirectoryStore('simple.zarr')
root=zarr.group(store=store)
z = root.create('z',shape=(10, 10), dtype='f',overwrite=True)
z[:] = 42

print('done zarr')

#=========================
#xarray, based on useful https://towardsdatascience.com/how-to-create-xarray-datasets-cf1859c95921

# create data
nsize=10
data=np.zeros((nsize,nsize),dtype=np.float32)
data[:,:] = 42.

# create coords
rows = np.arange(nsize)
cols = np.arange(nsize)

# put data into a dataset
ds = xr.Dataset(
    data_vars=dict(
        z=(["x", "y"], data)
    ),
    coords=dict(
        lon=(["x"], rows),
        lat=(["y"], cols),
    ),
    attrs=dict(description="coords with vectors"),
)

#and write
ds.to_zarr('simple_xarray.zarr')
print('done xarray zarr')

running ncdump 'file://simple_xarray.zarr#mode=nczarr,zarr' produces the entirely incorrect (note also the coordinates)

netcdf simple_xarray {
dimensions:
	y = 10 ;
	x = 10 ;
variables:
	int64 lat(y) ;
	int64 lon(x) ;
	float z(x, y) ;
		z:coordinates = "lat lon" ;

// global attributes:
		:description = "coords with vectors" ;
data:

 lat = 343734944002, 412316860496, 0, 1, 2, 3, 4, 5, 6, 7 ;

 lon = 343734944002, 412316860496, 0, 1, 2, 3, 4, 5, 6, 7 ;

 z =
  2.080671e-36, 5.605194e-43, 5.605194e-43, 6.305843e-44, 2.802597e-44, 
    2.942727e-44, 9.187894e-41, 3.087947e-38, 39.82812, 1.36231e+10,
  48.5647, 9.24857e-44, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ;
}

The zarr file is attached. It is correct when opened within python with xarray. For reference, if I create at netCDF file using xarray.to_netcdf(), I get also get the correct result:

netcdf simple_xarray {
dimensions:
	x = 10 ;
	y = 10 ;
variables:
	float z(x, y) ;
		z:_FillValue = NaNf ;
		z:coordinates = "lon lat" ;
	int64 lon(x) ;
	int64 lat(y) ;

// global attributes:
		:description = "coords with vectors" ;
data:

 z =
  42, 42, 42, 42, 42, 42, 42, 42, 42, 42,
  42, 42, 42, 42, 42, 42, 42, 42, 42, 42,
  42, 42, 42, 42, 42, 42, 42, 42, 42, 42,
  42, 42, 42, 42, 42, 42, 42, 42, 42, 42,
  42, 42, 42, 42, 42, 42, 42, 42, 42, 42,
  42, 42, 42, 42, 42, 42, 42, 42, 42, 42,
  42, 42, 42, 42, 42, 42, 42, 42, 42, 42,
  42, 42, 42, 42, 42, 42, 42, 42, 42, 42,
  42, 42, 42, 42, 42, 42, 42, 42, 42, 42,
  42, 42, 42, 42, 42, 42, 42, 42, 42, 42 ;

 lon = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ;

 lat = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ;
}

simple_xarray.zarr.zip

@DennisHeimbigner
Copy link
Collaborator

That file is compressed with Blosc. So where did you install
your plugins and to what do you have the HDF5_PLUGIN_PATH
environment variable set?

@JamiePringle
Copy link
Author

JamiePringle commented Aug 23, 2022

These are both stock Conda installs; the Blosc filter is what ships with them. There is not HDF5_PLUGIN_PATH specified. I think the Blosc filter comes from the numcodecs python package, which is version 0.10.2. If you use Conda to create an environment with numpy, xarray, and zarr, you should be able to replicate my test setup. If you have Conda installed, you can do conda create -n numpy xarray zarr netcdf4 to reproduce my environment.

I am happy to poke further if you give me some information how.

@DennisHeimbigner
Copy link
Collaborator

In order to read that file, you will need to find out where the C version
of blosc is installed. The python version in numcodec will not work with C code.
You might try searching for a file whose name begins with the string "lib__nch5blosc"

@JamiePringle
Copy link
Author

JamiePringle commented Aug 24, 2022

@DennisHeimbigner, I am afraid I don't understand. I can find where numcodecs installs blosc. The version of c-blosc it uses is 1.21.0, from https://github.com/Blosc/c-blosc/tree/98aad98d0fee2e4acccec78fbb19fb51e1107bf9 ; I can see where it has been linked to cpython shared library ./blosc.cpython-310-x86_64-linux-gnu.so

I have no problem reading these zarr files, as I created them. However, I want to make zarr files available to a larger audience -- in particular biologists who use R, and can use the R netCDF package. I cannot, if I want my work to be widely used, ask them to roll a custom netCDF4 library or to switch to python.

The Zarr files I am producing are the most default kind of zarr file that can be made, from fresh installs of all the relevant libraries. If netCDF is going to be compatible with zarr, it might want to support them.

For my parochial interests, are there compression options within zarr that would play better with netCDF? Do I have to convert my zarr files to netCDF/HDF? There are a number of reasons I prefer zarr, but if I must I must.

Is it to be expected that netCDF/HDF will support the most generic zarr as produced by the primary framework that makes zarr, the python zarr library?

I don't mean this rhetorically; I need to decide the best way to serve my data.

Jamie

@DennisHeimbigner
Copy link
Collaborator

I should note that NCZarr is intended to be compatible with the Zarr Version 2 Specification. It is possible that the python zarr implementation differs
from that spec in some details, but I have not observed it.

There is a description of NCZarr compressor (aka filter) support here: https://raw.githubusercontent.com/Unidata/netcdf-c/main/docs/filters.md

Look at the section entitled "# NCZarr Filter Support {#filters_nczarr}"

This attempts to describe the differences between how python handles its compressors and how NCZarr handles its compressors.

@DennisHeimbigner
Copy link
Collaborator

Is there a contact in the R community with whom I can discuss
the use of compressors in NCZarr in the context of R?

@JamiePringle
Copy link
Author

JamiePringle commented Aug 24, 2022

The standard in R is ncdf4 package, the contact information in the R community is given in: https://cran.r-project.org/web/packages/ncdf4/index.html

The zarr that I am producing is the standard Zarr 2 specification as produced by the python zarr package -- see https://zarr.readthedocs.io/en/stable/release.html

I just want to re-iterate that the programs I have given you, and the files I have uploaded, are the most generic Zarr output from the system that is used to produce the vast, vast majority of the zarr seen in the wild. Your project has the great potential to be a bridge between the python zarr community and communities who do not use python.

It would be a real shame if it did not work with generic Zarr output.

Note this is the same issue, I think, as in #2449 and #2234 . I am not sure that netCDF "has Zarr" until this is sorted out.

I am happy to help in any way in my capacity with this.

I will experiment with uncompressed Zarr, but this is not a practical solution for most data sets.

Jamie

@dopplershift
Copy link
Member

It would be a real shame if it did not work with generic Zarr output.

The intention is for netCDF to support generic Zarr output, so this should readily be fixable.

@DennisHeimbigner It's already been shown here that the Zarr data in question works fine through Python but fails with ncdump, so there's nothing R-specific at play here. My guess is there is a build-time or run-time issue getting the right copy of the blosc filter library linked. We need robust documentation and testing of these features, and we need to work to support these builds in more ecosystems (e.g. conda-forge) so that we can head off community problems like this.

@DennisHeimbigner
Copy link
Collaborator

We seem to be talking at cross purposes. I can read that file perfectly well with ncdump and with the C-language plugins properly installed.

I must remind JamiePringle that the compressor implementations are language dependent. So NCZarr (C language) cannot use the NumCodec compressors (Python language).

The missing information is how netcdf-c is being installed (conda, pypy, etc?).
And secondarily, does the installer properly handle the libraries in the netcdf-c plugins directory. That was the reason for my asking about the R community. Perhaps someone there can enlighten us about this.

@DennisHeimbigner
Copy link
Collaborator

I just sent a message to the maintainer of ncdf4 to see if he can provide insight into how ncdf4 handles plugins. I will pass along any useful response.

@dopplershift
Copy link
Member

@JamiePringle Can you share the commands you used to build and install netcdf-c 4.9.0? Also, to be clear, you're not setting any HDF5_PLUGIN_PATH when running ncdump?

@JamiePringle
Copy link
Author

JamiePringle commented Aug 24, 2022

@dopplershift I am using the ncdump commands provided by Macports (4.9.0 on the Mac) and the Conda install on both Mac and linux machines. I am not setting any HDF5_PLUGIN_PATH . I have compiled netCDF before, but I want to test the system that my data consumers are most likely to use.

I am happy to compile netCDF myself -- but it does not really solve my larger problem.

@DennisHeimbigner: you say "I must remind JamiePringle that the compressor implementations are language dependent. So NCZarr (C language) cannot use the NumCodec compressors (Python language)." I am not sure what to do with that. Does that mean that I cannot expect the stock netCDF from unidata to consume the zarr produced by the stock python package that produces zarr? In what sense is the data format interoperable? Does it mean I can't expect python produced zarr to be readable by netCDF?

Also, if you look above, I link to the c-library that numcodecs uses to do the blosc compression.

My goal is to produce data that can be easily consumed and distributed by scientist who use R and python and whatever.

@WardF WardF self-assigned this Aug 25, 2022
@WardF WardF added this to the 4.9.1 milestone Aug 25, 2022
@WardF
Copy link
Member

WardF commented Aug 25, 2022

@DennisHeimbigner Did you install Blosc manually, or are you using something provided by a package manager? I think the main issue we're having here stems from growing pains associated with relatively new, powerful functionality in the core library. As more people start using this functionality, we'll be able to refine our documentation and support procedures.

If I'm understanding the main thrust of the issue here (forgive me for jumping in late, I spent much of the last few days chasing down the Julia issue referenced over in #2486), it boils down to making sure folk who are using the netCDF-C library (and software dependent upon it) know they need to use compatible compressors, and how to configure them. We've documented HDF5_PLUGIN_PATH and it's use, but we are also in the position that libnetcdf might be the first exposure people are having to the HDF5 plugin functionality. So our documentation, while thorough, can always be refined based on the feedback we get.

@JamiePringle this is going to be the case where arriving at the solution is harder than actually adopting it (as is so often the case). This is complicated by the fact that this is newer functionality, and we do not maintain the installs provided by package managers like Macports, Homebrew, etc. As a result, options required for out-of-the-box advanced functionality may not be set.

I'm going to replicate Dennis' success on my own M1 MacOS machine, once I know where the Blosc install he's using successfully is coming from. I'll document that here, and take a look at our documentation to see if how we can highlight this in a more visible manner.

@JamiePringle
Copy link
Author

@WardF I am not sure if you were asking me, but the netCDF installs I was using are those provided either by Conda, or by macports (Conda installs a full set of netCDF libraries and binaries when it installs a package that requires netcdf). In all cases I used stock installs, and was careful to look in the paths so that I was not mixing and matching libraries and executables. It is worth noting that I get the same results in my linux and macOS work.

If it would be useful, I can use ldd to figure out what libraries are being linked to.

Thanks for looking into this!
Jamie

@WardF
Copy link
Member

WardF commented Aug 25, 2022

Hi @JamiePringle I was primarily asking Dennis, but it would be great if you were able to report the dependencies on your MacOS machine. I believe you'll need to use otool -L on MacOS instead of ldd.

That said, the compressor libraries are pulled in at runtime and won't necessarily show up as links to the libraries. That's what the HDF5_PLUGIN_PATH mechanism/variable is meant to specify. But once I can recreate Dennis' success opening the file on my own machine, I can provide more specific instructions on how to do that in a non-Windows/Non-Linux environment :).

@JamiePringle
Copy link
Author

@WardF Will do this evening. Note that I had the exact same issues on my ubuntu 20.0.4, even including the exact values of the output from ncdump...

@DennisHeimbigner
Copy link
Collaborator

Also, if you look above, I link to the c-library that numcodecs uses to do the blosc compression.
The key thing to understand is that NCZarr uses the existing HDF5 compression
mechanism, which is, unfortunately, a bit complicated. We have attempted
to simplify things.
The important thing to note is that HDF5 and NCZarr do not invoke the
compressor such as c-blosc directly. Rather, the compression code is wrapped
in code that translates the HDF5/NCZarr API to the c-blosc compressor code.
If NCZarr (and HDF5) cannot find those wrappers, then it cannot apply compression to the dataset.
Apparently the conda netcdf-c packager does not (yet) install these wrappers.
So, you will need to take some extra effort to do that installation.
The first step is to find those wrappers. If you can find the build directory
where libnetcdf was built, then there should be a directory there called "plugins".
See if you can find it and send us a listing of the contents of that directory.
\

@JamiePringle
Copy link
Author

@DennisHeimbigner @WardF @dopplershift I would be happy to post to either zarr developers or track down the maintainer of the zarr conda maintainer to nudge them as well towards doing what needs to be done. The existing packages do read their own blosc compressed data (it is their default).

These installs do not include a plugins directory for netCDF/HDF, as far as I can tell.

Or is their a helpful way I could be a matchmaker between your two groups? My ultimate goal is that the netCDF that is obtained from the common package managers read the zarr produced by xarray and zarr as produced by the most common package managers. Both the zarr and xarray groups suggest Conda the best way for regular users to install their packages.

Jamie

@DennisHeimbigner
Copy link
Collaborator

Thanks. If you can figure out to whom I should talk, please tell me.
[Ward would it be possible to tar up a Mac copy of the plugins and
send it to Jamie so that he can move forward?]

@dopplershift
Copy link
Member

@DennisHeimbigner Well in regards to fixing up conda, we can start by fixing the build of 4.9.0. See this open PR that you're tagged on: conda-forge/libnetcdf-feedstock#140

@DennisHeimbigner
Copy link
Collaborator

This should have been fixed with PR #2448
which apparently has been merged to master.
Are you in a position to verify the fix?

@WardF
Copy link
Member

WardF commented Aug 26, 2022

@DennisHeimbigner I'll be happy to but can you let me know if you installed blosc from a package manager of some sort, or if you compiled from source? Thanks :)

@dopplershift
Copy link
Member

@DennisHeimbigner I'm working to verify, but at some point someone else needs to learn the conda-forge stuff so I'm not in the critical path. It's already critical to so much of our community, and it may be even more important to reliably deploy these complicated plugin-based setups.

@DennisHeimbigner
Copy link
Collaborator

As I recall, I installed c-blosc on Ubuntu 21 using apt to get package libblosc-dev

@WardF
Copy link
Member

WardF commented Sep 1, 2022

Just an update on this; using the libblosc.1.12.1.dylib provided by homebrew, I get an EXC_BAD_ADDRESS error (see below).

(lldb) run "file://simple.zarr#mode=nczarr,file"
Process 84492 launched: '/Users/wfisher/environments/local-ncmain-1.12.2-plugins/bin/ncdump' (arm64)
warning: (arm64) /opt/homebrew/Cellar/libxau/1.0.10/lib/libXau.6.dylib address 0x0000000100288000 maps to more than one section: libXau.6.dylib.__DATA and libtheoraenc.1.dylib.__TEXT
Process 84492 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x104b6f850)
    frame #0: 0x000000019fcfac48 libsystem_malloc.dylib`free + 128
libsystem_malloc.dylib`free:
->  0x19fcfac48 <+128>: ldr    x8, [x20, #0x10]
    0x19fcfac4c <+132>: mov    x0, x20
    0x19fcfac50 <+136>: mov    x1, x19
    0x19fcfac54 <+140>: mov    x17, #0x91a2
Target 0: (ncdump) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x104b6f850)
  * frame #0: 0x000000019fcfac48 libsystem_malloc.dylib`free + 128
    frame #1: 0x00000001004c1408 libnetcdf.19.dylib`ncpsharedlibfree + 108
    frame #2: 0x00000001004c0f14 libnetcdf.19.dylib`NCZ_load_plugin + 1224
    frame #3: 0x00000001004c0758 libnetcdf.19.dylib`NCZ_load_plugin_dir + 496
    frame #4: 0x00000001004be204 libnetcdf.19.dylib`NCZ_load_all_plugins + 632
    frame #5: 0x00000001004bd8f8 libnetcdf.19.dylib`NCZ_filter_initialize + 116
    frame #6: 0x00000001004afe84 libnetcdf.19.dylib`define_vars + 4164
    frame #7: 0x00000001004ac0a8 libnetcdf.19.dylib`define_grp + 748
    frame #8: 0x00000001004abd4c libnetcdf.19.dylib`ncz_read_file + 84
    frame #9: 0x00000001004a9920 libnetcdf.19.dylib`ncz_open_file + 544
    frame #10: 0x00000001004a96a8 libnetcdf.19.dylib`NCZ_open + 364
    frame #11: 0x00000001003aa840 libnetcdf.19.dylib`NC_open + 1312
    frame #12: 0x00000001003aa314 libnetcdf.19.dylib`nc_open + 56
    frame #13: 0x0000000100002be4 ncdump`main + 1600
    frame #14: 0x000000010002d08c dyld`start + 520
(lldb) 

Using a version of libblosc.21.1.dylib compiled manually, I get an "undefined filter" error.

@WardF
Copy link
Member

WardF commented Sep 1, 2022

(note, the results in my previous message were generated w/ the current developer snapshot of netcdf-c. I'm able to replicate @JamiePringle's results with v4.8.1, reported at the top of this thread, as well.)

These are both stock Conda installs; the Blosc filter is what ships with them. There is not HDF5_PLUGIN_PATH specified. I think the Blosc filter comes from the numcodecs python package, which is version 0.10.2. If you use Conda to create an environment with numpy, xarray, and zarr, you should be able to replicate my test setup. If you have Conda installed, you can do conda create -n numpy xarray zarr netcdf4 to reproduce my environment.

I am happy to poke further if you give me some information how.

(quoting for my own reference)

@WardF
Copy link
Member

WardF commented Sep 1, 2022

@DennisHeimbigner Can you talk more about how you were able to read the file on Ubuntu? It seems pretty straightforward, set HDF5_PLUGIN_PATH to the appropriate location after installing libblosc-dev via the Ubuntu package manager, and then running ncdump "file://simple.zarr#mode=nczarr,file", but I am not having much luck. On my Ubuntu machine, I follow these steps and get a core dump when invoking ncdump. I'm sure I'm overlooking something simple, but it's probably quicker for me to clarify with you. Thanks!

@DennisHeimbigner
Copy link
Collaborator

What is the "appropriate location" you used for HDF5_PLUGIN_PATH?

@WardF
Copy link
Member

WardF commented Sep 1, 2022

@DennisHeimbigner On my system? /usr/lib/aarch64-linux-gnu/

@DennisHeimbigner
Copy link
Collaborator

Remember you need to point to the directory that contains the file
lib__nch5blosc.so

@WardF
Copy link
Member

WardF commented Sep 1, 2022

That file was not installed as part of libblosc-dev; I knew this would be a straightforward issue, thanks @DennisHeimbigner. How is the file lib__nch5blosc.so created?

@DennisHeimbigner
Copy link
Collaborator

As a reminder, note that HDF5 and NCZarr do not invoke the
compressor such as libblosc.21.1.dylib directly.
Rather, the compression code is wrapped
in code that translates the HDF5/NCZarr API to the ibblosc.21.1.dylib
compressor code.
If NCZarr (and HDF5) cannot find those wrappers, then it cannot apply compression to the dataset.
I assume that the --enable-filters option is enabled.
Then these wrappers are created in the netcdf-c/plugins directory, so you can
point HDF5_PLUGIN_PATH to that directory if you want.
If you want to install them in some other place, then use the
--with-plugin-dir="dir" option to install them in directory "dir".
You can alternatively specify --with-plugin-dir=yes and the wrappers
will be installed in a standard location, in which case you do not
need to set HDF5_PLUGIN_PATH.

@WardF
Copy link
Member

WardF commented Sep 6, 2022

Thanks, Dennis, I've gotten everything working. Let me think about this.

@WardF
Copy link
Member

WardF commented Sep 7, 2022

So @JamiePringle, I'll be updating our documentation to clarify a few things, but essentially, what's going on here is this:

The libnetcdf.so library cannot talk to the blosc library directly; it requires an "interface" library, which acts as a go-between. This interface library is built by the netCDF library, if blosc is detected at configure/build time.

From scratch, the steps to get this to work (on my system) were as follows, and assumes libhdf5 was installed (although not strictly necessary).

  1. Install blosc, blosc development headers.
  2. Configure netCDF with --enable-plugins and --with-plugin-dir=$HOME/netcdf-plugins
  3. Ensure blosc is specified in the generated libnetcdf.settings file.
  4. Run make, make install.

Once built and installed, I set the environmental variable HDF5_PLUGIN_PATH=$HOME/netcdf-plugins. Once this is done, I can run ncdump and access the files.

The reason this works is because:

  1. NetCDF builds the interface library.
  2. ncdump knows where to find the interface library because HDF5_PLUGIN_PATH is set.

@WardF
Copy link
Member

WardF commented Sep 7, 2022

I'll also be updating nc-config to have the option to report plugins supported.

@JamiePringle
Copy link
Author

JamiePringle commented Sep 8, 2022 via email

@JamiePringle
Copy link
Author

Thank you

@WardF
Copy link
Member

WardF commented Nov 9, 2022

@JamiePringle No problem; to answer your question posed above, we are involved in the conda feedstock pipeline, but I am not certain who I'd have to speak with for CRAN & the like. Clarifying the documentation is a great idea and a great start, however. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants