New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate where netcdf-c breaks thread safety #43
Comments
Constraints from Rust should also be taken into account when investigating (multiple reader, no writers), which could allow relaxing of the locking |
There is currently no locking (in A newer version of |
Caching of variables and attributes requires a separate lock when reading/writing variables/groups. |
Continuing from #42 which seems to overlap with this one. HDF5 does support single-writer / multiple-readers (e.g. http://docs.h5py.org/en/stable/swmr.html). So maybe this is something that could be supported on newer HDF5 based netcdf files. I saw somewhere that using hdf5 to write, and especially adding new variables, to a netcdf file is likely to make the file unreadable by netcdf. But reading should be fine. |
Multiple readers should be supported out of the box, but some initialisation might be required on first access of the variable. I think we can have a lock surrounding opening/closing files and smaller locks on variable access |
@gauteh Do you have a suitable benchmark we could use for testing parallell performance? |
I can try to make something self-sufficient - but I don't think I will
manage before after newyear. Are we talking about concurrent access from
threads or from processes? I am little bit confused about the difference
between this issue and the parallel read issue.
man. 30. des. 2019, 16:28 skrev magnusuMET <notifications@github.com>:
… @gauteh <https://github.com/gauteh> Do you have a suitable benchmark we
could use for testing parallell performance?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#43>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAN366IEF6GZ2SOEAFPYQ3Q3IHRLANCNFSM4J4HZULA>
.
|
And to same file struct? Or one struct per thread to same netcdf file?
man. 30. des. 2019, 16:31 skrev Gaute Hope <eg@gaute.vetsj.com>:
… I can try to make something self-sufficient - but I don't think I will
manage before after newyear. Are we talking about concurrent access from
threads or from processes? I am little bit confused about the difference
between this issue and the parallel read issue.
man. 30. des. 2019, 16:28 skrev magnusuMET ***@***.***>:
> @gauteh <https://github.com/gauteh> Do you have a suitable benchmark we
> could use for testing parallell performance?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#43>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAAN366IEF6GZ2SOEAFPYQ3Q3IHRLANCNFSM4J4HZULA>
> .
>
|
I believe we are mostly interested in threads
The parallel access requires an MPI communicator, and parallell instances of the same program. This is not exposed at all in the current form of the library |
Ok, I don't have anything using MPI at the moment. But I can create some
benchmarks for multi-thread concurrency.
man. 30. des. 2019, 16:39 skrev magnusuMET <notifications@github.com>:
… Are we talking about concurrent access from threads or from processes?
I believe we are mostly interested in threads
the difference between this issue and the parallel read issue
The parallel access requires an MPI communicator, and parallell instances
of the same program. This is not exposed at all in the current form of the
library
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#43>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAN36YMSDRXXYW4LJY55X3Q3II3BANCNFSM4J4HZULA>
.
|
I am not really sure whether |
Tried making a small example now using latest master. I'm not able to pass a I also noticed you are working on some big changes in #51, which is probably good to get in before resolving this..?
|
Yeah, I've pretty much redone entirely how this crate parses the netcdf file, which will remove the |
Great! |
Some observations from using a non-threadsafe HDF5 library: Segfaults and aborts, even when working on unrelated netCDF files. HDF5 should be considered highly unsafe for multithreading. |
I guess that settles the discussion on HDF5. I asked developers of Hyrax (an official opendap server) about their HDF5 interface. It seems to be more thread-safe, but I am not sure if this refers to the HDF5 reading library or possible an interface in-between. If somehow HDF5 can be made thread-safe w.r.t. at least reading that will put the performance much higher than netcdf-libs in other languages (depending on use case of course). |
Maybe using https://github.com/aldanor/hdf5-rust could help ? Note that I haven’t used that crate, but readme says « provides thread-safe Rust bindings and high-level wrappers for the HDF5 library API. » |
It also relies on the official HDF5 library, but provides thread-safety through a global lock. This means that a single process can only sequentially use any of the (unsafe) functions from HDF5. |
We have https://crates.io/crates/hdf5file, but quite a lot of work is neccessary to get this up to spec, and gain performance relative to a linked thread-safe |
For CDF-1,2,5 we could realistically create a safe dispatcher, I already have a CDF-parser written in |
Just to demonstrate the crash in netcdf (I have yet to repeat this in raw hdf): https://gist.github.com/magnusuMET/28a7991db0fcb5392b56573837aa7289 |
|
The global mutex is in no way ideal when reading/writing from multiple threads, and could be split into several (global/per file), or replaced by RWlocks.
This requires investigation into where netcdf does something thread unsafe, and limit the locking to this part. Should also investigate where HDF5 might be problematic
One could also integrate bindings to https://github.com/Parallel-NetCDF/PnetCDF, but this limits the formats to CLASSIC
Unidata/netcdf-c#1373
The text was updated successfully, but these errors were encountered: