Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in H5FD__free_cls with netcdf 4.9.1 #2617

Closed
opoplawski opened this issue Feb 13, 2023 · 24 comments · Fixed by #2827
Closed

Segfault in H5FD__free_cls with netcdf 4.9.1 #2617

opoplawski opened this issue Feb 13, 2023 · 24 comments · Fixed by #2827
Assignees
Milestone

Comments

@opoplawski
Copy link
Contributor

I'm testing updating Fedora to netcdf 4.9.1 and I'm seeing a new failure when running the tests for the octave-netcdf package. This might be tricky to track down, but I'm not seeing the failure with the current netcdf 4.9.0 package. HDF5 is remaining constant at 1.12.1.

The segfault occurs when exiting:

(gdb) bt
#0  0x00007fffcee93eb0 in ?? ()
#1  0x00007ffff5522b63 in H5FD__free_cls (cls=0x555555f09810) at ../../src/H5FD.c:188
#2  0x00007ffff557e57a in H5I__mark_node (key=0x0, _udata=<synthetic pointer>, _info=0x5555555a6c60) at ../../src/H5Iint.c:393
#3  H5I_clear_type (type=<optimized out>, force=false, app_ref=<optimized out>) at ../../src/H5Iint.c:339
#4  0x00007ffff5522af8 in H5FD_term_package () at ../../src/H5FD.c:147
#5  0x00007ffff5466e54 in H5_term_library () at ../../src/H5.c:377
#6  0x00007ffff546771d in H5_term_library () at ../../src/H5.c:460
#7  0x00007ffff546772d in H5close () at ../../src/H5.c:989
#8  0x00007ffff79077dd in octave::load_save_system::~load_save_system (this=<optimized out>, this=<optimized out>) at libinterp/corefcn/load-save.cc:274
#9  0x00007ffff78f75b5 in octave::interpreter::~interpreter (this=<optimized out>, this=<optimized out>) at libinterp/corefcn/interpreter.cc:661
#10 0x00007ffff703ae47 in std::default_delete<octave::interpreter>::operator() (this=<optimized out>, __ptr=0x5555555c62f0) at /usr/include/c++/12/bits/unique_ptr.h:95
#11 std::default_delete<octave::interpreter>::operator() (__ptr=0x5555555c62f0, this=<optimized out>) at /usr/include/c++/12/bits/unique_ptr.h:89
#12 std::unique_ptr<octave::interpreter, std::default_delete<octave::interpreter> >::~unique_ptr (this=<optimized out>, this=<optimized out>) at /usr/include/c++/12/bits/unique_ptr.h:396
#13 octave::application::~application (this=<optimized out>, this=<optimized out>) at libinterp/octave.cc:296
#14 0x0000555555556590 in octave::cli_application::~cli_application (this=<optimized out>, this=<optimized out>) at libinterp/octave.h:377
#15 main (argc=<optimized out>, argv=<optimized out>) at src/main-cli.cc:122
(gdb) up
#1  0x00007ffff5522b63 in H5FD__free_cls (cls=0x555555f09810) at ../../src/H5FD.c:188
188         if (cls->terminate && cls->terminate() < 0)
(gdb) list
183
184         /* If the file driver has a terminate callback, call it to give the file
185          * driver a chance to free singletons or other resources which will become
186          * invalid once the class structure is freed.
187          */
188         if (cls->terminate && cls->terminate() < 0)
189             HGOTO_ERROR(H5E_VFL, H5E_CANTCLOSEOBJ, FAIL, "virtual file driver '%s' did not terminate cleanly",
190                         cls->name)
191
192         H5MM_xfree(cls);
(gdb) print cls
$1 = (H5FD_class_t *) 0x555555f09810
(gdb) print *cls
$2 = {name = 0x7fffcef4c0d0 <error: Cannot access memory at address 0x7fffcef4c0d0>, maxaddr = 9223372036854775807, fc_degree = H5F_CLOSE_WEAK, terminate = 0x7fffcee93eb0, sb_size = 0x0, sb_encode = 0x0, sb_decode = 0x0, 
  fapl_size = 0, fapl_get = 0x0, fapl_copy = 0x0, fapl_free = 0x0, dxpl_size = 0, dxpl_copy = 0x0, dxpl_free = 0x0, open = 0x7fffcee97280, close = 0x7fffcee97230, cmp = 0x7fffcee94170, query = 0x7fffcee93ed0, get_type_map = 0x0, 
  alloc = 0x7fffcee93f40, free = 0x0, get_eoa = 0x7fffcee93f70, set_eoa = 0x7fffcee93fa0, get_eof = 0x7fffcee93fd0, get_handle = 0x7fffcee94060, read = 0x7fffcee97610, write = 0x7fffcee940f0, flush = 0x7fffcee94000, truncate = 0x0, 
  lock = 0x7fffcee94020, unlock = 0x7fffcee94040, fl_map = {H5FD_MEM_SUPER, H5FD_MEM_SUPER, H5FD_MEM_SUPER, H5FD_MEM_DRAW, H5FD_MEM_DRAW, H5FD_MEM_SUPER, H5FD_MEM_SUPER}}
(gdb) print *cls->terminate
Cannot access memory at address 0x7fffcee93eb0
(gdb) print cls->terminate
$3 = (herr_t (*)(void)) 0x7fffcee93eb0

(gdb) up
#3  H5I_clear_type (type=<optimized out>, force=false, app_ref=<optimized out>) at ../../src/H5Iint.c:339
339                 if (H5I__mark_node((void *)item, NULL, (void *)&udata) < 0)
(gdb) print *item
$6 = {id = 576460752303423489, count = 1, app_count = 1, object = 0x555555e33f50, marked = false, hh = {tbl = 0x55555558ebf0, prev = 0x55555558eb90, next = 0x0, hh_prev = 0x0, hh_next = 0x0, key = 0x5555555a6c60, keylen = 8, 
    hashv = 3217010591}}

valgrind:

==166== Jump to the invalid address stated on the next line
==166==    at 0x32D48EB0: ???
==166==    by 0x6D5F579: UnknownInlinedFun (H5Iint.c:393)
==166==    by 0x6D5F579: H5I_clear_type (H5Iint.c:339)
==166==    by 0x6D03AF7: H5FD_term_package (H5FD.c:147)
==166==    by 0x6C47E53: H5_term_library.part.0 (H5.c:377)
==166==    by 0x6C4872C: H5close (H5.c:989)
==166==    by 0x55597DC: octave::load_save_system::~load_save_system() (load-save.cc:274)
==166==    by 0x55495B4: octave::interpreter::~interpreter() (interpreter.cc:661)
==166==    by 0x4C8CE46: UnknownInlinedFun (unique_ptr.h:95)
==166==    by 0x4C8CE46: UnknownInlinedFun (unique_ptr.h:89)
==166==    by 0x4C8CE46: UnknownInlinedFun (unique_ptr.h:396)
==166==    by 0x4C8CE46: octave::application::~application() (octave.cc:296)
==166==    by 0x10A58F: UnknownInlinedFun (octave.h:377)
==166==    by 0x10A58F: main (main-cli.cc:122)
==166==  Address 0x32d48eb0 is not stack'd, malloc'd or (recently) free'd
==166== 
fatal: caught signal Segmentation fault -- stopping myself...
==166== 
==166== Process terminating with default action of signal 11 (SIGSEGV)
==166==    at 0x6A8EB94: __pthread_kill_implementation (in /usr/lib64/libc.so.6)
==166==    by 0x6A3DAED: raise (in /usr/lib64/libc.so.6)
==166==    by 0x6A3DB9F: ??? (in /usr/lib64/libc.so.6)
==166==    by 0x32D48EAF: ???
==166==    by 0x6D5F579: UnknownInlinedFun (H5Iint.c:393)
==166==    by 0x6D5F579: H5I_clear_type (H5Iint.c:339)
==166==    by 0x6D03AF7: H5FD_term_package (H5FD.c:147)
==166==    by 0x6C47E53: H5_term_library.part.0 (H5.c:377)
==166==    by 0x6C4872C: H5close (H5.c:989)
==166==    by 0x55597DC: octave::load_save_system::~load_save_system() (load-save.cc:274)
==166==    by 0x55495B4: octave::interpreter::~interpreter() (interpreter.cc:661)
==166==    by 0x4C8CE46: UnknownInlinedFun (unique_ptr.h:95)
==166==    by 0x4C8CE46: UnknownInlinedFun (unique_ptr.h:89)
==166==    by 0x4C8CE46: UnknownInlinedFun (unique_ptr.h:396)
==166==    by 0x4C8CE46: octave::application::~application() (octave.cc:296)
==166==    by 0x10A58F: UnknownInlinedFun (octave.h:377)
==166==    by 0x10A58F: main (main-cli.cc:122)

There are quite a lot of iterations through the loop in H5I_clear_type before the segfault. I have no idea what other information would be useful for tracking this down.

@edwardhartnett
Copy link
Contributor

You should try with the newly released hdf5-1.14.0. If fixes a lot of bugs, perhaps this is one of them.

@WardF
Copy link
Member

WardF commented Feb 13, 2023

Interesting; is this happening on particular hardware? I'm trying to get our big-endian machine back online to try to track down the other issue you've reported, @opoplawski, (#1338) and if this is also on specific hardware, let me know. Thanks!

@opoplawski
Copy link
Contributor Author

Seems to happen on all - and definitely on x86_64.

@WardF
Copy link
Member

WardF commented Feb 13, 2023

Thanks! That will make the issue easier to track down. Looking at the traces you provided, I'm trying to pin down at what point a function in libnetcdf is being called? I'm trying to grab the latest octave release, I assume this test is failing as part of the regular suite of tests, I will see if I can replicate it locally.

@WardF WardF self-assigned this Feb 13, 2023
@WardF WardF added this to the 4.9.2 milestone Feb 13, 2023
@opoplawski
Copy link
Contributor Author

Presumably netcdf calls are made during the tests, but at the point of the crash I think we're just closing down the HDF5 library.

I'm afraid this is multiple levels deep - it's not the tests from octave, but from the octave-netcdf package https://gnu-octave.github.io/packages/netcdf/

later I'll try to see if it's sensitive to updating hdf5.

@edwardhartnett
Copy link
Contributor

Could it be that both netcdf-c and octave are shutting down the HDF5 library? The netcdf-c one succeeds, and then the octave attempt fails?

@opoplawski
Copy link
Contributor Author

I set a breakpoint on H5close(), but I only see it called the once leading to the segfault.

@opoplawski opoplawski changed the title Segfault in H5FD__free_cls with netcdf 4.91. Segfault in H5FD__free_cls with netcdf 4.9.1 Feb 14, 2023
@opoplawski
Copy link
Contributor Author

I've at least stripped it down to a simple reproducer in octave with the netcdf package installed:

octave -H -q --no-window-system --no-site-file --eval "pkg('load','netcdf');import_netcdf;ncid = netcdf.create('test-netcdf.nc','NC_CLOBBER'); netcdf.close(ncid);"

So it really does seem to be something to do with library tear down in the octave environment as we aren't doing much else. It also crashed without the netcdf.close(ncid) call, but then you don't get a valid netcdf file. Since octave is built with HDF5 support, it calls H5close() on exit. But it still seems like the HDF5 state must be getting corrupted somehow for it to segfault.

I've also just discovered that I've still been building netcdf with -DH5_USE_110_API though I suspect that probably hasn't been needed for ages. I'm doing some rebuilds without that and with hdf5 1.12.2 and 1.14.0 and will report those results when done. octave isn't built with and H5_*_API defines so it would be good to make that consistent.

@DennisHeimbigner
Copy link
Collaborator

Any chance of running your program with valgrind?

@opoplawski
Copy link
Contributor Author

The valgrind output is the same as before - see the initial comment.

@opoplawski
Copy link
Contributor Author

Dropping -DH5_USE_110_API didn't have any effect.

@edwardhartnett
Copy link
Contributor

Some time ago I changed the netcdf-c code so that it does not depend on H5_USE_110_API or any other macro redefinition schemes that they use at HDF5. Simpler to just call the desired functions directly, and not worry about the redefines of their APIs (a misguided approach, IMO).

Is there a way of telling if the HDF5 library has been closed down?

@opoplawski
Copy link
Contributor Author

Still seeing this with netcdf 4.9.2

@WardF
Copy link
Member

WardF commented Apr 17, 2023

Investigating. Thanks!

@WardF WardF modified the milestones: 4.9.2, 4.9.3 May 16, 2023
@dasergatskov
Copy link

The same (or very similar?) bug has just been reported for Debian (and reproduced on Ubuntu 23.10) which are using older netcdf and hdf libraries :
https://savannah.gnu.org/bugs/?64999

On my Ubuntu 23.10 (when compiled with -Og -ggdb3):

Thread 1 "octave-gui" received signal SIGSEGV, Segmentation fault.
0x00007fffe01301b0 in ?? ()
(gdb) thread apply all bt
<...deleted...>
Thread 1 (Thread 0x7fffeb8bf880 (LWP 343770) "octave-gui"):
#0  0x00007fffe01301b0 in ??? ()
#1  0x00007ffff38fefec in H5FD__free_cls (cls=0x7fffc8119a30) at ../../../src/H5FD.c:188
#2  0x00007ffff395c288 in H5I__mark_node (key=0x0, _udata=<synthetic pointer>, _info=0x5555555c0550) at ../../../src/H5Iint.c:340
#3  H5I_clear_type (type=<optimized out>, force=false, app_ref=<optimized out>) at ../../../src/H5Iint.c:286
#4  0x00007ffff38fef90 in H5FD_term_package () at ../../../src/H5FD.c:147
#5  0x00007ffff3845f1c in H5_term_library () at ../../../src/H5.c:338
#6  0x00007ffff38463e1 in H5close () at ../../../src/H5.c:1006
#7  0x00007ffff75182e1 in octave::load_save_system::~load_save_system (this=this@entry=0x7fffc80022c0, __in_chrg=<optimized out>) at ../libinterp/corefcn/load-save.cc:271
#8  0x00007ffff74f4b9c in octave::interpreter::~interpreter (this=this@entry=0x7fffc8001470, __in_chrg=<optimized out>) at ../libinterp/corefcn/interpreter.cc:635
#9  0x00007ffff6d6fccb in std::default_delete<octave::interpreter>::operator() (this=this@entry=0x7fffffffddf8, __ptr=0x7fffc8001470) at /usr/include/c++/13/bits/unique_ptr.h:99
#10 0x00007ffff6d6fcf1 in std::unique_ptr<octave::interpreter, std::default_delete<octave::interpreter> >::~unique_ptr (this=0x7fffffffddf8, __in_chrg=<optimized out>) at /usr/include/c++/13/tuple:125
#11 0x00007ffff6d6d34c in octave::application::~application (this=this@entry=0x7fffffffdbf0, __in_chrg=<optimized out>) at ../libinterp/octave.cc:298
#12 0x00005555555566fa in octave::qt_application::~qt_application (this=0x7fffffffdbf0, __in_chrg=<optimized out>) at ../libgui/src/qt-application.h:61
#13 main (argc=1, argv=<optimized out>) at ../src/main-gui.cc:164
(gdb)

This is with
libnetcdf-dev/mantic,now 1:4.9.2-2ubuntu1 amd64
libhdf5-103-1/mantic,now 1.10.8+repack1-1ubuntu1 amd64

Dmitri.

@lostbard
Copy link
Contributor

Not sure how much help it is, but I notice that netcdf built with -DENABLE_BYTERANGE=OFF does not seem to have the issue

@DennisHeimbigner
Copy link
Collaborator

Is there a test program that produces this error but does not use octave?

@lostbard
Copy link
Contributor

lostbard commented Dec 12, 2023

Issue is that octave load/unloads Hd5, but when netcdf is unloaded (before octaves hd5 unload) it has not unregistered the new class which is now not in memory.

Adding a finalize call for the HTTP class fixes the crash.

@DennisHeimbigner
Copy link
Collaborator

Something about this bothers me. It seems to me the problem is caused by octave, and it should
be their responsibility to unload hdf5 and netcdf in the correct order.

@DennisHeimbigner
Copy link
Collaborator

Also, I have a feeling that this is going to be an ongoing issue because there is at least one other
H5FD plugin (H5FDcore). Also, what about filter plugins?

@lostbard
Copy link
Contributor

lostbard commented Dec 13, 2023

Something about this bothers me. It seems to me the problem is caused by octave, and it should be their responsibility to unload hdf5 and netcdf in the correct order.

I can make octave unload netcdf before closing, however since netcdf isnt cleaning up itself on unload (ie: unregistering things it registered) the issue isnt fixed. As when octave (and hdf5 closes later on) it still has references to the netcdf class that are no longer in memory.

@lostbard
Copy link
Contributor

HDFCore is loaded within the hdf5 library isnt it? - therefor owned by hdf5, so will be there for the lifetime of the hdf5 library being loaded.

@lostbard
Copy link
Contributor

lostbard commented Dec 14, 2023

For the hdf5filter.c, it looks like the code already has an unregister call done within nc4_global_filter_action

@opoplawski
Copy link
Contributor Author

Thank you for fixing this! Looks good here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants