Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory leak in hdf5 writer (?) #385

Closed
xiaoqiangwang opened this issue Mar 6, 2019 · 37 comments
Closed

memory leak in hdf5 writer (?) #385

xiaoqiangwang opened this issue Mar 6, 2019 · 37 comments

Comments

@xiaoqiangwang
Copy link
Contributor

The writer is in single mode, without filters, no lazy open and no SWMR. I see the memory usage ramps up constantly in writing hdf5 files, in the step of ~10-60 KB.

I am not sure whether this leak comes from libhdf5 or NDFileHDF5.

@MarkRivers
Copy link
Member

Do you see a similar leak if you test with the NDFileNetCDF or TIFF drivers? Can you send a screen shot of your detector and HDF5 screens so we can try to reproduce the problem?

@xiaoqiangwang
Copy link
Contributor Author

xiaoqiangwang commented Mar 6, 2019

The memory usage stays constant for NDFileNetCDF and NDFileTIFF. The data comes from a simDetector IOC running at 1Hz.

Here is the NDFileHDF5 panel when it runs on Windows.
ndfilehdf5

@MarkRivers
Copy link
Member

Is this issue still occurring?

@MarkRivers
Copy link
Member

I just answered my own question. Using mode=Single and Autosave=Yes then I see a significant memory consumption when saving HDF5 files, but not when saving TIFF or netCDF.

@MarkRivers
Copy link
Member

I just ran the simDetector under valgrind looking for memory leaks when saving HDF5 files in Single mode. When it exits I see the following, which looks like it could be the problem.

==109347== 6,048 bytes in 378 blocks are definitely lost in loss record 12,126 of 12,702
==109347==    at 0x4C2B9B5: calloc (vg_replace_malloc.c:711)
==109347==    by 0x653C0E: NDFileHDF5AttributeDataset::configureDims(int) (NDFileHDF5AttributeDataset.cpp:231)
==109347==    by 0x6540E9: NDFileHDF5AttributeDataset::createDataset(int) (NDFileHDF5AttributeDataset.cpp:73)
==109347==    by 0x64C35D: NDFileHDF5::createAttributeDataset(NDArray*) (NDFileHDF5.cpp:2735)
==109347==    by 0x64E248: NDFileHDF5::openFile(char const*, int, NDArray*) (NDFileHDF5.cpp:282)
==109347==    by 0x66BB1C: NDPluginFile::openFileBase(int, NDArray*) (NDPluginFile.cpp:73)
==109347==    by 0x66C6D3: NDPluginFile::writeFileBase() (NDPluginFile.cpp:254)
==109347==    by 0x66CFE7: NDPluginFile::processCallbacks(NDArray*) (NDPluginFile.cpp:756)
==109347==    by 0x6687C5: NDPluginDriver::processTask() (NDPluginDriver.cpp:524)
==109347==    by 0x12A3FC8: epicsThreadCallEntryPoint (epicsThread.cpp:83)
==109347==    by 0x12A9A2B: start_routine (osdThread.c:403)
==109347==    by 0x538DE24: start_thread (in /usr/lib64/libpthread-2.17.so)
==109347==
==109347== 6,048 bytes in 378 blocks are definitely lost in loss record 12,127 of 12,702
==109347==    at 0x4C2B9B5: calloc (vg_replace_malloc.c:711)
==109347==    by 0x653C27: NDFileHDF5AttributeDataset::configureDims(int) (NDFileHDF5AttributeDataset.cpp:232)
==109347==    by 0x6540E9: NDFileHDF5AttributeDataset::createDataset(int) (NDFileHDF5AttributeDataset.cpp:73)
==109347==    by 0x64C35D: NDFileHDF5::createAttributeDataset(NDArray*) (NDFileHDF5.cpp:2735)
==109347==    by 0x64E248: NDFileHDF5::openFile(char const*, int, NDArray*) (NDFileHDF5.cpp:282)
==109347==    by 0x66BB1C: NDPluginFile::openFileBase(int, NDArray*) (NDPluginFile.cpp:73)
==109347==    by 0x66C6D3: NDPluginFile::writeFileBase() (NDPluginFile.cpp:254)
==109347==    by 0x66CFE7: NDPluginFile::processCallbacks(NDArray*) (NDPluginFile.cpp:756)
==109347==    by 0x6687C5: NDPluginDriver::processTask() (NDPluginDriver.cpp:524)
==109347==    by 0x12A3FC8: epicsThreadCallEntryPoint (epicsThread.cpp:83)
==109347==    by 0x12A9A2B: start_routine (osdThread.c:403)
==109347==    by 0x538DE24: start_thread (in /usr/lib64/libpthread-2.17.so)
==109347==
==109347== 6,048 bytes in 378 blocks are definitely lost in loss record 12,128 of 12,702
==109347==    at 0x4C2B9B5: calloc (vg_replace_malloc.c:711)
==109347==    by 0x653C3D: NDFileHDF5AttributeDataset::configureDims(int) (NDFileHDF5AttributeDataset.cpp:233)
==109347==    by 0x6540E9: NDFileHDF5AttributeDataset::createDataset(int) (NDFileHDF5AttributeDataset.cpp:73)
==109347==    by 0x64C35D: NDFileHDF5::createAttributeDataset(NDArray*) (NDFileHDF5.cpp:2735)
==109347==    by 0x64E248: NDFileHDF5::openFile(char const*, int, NDArray*) (NDFileHDF5.cpp:282)
==109347==    by 0x66BB1C: NDPluginFile::openFileBase(int, NDArray*) (NDPluginFile.cpp:73)
==109347==    by 0x66C6D3: NDPluginFile::writeFileBase() (NDPluginFile.cpp:254)
==109347==    by 0x66CFE7: NDPluginFile::processCallbacks(NDArray*) (NDPluginFile.cpp:756)
==109347==    by 0x6687C5: NDPluginDriver::processTask() (NDPluginDriver.cpp:524)
==109347==    by 0x12A3FC8: epicsThreadCallEntryPoint (epicsThread.cpp:83)
==109347==    by 0x12A9A2B: start_routine (osdThread.c:403)
==109347==    by 0x538DE24: start_thread (in /usr/lib64/libpthread-2.17.so)
==109347==
==109347== 6,048 bytes in 378 blocks are definitely lost in loss record 12,129 of 12,702
==109347==    at 0x4C2B9B5: calloc (vg_replace_malloc.c:711)
==109347==    by 0x653C53: NDFileHDF5AttributeDataset::configureDims(int) (NDFileHDF5AttributeDataset.cpp:234)
==109347==    by 0x6540E9: NDFileHDF5AttributeDataset::createDataset(int) (NDFileHDF5AttributeDataset.cpp:73)
==109347==    by 0x64C35D: NDFileHDF5::createAttributeDataset(NDArray*) (NDFileHDF5.cpp:2735)
==109347==    by 0x64E248: NDFileHDF5::openFile(char const*, int, NDArray*) (NDFileHDF5.cpp:282)
==109347==    by 0x66BB1C: NDPluginFile::openFileBase(int, NDArray*) (NDPluginFile.cpp:73)
==109347==    by 0x66C6D3: NDPluginFile::writeFileBase() (NDPluginFile.cpp:254)
==109347==    by 0x66CFE7: NDPluginFile::processCallbacks(NDArray*) (NDPluginFile.cpp:756)
==109347==    by 0x6687C5: NDPluginDriver::processTask() (NDPluginDriver.cpp:524)
==109347==    by 0x12A3FC8: epicsThreadCallEntryPoint (epicsThread.cpp:83)
==109347==    by 0x12A9A2B: start_routine (osdThread.c:403)
==109347==    by 0x538DE24: start_thread (in /usr/lib64/libpthread-2.17.so)
==109347==
==109347== 6,048 bytes in 378 blocks are definitely lost in loss record 12,130 of 12,702
==109347==    at 0x4C2B9B5: calloc (vg_replace_malloc.c:711)
==109347==    by 0x653C69: NDFileHDF5AttributeDataset::configureDims(int) (NDFileHDF5AttributeDataset.cpp:235)
==109347==    by 0x6540E9: NDFileHDF5AttributeDataset::createDataset(int) (NDFileHDF5AttributeDataset.cpp:73)
==109347==    by 0x64C35D: NDFileHDF5::createAttributeDataset(NDArray*) (NDFileHDF5.cpp:2735)
==109347==    by 0x64E248: NDFileHDF5::openFile(char const*, int, NDArray*) (NDFileHDF5.cpp:282)
==109347==    by 0x66BB1C: NDPluginFile::openFileBase(int, NDArray*) (NDPluginFile.cpp:73)
==109347==    by 0x66C6D3: NDPluginFile::writeFileBase() (NDPluginFile.cpp:254)
==109347==    by 0x66CFE7: NDPluginFile::processCallbacks(NDArray*) (NDPluginFile.cpp:756)
==109347==    by 0x6687C5: NDPluginDriver::processTask() (NDPluginDriver.cpp:524)
==109347==    by 0x12A3FC8: epicsThreadCallEntryPoint (epicsThread.cpp:83)
==109347==    by 0x12A9A2B: start_routine (osdThread.c:403)
==109347==    by 0x538DE24: start_thread (in /usr/lib64/libpthread-2.17.so)
==109347==

@MarkRivers
Copy link
Member

I have found and fixed 3 definite memory leaks in the NDFileHDF5* files. However, it still is leaking memory, about 100 kB per file saved I believe. I'll continue to work on it today.

@MarkRivers
Copy link
Member

The following shows the output of ps -u under the following setup:
Master branch
ImageMode=Multiple
NumImages=1000
HDF plugin Single mode, AutoSave=Yes

This is for the master branch where there are definite memory leaks. The first entry is before acquistion was started at all. The subsequent entries are after running each acquisition of 1000 files. The VSZ (virtual memory in kB) increases by 131072 or 65536 each time (ignoring the first entry before acquisition started). This is suspicious, since it is exactly a multiple of 64MB. Is this meaningful, or is it just the quantum of memory allocation that Linux reports?

USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
epics     71306  2.3  0.0 5798796 53580 pts/32  Sl+  11:12   0:00 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     71306 16.6  0.2 5943256 175272 pts/32 Sl+  11:12   0:16 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     71306 17.8  0.4 6074328 286200 pts/32 Sl+  11:12   0:32 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     71306 18.7  0.6 6205400 397132 pts/32 Sl+  11:12   0:46 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     71306 19.7  0.7 6270936 507772 pts/32 Sl+  11:12   1:02 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     71306 20.5  0.9 6402008 618700 pts/32 Sl+  11:12   1:18 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux

This is the running the fix_hdf5_leaks branch, where I fixed the obvious memory leaks in NDFileHDF5* which were uncovered by valgrind. Again the VSZ increases by 131072 or 65535 kB for each 1000 files. So there appears to be no change from the master branch.

corvette:~/scratch>ps -u -q 73742
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
epics     73742  3.3  0.0 5799732 53556 pts/32  Sl+  11:19   0:00 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     73742 17.4  0.2 5944192 171936 pts/32 Sl+  11:19   0:15 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     73742 20.4  0.4 6075264 278892 pts/32 Sl+  11:19   0:30 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     73742 19.6  0.5 6140800 386100 pts/32 Sl+  11:19   0:46 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     73742 19.6  0.7 6271872 493320 pts/32 Sl+  11:19   1:01 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     73742 20.0  0.9 6402944 600528 pts/32 Sl+  11:19   1:16 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux

This is the fix_hdf5_leaks branch but collecting 100 files at a time, rather than 1000. It is interesting. Most of the time the VSZ does not increase at all. But every 5 or 6 times it increases by 65536.

corvette:~/scratch>ps -u -q 73742
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
epics     73742 17.5  0.9 6402944 600528 pts/32 Sl+  11:19   1:17 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     73742 17.5  0.9 6402944 611352 pts/32 Sl+  11:19   1:18 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     73742 17.4  0.9 6402944 621912 pts/32 Sl+  11:19   1:20 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     73742 17.4  0.9 6402944 632736 pts/32 Sl+  11:19   1:22 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     73742 17.3  0.9 6402944 643560 pts/32 Sl+  11:19   1:23 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     73742 17.2  0.9 6468480 654136 pts/32 Sl+  11:19   1:25 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     73742 16.7  1.0 6468480 664960 pts/32 Sl+  11:19   1:26 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     73742 16.7  1.0 6468480 675520 pts/32 Sl+  11:19   1:28 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     73742 16.5  1.0 6468480 686344 pts/32 Sl+  11:19   1:30 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     73742 16.4  1.0 6468480 696904 pts/32 Sl+  11:19   1:31 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     73742 16.4  1.0 6468480 707728 pts/32 Sl+  11:19   1:33 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     73742 16.3  1.0 6534016 718564 pts/32 Sl+  11:19   1:35 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux

@MarkRivers
Copy link
Member

This is the same measurement as above, but using the NDFileNetCDF plugin, rather than NDFileHDF5.

It was done for 1000 files per measurement.

Ignoring the first value (when no buffers had yet been allocated), the memory only increased once, from
5876644 to 5880748, which is 4104 kB, or about 2 MB. This is completely expected, because the number of NDArrays in the NDArrayPool jumped from 6 to 8 at this time. Each array is 2 MB because they are 1024 x 1024 x 16-bit. So the conclusion is that the netCDF plugin does not have the memory leak of about 70kB per file that the HDF5 file plugin does.

corvette:~/scratch>ps -u -q 81853
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
epics     81853  2.1  0.0 5798796 53552 pts/32  Sl+  11:55   0:00 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     81853 16.2  0.0 5876644 60384 pts/32  Sl+  11:55   0:16 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     81853 19.9  0.0 5876644 60384 pts/32  Sl+  11:55   0:32 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     81853 21.5  0.0 5880748 64608 pts/32  Sl+  11:55   0:48 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     81853 21.5  0.0 5880748 64608 pts/32  Sl+  11:55   1:02 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     81853 22.4  0.0 5880748 64608 pts/32  Sl+  11:55   1:17 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     81853 19.7  0.0 5880748 64608 pts/32  Sl+  11:55   1:34 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     81853 20.4  0.0 5880748 64608 pts/32  Sl+  11:55   1:49 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux

@MarkRivers
Copy link
Member

I discussed this with @anjohnson . It is clear that ps can report memory allocations smaller than 64MB because it reported 4MB with the netCDF plugin. The fact that the HDF5 memory leak comes in quanta of 64MB is an indication that some other memory allocator is being used, perhaps the one in the hdf5 library, or the one in the xml2 library. This allocator seems to be allocating in chunks of 64 MB. Does anyone have any idea whether hdf5 or xml2 do this?

My fixes to the memory leaks in NDFileHDF* so far (#387) do not eliminate the major memory leak.

For my test with the master branch, ignoring the first entry the memory leaks was (6402008-5943256)/4000 = 114.7 kB/file. For my test on the fix_hdf5_leaks branch the leak was (6402944- 5944192)/4000. = 114.7 kB/file, the same. So it seems that my fixes must be in the noise compared to the major leak.

@MarkRivers
Copy link
Member

On both the master and the fix_hdf5_leaks branch valgrind immediately reports the following when the first HDF5 file is written:

epics> auto_settings.sav: 2927 of 2927 PV's connected
==100782== Thread 26 FileHDF1_Plugin:
==100782== Invalid read of size 8
==100782==    at 0x4C2E0C0: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1022)
==100782==    by 0xF67950: H5Pset_fill_value (H5Pdcpl.c:3217)
==100782==    by 0x653D86: NDFileHDF5AttributeDataset::createHDF5Dataset() (NDFileHDF5AttributeDataset.cpp:104)
==100782==    by 0x64C4A5: NDFileHDF5::createAttributeDataset(NDArray*) (NDFileHDF5.cpp:2738)
==100782==    by 0x64E388: NDFileHDF5::openFile(char const*, int, NDArray*) (NDFileHDF5.cpp:282)
==100782==    by 0x66BDFC: NDPluginFile::openFileBase(int, NDArray*) (NDPluginFile.cpp:73)
==100782==    by 0x66C9B3: NDPluginFile::writeFileBase() (NDPluginFile.cpp:254)
==100782==    by 0x66D2C7: NDPluginFile::processCallbacks(NDArray*) (NDPluginFile.cpp:756)
==100782==    by 0x668AA5: NDPluginDriver::processTask() (NDPluginDriver.cpp:524)
==100782==    by 0x12A42A8: epicsThreadCallEntryPoint (epicsThread.cpp:83)
==100782==    by 0x12A9D0B: start_routine (osdThread.c:403)
==100782==    by 0x538DE24: start_thread (in /usr/lib64/libpthread-2.17.so)
==100782==  Address 0x1c87f278 is 0 bytes after a block of size 8 alloc'd
==100782==    at 0x4C2B9B5: calloc (vg_replace_malloc.c:711)
==100782==    by 0x653CCC: NDFileHDF5AttributeDataset::NDFileHDF5AttributeDataset(long, std::string const&, NDAttrDataType_t) (NDFileHDF5AttributeDataset.cpp:42)
==100782==    by 0x64C41E: NDFileHDF5::createAttributeDataset(NDArray*) (NDFileHDF5.cpp:2725)
==100782==    by 0x64E388: NDFileHDF5::openFile(char const*, int, NDArray*) (NDFileHDF5.cpp:282)
==100782==    by 0x66BDFC: NDPluginFile::openFileBase(int, NDArray*) (NDPluginFile.cpp:73)
==100782==    by 0x66C9B3: NDPluginFile::writeFileBase() (NDPluginFile.cpp:254)
==100782==    by 0x66D2C7: NDPluginFile::processCallbacks(NDArray*) (NDPluginFile.cpp:756)
==100782==    by 0x668AA5: NDPluginDriver::processTask() (NDPluginDriver.cpp:524)
==100782==    by 0x12A42A8: epicsThreadCallEntryPoint (epicsThread.cpp:83)
==100782==    by 0x12A9D0B: start_routine (osdThread.c:403)
==100782==    by 0x538DE24: start_thread (in /usr/lib64/libpthread-2.17.so)
==100782==    by 0x6511BAC: clone (in /usr/lib64/libc-2.17.so)
==100782==
==100782== Invalid read of size 8
==100782==    at 0x4C2E0CE: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1022)
==100782==    by 0xF67950: H5Pset_fill_value (H5Pdcpl.c:3217)
==100782==    by 0x653D86: NDFileHDF5AttributeDataset::createHDF5Dataset() (NDFileHDF5AttributeDataset.cpp:104)
==100782==    by 0x64C4A5: NDFileHDF5::createAttributeDataset(NDArray*) (NDFileHDF5.cpp:2738)
==100782==    by 0x64E388: NDFileHDF5::openFile(char const*, int, NDArray*) (NDFileHDF5.cpp:282)
==100782==    by 0x66BDFC: NDPluginFile::openFileBase(int, NDArray*) (NDPluginFile.cpp:73)
==100782==    by 0x66C9B3: NDPluginFile::writeFileBase() (NDPluginFile.cpp:254)
==100782==    by 0x66D2C7: NDPluginFile::processCallbacks(NDArray*) (NDPluginFile.cpp:756)
==100782==    by 0x668AA5: NDPluginDriver::processTask() (NDPluginDriver.cpp:524)
==100782==    by 0x12A42A8: epicsThreadCallEntryPoint (epicsThread.cpp:83)
==100782==    by 0x12A9D0B: start_routine (osdThread.c:403)
==100782==    by 0x538DE24: start_thread (in /usr/lib64/libpthread-2.17.so)
==100782==  Address 0x1c87f280 is 8 bytes after a block of size 8 alloc'd
==100782==    at 0x4C2B9B5: calloc (vg_replace_malloc.c:711)
==100782==    by 0x653CCC: NDFileHDF5AttributeDataset::NDFileHDF5AttributeDataset(long, std::string const&, NDAttrDataType_t) (NDFileHDF5AttributeDataset.cpp:42)
==100782==    by 0x64C41E: NDFileHDF5::createAttributeDataset(NDArray*) (NDFileHDF5.cpp:2725)
==100782==    by 0x64E388: NDFileHDF5::openFile(char const*, int, NDArray*) (NDFileHDF5.cpp:282)
==100782==    by 0x66BDFC: NDPluginFile::openFileBase(int, NDArray*) (NDPluginFile.cpp:73)
==100782==    by 0x66C9B3: NDPluginFile::writeFileBase() (NDPluginFile.cpp:254)
==100782==    by 0x66D2C7: NDPluginFile::processCallbacks(NDArray*) (NDPluginFile.cpp:756)
==100782==    by 0x668AA5: NDPluginDriver::processTask() (NDPluginDriver.cpp:524)
==100782==    by 0x12A42A8: epicsThreadCallEntryPoint (epicsThread.cpp:83)
==100782==    by 0x12A9D0B: start_routine (osdThread.c:403)
==100782==    by 0x538DE24: start_thread (in /usr/lib64/libpthread-2.17.so)
==100782==    by 0x6511BAC: clone (in /usr/lib64/libc-2.17.so)
==100782==
==100782== Syscall param write(buf) points to uninitialised byte(s)
==100782==    at 0x539479D: ??? (in /usr/lib64/libpthread-2.17.so)
==100782==    by 0x10CCA9D: H5FD_sec2_write (H5FDsec2.c:802)
==100782==    by 0xEC5269: H5FD_write (H5FDint.c:307)
==100782==    by 0xEACCE3: H5F__accum_write (H5Faccum.c:652)
==100782==    by 0xF922B0: H5PB_write (H5PB.c:1040)
==100782==    by 0xEB518C: H5F_block_write (H5Fio.c:194)
==100782==    by 0xE643BE: H5C__flush_single_entry (H5C.c:6285)
==100782==    by 0xE658EF: H5C_flush_ring (H5C.c:6026)
==100782==    by 0xE658EF: H5C_flush_cache (H5C.c:1152)
==100782==    by 0xE4348B: H5AC_flush (H5AC.c:864)
==100782==    by 0xEAEB65: H5F__flush_phase2 (H5Fint.c:1571)
==100782==    by 0xEB0C74: H5F__dest (H5Fint.c:883)
==100782==    by 0xEB16F0: H5F_try_close (H5Fint.c:1881)
==100782==  Address 0x207676b0 is 2,464 bytes inside a block of size 16,448 alloc'd
==100782==    at 0x4C29C23: malloc (vg_replace_malloc.c:299)
==100782==    by 0xF236B8: H5MM_malloc (H5MM.c:294)
==100782==    by 0xEC811E: H5FL_malloc (H5FL.c:243)
==100782==    by 0xEC811E: H5FL_blk_malloc (H5FL.c:921)
==100782==    by 0xEC8444: H5FL_blk_realloc (H5FL.c:1142)
==100782==    by 0xEAB558: H5F__accum_adjust (H5Faccum.c:393)
==100782==    by 0xEACD11: H5F__accum_write (H5Faccum.c:486)
==100782==    by 0xF922B0: H5PB_write (H5PB.c:1040)
==100782==    by 0xEB518C: H5F_block_write (H5Fio.c:194)
==100782==    by 0xE643BE: H5C__flush_single_entry (H5C.c:6285)
==100782==    by 0xE658EF: H5C_flush_ring (H5C.c:6026)
==100782==    by 0xE658EF: H5C_flush_cache (H5C.c:1152)
==100782==    by 0xE4348B: H5AC_flush (H5AC.c:864)
==100782==    by 0xEAEB65: H5F__flush_phase2 (H5Fint.c:1571)
==100782==

It seems like we need to fix these problems.

These are the main leaks in the NDFileHDF5 plugin on the fix_hdf5_leaks branch. They seem to all be XML related.

==100782== 164,951 (79,920 direct, 85,031 indirect) bytes in 999 blocks are definitely lost in loss record 8,118 of 8,173
==100782==    at 0x4C2A243: operator new(unsigned long) (vg_replace_malloc.c:334)
==100782==    by 0x65B0DF: hdf5::Group::new_hardlink(std::string const&) (NDFileHDF5Layout.cpp:398)
==100782==    by 0x655D86: hdf5::LayoutXML::new_hardlink() (NDFileHDF5LayoutXML.cpp:626)
==100782==    by 0x656764: hdf5::LayoutXML::process_node() (NDFileHDF5LayoutXML.cpp:288)
==100782==    by 0x656937: hdf5::LayoutXML::load_xml() (NDFileHDF5LayoutXML.cpp:120)
==100782==    by 0x64DC81: NDFileHDF5::createFileLayout(NDArray*) (NDFileHDF5.cpp:3719)
==100782==    by 0x64E24A: NDFileHDF5::openFile(char const*, int, NDArray*) (NDFileHDF5.cpp:263)
==100782==    by 0x66BDFC: NDPluginFile::openFileBase(int, NDArray*) (NDPluginFile.cpp:73)
==100782==    by 0x66C9B3: NDPluginFile::writeFileBase() (NDPluginFile.cpp:254)
==100782==    by 0x66D2C7: NDPluginFile::processCallbacks(NDArray*) (NDPluginFile.cpp:756)
==100782==    by 0x668AA5: NDPluginDriver::processTask() (NDPluginDriver.cpp:524)
==100782==    by 0x12A42A8: epicsThreadCallEntryPoint (epicsThread.cpp:83)

 71,991 bytes in 7,999 blocks are definitely lost in loss record 8,072 of 8,173
==100782==    at 0x4C29C23: malloc (vg_replace_malloc.c:299)
==100782==    by 0xE0B3D3: xmlStrndup__internal_alias (xmlstring.c:45)
==100782==    by 0xE0B3D3: xmlStrdup (xmlstring.c:71)
==100782==    by 0xD868F0: xmlTextReaderGetAttribute (xmlreader.c:2417)
==100782==    by 0x654BDB: hdf5::LayoutXML::process_attribute_xml_attribute(hdf5::Attribute&) (NDFileHDF5LayoutXML.cpp:368)
==100782==    by 0x655BB0: hdf5::LayoutXML::new_attribute() (NDFileHDF5LayoutXML.cpp:587)
==100782==    by 0x656862: hdf5::LayoutXML::process_node() (NDFileHDF5LayoutXML.cpp:284)
==100782==    by 0x656937: hdf5::LayoutXML::load_xml() (NDFileHDF5LayoutXML.cpp:120)
==100782==    by 0x64DC81: NDFileHDF5::createFileLayout(NDArray*) (NDFileHDF5.cpp:3719)
==100782==    by 0x64E24A: NDFileHDF5::openFile(char const*, int, NDArray*) (NDFileHDF5.cpp:263)
==100782==    by 0x66BDFC: NDPluginFile::openFileBase(int, NDArray*) (NDPluginFile.cpp:73)
==100782==    by 0x66C9B3: NDPluginFile::writeFileBase() (NDPluginFile.cpp:254)
==100782==    by 0x66D2C7: NDPluginFile::processCallbacks(NDArray*) (NDPluginFile.cpp:756)
==100782==

 71,000 bytes in 8,000 blocks are definitely lost in loss record 8,069 of 8,173
==100782==    at 0x4C29C23: malloc (vg_replace_malloc.c:299)
==100782==    by 0xE0B3D3: xmlStrndup__internal_alias (xmlstring.c:45)
==100782==    by 0xE0B3D3: xmlStrdup (xmlstring.c:71)
==100782==    by 0xD868F0: xmlTextReaderGetAttribute (xmlreader.c:2417)
==100782==    by 0x654DF5: hdf5::LayoutXML::process_attribute_xml_attribute(hdf5::Attribute&) (NDFileHDF5LayoutXML.cpp:401)
==100782==    by 0x655BB0: hdf5::LayoutXML::new_attribute() (NDFileHDF5LayoutXML.cpp:587)
==100782==    by 0x656862: hdf5::LayoutXML::process_node() (NDFileHDF5LayoutXML.cpp:284)
==100782==    by 0x656937: hdf5::LayoutXML::load_xml() (NDFileHDF5LayoutXML.cpp:120)
==100782==    by 0x64DC81: NDFileHDF5::createFileLayout(NDArray*) (NDFileHDF5.cpp:3719)
==100782==    by 0x64E24A: NDFileHDF5::openFile(char const*, int, NDArray*) (NDFileHDF5.cpp:263)
==100782==    by 0x66BDFC: NDPluginFile::openFileBase(int, NDArray*) (NDPluginFile.cpp:73)
==100782==    by 0x66C9B3: NDPluginFile::writeFileBase() (NDPluginFile.cpp:254)
==100782==    by 0x66D2C7: NDPluginFile::processCallbacks(NDArray*) (NDPluginFile.cpp:756)

==100782== 69,991 bytes in 7,999 blocks are definitely lost in loss record 8,068 of 8,173
==100782==    at 0x4C29C23: malloc (vg_replace_malloc.c:299)
==100782==    by 0xE0B3D3: xmlStrndup__internal_alias (xmlstring.c:45)
==100782==    by 0xE0B3D3: xmlStrdup (xmlstring.c:71)
==100782==    by 0xD868F0: xmlTextReaderGetAttribute (xmlreader.c:2417)
==100782==    by 0x655B79: hdf5::LayoutXML::new_attribute() (NDFileHDF5LayoutXML.cpp:581)
==100782==    by 0x656862: hdf5::LayoutXML::process_node() (NDFileHDF5LayoutXML.cpp:284)
==100782==    by 0x656937: hdf5::LayoutXML::load_xml() (NDFileHDF5LayoutXML.cpp:120)
==100782==    by 0x64DC81: NDFileHDF5::createFileLayout(NDArray*) (NDFileHDF5.cpp:3719)
==100782==    by 0x64E24A: NDFileHDF5::openFile(char const*, int, NDArray*) (NDFileHDF5.cpp:263)
==100782==    by 0x66BDFC: NDPluginFile::openFileBase(int, NDArray*) (NDPluginFile.cpp:73)
==100782==    by 0x66C9B3: NDPluginFile::writeFileBase() (NDPluginFile.cpp:254)
==100782==    by 0x66D2C7: NDPluginFile::processCallbacks(NDArray*) (NDPluginFile.cpp:756)
==100782==    by 0x668AA5: NDPluginDriver::processTask() (NDPluginDriver.cpp:524)

==100782== 52,985 bytes in 7,997 blocks are definitely lost in loss record 8,044 of 8,173
==100782==    at 0x4C29C23: malloc (vg_replace_malloc.c:299)
==100782==    by 0xE0B3D3: xmlStrndup__internal_alias (xmlstring.c:45)
==100782==    by 0xE0B3D3: xmlStrdup (xmlstring.c:71)
==100782==    by 0xD868F0: xmlTextReaderGetAttribute (xmlreader.c:2417)
==100782==    by 0x654E57: hdf5::LayoutXML::process_attribute_xml_attribute(hdf5::Attribute&) (NDFileHDF5LayoutXML.cpp:404)
==100782==    by 0x655BB0: hdf5::LayoutXML::new_attribute() (NDFileHDF5LayoutXML.cpp:587)
==100782==    by 0x656862: hdf5::LayoutXML::process_node() (NDFileHDF5LayoutXML.cpp:284)
==100782==    by 0x656937: hdf5::LayoutXML::load_xml() (NDFileHDF5LayoutXML.cpp:120)
==100782==    by 0x64DC81: NDFileHDF5::createFileLayout(NDArray*) (NDFileHDF5.cpp:3719)
==100782==    by 0x64E24A: NDFileHDF5::openFile(char const*, int, NDArray*) (NDFileHDF5.cpp:263)
==100782==    by 0x66BDFC: NDPluginFile::openFileBase(int, NDArray*) (NDPluginFile.cpp:73)
==100782==    by 0x66C9B3: NDPluginFile::writeFileBase() (NDPluginFile.cpp:254)
==100782==    by 0x66D2C7: NDPluginFile::processCallbacks(NDArray*) (NDPluginFile.cpp:756)

==100782== 32,000 bytes in 1,000 blocks are definitely lost in loss record 7,984 of 8,173
==100782==    at 0x4C29C23: malloc (vg_replace_malloc.c:299)
==100782==    by 0xE0B3D3: xmlStrndup__internal_alias (xmlstring.c:45)
==100782==    by 0xE0B3D3: xmlStrdup (xmlstring.c:71)
==100782==    by 0xD868F0: xmlTextReaderGetAttribute (xmlreader.c:2417)
==100782==    by 0x655D51: hdf5::LayoutXML::new_hardlink() (NDFileHDF5LayoutXML.cpp:620)
==100782==    by 0x656764: hdf5::LayoutXML::process_node() (NDFileHDF5LayoutXML.cpp:288)
==100782==    by 0x656937: hdf5::LayoutXML::load_xml() (NDFileHDF5LayoutXML.cpp:120)
==100782==    by 0x64DC81: NDFileHDF5::createFileLayout(NDArray*) (NDFileHDF5.cpp:3719)
==100782==    by 0x64E24A: NDFileHDF5::openFile(char const*, int, NDArray*) (NDFileHDF5.cpp:263)
==100782==    by 0x66BDFC: NDPluginFile::openFileBase(int, NDArray*) (NDPluginFile.cpp:73)
==100782==    by 0x66C9B3: NDPluginFile::writeFileBase() (NDPluginFile.cpp:254)
==100782==    by 0x66D2C7: NDPluginFile::processCallbacks(NDArray*) (NDPluginFile.cpp:756)
==100782==    by 0x668AA5: NDPluginDriver::processTask() (NDPluginDriver.cpp:524)

@MarkRivers
Copy link
Member

I did a test first collecting in Single mode where VSZ increases. Then I changed to Stream mode where it does not increase. So the memory leak appears to be per file, not per array.

USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
epics    118808  2.8  0.0 5799732 55592 pts/32  Sl+  14:52   0:00 ../../bin/linux-x86_64/simDetectorApp USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
epics    118808 19.9  0.2 5944192 173756 pts/32 Sl+  14:52   0:16 ../../bin/linux-x86_64/simDetectorApp USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
epics    118808 21.0  0.4 6077316 283080 pts/32 Sl+  14:52   0:30 ../../bin/linux-x86_64/simDetectorApp USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
epics    118808 21.7  0.5 6142852 390056 pts/32 Sl+  14:52   0:46 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux

Now I switched to Stream mode, 1000 arrays per file. Note the VSZ does not increase.

epics    118808 17.9  0.6 6208388 395852 pts/32 Sl+  14:52   0:53 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics    118808 16.5  0.6 6208388 394692 pts/32 Sl+  14:52   0:59 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics    118808 15.7  0.6 6208388 396244 pts/32 Sl+  14:52   1:06 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
corvette:~/scratch>

@MarkRivers
Copy link
Member

I have found and fixed the memory leak responsible for this above:

==100782== 164,951 (79,920 direct, 85,031 indirect) bytes in 999 blocks are definitely lost in loss record 8,118 of 8,173
==100782==    at 0x4C2A243: operator new(unsigned long) (vg_replace_malloc.c:334)
==100782==    by 0x65B0DF: hdf5::Group::new_hardlink(std::string const&) (NDFileHDF5Layout.cpp:398)
==100782==    by 0x655D86: hdf5::LayoutXML::new_hardlink() (NDFileHDF5LayoutXML.cpp:626)
==100782==    by 0x656764: hdf5::LayoutXML::process_node() (NDFileHDF5LayoutXML.cpp:288)
==100782==    by 0x656937: hdf5::LayoutXML::load_xml() (NDFileHDF5LayoutXML.cpp:120)
==100782==    by 0x64DC81: NDFileHDF5::createFileLayout(NDArray*) (NDFileHDF5.cpp:3719)
==100782==    by 0x64E24A: NDFileHDF5::openFile(char const*, int, NDArray*) (NDFileHDF5.cpp:263)
==100782==    by 0x66BDFC: NDPluginFile::openFileBase(int, NDArray*) (NDPluginFile.cpp:73)
==100782==    by 0x66C9B3: NDPluginFile::writeFileBase() (NDPluginFile.cpp:254)
==100782==    by 0x66D2C7: NDPluginFile::processCallbacks(NDArray*) (NDPluginFile.cpp:756)
==100782==    by 0x668AA5: NDPluginDriver::processTask() (NDPluginDriver.cpp:524)
==100782==    by 0x12A42A8: epicsThreadCallEntryPoint (epicsThread.cpp:83)

With this fix:

corvette:ADCore/ADApp/pluginSrc>git diff NDFileHDF5.cpp
diff --git a/ADApp/pluginSrc/NDFileHDF5.cpp b/ADApp/pluginSrc/NDFileHDF5.cpp
index 6f44744..380a2dd 100644
--- a/ADApp/pluginSrc/NDFileHDF5.cpp
+++ b/ADApp/pluginSrc/NDFileHDF5.cpp
@@ -718,6 +718,7 @@ asynStatus NDFileHDF5::createHardLinks(hdf5::Group* root)
         asynPrint(this->pasynUserSelf, ASYN_TRACE_ERROR, "%s::%s error creating hard link from: %s to %s\n",
                   driverName, functionName, targetName.c_str(), linkName.c_str());
       }
+      delete it_hardlinks->second;
     }

     hdf5::Group::MapGroups_t::const_iterator it_group;

However, this also does not significantly reduce the memory leak.

@MarkRivers
Copy link
Member

I found and fixed the problem responsible for this valgrind error:

epics> auto_settings.sav: 2927 of 2927 PV's connected
==100782== Thread 26 FileHDF1_Plugin:
==100782== Invalid read of size 8
==100782==    at 0x4C2E0C0: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1022)
==100782==    by 0xF67950: H5Pset_fill_value (H5Pdcpl.c:3217)
==100782==    by 0x653D86: NDFileHDF5AttributeDataset::createHDF5Dataset() (NDFileHDF5AttributeDataset.cpp:104)
==100782==    by 0x64C4A5: NDFileHDF5::createAttributeDataset(NDArray*) (NDFileHDF5.cpp:2738)
==100782==    by 0x64E388: NDFileHDF5::openFile(char const*, int, NDArray*) (NDFileHDF5.cpp:282)
==100782==    by 0x66BDFC: NDPluginFile::openFileBase(int, NDArray*) (NDPluginFile.cpp:73)
==100782==    by 0x66C9B3: NDPluginFile::writeFileBase() (NDPluginFile.cpp:254)
==100782==    by 0x66D2C7: NDPluginFile::processCallbacks(NDArray*) (NDPluginFile.cpp:756)
==100782==    by 0x668AA5: NDPluginDriver::processTask() (NDPluginDriver.cpp:524)
==100782==    by 0x12A42A8: epicsThreadCallEntryPoint (epicsThread.cpp:83)
==100782==    by 0x12A9D0B: start_routine (osdThread.c:403)
==100782==    by 0x538DE24: start_thread (in /usr/lib64/libpthread-2.17.so)
==100782==  Address 0x1c87f278 is 0 bytes after a block of size 8 alloc'd
==100782==    at 0x4C2B9B5: calloc (vg_replace_malloc.c:711)
==100782==    by 0x653CCC: NDFileHDF5AttributeDataset::NDFileHDF5AttributeDataset(long, std::string const&, NDAttrDataType_t) (NDFileHDF5AttributeDataset.cpp:42)
==100782==    by 0x64C41E: NDFileHDF5::createAttributeDataset(NDArray*) (NDFileHDF5.cpp:2725)
==100782==    by 0x64E388: NDFileHDF5::openFile(char const*, int, NDArray*) (NDFileHDF5.cpp:282)
==100782==    by 0x66BDFC: NDPluginFile::openFileBase(int, NDArray*) (NDPluginFile.cpp:73)
==100782==    by 0x66C9B3: NDPluginFile::writeFileBase() (NDPluginFile.cpp:254)
==100782==    by 0x66D2C7: NDPluginFile::processCallbacks(NDArray*) (NDPluginFile.cpp:756)
==100782==    by 0x668AA5: NDPluginDriver::processTask() (NDPluginDriver.cpp:524)
==100782==    by 0x12A42A8: epicsThreadCallEntryPoint (epicsThread.cpp:83)
==100782==    by 0x12A9D0B: start_routine (osdThread.c:403)
==100782==    by 0x538DE24: start_thread (in /usr/lib64/libpthread-2.17.so)
==100782==    by 0x6511BAC: clone (in /usr/lib64/libc-2.17.so)

I added debugging in H5Pset_fill_value (H5Pdcpl.c:3217) and found that the memory being allocated for ptrFillValue must be MAX_ATTRIBUTE_STRING_SIZE, so I increased it to that with this:

diff --git a/ADApp/pluginSrc/NDFileHDF5AttributeDataset.cpp b/ADApp/pluginSrc/NDFileHDF5AttributeDataset.cpp
index 2b4968a..8c01ce3 100644
--- a/ADApp/pluginSrc/NDFileHDF5AttributeDataset.cpp
+++ b/ADApp/pluginSrc/NDFileHDF5AttributeDataset.cpp
@@ -39,7 +39,7 @@ NDFileHDF5AttributeDataset::NDFileHDF5AttributeDataset(hid_t file, const std::st
 {
   //printf("Constructor called for %s\n", name.c_str());
   // Allocate enough memory for the fill value to accept any data type
-  ptrFillValue_ = (void*)calloc(8, sizeof(char));
+  ptrFillValue_ = (void*)calloc(MAX_ATTRIBUTE_STRING_SIZE, sizeof(char));
 }

valgrind no longer reports an error.

@MarkRivers
Copy link
Member

I found many places in NDFileHDF5LayoutXML.cpp where xmlFree was not being called to deallocate the string returned by xmlTextReaderGetAttribute(). I have fixed these. valgrind no longer reports these problems. However, it has not changed the memory leak, which is still leaking 64MB to 128MB for each 1000 files saved.

It is beginning to look like the problem may be in the hdf5 library itself? Or is there an hdf5 allocator we are calling and not freeing that valgrind is not picking up?

@MarkRivers
Copy link
Member

I think this is a fairly serious issue. It is leaking >100Kb of memory per file written. If saving many small files it will eventually consume a huge amount of virtual memory. Can someone at DLS take a look at this?

@MarkRivers
Copy link
Member

This is the configuration I am using to test the HDF5 memory leak

image

image

image

I have now merged #387 and am testing with that fix.

This is the growth in VSZ as 1000 individual HDF5 files are saved.

This is the result when there is no Attributes file as shown in the screen above.


corvette:~/devel/areaDetector/ADSimDetector>ps -u -q 16713
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
epics     16713  3.9  0.2 5803752 147372 pts/32 Sl+  07:59   0:29 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     16713  4.9  0.2 5803752 189084 pts/32 Sl+  07:59   0:39 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     16713  5.7  0.3 5869288 230844 pts/32 Sl+  07:59   0:50 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     16713  6.5  0.4 5934824 272604 pts/32 Sl+  07:59   1:00 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     16713  7.2  0.4 5938928 318540 pts/32 Sl+  07:59   1:10 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     16713  7.7  0.5 6004464 360300 pts/32 Sl+  07:59   1:21 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     16713  8.2  0.6 6070000 402052 pts/32 Sl+  07:59   1:31 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     16713  8.7  0.6 6070000 443764 pts/32 Sl+  07:59   1:41 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux

So I saved 1000 files 8 times and the memory increased from 5803752 to 6070000 kB. That is 33.3 kB per file.

I then repeated the measurement, but this time I used the simDetectorAttributes.xml file that comes with ADSimDetector. This saves additional attributes to the HDF5 file.

corvette:~/devel/areaDetector/ADSimDetector>ps -u -q 20393
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
epics     20393 10.1  0.4 6269812 277780 pts/32 Sl+  08:19   0:33 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     20393 12.6  0.5 6335348 384500 pts/32 Sl+  08:19   0:49 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     20393 14.4  0.7 6466420 491004 pts/32 Sl+  08:19   1:05 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     20393 15.8  0.9 6597492 597508 pts/32 Sl+  08:19   1:21 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     20393 16.7  1.0 6663028 703960 pts/32 Sl+  08:19   1:35 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     20393 17.6  1.2 6794100 810472 pts/32 Sl+  08:19   1:51 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     20393 18.3  1.3 6925172 917240 pts/32 Sl+  08:19   2:07 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     20393 11.0  1.5 6990708 1023696 pts/32 Sl+ 08:19   2:28 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux

I again saved 8000 files, and this time the memory increased from 6269812 to 6990708 kB. That is 90.1 kB per file. So it appears that the memory leak is related to the number of attributes.

Note that even when no attributes file is used the HDF5 plugin still saves attributes like UniqueId, TimeStamp, etc. So it is possible that the memory leak is entirely due to some problem with the attributes.

I ran valgrind while collecting 1000 files. It is much slower, simDetector can only do about 7 frames/s and HDF5 plugin can only do about 3 files/s.

valgrind --leak-check=yes --log-file=valgrind_output.txt ../../bin/linux-x86_64/simDetectorApp st.cmd.linux

This is the valgrind output file.
valgrind_output.txt

There are quite a few memory loss entries for NDFileHDF5. However, they are small, and I don't think they are causing 90kB loss per file. The leaks I fixed the other day were much larger, and they had no measurable affect on the memory leak.

@MarkRivers
Copy link
Member

MarkRivers commented Mar 26, 2019

I did another test. This time I changed the HDF5 plugin to have StorePerform=No and StoreAttr=No. This was the memory usage as I saved 1000 files 11 times.

corvette:~/devel/areaDetector/ADSimDetector>ps -u -q 52513
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
epics     52513 10.5  0.1 6011220 76544 pts/32  Sl+  10:51   0:07 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     52513 14.9  0.1 6011220 91064 pts/32  Sl+  10:51   0:15 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     52513 16.7  0.1 6011220 105848 pts/32 Sl+  10:51   0:21 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     52513 17.8  0.1 6011220 120368 pts/32 Sl+  10:51   0:27 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     52513 18.8  0.2 6076756 134900 pts/32 Sl+  10:51   0:34 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     52513 19.4  0.2 6084964 157868 pts/32 Sl+  10:51   0:42 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     52513 19.9  0.2 6084964 172388 pts/32 Sl+  10:51   0:49 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     52513 20.3  0.2 6084964 186908 pts/32 Sl+  10:51   0:56 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     52513 20.1  0.3 6150500 201428 pts/32 Sl+  10:51   1:02 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     52513 20.3  0.3 6150500 216212 pts/32 Sl+  10:51   1:09 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     52513 20.4  0.3 6150500 230732 pts/32 Sl+  10:51   1:16 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux
epics     52513 20.4  0.3 6150500 245252 pts/32 Sl+  10:51   1:23 ../../bin/linux-x86_64/simDetectorApp st.cmd.linux

The memory loss was 6150500-6011220 = 139 MB. Of this 8MB was for NDArray buffer allocation because the HDF5 file plugin used 4 more queue elements than normal, because I was running at 50Hz rather than 20Hz. It thus lost 131 MB/11000 files = 12 kB/file. This is significantly less than previous tests which ranged from 33 to 90 kB/file.

Repeated tests with the netCDF plugin show no memory leak. In 11000 files it did not leak 1 kB. The memory leaks with the HDF5 plugin come in quanta of 64MB, suggesting it is some specialized memory allocator, and not just malloc() or new()? Does HDF5 itself have an allocator that allocates in chunks of 64 MB?

@hinxx
Copy link
Contributor

hinxx commented Mar 27, 2019

Looking at the setup images, I guess one needs to set auto save to Yes in order to get the created.

I ran the same test on my box and it resulted in exactly 65536 bytes in VSZ increase between two consequtive runs to save 1000 images.

At a quick glance, code in Hdf5 plugin treats attributes a bit differently than whst I’m ised to in other plugins; it maintains additional list of attributes to not copy the array when the attributes are of interest only.

I have not tried other combinations of flags, but what I’m reading above it would make sense to look into how attributes are being saved. I’ll try to take our the code that does attribute saving and by that establishing a baseline when there are no more leaks, albeit no attributes, too. This should give a hint if we are looking into the right direction.

@ulrikpedersen
Copy link
Member

I think this is a fairly serious issue. It is leaking >100Kb of memory per file written. If saving many small files it will eventually consume a huge amount of virtual memory. Can someone at DLS take a look at this?

Hi @MarkRivers. Apologies, I completely missed this thread (@GDYendell pointed it out to me today). This is serious and we will look into it. We are just near the end of a very busy shutdown so can start work on this one after Wednesday next week.

@MarkRivers
Copy link
Member

Looking at the setup images, I guess one needs to set auto save to Yes in order to get the created.

Yes, sorry I had not yet turned on that flag when I made the screen shot, but it does need to be set to Yes.

I ran the same test on my box and it resulted in exactly 65536 bytes in VSZ increase between two consequtive runs to save 1000 images.

I think the units of VSZ are kB, so it was actually a 64MB increase for 1000 images. When I use simDetectorAttributes.xml I see either 64MB or 128MB for 1000 images. This is suspicious. What system is allocating memory in 64MB quanta? I don't think alloc()/malloc() do that, I see VSZ increase by 2MB when a new NDArray is allocated. Does the C++ new operator allocate in large chunks, or is this something being done in the HDF5 library memory allocator?

At a quick glance, code in Hdf5 plugin treats attributes a bit differently than whst I’m ised to in other plugins; it maintains additional list of attributes to not copy the array when the attributes are of interest only.

Actually the netCDF, TIFF and Nexus file plugins all do this as well. This is a search for the string pFileAttributes in all plugin source files:

corvette:ADCore/ADApp/pluginSrc>grep pFileAttributes *.cpp
NDFileHDF5.cpp:  this->pFileAttributes->clear();
NDFileHDF5.cpp:  this->getAttributes(this->pFileAttributes);
NDFileHDF5.cpp:  pArray->pAttributeList->copy(this->pFileAttributes);
NDFileHDF5.cpp:            driverName, functionName, this->pFileAttributes->count(), pArray->pAttributeList->count());
NDFileHDF5.cpp:                    ndAttr = this->pFileAttributes->find(hdfAttrNode->getName().c_str());
NDFileHDF5.cpp:            ndAttr = this->pFileAttributes->find(uniqueIDName);
NDFileHDF5.cpp:        ndAttr = this->pFileAttributes->find(attr.source.get_src_def().c_str());
NDFileHDF5.cpp:    status = (asynStatus)this->getAttributes(this->pFileAttributes);
NDFileHDF5.cpp:    status = (asynStatus)pArray->pAttributeList->copy(this->pFileAttributes);
NDFileHDF5.cpp:  this->pFileAttributes = new NDAttributeList;
NDFileHDF5.cpp:            driverName, functionName, extraDims, this->pFileAttributes->count());
NDFileHDF5.cpp:  ndAttr = this->pFileAttributes->next(ndAttr); // get the first NDAttribute
NDFileHDF5.cpp:    ndAttr = this->pFileAttributes->next(ndAttr);
NDFileHDF5.cpp:      ndAttr = this->pFileAttributes->find(hdfAttrNode->getName().c_str());
NDFileHDF5.cpp:  ndAttr = this->pFileAttributes->find(uniqueIDName);
NDFileHDF5.cpp:  this->pFileAttributes->add("NDArrayUniqueId",
NDFileHDF5.cpp:  this->pFileAttributes->add("NDArrayTimeStamp",
NDFileHDF5.cpp:  this->pFileAttributes->add("NDArrayEpicsTSSec",
NDFileHDF5.cpp:  this->pFileAttributes->add("NDArrayEpicsTSnSec",
NDFileNetCDF.cpp:    this->pFileAttributes->clear();
NDFileNetCDF.cpp:    this->getAttributes(this->pFileAttributes);
NDFileNetCDF.cpp:    pArray->pAttributeList->copy(this->pFileAttributes);
NDFileNetCDF.cpp:    numAttributes = this->pFileAttributes->count();
NDFileNetCDF.cpp:    pAttribute = this->pFileAttributes->next(NULL);
NDFileNetCDF.cpp:        pAttribute = this->pFileAttributes->next(pAttribute);
NDFileNetCDF.cpp:    this->getAttributes(this->pFileAttributes);
NDFileNetCDF.cpp:    pArray->pAttributeList->copy(this->pFileAttributes);
NDFileNetCDF.cpp:    pAttribute = this->pFileAttributes->next(NULL);
NDFileNetCDF.cpp:        pAttribute = this->pFileAttributes->next(pAttribute);
NDFileNetCDF.cpp:    this->pFileAttributes = new NDAttributeList;
NDFileNexus.cpp:  this->pFileAttributes->clear();
NDFileNexus.cpp:  this->getAttributes(this->pFileAttributes);
NDFileNexus.cpp:  pArray->pAttributeList->copy(this->pFileAttributes);
NDFileNexus.cpp:  this->getAttributes(this->pFileAttributes);
NDFileNexus.cpp:  pArray->pAttributeList->copy(this->pFileAttributes);
NDFileNexus.cpp:      pAttr = this->pFileAttributes->find(nodeSource);
NDFileNexus.cpp:      pAttr = this->pFileAttributes->find(nodeSource);
NDFileNexus.cpp:  this->pFileAttributes = new NDAttributeList;
NDFileTIFF.cpp:    this->pFileAttributes->clear();
NDFileTIFF.cpp:    this->getAttributes(this->pFileAttributes);
NDFileTIFF.cpp:    pArray->pAttributeList->copy(this->pFileAttributes);
NDFileTIFF.cpp:    pAttribute = this->pFileAttributes->find("Model");
NDFileTIFF.cpp:    pAttribute = this->pFileAttributes->find("Manufacturer");
NDFileTIFF.cpp:    pAttribute = this->pFileAttributes->find("TIFFImageDescription");
NDFileTIFF.cpp:    numAttributes_ = this->pFileAttributes->count();
NDFileTIFF.cpp:        "%s:%s this->pFileAttributes->count(): %d\n",
NDFileTIFF.cpp:    pAttribute = this->pFileAttributes->next(NULL);
NDFileTIFF.cpp:        pAttribute = this->pFileAttributes->next(pAttribute);
NDFileTIFF.cpp:    this->pFileAttributes = new NDAttributeList;

So the TIFF and netCDF plugins do similar things to the HDF5 plugin with the attribute list. But the TIFF and netCDF file plugins do not have a memory leak.

@ulrikpedersen
Copy link
Member

The HDF5 library could leak memory if hdf5 "objects" are not closed properly after use. Normally I would consider those objects "handles" and don't expect the leaks in that case to be more than a few bytes - but with 1000s of iterations that can of course build up...

There are a few tricks to check whether object have been left open - and also to force-close any open objects on file close. See these short articles from the HDF Group: https://confluence.hdfgroup.org/x/qsLoAg https://confluence.hdfgroup.org/x/qMLoAg

So in order to determine if the leak is due to "open objects" in the HDF5 lib we could try:

  • Open/create the file with the file access property fclose_degree set to "strong" - use function H5Pset_fclose_degree... But wait - we're already doing that here :(

  • I notice there is some code in the HDF5 plugin already to check if all objects for groups, datasets and hdf5-attributes have been closed before closing the file. If open objects are detected you should get ASYN_TRACE_FLOW messages such as Closing file not totally clean. ..... We could modify one of those H5Fget_obj_count calls to request ALL_OBJ in case we've missed an object type.

  • We could also try a test by closing the entire HDF5 library and freeing all memory with a call to H5close after closing a file and before creating a new one. This is brute-force and could only help to identify whether the leak is in the HDF5 library - but not where.

I also found this article about false valgring errors in HDF5 - but the numbers look so small compared to the 64MB number that I doubt this is the case...

@MarkRivers
Copy link
Member

I also found this article about false valgring errors in HDF5 - but the numbers look so small compared to the 64MB number that I doubt this is the case...

Our primary diagnostic is not from valgrind, it is from watching VSZ grow with ps or top.

@hinxx
Copy link
Contributor

hinxx commented Mar 28, 2019

I have added a call to H5get_obj_counnt() on line 1660 in NDFileHDF.cpp:

  obj_count = (int)H5Fget_obj_count(this->file, H5F_OBJ_ALL);
  if (obj_count > 0){
    fprintf(stderr,
              "%s::%s Closing file not totally clean.  All remaining=%d\n",
              driverName, functionName, obj_count);
  }

I also made other trace calls to be just fprint(stderr, ..)'s that is why some much output. Here is what I got right away after starting the same test:

NDFileHDF5::closeFile Closing file not totally clean.  All remaining=1
NDFileHDF5::closeFile file closed! runtime=0.010 s overall acquisition performance=0.00 Mbit/s
NDFileHDF5::closeFile closing HDF cparms 720575940379282709
NDFileHDF5::closeFile closing HDF datatype 216172782113788601
NDFileHDF5::closeFile closing groups
NDFileHDF5::closeFile Closing file not totally clean.  All remaining=1
NDFileHDF5::closeFile file closed! runtime=0.009 s overall acquisition performance=0.00 Mbit/s
NDFileHDF5::closeFile closing HDF cparms 720575940379282735
NDFileHDF5::closeFile closing HDF datatype 216172782113788636
NDFileHDF5::closeFile closing groups
NDFileHDF5::closeFile Closing file not totally clean.  All remaining=1
NDFileHDF5::closeFile file closed! runtime=0.007 s overall acquisition performance=0.00 Mbit/s
NDFileHDF5::closeFile closing HDF cparms 720575940379282761
NDFileHDF5::closeFile closing HDF datatype 216172782113788671

Seems like we are not being tidy enough when closing the file.

On another note, I did try to get HDF5 compiled with the friendly valgrind memory checks enabled that @ulrikpedersen pointed out, but I can not figure out what is the Makefile/g++ switch that need to set in adsupport-R1-4/supportApp/hdf5Src/Makefile, I guess.

@ulrikpedersen
Copy link
Member

I can not figure out what is the Makefile/g++ switch that need to set in adsupport-R1-4/supportApp/hdf5Src/Makefile

I think that should be added around Makefile line 15

Add the flag to USR_CFLAGS_Linux +=

@ulrikpedersen
Copy link
Member

NDFileHDF5::closeFile Closing file not totally clean. All remaining=1

Hmmm... Pretty close to being clearn... Could that single remaining object be the file object itself?

Would be interesting to add the call to H5close after closing the file and see whether or not the memory leak remain.

@hinxx
Copy link
Contributor

hinxx commented Mar 28, 2019

Add the flag to USR_CFLAGS_Linux +=

I think the --enable-using-memchecker is configure switch and not Makefile/g++. Adding it to the USR_CFLAGS:

    USR_CFLAGS += --enable-using-memchecker

raises this error cc1: error: unknown pass using-memchecker specified in -fenable for me.

I was after the equivalent for the Makefile, but it is not clear to me on how the USINGMEMCHECKER=yes (found in Makefile generated by the ./configure on stock hdf5 sources) comes into play during compile time. Should I just use the -D USINGMEMCHECKER=yes or -D USINGMEMCHECKER? I could look at the valgrind differences to determine if it made any impact, but that is kind of tedious.

@hinxx
Copy link
Contributor

hinxx commented Mar 28, 2019

Would be interesting to add the call to H5close after closing the file and see whether or not the memory leak remain.

My hunch was the comment from @MarkRivers that he sees way less memory leaks if using stream mode vs. single mode.

I'm running the test with H5close() right after:

  // Close the HDF file
  H5Fclose(this->file);
  this->file = 0;

... and the first test run looks promising already:

$ bash test-hdf5-memlkeak.sh 
Old : SIM1:cam1:AcquireTime          0.05
New : SIM1:cam1:AcquireTime          0.05
Old : SIM1:HDF1:FilePath \003
New : SIM1:HDF1:FilePath /tmp
Old : SIM1:HDF1:FileName \003
New : SIM1:HDF1:FileName test_hdf5
Old : SIM1:HDF1:AutoIncrement        No
New : SIM1:HDF1:AutoIncrement        Yes
Old : SIM1:HDF1:FileTemplate \003
New : SIM1:HDF1:FileTemplate %s%s_%d.h5
Old : SIM1:HDF1:AutoSave             No
New : SIM1:HDF1:AutoSave             Yes
Old : SIM1:HDF1:NumCapture           1
New : SIM1:HDF1:NumCapture           1000
Old : SIM1:HDF1:EnableCallbacks      Disable
New : SIM1:HDF1:EnableCallbacks      Enable
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
hinxx    25618 16.1  0.3 4460384 57540 pts/6   Sl+  10:29   0:11 /opt/bde/R3.15.5/artifacts/simapp-master/bin/linux-x86_64/iocApp st.cmd
hinxx    25618 16.7  0.3 4460384 60708 pts/6   Sl+  10:29   0:24 /opt/bde/R3.15.5/artifacts/simapp-master/bin/linux-x86_64/iocApp st.cmd
hinxx    25618 16.8  0.3 4460384 63876 pts/6   Sl+  10:29   0:36 /opt/bde/R3.15.5/artifacts/simapp-master/bin/linux-x86_64/iocApp st.cmd
hinxx    25618 17.1  0.4 4460384 67044 pts/6   Sl+  10:29   0:48 /opt/bde/R3.15.5/artifacts/simapp-master/bin/linux-x86_64/iocApp st.cmd
hinxx    25618 17.0  0.4 4460384 70216 pts/6   Sl+  10:29   1:00 /opt/bde/R3.15.5/artifacts/simapp-master/bin/linux-x86_64/iocApp st.cmd
hinxx    25618 17.1  0.4 4460384 73384 pts/6   Sl+  10:29   1:12 /opt/bde/R3.15.5/artifacts/simapp-master/bin/linux-x86_64/iocApp st.cmd
hinxx    25618 17.2  0.4 4460384 76552 pts/6   Sl+  10:29   1:25 /opt/bde/R3.15.5/artifacts/simapp-master/bin/linux-x86_64/iocApp st.cmd
hinxx    25618 17.3  0.4 4460384 79720 pts/6   Sl+  10:29   1:37 /opt/bde/R3.15.5/artifacts/simapp-master/bin/linux-x86_64/iocApp st.cmd
hinxx    25618 17.3  0.5 4460384 82624 pts/6   Sl+  10:29   1:50 /opt/bde/R3.15.5/artifacts/simapp-master/bin/linux-x86_64/iocApp st.cmd
hinxx    25618 17.4  0.5 4460384 85796 pts/6   Sl+  10:29   2:02 /opt/bde/R3.15.5/artifacts/simapp-master/bin/linux-x86_64/iocApp st.cmd

Memory leak: 4460384 − 4460384 = 0

@hinxx
Copy link
Contributor

hinxx commented Mar 28, 2019

This is the patch to the ADCore master I used for the above results.

@ulrikpedersen
Copy link
Member

I think the --enable-using-memchecker is configure switch and not Makefile/g++

Yes, of course you're right 👍 I guess you can't use HDF5 from ADSupport if you want to re-configure it? (for Linux we build our own HDF5 and don't use ADSupport)

Memory leak: 4460384 − 4460384 = 0

Cool! Then we know it's in the HDF5 library and even have a way to force it to be cleared. I would really like to work out exactly where we are leaking though 🤓

This is the patch to the ADCore master I used for the above results.

Could you just push your changed code to a new branch on your ADCore fork on https://github.com/hinxx/ADCore?

@hinxx
Copy link
Contributor

hinxx commented Mar 28, 2019

for Linux we build our own HDF5 and don't use ADSupport

Aaa, I see.. I've gotten used to have it all from the areaDetectore code base ;).

I would really like to work out exactly where we are leaking though

+1 on that. Also trying to have these kind of bugs noticed early (CI + valgrind?) might be a good investment in the future. I still think having a small and clean report (< 300kb for a start ) of potential leaks to look would be a nice to have.
For now it looks like OK, but it is fragile stability IMO.

Could you just push your changed code to a new branch on your ADCore fork on https://github.com/hinxx/ADCore?

Of course. I'll keep it minimal.

@ulrikpedersen
Copy link
Member

Also trying to have these kind of bugs noticed early (CI + valgrind?) might be a good investment in the future. I still think having a small and clean report (< 300kb for a start ) of potential leaks to look would be a nice to have.

I would like to see that in CI as well. We could add it to the current Travis CI - but that only builds and runs the unittest which are not likely to give the same valgrind result as running the simDetector IOC... Chances are the unittest framework will add its own memory leaks as well ;-)

@hinxx
Copy link
Contributor

hinxx commented Mar 28, 2019

I understand it is not a simple task as running units tests.. in addition, we need to excite the IOC as
well.

Chances are the unittest framework will add its own memory leaks as well ;-)

True that, nothing is perfect.
One attempt could be looking at the diff of the valgrind outputs (stable vs. new code), after known test has been ran on CI box. The diff would hopefully point out possible areaDetector code memory leaks and abstract away the other. We could have a crude indication that new code is not fit enough for import.
There should be better CI practices of doing this, though.

@hinxx
Copy link
Contributor

hinxx commented Mar 28, 2019

I found the flag that ADSupport can use for enabling memory checker (not tested):

USR_CFLAGS_Linux += -D H5_USING_MEMCHECKER

I'll play around with this as time permits, hopefully something good comes out of it.

@MarkRivers
Copy link
Member

Closed via #390

@MarkRivers
Copy link
Member

I just noticed a very interesting performance problem. I am working on the master branch, prior to testing #397 pull request.

I first ran with the master branch code that has the H5close() call. The performance was excellent, it could save single HDF5 files at the full 100 frames/s rate I was running the simDetector. It also does not have a memory leak, as we know.

I then commented out the call to H5close(), just to confirm that would re-introduce the memory leak. To my surprise, not only did it introduce the memory leak, but it causes a huge performance hit. I was saving 1000 files in Single mode each time I ran the simDetector in Multiple mode with NumImages = 1000. I noticed that the frames/s that the HDF5 plugin was able to do kept getting smaller the more files I saved. For the first few files it was doing 100 files/s, but then rapidly slowed down. I then restarted the IOC and did a systematic test. The following is the HDF5 plugin RunTime PV after saving each 1000 files. It tells how long it took to save the last file in that batch of 1000.

Files saved RunTime
1000 29 ms
2000 47 ms
3000 69 ms
4000 88 ms

So it started doing 100 files/s, but after 1000 had already slowed down to 33 frames/s. After 4000 frames it had slowed down to about 11.5 frames/s. In further tests it slowed down to below 5 frames/s. Clearly the problem is not just a memory leak, it is causing a major speed reduction as well.

This problem goes away by uncommenting the call to H5close().

@MarkRivers
Copy link
Member

I just downloaded @hinxx branch from his pull request. I merged the current master into that branch and tested it.

The memory leak is gone. So is the bad performance described above. So now without calling H5close() both problems are solved. Great work!

I will remove the "HK" comments and merge into master.

@MarkRivers
Copy link
Member

Closing via #397

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants