-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory leak in hdf5 writer (?) #385
Comments
Do you see a similar leak if you test with the NDFileNetCDF or TIFF drivers? Can you send a screen shot of your detector and HDF5 screens so we can try to reproduce the problem? |
Is this issue still occurring? |
I just answered my own question. Using mode=Single and Autosave=Yes then I see a significant memory consumption when saving HDF5 files, but not when saving TIFF or netCDF. |
I just ran the simDetector under valgrind looking for memory leaks when saving HDF5 files in Single mode. When it exits I see the following, which looks like it could be the problem.
|
I have found and fixed 3 definite memory leaks in the NDFileHDF5* files. However, it still is leaking memory, about 100 kB per file saved I believe. I'll continue to work on it today. |
The following shows the output of ps -u under the following setup: This is for the master branch where there are definite memory leaks. The first entry is before acquistion was started at all. The subsequent entries are after running each acquisition of 1000 files. The VSZ (virtual memory in kB) increases by 131072 or 65536 each time (ignoring the first entry before acquisition started). This is suspicious, since it is exactly a multiple of 64MB. Is this meaningful, or is it just the quantum of memory allocation that Linux reports?
This is the running the fix_hdf5_leaks branch, where I fixed the obvious memory leaks in NDFileHDF5* which were uncovered by valgrind. Again the VSZ increases by 131072 or 65535 kB for each 1000 files. So there appears to be no change from the master branch.
This is the fix_hdf5_leaks branch but collecting 100 files at a time, rather than 1000. It is interesting. Most of the time the VSZ does not increase at all. But every 5 or 6 times it increases by 65536.
|
This is the same measurement as above, but using the NDFileNetCDF plugin, rather than NDFileHDF5. It was done for 1000 files per measurement. Ignoring the first value (when no buffers had yet been allocated), the memory only increased once, from
|
I discussed this with @anjohnson . It is clear that ps can report memory allocations smaller than 64MB because it reported 4MB with the netCDF plugin. The fact that the HDF5 memory leak comes in quanta of 64MB is an indication that some other memory allocator is being used, perhaps the one in the hdf5 library, or the one in the xml2 library. This allocator seems to be allocating in chunks of 64 MB. Does anyone have any idea whether hdf5 or xml2 do this? My fixes to the memory leaks in NDFileHDF* so far (#387) do not eliminate the major memory leak. For my test with the master branch, ignoring the first entry the memory leaks was (6402008-5943256)/4000 = 114.7 kB/file. For my test on the fix_hdf5_leaks branch the leak was (6402944- 5944192)/4000. = 114.7 kB/file, the same. So it seems that my fixes must be in the noise compared to the major leak. |
On both the master and the fix_hdf5_leaks branch valgrind immediately reports the following when the first HDF5 file is written:
It seems like we need to fix these problems. These are the main leaks in the NDFileHDF5 plugin on the fix_hdf5_leaks branch. They seem to all be XML related.
|
I did a test first collecting in Single mode where VSZ increases. Then I changed to Stream mode where it does not increase. So the memory leak appears to be per file, not per array.
Now I switched to Stream mode, 1000 arrays per file. Note the VSZ does not increase.
|
I have found and fixed the memory leak responsible for this above:
With this fix:
However, this also does not significantly reduce the memory leak. |
I found and fixed the problem responsible for this valgrind error:
I added debugging in H5Pset_fill_value (H5Pdcpl.c:3217) and found that the memory being allocated for ptrFillValue must be MAX_ATTRIBUTE_STRING_SIZE, so I increased it to that with this:
valgrind no longer reports an error. |
I found many places in NDFileHDF5LayoutXML.cpp where xmlFree was not being called to deallocate the string returned by xmlTextReaderGetAttribute(). I have fixed these. valgrind no longer reports these problems. However, it has not changed the memory leak, which is still leaking 64MB to 128MB for each 1000 files saved. It is beginning to look like the problem may be in the hdf5 library itself? Or is there an hdf5 allocator we are calling and not freeing that valgrind is not picking up? |
I think this is a fairly serious issue. It is leaking >100Kb of memory per file written. If saving many small files it will eventually consume a huge amount of virtual memory. Can someone at DLS take a look at this? |
This is the configuration I am using to test the HDF5 memory leak I have now merged #387 and am testing with that fix. This is the growth in VSZ as 1000 individual HDF5 files are saved. This is the result when there is no Attributes file as shown in the screen above.
So I saved 1000 files 8 times and the memory increased from 5803752 to 6070000 kB. That is 33.3 kB per file. I then repeated the measurement, but this time I used the simDetectorAttributes.xml file that comes with ADSimDetector. This saves additional attributes to the HDF5 file.
I again saved 8000 files, and this time the memory increased from 6269812 to 6990708 kB. That is 90.1 kB per file. So it appears that the memory leak is related to the number of attributes. Note that even when no attributes file is used the HDF5 plugin still saves attributes like UniqueId, TimeStamp, etc. So it is possible that the memory leak is entirely due to some problem with the attributes. I ran valgrind while collecting 1000 files. It is much slower, simDetector can only do about 7 frames/s and HDF5 plugin can only do about 3 files/s.
This is the valgrind output file. There are quite a few memory loss entries for NDFileHDF5. However, they are small, and I don't think they are causing 90kB loss per file. The leaks I fixed the other day were much larger, and they had no measurable affect on the memory leak. |
I did another test. This time I changed the HDF5 plugin to have StorePerform=No and StoreAttr=No. This was the memory usage as I saved 1000 files 11 times.
The memory loss was 6150500-6011220 = 139 MB. Of this 8MB was for NDArray buffer allocation because the HDF5 file plugin used 4 more queue elements than normal, because I was running at 50Hz rather than 20Hz. It thus lost 131 MB/11000 files = 12 kB/file. This is significantly less than previous tests which ranged from 33 to 90 kB/file. Repeated tests with the netCDF plugin show no memory leak. In 11000 files it did not leak 1 kB. The memory leaks with the HDF5 plugin come in quanta of 64MB, suggesting it is some specialized memory allocator, and not just malloc() or new()? Does HDF5 itself have an allocator that allocates in chunks of 64 MB? |
Looking at the setup images, I guess one needs to set auto save to Yes in order to get the created. I ran the same test on my box and it resulted in exactly 65536 bytes in VSZ increase between two consequtive runs to save 1000 images. At a quick glance, code in Hdf5 plugin treats attributes a bit differently than whst I’m ised to in other plugins; it maintains additional list of attributes to not copy the array when the attributes are of interest only. I have not tried other combinations of flags, but what I’m reading above it would make sense to look into how attributes are being saved. I’ll try to take our the code that does attribute saving and by that establishing a baseline when there are no more leaks, albeit no attributes, too. This should give a hint if we are looking into the right direction. |
Hi @MarkRivers. Apologies, I completely missed this thread (@GDYendell pointed it out to me today). This is serious and we will look into it. We are just near the end of a very busy shutdown so can start work on this one after Wednesday next week. |
Yes, sorry I had not yet turned on that flag when I made the screen shot, but it does need to be set to Yes.
I think the units of VSZ are kB, so it was actually a 64MB increase for 1000 images. When I use simDetectorAttributes.xml I see either 64MB or 128MB for 1000 images. This is suspicious. What system is allocating memory in 64MB quanta? I don't think alloc()/malloc() do that, I see VSZ increase by 2MB when a new NDArray is allocated. Does the C++ new operator allocate in large chunks, or is this something being done in the HDF5 library memory allocator?
Actually the netCDF, TIFF and Nexus file plugins all do this as well. This is a search for the string pFileAttributes in all plugin source files:
So the TIFF and netCDF plugins do similar things to the HDF5 plugin with the attribute list. But the TIFF and netCDF file plugins do not have a memory leak. |
The HDF5 library could leak memory if hdf5 "objects" are not closed properly after use. Normally I would consider those objects "handles" and don't expect the leaks in that case to be more than a few bytes - but with 1000s of iterations that can of course build up... There are a few tricks to check whether object have been left open - and also to force-close any open objects on file close. See these short articles from the HDF Group: https://confluence.hdfgroup.org/x/qsLoAg https://confluence.hdfgroup.org/x/qMLoAg So in order to determine if the leak is due to "open objects" in the HDF5 lib we could try:
I also found this article about false valgring errors in HDF5 - but the numbers look so small compared to the 64MB number that I doubt this is the case... |
Our primary diagnostic is not from valgrind, it is from watching VSZ grow with ps or top. |
I have added a call to H5get_obj_counnt() on line 1660 in NDFileHDF.cpp:
I also made other trace calls to be just fprint(stderr, ..)'s that is why some much output. Here is what I got right away after starting the same test:
Seems like we are not being tidy enough when closing the file. On another note, I did try to get HDF5 compiled with the friendly valgrind memory checks enabled that @ulrikpedersen pointed out, but I can not figure out what is the Makefile/g++ switch that need to set in adsupport-R1-4/supportApp/hdf5Src/Makefile, I guess. |
I think that should be added around Makefile line 15 Add the flag to |
Hmmm... Pretty close to being clearn... Could that single remaining object be the file object itself? Would be interesting to add the call to H5close after closing the file and see whether or not the memory leak remain. |
I think the
raises this error I was after the equivalent for the Makefile, but it is not clear to me on how the |
My hunch was the comment from @MarkRivers that he sees way less memory leaks if using stream mode vs. single mode. I'm running the test with
... and the first test run looks promising already:
Memory leak: 4460384 − 4460384 = 0 |
This is the patch to the ADCore master I used for the above results. |
Yes, of course you're right 👍 I guess you can't use HDF5 from ADSupport if you want to re-configure it? (for Linux we build our own HDF5 and don't use ADSupport)
Cool! Then we know it's in the HDF5 library and even have a way to force it to be cleared. I would really like to work out exactly where we are leaking though 🤓
Could you just push your changed code to a new branch on your ADCore fork on https://github.com/hinxx/ADCore? |
Aaa, I see.. I've gotten used to have it all from the areaDetectore code base ;).
+1 on that. Also trying to have these kind of bugs noticed early (CI + valgrind?) might be a good investment in the future. I still think having a small and clean report (< 300kb for a start ) of potential leaks to look would be a nice to have.
Of course. I'll keep it minimal. |
I would like to see that in CI as well. We could add it to the current Travis CI - but that only builds and runs the unittest which are not likely to give the same valgrind result as running the simDetector IOC... Chances are the unittest framework will add its own memory leaks as well ;-) |
I understand it is not a simple task as running units tests.. in addition, we need to excite the IOC as
True that, nothing is perfect. |
I found the flag that ADSupport can use for enabling memory checker (not tested):
I'll play around with this as time permits, hopefully something good comes out of it. |
Closed via #390 |
I just noticed a very interesting performance problem. I am working on the master branch, prior to testing #397 pull request. I first ran with the master branch code that has the H5close() call. The performance was excellent, it could save single HDF5 files at the full 100 frames/s rate I was running the simDetector. It also does not have a memory leak, as we know. I then commented out the call to H5close(), just to confirm that would re-introduce the memory leak. To my surprise, not only did it introduce the memory leak, but it causes a huge performance hit. I was saving 1000 files in Single mode each time I ran the simDetector in Multiple mode with NumImages = 1000. I noticed that the frames/s that the HDF5 plugin was able to do kept getting smaller the more files I saved. For the first few files it was doing 100 files/s, but then rapidly slowed down. I then restarted the IOC and did a systematic test. The following is the HDF5 plugin RunTime PV after saving each 1000 files. It tells how long it took to save the last file in that batch of 1000.
So it started doing 100 files/s, but after 1000 had already slowed down to 33 frames/s. After 4000 frames it had slowed down to about 11.5 frames/s. In further tests it slowed down to below 5 frames/s. Clearly the problem is not just a memory leak, it is causing a major speed reduction as well. This problem goes away by uncommenting the call to H5close(). |
I just downloaded @hinxx branch from his pull request. I merged the current master into that branch and tested it. The memory leak is gone. So is the bad performance described above. So now without calling H5close() both problems are solved. Great work! I will remove the "HK" comments and merge into master. |
Closing via #397 |
The writer is in single mode, without filters, no lazy open and no SWMR. I see the memory usage ramps up constantly in writing hdf5 files, in the step of ~10-60 KB.
I am not sure whether this leak comes from libhdf5 or NDFileHDF5.
The text was updated successfully, but these errors were encountered: