-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In dials.integrate track down and prevent (ideally) cause of exit code = -9 #659
Comments
Multiple possible solutions for this - using threaded integration in preference to multiprocessing is one, better prediction of memory usage would be a second. Really needs a test case to throw the problem however. |
Possible solution 1 is in the works #833 |
This is still one of the most commonly-reported bugs, and in my eyes the most serious. We struggle to process "normal" data on a "normal" computer. For example, I have a dataset from a CCP4 workshop that is nothing particularly special: P3 cell with a=84.5 Å, c=107.0 Å, P6M detector, 360° at 0.1°/image and a moderately large mosaicity of 0.5°. I can't process this beyond 2.0 Å on my laptop with 16 GB RAM, even using just a single process! Ronan ran it on one of the CCP4 servers to 1.18 Å with nproc=1 and noted it required 31 GB of memory and took 7.5 hours to complete. XDS has no such problem. We are keeping all integrated data in RAM for no good reason. Once a spot is integrated, it is integrated. Dump it somewhere. Or at least dump its shoebox |
Excellent to see some diagnostic here, keeping all shoeboxes around is unnecessary and a very poor idea
Do you have pointers where the issues are centred?
|
I don't have any such pointers, just coming at it from a user perspective at the moment. Tried to process data, it failed. I did notice it failed during integration though, so after profile modelling. It integrated a couple of hundred images then died. I no longer have the logs for that run here though. |
If the offending images are somewhere public would help a lot! 👍
|
I can't do for this case. I think a 3600 image fine-sliced dataset of thaumatin would do though. We can set the mosaicity to 0.5° manually |
Reproduced: /dls/i04/data/2020/cm26459-1/20200220/TestInsulin/ins_10/ins_10_1_master.h5
...
|
This was on my MacBook Pro with 16 GB RAM |
Thanks, ideal test case 👍 |
It bombed during integration with
finally failing here:
|
Disabling checks in debugging-integrate-memory-use branch to see where this actually fails, rather than computes something && says no |
Oh, not very helpful
no useful stack trace 🤔 last console output
|
What are the options for inspecting memory use at a breakpoint in processing? I suspect it's shoeboxes that tip the balance, but it would be good to know for sure. |
A memory profiler might help - valgrind? VTune also has a memory profiler that looks promising, started running things before I left for leave but didn’t get to dig in. |
Another option could be https://docs.python.org/3/library/faulthandler.html - just taking a look at this now, may try hacking in on my branch above |
Useful 🙂 |
Would guppy help, or crippled by boost.python? |
Digging a bit here I think the solution will be in processor to pass the option not to save the shoeboxes down the stack so we don't allocate / fill them outside the minimum scope where they are needed. Looking more at this now. |
Thanks for looking at it. I was wondering about moving shoeboxes out of the reflection table and into a map structure, leaving only keys into the map in the reflection table. Shoeboxes can then be consumed and deleted after use. But I wasn't 100% sure it is the shoeboxes at fault (though, say >95%) |
suggested in #659 specifically for debugging, but we should enable this whenever possible.
Looking at maximum resident size in memory it climbs fairly slowly then jumps as we go into the actual integration phase - this is I think where things go bad usually. This was "sensible" processing on a small molecule example so probably not representative - waiting on the smoking gnu above to finish running... |
OK, troublesome task above (with slightly silly mosaic parameters) has finished and illustrates the issue well - From the output -
This shoebox memory estimate is sound, and the problem... |
Coming up for some air... it looks like the shoeboxes are kinda deallocated at the C++ level - there are still references to |
General synopsis at 13:15 hrs: looks like the shoeboxes are bring freed in the C++ code - if we are not writing them out - and deleting the "shoebox" column from the reflection table makes very little difference, as this is just 6 numbers or so and a couple of pointers per reflection. |
Though... the memory estimate above seems spot on which makes me like 🤔 I checked the reference counts and I think the shoeboxes are bring freed... before being unlinked from the shoebox structure the |
Testing with small molecule set with deliberately large profile model parameters - the memory does not go up constantly, so something is being freed somewhere. I am coming around to the idea that what is causing us problems is the steep rise in memory use between 0 and 50 frames at the start of actual integration. I am not sure we can avoid this whilst keeping the pixel data in memory. However: this is not a matter of us doing something dumb, this is a real problem. |
Looking at another thread - are our processing / profile parameters sensible? Compare what XDS reports and DIALS, considering that in theory they have the same underlying model -
perform integration with both - manually setting the profile parameters to "simulate" the XDS job and write a little jiffy to get the differences for matched reflections. If our profile parameters were too large we would see simply added noise, and the average intensity would be the same. If the XDS parameters are too small the integrated intensity should be systematically lower. Script:
Result: Yes, there is a systematic difference in favour of the DIALS profile parameters being ~ right (at least, using smaller profile parameters makes for a systematic reduction in intensities which I interpret as missing the tails of the peak) |
|
Just for the record: In DIALS 2.1.3 (which is the version included in CCP4 7.1) a related issue shows up as
and this should be fixed in DIALS 2.2.3 via #1221 |
#659 In the integrator, calculate the maximum memory needed. If this exceeds the available memory, then split the reflection table into random subsets and process by performing multiple passes over the imagesets. Applies only to regular integrators (not threaded) and will only take effect in situations where processing currently exits.
This issue has been automatically marked as stale because it has not had recent activity. The label will be removed automatically if any activity occurs. Thank you for your contributions. |
100% not at all stale - still an issue 😞 |
Believed to be memory allocation in second step of integration i.e. once the reference profiles are constructed, when gets to integration it falls over. This is probably the most widely reported real bug.
The text was updated successfully, but these errors were encountered: