-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid integrate memory limit #1392
Conversation
#659 In the integrator, calculate the maximum memory needed. If this exceeds the available memory, then split the reflection table into random subsets and process by performing multiple passes over the imagesets. Applies only to regular integrators (not threaded) and will only take effect in situations where processing currently exits.
Codecov Report
@@ Coverage Diff @@
## main #1392 +/- ##
==========================================
- Coverage 66.61% 64.68% -1.94%
==========================================
Files 615 618 +3
Lines 68843 69747 +904
Branches 9585 9577 -8
==========================================
- Hits 45858 45114 -744
- Misses 21054 22845 +1791
+ Partials 1931 1788 -143 |
Before looking at the code: did you provide a mechanism in this to force this splitting behaviour? Seems that that would be very useful for evaluating the impact on the data. Completely behind this as a stop-gap on the road to looking at a proper solution. |
There is no mechanism to force this, there should be no impact on the data which should be exactly the same. |
👍 @jbeilstenedmands thank you |
I like this approach 👍 I would like to test and provide a review, which I can do next week. |
This is now ready for review. A few extra things to note - I used a random splitting to divide the data as a way to quickly but approximately halve the amount of memory needed per image during processing, this could in theory be done more systematically but I didn't feel this was necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a few minor comments.
Co-authored-by: Markus Gerstel <2102431+Anthchirp@users.noreply.github.com>
Co-authored-by: Markus Gerstel <2102431+Anthchirp@users.noreply.github.com>
Sorry for the review spam. Looks like Github doesn't like rewrite suggestions for 40+ lines - it gave me no indication that it actually created those comments, which is how we ended up with 5 of them. |
To aid reviewing things like this we now have a command-line tool which will match integration results and verify that they are identical (not polished)
returns nothing => all good Have verified changes above make literally no difference to the integrated data in one case, will now review. |
I tried testing as above with the beta lactamase data, but it tripped up - actually during the first stage with
|
Looks like a |
Thanks for the report @dagewa. Perhaps there is enough memory to process the table but not also to hold everything else in memory as well. I believe integration has a parameter max_memory_usage "The maximum percentage of available memory to use for allocating shoebox arrays.", so maybe I need to include this here also. |
Maybe it is worth sampling a range of |
Is more conservative and accounts for max_memory_usage phil parameter
This PR attempts to provide a workaround for the longstanding dials.integrate memory crash issues (#659).
When there is not enough memory to process the data, simply split the reflection table into subsets that will be within the memory limit and process each individually for the integrating step (profile modelling still done on all reflections). This only affects cases where the processing would currently fail, at the 'cost' of having to have multiple reads of the raw data.
I have been testing on the beta lactamase dataset initially, and will now look to perform further tests, others welcome to test also.
Set a large sigma_m to force higher memory requirements.
Set memory just above what's needed to trigger (i.e. current behaviour)
ulimit -v 1500000:
Set memory below what's needed to trigger
ulimit -v 1200000
Output datasets are identical.