-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a subset of reflections for sigma_m calculation in integrate #942
Conversation
Reviewing now - first rerunning some tests inside of |
Seems to save a serious amount of time for the "normal" data set - was 344 now 285. Subsequent analysis suggests we should use the same subset of reflections for sigma_b and also perform the zeta filtering upfront not just for sigma_m. Will test. |
Filter reflections at the top, use consistent set of reflections for both sigma_b and sigma_m, as a courtesy report the number of reflections used. Also separated out the still case for clarity though means slight code duplication occurs.
Implemented the filter of reflections further up the stack, saves another ~ 5s. Now running full test. |
Overall processing results with xia2 - before changes in this PR:
after:
wall clock time changed from 344 to 280s for the integration. |
Having now made commits to the branch I am no longer qualified to comment on the pull request. Will ask for others to provide input. |
For inspection: Original: Latest: |
Currently, in dials.integrate, all reflections are used to calculate sigma_m using a simplex minimiser, which takes a significant time for large datasets.
As this problem is hugely overdetermined, it should be possible to use a subset of reflections and obtain a result that is just as accurate. As dials.refine already selects a well-sample set of the data, it is proposed to use this subset if available. Also use at least 10000 reflections.
I have tested this on the beta-lactamase example used in the dials tutorial, and on my macbook the runtime for this calculation goes down from 42.9s to 18.9s out of an initial runtime of 415s, so a noticeable saving of around 5%. In this case, these are the differences in integration:
Master:
This branch:
So less than 1% difference in sigma m
Master:
This branch:
Then going on to scale:
Master:
This branch:
So looks pretty similar overall.
Would be useful to also test on bigger datasets to see the difference.