-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change blocking in integration for significant performance improvement #1396
Conversation
Currently, blocks always overlap by half the block size, leading to unneccesary double reading of the data. Instead, use larger blocks with the same overlaps.
Codecov Report
@@ Coverage Diff @@
## master #1396 +/- ##
==========================================
+ Coverage 65.66% 65.69% +0.03%
==========================================
Files 614 614
Lines 69474 69494 +20
Branches 9506 9510 +4
==========================================
+ Hits 45621 45657 +36
+ Misses 22017 22003 -14
+ Partials 1836 1834 -2 |
I might have misunderstood this, but isn't the fairly fine-grained blocking a feature, such that you get a different profile model as the scan evolves in φ? If we make big blocks does that mean we are losing information about how the profile evolves along the scan? |
@dagewa I think the profile models are still locally determined based on the xyz distance from a given spot to the surrounding reference profiles, and I do not think this would change. FWIW if you set |
What I mean is that the reference profiles are determined in blocks in φ, right? Profile fitting is then done with xyz distance to reference profiles. Maybe this PR does touch the reference profile creation, in which case I step down. I'm just checking that there isn't some reason to believe that we are losing explanatory power in the model here. |
As with the parallel PR to split the reflection list if we have too many reflections we should do a side-by-side comparison of the integration results to show they are identical. Your concerns are clearly valid and a core part of the PR process 🙂 I'll see if I can get to the script to compare written |
Thanks for the comment @dagewa, i'll check with regards to the profile model formation. |
Thanks, I'll tag @jmp1985 here for comment. I think it might be a reasonable ultimate goal to remove the blockwise determination of profile models for a global scan-varying model. This would be in 3D to replace the blocks in detector X, Y and the blocks in φ. I think there was some work before on scan-varying profile models within each X, Y block, but with @jbeilstenedmands's work on smoothers in higher dimensions it should be possible to create a 3D model. |
The profile modelling determination method is unchanged here. The profile model 'blocking'/gridding is determined by a separate set of parameters in the profile.gaussian_rs scope:
This is unchanged between master and this PR, and the |
Sounds good, it sounds like any profile modelling changes we aim for in the long run would be orthogonal to this then. 👍 |
I reckon this is ready for review now. |
Having looked at the threaded integrator code, it seems that this idea basically exists within the threaded integrator if njobs > 1.
which is the same as my blocking in this case, offset by 1. I'll see if the existing code in the threaded integrator can be reused for this work. |
It seems that adopting some code from the threaded integrator is not a good idea at this time. The main issue is that the threaded integrator does not currently work with multi-sweep data, and getting it to do that requires adding a significant amount of new c++ bookkeeping code. There are a few other details which seem to cause slightly different behaviour even for single imageset datasets, which also require further investigation. For stability of standard processing I think it's best to keep this PR as is and review in its current state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance enhancements are compelling, especially to get this at no cost to processing quality (in fact, to slightly improve processing quality). 👍 I made a couple of suggestions only.
Currently,
dials.integrate
splits the data into blocks, with overlaps between the blocks chosen so that not too many reflections are split across blocks (the default threshold is 0.95 i.e. at least 95% of the data will be processed as fulls, with the overlap size depending on the mosaicity). However, the effect of the overlap and current block determination routine is that the data must be read twice as the data is split into blocks which overlap half of the previous block; e.g. 0-20, 10-30, 20-40 ....A better approach is to use large blocks equal to the number of processes such that the overlaps are small as a fraction of the dataset. e.g. in the above, if the image range is 0-400 and we're using nproc=4, use 4 blocks of image ranges 0-105, 95-205, 195-305, 295-400. This does not affect the memory needed as shoeboxes are properly deallocated as each block reads through its images, however the read time can be reduced by around a factor of 2 (the actual process time to integrate the reflections remains approximately the same). As the data are read in the profile modelling and integration steps, this reduces the number of data reads from 4 to around 2 (depends exactly on how much/little overlap there is which depends on mosaicity and dataset size).
In addition, a much lower percentage of the overall reflections are split, so more reflections are integrated as full reflections which should result in better intensity estimates - the logs show improved cc_spearman sum/prf and cc_pearson sum/prf.
As an example of the performance benefits, we're looking at taking off up to ~35% of the runtime.
Beta-lactamase runtimes, macbook nproc=4, branch 187s vs master 290s. (35% saved)
Insulin dataset (see below) runtimes, linux diamond-ws357 nproc=8, branch 403s vs 638s master (37% saved)
The improvement for a dataset of course depends on the read time vs process time but usually the read time seems a large fraction.
Todo: verify handling of multi-imageset/multi-experiment datasets, add more tests.
Example output from logs:
beta-lactamase, master
Automatically determined 239 blocks; 1-6, 4-9, 7-12, ..., 715-720.
beta-lactamse, PR branch
Automatically determined 4 blocks; 1-183, 181-363, 361-543, 541-720.
Insulin dataset /dls/i04/data/2020/cm26459-1/20200220/TestInsulin/ins_10/ins_10_1_master.h5 (dataset used in testing #659 ).
Insulin master
Automatically determined 89 blocks; 1-40, 21-60, 41-80, ..., 1761-1800
Insulin PR branch
Automatically determined 8 blocks; 1-243, 223-246, ..., 1562-1800.