Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change blocking in integration for significant performance improvement #1396

Merged
merged 8 commits into from
Oct 14, 2020

Conversation

jbeilstenedmands
Copy link
Contributor

Currently, dials.integrate splits the data into blocks, with overlaps between the blocks chosen so that not too many reflections are split across blocks (the default threshold is 0.95 i.e. at least 95% of the data will be processed as fulls, with the overlap size depending on the mosaicity). However, the effect of the overlap and current block determination routine is that the data must be read twice as the data is split into blocks which overlap half of the previous block; e.g. 0-20, 10-30, 20-40 ....

A better approach is to use large blocks equal to the number of processes such that the overlaps are small as a fraction of the dataset. e.g. in the above, if the image range is 0-400 and we're using nproc=4, use 4 blocks of image ranges 0-105, 95-205, 195-305, 295-400. This does not affect the memory needed as shoeboxes are properly deallocated as each block reads through its images, however the read time can be reduced by around a factor of 2 (the actual process time to integrate the reflections remains approximately the same). As the data are read in the profile modelling and integration steps, this reduces the number of data reads from 4 to around 2 (depends exactly on how much/little overlap there is which depends on mosaicity and dataset size).

In addition, a much lower percentage of the overall reflections are split, so more reflections are integrated as full reflections which should result in better intensity estimates - the logs show improved cc_spearman sum/prf and cc_pearson sum/prf.

As an example of the performance benefits, we're looking at taking off up to ~35% of the runtime.
Beta-lactamase runtimes, macbook nproc=4, branch 187s vs master 290s. (35% saved)
Insulin dataset (see below) runtimes, linux diamond-ws357 nproc=8, branch 403s vs 638s master (37% saved)
The improvement for a dataset of course depends on the read time vs process time but usually the read time seems a large fraction.

Todo: verify handling of multi-imageset/multi-experiment datasets, add more tests.

Example output from logs:
beta-lactamase, master
Automatically determined 239 blocks; 1-6, 4-9, 7-12, ..., 715-720.

        +---------------------------------------+-----------+----------+--------+
        | Item                                  |   Overall |      Low |   High |
        |---------------------------------------+-----------+----------+--------|
        | dmin                                  |      1.21 |     3.28 |   1.21 |
        | dmax                                  |     69.31 |    69.31 |   1.23 |
        | number fully recorded                 | 362348    | 25986    | 382    |
        | number partially recorded             |  17684    |  1514    |   3    |
        | number with invalid background pixels |  89360    |  5055    | 353    |
        | number with invalid foreground pixels |  52845    |  3567    | 165    |
        | number with overloaded pixels         |      0    |     0    |   0    |
        | number in powder rings                |      0    |     0    |   0    |
        | number processed with summation       | 323345    | 23569    | 220    |
        | number processed with profile fitting | 317255    | 23303    | 134    |
        | number failed in background modelling |   2980    |   866    |   0    |
        | number failed in summation            |  52845    |  3567    | 165    |
        | number failed in profile fitting      |  58935    |  3833    | 251    |
        | ibg                                   |      0.88 |     3.61 |   0.08 |
        | i/sigi (summation)                    |      7.09 |    41.46 |   0.31 |
        | i/sigi (profile fitting)              |      7.33 |    43.18 |   0.24 |
        | cc prf                                |      0.72 |     0.86 |   0.47 |
        | cc_pearson sum/prf                    |      0.96 |     0.96 |   0.59 |
        | cc_spearman sum/prf                   |      0.95 |     0.98 |   0.6  |
        +---------------------------------------+-----------+----------+--------+
        
        Timing information for integration
        +-------------------+----------------+
        | Read time         | 211.61 seconds |
        | Extract time      | 4.23 seconds   |
        | Pre-process time  | 0.51 seconds   |
        | Process time      | 298.98 seconds |
        | Post-process time | 0.00 seconds   |
        | Total time        | 544.48 seconds |
        +-------------------+----------------+

beta-lactamse, PR branch
Automatically determined 4 blocks; 1-183, 181-363, 361-543, 541-720.

        +---------------------------------------+-----------+----------+--------+
        | Item                                  |   Overall |      Low |   High |
        |---------------------------------------+-----------+----------+--------|
        | dmin                                  |      1.21 |     3.28 |   1.21 |
        | dmax                                  |     69.31 |    69.31 |   1.23 |
        | number fully recorded                 | 366066    | 26283    | 382    |
        | number partially recorded             |   2636    |   238    |   3    |
        | number with invalid background pixels |  87699    |  4892    | 353    |
        | number with invalid foreground pixels |  51670    |  3428    | 165    |
        | number with overloaded pixels         |      0    |     0    |   0    |
        | number in powder rings                |      0    |     0    |   0    |
        | number processed with summation       | 316604    | 23056    | 220    |
        | number processed with profile fitting | 312145    | 22918    | 134    |
        | number failed in background modelling |   2782    |   832    |   0    |
        | number failed in summation            |  51670    |  3428    | 165    |
        | number failed in profile fitting      |  56129    |  3566    | 251    |
        | ibg                                   |      0.88 |     3.63 |   0.08 |
        | i/sigi (summation)                    |      7.19 |    42.12 |   0.31 |
        | i/sigi (profile fitting)              |      7.31 |    42.9  |   0.24 |
        | cc prf                                |      0.72 |     0.86 |   0.47 |
        | cc_pearson sum/prf                    |      1    |     1    |   0.59 |
        | cc_spearman sum/prf                   |      0.96 |     1    |   0.6  |
        +---------------------------------------+-----------+----------+--------+
        
 Timing information for integration
        +-------------------+----------------+
        | Read time         | 88.84 seconds  |
        | Extract time      | 2.46 seconds   |
        | Pre-process time  | 0.47 seconds   |
        | Process time      | 268.74 seconds |
        | Post-process time | 0.00 seconds   |
        | Total time        | 382.76 seconds |
        +-------------------+----------------+

Insulin dataset /dls/i04/data/2020/cm26459-1/20200220/TestInsulin/ins_10/ins_10_1_master.h5 (dataset used in testing #659 ).
Insulin master
Automatically determined 89 blocks; 1-40, 21-60, 41-80, ..., 1761-1800

        +---------------------------------------+-----------+----------+--------+
        | Item                                  |   Overall |      Low |   High |
        |---------------------------------------+-----------+----------+--------|
        | dmin                                  |      1.48 |     4.01 |   1.48 |
        | dmax                                  |     55.64 |    55.64 |   1.5  |
        | number fully recorded                 | 177230    | 15108    |  75    |
        | number partially recorded             |  17249    |  1951    |   2    |
        | number with invalid background pixels |  54285    |  3980    |  52    |
        | number with invalid foreground pixels |  23788    |  1931    |  27    |
        | number with overloaded pixels         |      1    |     1    |   0    |
        | number in powder rings                |      0    |     0    |   0    |
        | number processed with summation       | 168544    | 14954    |  50    |
        | number processed with profile fitting | 160610    | 14363    |  25    |
        | number failed in background modelling |    622    |   157    |   0    |
        | number failed in summation            |  23788    |  1931    |  27    |
        | number failed in profile fitting      |  31722    |  2522    |  52    |
        | ibg                                   |      0.08 |     0.12 |   0.02 |
        | i/sigi (summation)                    |     15.77 |    58.75 |   0.55 |
        | i/sigi (profile fitting)              |     16.93 |    63.57 |   0.33 |
        | cc prf                                |      0.71 |     0.88 |   0.39 |
        | cc_pearson sum/prf                    |      0.96 |     0.96 |   0.75 |
        | cc_spearman sum/prf                   |      0.98 |     0.97 |   0.72 |
        +---------------------------------------+-----------+----------+--------+
        
 Timing information for integration
        +-------------------+-----------------+
        | Read time         | 1764.10 seconds |
        | Extract time      | 27.65 seconds   |
        | Pre-process time  | 0.22 seconds    |
        | Process time      | 413.53 seconds  |
        | Post-process time | 0.00 seconds    |
        | Total time        | 2640.35 seconds |
        +-------------------+-----------------+

Insulin PR branch
Automatically determined 8 blocks; 1-243, 223-246, ..., 1562-1800.

        +---------------------------------------+-----------+----------+--------+
        | Item                                  |   Overall |      Low |   High |
        |---------------------------------------+-----------+----------+--------|
        | dmin                                  |      1.48 |     4.01 |   1.48 |
        | dmax                                  |     55.64 |    55.64 |   1.5  |
        | number fully recorded                 | 180324    | 15486    |  75    |
        | number partially recorded             |   4086    |   393    |   2    |
        | number with invalid background pixels |  52736    |  3728    |  52    |
        | number with invalid foreground pixels |  23123    |  1781    |  27    |
        | number with overloaded pixels         |      1    |     1    |   0    |
        | number in powder rings                |      0    |     0    |   0    |
        | number processed with summation       | 160982    | 14066    |  50    |
        | number processed with profile fitting | 156617    | 13889    |  25    |
        | number failed in background modelling |    576    |   119    |   0    |
        | number failed in summation            |  23123    |  1781    |  27    |
        | number failed in profile fitting      |  27488    |  1958    |  52    |
        | ibg                                   |      0.08 |     0.12 |   0.02 |
        | i/sigi (summation)                    |     16.24 |    61.47 |   0.55 |
        | i/sigi (profile fitting)              |     16.73 |    62.94 |   0.33 |
        | cc prf                                |      0.71 |     0.89 |   0.39 |
        | cc_pearson sum/prf                    |      0.99 |     0.99 |   0.75 |
        | cc_spearman sum/prf                   |      0.99 |     1    |   0.72 |
        +---------------------------------------+-----------+----------+--------+
        
        Timing information for integration
        +-------------------+-----------------+
        | Read time         | 983.63 seconds  |
        | Extract time      | 22.52 seconds   |
        | Pre-process time  | 0.22 seconds    |
        | Process time      | 410.37 seconds  |
        | Post-process time | 0.00 seconds    |
        | Total time        | 1840.23 seconds |
        +-------------------+-----------------+

Currently, blocks always overlap by half the block size, leading
to unneccesary double reading of the data. Instead, use larger
blocks with the same overlaps.
@codecov
Copy link

codecov bot commented Sep 7, 2020

Codecov Report

Merging #1396 into master will increase coverage by 0.03%.
The diff coverage is 94.73%.

@@            Coverage Diff             @@
##           master    #1396      +/-   ##
==========================================
+ Coverage   65.66%   65.69%   +0.03%     
==========================================
  Files         614      614              
  Lines       69474    69494      +20     
  Branches     9506     9510       +4     
==========================================
+ Hits        45621    45657      +36     
+ Misses      22017    22003      -14     
+ Partials     1836     1834       -2     

@dagewa
Copy link
Member

dagewa commented Sep 8, 2020

I might have misunderstood this, but isn't the fairly fine-grained blocking a feature, such that you get a different profile model as the scan evolves in φ? If we make big blocks does that mean we are losing information about how the profile evolves along the scan?

@graeme-winter
Copy link
Contributor

@dagewa I think the profile models are still locally determined based on the xyz distance from a given spot to the surrounding reference profiles, and I do not think this would change. FWIW if you set nproc=1 you get all the data integrated in a single block, ISTR

@dagewa
Copy link
Member

dagewa commented Sep 8, 2020

What I mean is that the reference profiles are determined in blocks in φ, right? Profile fitting is then done with xyz distance to reference profiles. Maybe this PR does touch the reference profile creation, in which case I step down. I'm just checking that there isn't some reason to believe that we are losing explanatory power in the model here.

@graeme-winter
Copy link
Contributor

As with the parallel PR to split the reflection list if we have too many reflections we should do a side-by-side comparison of the integration results to show they are identical. Your concerns are clearly valid and a core part of the PR process 🙂

I'll see if I can get to the script to compare written

@jbeilstenedmands
Copy link
Contributor Author

Thanks for the comment @dagewa, i'll check with regards to the profile model formation.

@dagewa
Copy link
Member

dagewa commented Sep 8, 2020

Thanks, I'll tag @jmp1985 here for comment. I think it might be a reasonable ultimate goal to remove the blockwise determination of profile models for a global scan-varying model. This would be in 3D to replace the blocks in detector X, Y and the blocks in φ. I think there was some work before on scan-varying profile models within each X, Y block, but with @jbeilstenedmands's work on smoothers in higher dimensions it should be possible to create a 3D model.

@jbeilstenedmands
Copy link
Contributor Author

The profile modelling determination method is unchanged here. The profile model 'blocking'/gridding is determined by a separate set of parameters in the profile.gaussian_rs scope:

    fitting {
      scan_step = 5
        .help = "Space between profiles in degrees"
        .type = float(allow_none=True)
      grid_size = 5
        .help = "The size of the profile grid."
        .type = int(allow_none=True)
      threshold = 0.02
        .help = "The threshold to use in reference profile"
        .type = float(allow_none=True)
      grid_method = single *regular_grid circular_grid spherical_grid
        .help = "Select the profile grid method"
        .type = choice
      fit_method = *reciprocal_space detector_space
        .help = "The fitting method"
        .type = choice
      detector_space {
        deconvolution = False
          .help = "Do deconvolution in detector space"
          .type = bool
      }

This is unchanged between master and this PR, and the Summary of profile model table in the log output confirms that the same gridding is used in both cases.
dials.integrate.master.log
dials.integrate.branch.log

@dagewa
Copy link
Member

dagewa commented Sep 8, 2020

Sounds good, it sounds like any profile modelling changes we aim for in the long run would be orthogonal to this then. 👍

@jbeilstenedmands jbeilstenedmands marked this pull request as ready for review September 9, 2020 15:09
@jbeilstenedmands
Copy link
Contributor Author

I reckon this is ready for review now.
Only thing I wasn't sure about is the mp.njobs paramater. If you have nproc=4 and njobs=2, I assume you want to split each dataset into 8 blocks (rather than just splitting based on nproc) as you effectively have 8 processors/workers?

algorithms/integration/processor.py Outdated Show resolved Hide resolved
algorithms/integration/processor.py Outdated Show resolved Hide resolved
test/command_line/test_integrate.py Outdated Show resolved Hide resolved
@jbeilstenedmands
Copy link
Contributor Author

Having looked at the threaded integrator code, it seems that this idea basically exists within the threaded integrator if njobs > 1.
e.g. for integrator=3d_threaded, njobs=4, the following blocking is made:

+-----+--------------+------------+--------------+------------+-----------------+
|   # |   Frame From |   Frame To |   Angle From |   Angle To |   # Reflections |
|-----+--------------+------------+--------------+------------+-----------------|
|   0 |            0 |        182 |            0 |         91 |           25790 |
|   1 |          180 |        362 |           90 |        181 |           23535 |
|   2 |          360 |        542 |          180 |        271 |           23756 |
|   3 |          540 |        720 |          270 |        360 |           23354 |
+-----+--------------+------------+--------------+------------+-----------------+

which is the same as my blocking in this case, offset by 1.

I'll see if the existing code in the threaded integrator can be reused for this work.

@jbeilstenedmands
Copy link
Contributor Author

jbeilstenedmands commented Sep 17, 2020

It seems that adopting some code from the threaded integrator is not a good idea at this time. The main issue is that the threaded integrator does not currently work with multi-sweep data, and getting it to do that requires adding a significant amount of new c++ bookkeeping code. There are a few other details which seem to cause slightly different behaviour even for single imageset datasets, which also require further investigation.

For stability of standard processing I think it's best to keep this PR as is and review in its current state.
I'll look at fixing the threaded integrator separately and then see if the code can be combined in future.

Copy link
Member

@dagewa dagewa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance enhancements are compelling, especially to get this at no cost to processing quality (in fact, to slightly improve processing quality). 👍 I made a couple of suggestions only.

algorithms/integration/processor.py Show resolved Hide resolved
algorithms/integration/processor.py Outdated Show resolved Hide resolved
@jbeilstenedmands jbeilstenedmands merged commit 472ebb5 into master Oct 14, 2020
@jbeilstenedmands jbeilstenedmands deleted the improve_block_size branch October 14, 2020 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants