Change blocking in integration for significant performance improvement #1396

jbeilstenedmands · 2020-09-07T17:57:47Z

Currently, dials.integrate splits the data into blocks, with overlaps between the blocks chosen so that not too many reflections are split across blocks (the default threshold is 0.95 i.e. at least 95% of the data will be processed as fulls, with the overlap size depending on the mosaicity). However, the effect of the overlap and current block determination routine is that the data must be read twice as the data is split into blocks which overlap half of the previous block; e.g. 0-20, 10-30, 20-40 ....

A better approach is to use large blocks equal to the number of processes such that the overlaps are small as a fraction of the dataset. e.g. in the above, if the image range is 0-400 and we're using nproc=4, use 4 blocks of image ranges 0-105, 95-205, 195-305, 295-400. This does not affect the memory needed as shoeboxes are properly deallocated as each block reads through its images, however the read time can be reduced by around a factor of 2 (the actual process time to integrate the reflections remains approximately the same). As the data are read in the profile modelling and integration steps, this reduces the number of data reads from 4 to around 2 (depends exactly on how much/little overlap there is which depends on mosaicity and dataset size).

In addition, a much lower percentage of the overall reflections are split, so more reflections are integrated as full reflections which should result in better intensity estimates - the logs show improved cc_spearman sum/prf and cc_pearson sum/prf.

As an example of the performance benefits, we're looking at taking off up to ~35% of the runtime.
Beta-lactamase runtimes, macbook nproc=4, branch 187s vs master 290s. (35% saved)
Insulin dataset (see below) runtimes, linux diamond-ws357 nproc=8, branch 403s vs 638s master (37% saved)
The improvement for a dataset of course depends on the read time vs process time but usually the read time seems a large fraction.

Todo: verify handling of multi-imageset/multi-experiment datasets, add more tests.

Example output from logs:
beta-lactamase, master
Automatically determined 239 blocks; 1-6, 4-9, 7-12, ..., 715-720.

        +---------------------------------------+-----------+----------+--------+
        | Item                                  |   Overall |      Low |   High |
        |---------------------------------------+-----------+----------+--------|
        | dmin                                  |      1.21 |     3.28 |   1.21 |
        | dmax                                  |     69.31 |    69.31 |   1.23 |
        | number fully recorded                 | 362348    | 25986    | 382    |
        | number partially recorded             |  17684    |  1514    |   3    |
        | number with invalid background pixels |  89360    |  5055    | 353    |
        | number with invalid foreground pixels |  52845    |  3567    | 165    |
        | number with overloaded pixels         |      0    |     0    |   0    |
        | number in powder rings                |      0    |     0    |   0    |
        | number processed with summation       | 323345    | 23569    | 220    |
        | number processed with profile fitting | 317255    | 23303    | 134    |
        | number failed in background modelling |   2980    |   866    |   0    |
        | number failed in summation            |  52845    |  3567    | 165    |
        | number failed in profile fitting      |  58935    |  3833    | 251    |
        | ibg                                   |      0.88 |     3.61 |   0.08 |
        | i/sigi (summation)                    |      7.09 |    41.46 |   0.31 |
        | i/sigi (profile fitting)              |      7.33 |    43.18 |   0.24 |
        | cc prf                                |      0.72 |     0.86 |   0.47 |
        | cc_pearson sum/prf                    |      0.96 |     0.96 |   0.59 |
        | cc_spearman sum/prf                   |      0.95 |     0.98 |   0.6  |
        +---------------------------------------+-----------+----------+--------+
        
        Timing information for integration
        +-------------------+----------------+
        | Read time         | 211.61 seconds |
        | Extract time      | 4.23 seconds   |
        | Pre-process time  | 0.51 seconds   |
        | Process time      | 298.98 seconds |
        | Post-process time | 0.00 seconds   |
        | Total time        | 544.48 seconds |
        +-------------------+----------------+

beta-lactamse, PR branch
Automatically determined 4 blocks; 1-183, 181-363, 361-543, 541-720.

        +---------------------------------------+-----------+----------+--------+
        | Item                                  |   Overall |      Low |   High |
        |---------------------------------------+-----------+----------+--------|
        | dmin                                  |      1.21 |     3.28 |   1.21 |
        | dmax                                  |     69.31 |    69.31 |   1.23 |
        | number fully recorded                 | 366066    | 26283    | 382    |
        | number partially recorded             |   2636    |   238    |   3    |
        | number with invalid background pixels |  87699    |  4892    | 353    |
        | number with invalid foreground pixels |  51670    |  3428    | 165    |
        | number with overloaded pixels         |      0    |     0    |   0    |
        | number in powder rings                |      0    |     0    |   0    |
        | number processed with summation       | 316604    | 23056    | 220    |
        | number processed with profile fitting | 312145    | 22918    | 134    |
        | number failed in background modelling |   2782    |   832    |   0    |
        | number failed in summation            |  51670    |  3428    | 165    |
        | number failed in profile fitting      |  56129    |  3566    | 251    |
        | ibg                                   |      0.88 |     3.63 |   0.08 |
        | i/sigi (summation)                    |      7.19 |    42.12 |   0.31 |
        | i/sigi (profile fitting)              |      7.31 |    42.9  |   0.24 |
        | cc prf                                |      0.72 |     0.86 |   0.47 |
        | cc_pearson sum/prf                    |      1    |     1    |   0.59 |
        | cc_spearman sum/prf                   |      0.96 |     1    |   0.6  |
        +---------------------------------------+-----------+----------+--------+
        
 Timing information for integration
        +-------------------+----------------+
        | Read time         | 88.84 seconds  |
        | Extract time      | 2.46 seconds   |
        | Pre-process time  | 0.47 seconds   |
        | Process time      | 268.74 seconds |
        | Post-process time | 0.00 seconds   |
        | Total time        | 382.76 seconds |
        +-------------------+----------------+

Insulin dataset /dls/i04/data/2020/cm26459-1/20200220/TestInsulin/ins_10/ins_10_1_master.h5 (dataset used in testing #659 ).
Insulin master
Automatically determined 89 blocks; 1-40, 21-60, 41-80, ..., 1761-1800

        +---------------------------------------+-----------+----------+--------+
        | Item                                  |   Overall |      Low |   High |
        |---------------------------------------+-----------+----------+--------|
        | dmin                                  |      1.48 |     4.01 |   1.48 |
        | dmax                                  |     55.64 |    55.64 |   1.5  |
        | number fully recorded                 | 177230    | 15108    |  75    |
        | number partially recorded             |  17249    |  1951    |   2    |
        | number with invalid background pixels |  54285    |  3980    |  52    |
        | number with invalid foreground pixels |  23788    |  1931    |  27    |
        | number with overloaded pixels         |      1    |     1    |   0    |
        | number in powder rings                |      0    |     0    |   0    |
        | number processed with summation       | 168544    | 14954    |  50    |
        | number processed with profile fitting | 160610    | 14363    |  25    |
        | number failed in background modelling |    622    |   157    |   0    |
        | number failed in summation            |  23788    |  1931    |  27    |
        | number failed in profile fitting      |  31722    |  2522    |  52    |
        | ibg                                   |      0.08 |     0.12 |   0.02 |
        | i/sigi (summation)                    |     15.77 |    58.75 |   0.55 |
        | i/sigi (profile fitting)              |     16.93 |    63.57 |   0.33 |
        | cc prf                                |      0.71 |     0.88 |   0.39 |
        | cc_pearson sum/prf                    |      0.96 |     0.96 |   0.75 |
        | cc_spearman sum/prf                   |      0.98 |     0.97 |   0.72 |
        +---------------------------------------+-----------+----------+--------+
        
 Timing information for integration
        +-------------------+-----------------+
        | Read time         | 1764.10 seconds |
        | Extract time      | 27.65 seconds   |
        | Pre-process time  | 0.22 seconds    |
        | Process time      | 413.53 seconds  |
        | Post-process time | 0.00 seconds    |
        | Total time        | 2640.35 seconds |
        +-------------------+-----------------+

Insulin PR branch
Automatically determined 8 blocks; 1-243, 223-246, ..., 1562-1800.

        +---------------------------------------+-----------+----------+--------+
        | Item                                  |   Overall |      Low |   High |
        |---------------------------------------+-----------+----------+--------|
        | dmin                                  |      1.48 |     4.01 |   1.48 |
        | dmax                                  |     55.64 |    55.64 |   1.5  |
        | number fully recorded                 | 180324    | 15486    |  75    |
        | number partially recorded             |   4086    |   393    |   2    |
        | number with invalid background pixels |  52736    |  3728    |  52    |
        | number with invalid foreground pixels |  23123    |  1781    |  27    |
        | number with overloaded pixels         |      1    |     1    |   0    |
        | number in powder rings                |      0    |     0    |   0    |
        | number processed with summation       | 160982    | 14066    |  50    |
        | number processed with profile fitting | 156617    | 13889    |  25    |
        | number failed in background modelling |    576    |   119    |   0    |
        | number failed in summation            |  23123    |  1781    |  27    |
        | number failed in profile fitting      |  27488    |  1958    |  52    |
        | ibg                                   |      0.08 |     0.12 |   0.02 |
        | i/sigi (summation)                    |     16.24 |    61.47 |   0.55 |
        | i/sigi (profile fitting)              |     16.73 |    62.94 |   0.33 |
        | cc prf                                |      0.71 |     0.89 |   0.39 |
        | cc_pearson sum/prf                    |      0.99 |     0.99 |   0.75 |
        | cc_spearman sum/prf                   |      0.99 |     1    |   0.72 |
        +---------------------------------------+-----------+----------+--------+
        
        Timing information for integration
        +-------------------+-----------------+
        | Read time         | 983.63 seconds  |
        | Extract time      | 22.52 seconds   |
        | Pre-process time  | 0.22 seconds    |
        | Process time      | 410.37 seconds  |
        | Post-process time | 0.00 seconds    |
        | Total time        | 1840.23 seconds |
        +-------------------+-----------------+

Currently, blocks always overlap by half the block size, leading to unneccesary double reading of the data. Instead, use larger blocks with the same overlaps.

codecov · 2020-09-07T18:26:25Z

Codecov Report

Merging #1396 into master will increase coverage by 0.03%.
The diff coverage is 94.73%.

@@            Coverage Diff             @@
##           master    #1396      +/-   ##
==========================================
+ Coverage   65.66%   65.69%   +0.03%     
==========================================
  Files         614      614              
  Lines       69474    69494      +20     
  Branches     9506     9510       +4     
==========================================
+ Hits        45621    45657      +36     
+ Misses      22017    22003      -14     
+ Partials     1836     1834       -2

dagewa · 2020-09-08T07:39:29Z

I might have misunderstood this, but isn't the fairly fine-grained blocking a feature, such that you get a different profile model as the scan evolves in φ? If we make big blocks does that mean we are losing information about how the profile evolves along the scan?

graeme-winter · 2020-09-08T07:42:51Z

@dagewa I think the profile models are still locally determined based on the xyz distance from a given spot to the surrounding reference profiles, and I do not think this would change. FWIW if you set nproc=1 you get all the data integrated in a single block, ISTR

dagewa · 2020-09-08T07:47:07Z

What I mean is that the reference profiles are determined in blocks in φ, right? Profile fitting is then done with xyz distance to reference profiles. Maybe this PR does touch the reference profile creation, in which case I step down. I'm just checking that there isn't some reason to believe that we are losing explanatory power in the model here.

graeme-winter · 2020-09-08T07:50:48Z

As with the parallel PR to split the reflection list if we have too many reflections we should do a side-by-side comparison of the integration results to show they are identical. Your concerns are clearly valid and a core part of the PR process 🙂

I'll see if I can get to the script to compare written

jbeilstenedmands · 2020-09-08T07:56:49Z

Thanks for the comment @dagewa, i'll check with regards to the profile model formation.

dagewa · 2020-09-08T08:34:19Z

Thanks, I'll tag @jmp1985 here for comment. I think it might be a reasonable ultimate goal to remove the blockwise determination of profile models for a global scan-varying model. This would be in 3D to replace the blocks in detector X, Y and the blocks in φ. I think there was some work before on scan-varying profile models within each X, Y block, but with @jbeilstenedmands's work on smoothers in higher dimensions it should be possible to create a 3D model.

jbeilstenedmands · 2020-09-08T09:57:15Z

The profile modelling determination method is unchanged here. The profile model 'blocking'/gridding is determined by a separate set of parameters in the profile.gaussian_rs scope:

    fitting {
      scan_step = 5
        .help = "Space between profiles in degrees"
        .type = float(allow_none=True)
      grid_size = 5
        .help = "The size of the profile grid."
        .type = int(allow_none=True)
      threshold = 0.02
        .help = "The threshold to use in reference profile"
        .type = float(allow_none=True)
      grid_method = single *regular_grid circular_grid spherical_grid
        .help = "Select the profile grid method"
        .type = choice
      fit_method = *reciprocal_space detector_space
        .help = "The fitting method"
        .type = choice
      detector_space {
        deconvolution = False
          .help = "Do deconvolution in detector space"
          .type = bool
      }

This is unchanged between master and this PR, and the Summary of profile model table in the log output confirms that the same gridding is used in both cases.
dials.integrate.master.log
dials.integrate.branch.log

dagewa · 2020-09-08T10:16:25Z

Sounds good, it sounds like any profile modelling changes we aim for in the long run would be orthogonal to this then. 👍

jbeilstenedmands · 2020-09-09T15:11:09Z

I reckon this is ready for review now.
Only thing I wasn't sure about is the mp.njobs paramater. If you have nproc=4 and njobs=2, I assume you want to split each dataset into 8 blocks (rather than just splitting based on nproc) as you effectively have 8 processors/workers?

algorithms/integration/processor.py

test/command_line/test_integrate.py

jbeilstenedmands · 2020-09-14T12:27:15Z

Having looked at the threaded integrator code, it seems that this idea basically exists within the threaded integrator if njobs > 1.
e.g. for integrator=3d_threaded, njobs=4, the following blocking is made:

+-----+--------------+------------+--------------+------------+-----------------+
|   # |   Frame From |   Frame To |   Angle From |   Angle To |   # Reflections |
|-----+--------------+------------+--------------+------------+-----------------|
|   0 |            0 |        182 |            0 |         91 |           25790 |
|   1 |          180 |        362 |           90 |        181 |           23535 |
|   2 |          360 |        542 |          180 |        271 |           23756 |
|   3 |          540 |        720 |          270 |        360 |           23354 |
+-----+--------------+------------+--------------+------------+-----------------+

which is the same as my blocking in this case, offset by 1.

I'll see if the existing code in the threaded integrator can be reused for this work.

jbeilstenedmands · 2020-09-17T10:06:07Z

It seems that adopting some code from the threaded integrator is not a good idea at this time. The main issue is that the threaded integrator does not currently work with multi-sweep data, and getting it to do that requires adding a significant amount of new c++ bookkeeping code. There are a few other details which seem to cause slightly different behaviour even for single imageset datasets, which also require further investigation.

For stability of standard processing I think it's best to keep this PR as is and review in its current state.
I'll look at fixing the threaded integrator separately and then see if the code can be combined in future.

dagewa

The performance enhancements are compelling, especially to get this at no cost to processing quality (in fact, to slightly improve processing quality). 👍 I made a couple of suggestions only.

algorithms/integration/processor.py

jbeilstenedmands added 2 commits September 7, 2020 16:17

Change blocking in integration.

5acfe1b

Currently, blocks always overlap by half the block size, leading to unneccesary double reading of the data. Instead, use larger blocks with the same overlaps.

Update test

c62bbdf

jbeilstenedmands added 3 commits September 8, 2020 15:53

Account for multiple imagesets, use of mp.njobs parameter

81ee548

Add tests and news

8e31018

Refactor code to allow per-imageset calculation of block size

a976486

jbeilstenedmands marked this pull request as ready for review September 9, 2020 15:09

jbeilstenedmands requested review from graeme-winter and dagewa September 9, 2020 15:11

Anthchirp reviewed Sep 9, 2020

View reviewed changes

algorithms/integration/processor.py Outdated Show resolved Hide resolved

algorithms/integration/processor.py Outdated Show resolved Hide resolved

test/command_line/test_integrate.py Outdated Show resolved Hide resolved

Adopt review suggestions

58abafb

dagewa approved these changes Sep 17, 2020

View reviewed changes

algorithms/integration/processor.py Show resolved Hide resolved

algorithms/integration/processor.py Outdated Show resolved Hide resolved

Use string format method

81854f8

github-actions bot added the PR: merge conflicts label Sep 18, 2020

Merge branch 'master' into improve_block_size

96b7ca8

github-actions bot removed the PR: merge conflicts label Oct 6, 2020

jbeilstenedmands merged commit 472ebb5 into master Oct 14, 2020

jbeilstenedmands deleted the improve_block_size branch October 14, 2020 09:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change blocking in integration for significant performance improvement #1396

Change blocking in integration for significant performance improvement #1396

jbeilstenedmands commented Sep 7, 2020

codecov bot commented Sep 7, 2020 •

edited

dagewa commented Sep 8, 2020

graeme-winter commented Sep 8, 2020

dagewa commented Sep 8, 2020

graeme-winter commented Sep 8, 2020

jbeilstenedmands commented Sep 8, 2020

dagewa commented Sep 8, 2020

jbeilstenedmands commented Sep 8, 2020

dagewa commented Sep 8, 2020

jbeilstenedmands commented Sep 9, 2020

jbeilstenedmands commented Sep 14, 2020

jbeilstenedmands commented Sep 17, 2020 •

edited

dagewa left a comment

Change blocking in integration for significant performance improvement #1396

Change blocking in integration for significant performance improvement #1396

Conversation

jbeilstenedmands commented Sep 7, 2020

codecov bot commented Sep 7, 2020 • edited

Codecov Report

dagewa commented Sep 8, 2020

graeme-winter commented Sep 8, 2020

dagewa commented Sep 8, 2020

graeme-winter commented Sep 8, 2020

jbeilstenedmands commented Sep 8, 2020

dagewa commented Sep 8, 2020

jbeilstenedmands commented Sep 8, 2020

dagewa commented Sep 8, 2020

jbeilstenedmands commented Sep 9, 2020

jbeilstenedmands commented Sep 14, 2020

jbeilstenedmands commented Sep 17, 2020 • edited

dagewa left a comment

Choose a reason for hiding this comment

codecov bot commented Sep 7, 2020 •

edited

jbeilstenedmands commented Sep 17, 2020 •

edited