Use a subset of reflections for sigma_m calculation in integrate #942

jbeilstenedmands · 2019-09-18T15:01:06Z

Currently, in dials.integrate, all reflections are used to calculate sigma_m using a simplex minimiser, which takes a significant time for large datasets.
As this problem is hugely overdetermined, it should be possible to use a subset of reflections and obtain a result that is just as accurate. As dials.refine already selects a well-sample set of the data, it is proposed to use this subset if available. Also use at least 10000 reflections.

I have tested this on the beta-lactamase example used in the dials tutorial, and on my macbook the runtime for this calculation goes down from 42.9s to 18.9s out of an initial runtime of 415s, so a noticeable saving of around 5%. In this case, these are the differences in integration:
Master:

Calculating E.S.D Reflecting Range.
 sigma b: 0.044158 degrees
 sigma m: 0.057864 degrees

This branch:

Calculating E.S.D Reflecting Range.
 sigma b: 0.044158 degrees
 sigma m: 0.057451 degrees

So less than 1% difference in sigma m

Master:

 --------------------------------------------------------------
 Item                                  | Overall | Low   | High
 --------------------------------------------------------------
 dmin                                  | 1.21    | 3.28  | 1.21
 dmax                                  | 69.30   | 69.30 | 1.23
 number fully recorded                 | 353295  | 25254 | 379
 number partially recorded             | 34182   | 2858  | 3
 number with invalid background pixels | 97394   | 4859  | 371
 number with invalid foreground pixels | 54864   | 3097  | 186
 number with overloaded pixels         | 0       | 0     | 0
 number in powder rings                | 0       | 0     | 0
 number processed with summation       | 326603  | 23950 | 196
 number processed with profile fitting | 320872  | 23730 | 106
 number failed in background modelling | 0       | 0     | 0
 number failed in summation            | 55020   | 3097  | 186
 number failed in profile fitting      | 60751   | 3317  | 276
 ibg                                   | 0.00    | 0.00  | 0.00
 i/sigi (summation)                    | 6.98    | 40.91 | 0.28
 i/sigi (profile fitting)              | 7.41    | 43.47 | 0.22
 cc prf                                | 0.00    | 0.00  | 0.00
 cc_pearson sum/prf                    | 0.96    | 0.95  | 0.62
 cc_spearman sum/prf                   | 0.94    | 0.98  | 0.61
 --------------------------------------------------------------

This branch:

 --------------------------------------------------------------
 Item                                  | Overall | Low   | High
 --------------------------------------------------------------
 dmin                                  | 1.21    | 3.28  | 1.21
 dmax                                  | 69.30   | 69.30 | 1.23
 number fully recorded                 | 353482  | 25268 | 379
 number partially recorded             | 33721   | 2824  | 3
 number with invalid background pixels | 97349   | 4855  | 371
 number with invalid foreground pixels | 54839   | 3094  | 186
 number with overloaded pixels         | 0       | 0     | 0
 number in powder rings                | 0       | 0     | 0
 number processed with summation       | 326386  | 23936 | 196
 number processed with profile fitting | 320675  | 23715 | 106
 number failed in background modelling | 0       | 0     | 0
 number failed in summation            | 54988   | 3094  | 186
 number failed in profile fitting      | 60699   | 3315  | 276
 ibg                                   | 0.00    | 0.00  | 0.00
 i/sigi (summation)                    | 6.99    | 40.93 | 0.28
 i/sigi (profile fitting)              | 7.41    | 43.48 | 0.22
 cc prf                                | 0.00    | 0.00  | 0.00
 cc_pearson sum/prf                    | 0.95    | 0.95  | 0.62
 cc_spearman sum/prf                   | 0.94    | 0.98  | 0.61

Then going on to scale:

Master:

Resolution: 69.19 - 1.40
Observations: 276236
Unique reflections: 41116
Redundancy: 6.7
Completeness: 94.08%
Mean intensity: 86.7
Mean I/sigma(I): 12.0
R-merge: 0.066
R-meas:  0.072
R-pim:   0.028


Statistics by resolution bin:
 d_max  d_min   #obs  #uniq   mult.  %comp       <I>  <I/sI>    r_mrg   r_meas    r_pim   cc1/2   cc_ano
 69.27   3.80  14922   2234    6.68  99.02     588.9    41.5    0.041    0.045    0.017   0.997*   0.235*
  3.80   3.02  14677   2171    6.76  97.97     403.9    36.3    0.044    0.048    0.018   0.998*   0.256*
  3.02   2.63  14289   2137    6.69  97.49     168.5    26.7    0.057    0.061    0.023   0.998*   0.279*
  2.63   2.39  14905   2115    7.05  96.93     113.1    23.0    0.067    0.072    0.027   0.996*   0.286*
  2.39   2.22  14402   2136    6.74  96.61      88.8    19.1    0.074    0.081    0.031   0.996*   0.208*
  2.22   2.09  14570   2091    6.97  95.52      71.8    16.8    0.082    0.088    0.033   0.996*   0.238*
  2.09   1.99  13705   2072    6.61  95.44      53.6    13.6    0.093    0.101    0.039   0.995*   0.212*
  1.99   1.90  14689   2091    7.02  95.39      39.0    11.2    0.115    0.124    0.046   0.994*   0.098*
  1.90   1.83  13897   2052    6.77  94.17      26.9     8.7    0.139    0.151    0.057   0.991*   0.177*
  1.83   1.76  13783   2041    6.75  93.97      19.4     6.8    0.167    0.181    0.069   0.986*   0.151*
  1.76   1.71  13517   2045    6.61  93.81      14.2     5.4    0.201    0.219    0.084   0.978*   0.113*
  1.71   1.66  13788   2037    6.77  93.14      11.4     4.4    0.235    0.254    0.097   0.979*   0.091*
  1.66   1.62  14054   1999    7.03  92.50       9.8     3.9    0.265    0.286    0.108   0.976*   0.116*
  1.62   1.58  13842   2032    6.81  92.45       8.2     3.3    0.304    0.329    0.126   0.965*   0.094*
  1.58   1.54  13417   2005    6.69  91.80       6.9     2.9    0.335    0.364    0.140   0.955*   0.075*
  1.54   1.51  13015   1993    6.53  92.01       5.7     2.4    0.385    0.418    0.162   0.956*   0.051
  1.51   1.48  13426   2005    6.70  91.76       5.0     2.1    0.442    0.479    0.184   0.938*   0.037
  1.48   1.45  13220   1935    6.83  90.21       3.9     1.7    0.540    0.584    0.222   0.908*   0.082*
  1.45   1.42  13127   1998    6.57  90.82       3.2     1.4    0.621    0.674    0.260   0.899*   0.039
  1.42   1.40  10991   1927    5.70  90.22       3.1     1.2    0.644    0.709    0.293   0.879*   0.008
 69.19   1.40 276236  41116    6.72  94.08      86.7    12.0    0.066    0.072    0.028   0.998*   0.127*

This branch:

Resolution: 69.19 - 1.40
Observations: 276194
Unique reflections: 41116
Redundancy: 6.7
Completeness: 94.08%
Mean intensity: 86.6
Mean I/sigma(I): 12.0
R-merge: 0.066
R-meas:  0.072
R-pim:   0.028


Statistics by resolution bin:
 d_max  d_min   #obs  #uniq   mult.  %comp       <I>  <I/sI>    r_mrg   r_meas    r_pim   cc1/2   cc_ano
 69.27   3.80  14921   2234    6.68  99.02     588.7    41.6    0.041    0.045    0.017   0.997*   0.180*
  3.80   3.02  14674   2171    6.76  97.97     403.8    36.4    0.044    0.048    0.018   0.998*   0.262*
  3.02   2.63  14285   2137    6.68  97.49     168.4    26.7    0.057    0.061    0.023   0.998*   0.266*
  2.63   2.39  14904   2115    7.05  96.93     113.1    23.0    0.067    0.072    0.027   0.996*   0.282*
  2.39   2.22  14405   2136    6.74  96.61      88.8    19.1    0.074    0.080    0.031   0.996*   0.225*
  2.22   2.09  14570   2091    6.97  95.52      71.8    16.8    0.082    0.088    0.033   0.996*   0.239*
  2.09   1.99  13701   2072    6.61  95.44      53.6    13.6    0.093    0.101    0.039   0.995*   0.219*
  1.99   1.90  14685   2091    7.02  95.39      39.0    11.2    0.115    0.124    0.046   0.994*   0.108*
  1.90   1.83  13895   2052    6.77  94.17      26.9     8.7    0.139    0.151    0.057   0.991*   0.191*
  1.83   1.76  13778   2041    6.75  93.97      19.4     6.8    0.167    0.181    0.069   0.986*   0.125*
  1.76   1.71  13520   2045    6.61  93.81      14.2     5.4    0.201    0.219    0.084   0.978*   0.112*
  1.71   1.66  13784   2037    6.77  93.14      11.3     4.4    0.235    0.254    0.097   0.979*   0.098*
  1.66   1.62  14051   1999    7.03  92.50       9.8     3.9    0.265    0.286    0.108   0.976*   0.118*
  1.62   1.58  13837   2032    6.81  92.45       8.1     3.3    0.304    0.329    0.126   0.965*   0.091*
  1.58   1.54  13415   2005    6.69  91.80       6.8     2.9    0.335    0.364    0.140   0.955*   0.071*
  1.54   1.51  13012   1993    6.53  92.01       5.7     2.4    0.385    0.418    0.162   0.956*   0.055*
  1.51   1.48  13425   2005    6.70  91.76       5.0     2.1    0.442    0.479    0.184   0.939*   0.035
  1.48   1.45  13217   1935    6.83  90.21       3.9     1.7    0.540    0.584    0.222   0.906*   0.081*
  1.45   1.42  13127   1998    6.57  90.82       3.2     1.4    0.621    0.674    0.260   0.899*   0.038
  1.42   1.40  10988   1927    5.70  90.22       3.1     1.2    0.644    0.709    0.293   0.872*   0.008
 69.19   1.40 276194  41116    6.72  94.08      86.6    12.0    0.066    0.072    0.028   0.998*   0.151*

So looks pretty similar overall.

Would be useful to also test on bigger datasets to see the difference.

graeme-winter · 2019-09-19T11:15:09Z

Reviewing now - first rerunning some tests inside of xia2 to get direct comparison

graeme-winter · 2019-09-19T12:42:42Z

Seems to save a serious amount of time for the "normal" data set - was 344 now 285.

Subsequent analysis suggests we should use the same subset of reflections for sigma_b and also perform the zeta filtering upfront not just for sigma_m. Will test.

Filter reflections at the top, use consistent set of reflections for both sigma_b and sigma_m, as a courtesy report the number of reflections used. Also separated out the still case for clarity though means slight code duplication occurs.

graeme-winter · 2019-09-19T13:11:02Z

Implemented the filter of reflections further up the stack, saves another ~ 5s. Now running full test.

graeme-winter · 2019-09-19T13:23:26Z

Overall processing results with xia2 - before changes in this PR:

For AUTOMATIC/DEFAULT/NATIVE                 Overall    Low     High
High resolution limit                           1.37    3.72    1.37
Low resolution limit                           54.10   54.14    1.39
Completeness                                  100.0   100.0   100.0
Multiplicity                                    8.4     7.9     7.4
I/sigma                                        16.4    70.0     0.8
Rmerge(I)                                     0.058   0.025   1.005
Rmerge(I+/-)                                  0.055   0.022   0.936
Rmeas(I)                                      0.062   0.026   1.080
Rmeas(I+/-)                                   0.062   0.025   1.088
Rpim(I)                                       0.021   0.009   0.392
Rpim(I+/-)                                    0.029   0.012   0.547
CC half                                       0.999   0.999   0.779
Wilson B factor                              13.816
Anomalous completeness                        100.0    99.9   100.0
Anomalous multiplicity                          4.5     4.7     3.9
Anomalous correlation                         0.068   0.137  -0.028
Anomalous slope                               0.564
Total observations                           464523   24216   20090
Total unique                                  55000    3064    2697
Assuming spacegroup: P 41 21 2
Unit cell (with estimated std devs):
57.97635(6) 57.97635(6) 150.4096(2)

after:

For AUTOMATIC/DEFAULT/NATIVE                 Overall    Low     High
High resolution limit                           1.37    3.72    1.37
Low resolution limit                           54.10   54.14    1.39
Completeness                                  100.0   100.0   100.0
Multiplicity                                    8.4     7.9     7.5
I/sigma                                        16.4    69.9     0.8
Rmerge(I)                                     0.059   0.025   1.003
Rmerge(I+/-)                                  0.055   0.022   0.935
Rmeas(I)                                      0.062   0.027   1.078
Rmeas(I+/-)                                   0.062   0.025   1.086
Rpim(I)                                       0.021   0.009   0.391
Rpim(I+/-)                                    0.029   0.012   0.546
CC half                                       0.999   0.999   0.785
Wilson B factor                              13.721
Anomalous completeness                        100.0   100.0   100.0
Anomalous multiplicity                          4.5     4.7     3.9
Anomalous correlation                         0.068   0.189  -0.044
Anomalous slope                               0.575
Total observations                           464659   24223   20109
Total unique                                  55000    3064    2697
Assuming spacegroup: P 41 21 2
Unit cell (with estimated std devs):
57.97638(6) 57.97638(6) 150.4095(2)

wall clock time changed from 344 to 280s for the integration.

graeme-winter · 2019-09-19T13:25:33Z

Having now made commits to the branch I am no longer qualified to comment on the pull request. Will ask for others to provide input.

graeme-winter · 2019-09-19T13:29:29Z

For inspection:

Original:
/dls/mx-scratch/dials/performance-week/normal/review-942

Latest:
/dls/mx-scratch/dials/performance-week/normal/review-942-extra

rjgildea · 2019-09-19T14:19:53Z

Merging stats look pretty much indistinguishable:

$ xia2.compare_merging_stats /dls/mx-scratch/dials/performance-week/normal/review-942{,-extra}/DataFiles/AUTOMATIC_DEFAULT_scaled_unmerged.mtz plot_labels="old new"

Use a subset of reflections for sigma_m calculation in integrate

8055851

graeme-winter self-requested a review September 18, 2019 15:40

For subset of reflections

81bc2d5

Filter reflections at the top, use consistent set of reflections for both sigma_b and sigma_m, as a courtesy report the number of reflections used. Also separated out the still case for clarity though means slight code duplication occurs.

Add news fragment

8220f39

graeme-winter requested a review from jmp1985 September 19, 2019 13:25

Anthchirp merged commit 5f18ca9 into master Sep 19, 2019

Anthchirp deleted the sigma_m_calc branch October 22, 2019 09:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a subset of reflections for sigma_m calculation in integrate #942

Use a subset of reflections for sigma_m calculation in integrate #942

jbeilstenedmands commented Sep 18, 2019

graeme-winter commented Sep 19, 2019

graeme-winter commented Sep 19, 2019

graeme-winter commented Sep 19, 2019

graeme-winter commented Sep 19, 2019

graeme-winter commented Sep 19, 2019

graeme-winter commented Sep 19, 2019

rjgildea commented Sep 19, 2019

Use a subset of reflections for sigma_m calculation in integrate #942

Use a subset of reflections for sigma_m calculation in integrate #942

Conversation

jbeilstenedmands commented Sep 18, 2019

graeme-winter commented Sep 19, 2019

graeme-winter commented Sep 19, 2019

graeme-winter commented Sep 19, 2019

graeme-winter commented Sep 19, 2019

graeme-winter commented Sep 19, 2019

graeme-winter commented Sep 19, 2019

rjgildea commented Sep 19, 2019