Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a subset of reflections for sigma_m calculation in integrate #942

Merged
merged 3 commits into from
Sep 19, 2019

Conversation

jbeilstenedmands
Copy link
Contributor

Currently, in dials.integrate, all reflections are used to calculate sigma_m using a simplex minimiser, which takes a significant time for large datasets.
As this problem is hugely overdetermined, it should be possible to use a subset of reflections and obtain a result that is just as accurate. As dials.refine already selects a well-sample set of the data, it is proposed to use this subset if available. Also use at least 10000 reflections.

I have tested this on the beta-lactamase example used in the dials tutorial, and on my macbook the runtime for this calculation goes down from 42.9s to 18.9s out of an initial runtime of 415s, so a noticeable saving of around 5%. In this case, these are the differences in integration:
Master:

Calculating E.S.D Reflecting Range.
 sigma b: 0.044158 degrees
 sigma m: 0.057864 degrees

This branch:

Calculating E.S.D Reflecting Range.
 sigma b: 0.044158 degrees
 sigma m: 0.057451 degrees

So less than 1% difference in sigma m

Master:

 --------------------------------------------------------------
 Item                                  | Overall | Low   | High
 --------------------------------------------------------------
 dmin                                  | 1.21    | 3.28  | 1.21
 dmax                                  | 69.30   | 69.30 | 1.23
 number fully recorded                 | 353295  | 25254 | 379
 number partially recorded             | 34182   | 2858  | 3
 number with invalid background pixels | 97394   | 4859  | 371
 number with invalid foreground pixels | 54864   | 3097  | 186
 number with overloaded pixels         | 0       | 0     | 0
 number in powder rings                | 0       | 0     | 0
 number processed with summation       | 326603  | 23950 | 196
 number processed with profile fitting | 320872  | 23730 | 106
 number failed in background modelling | 0       | 0     | 0
 number failed in summation            | 55020   | 3097  | 186
 number failed in profile fitting      | 60751   | 3317  | 276
 ibg                                   | 0.00    | 0.00  | 0.00
 i/sigi (summation)                    | 6.98    | 40.91 | 0.28
 i/sigi (profile fitting)              | 7.41    | 43.47 | 0.22
 cc prf                                | 0.00    | 0.00  | 0.00
 cc_pearson sum/prf                    | 0.96    | 0.95  | 0.62
 cc_spearman sum/prf                   | 0.94    | 0.98  | 0.61
 --------------------------------------------------------------

This branch:

 --------------------------------------------------------------
 Item                                  | Overall | Low   | High
 --------------------------------------------------------------
 dmin                                  | 1.21    | 3.28  | 1.21
 dmax                                  | 69.30   | 69.30 | 1.23
 number fully recorded                 | 353482  | 25268 | 379
 number partially recorded             | 33721   | 2824  | 3
 number with invalid background pixels | 97349   | 4855  | 371
 number with invalid foreground pixels | 54839   | 3094  | 186
 number with overloaded pixels         | 0       | 0     | 0
 number in powder rings                | 0       | 0     | 0
 number processed with summation       | 326386  | 23936 | 196
 number processed with profile fitting | 320675  | 23715 | 106
 number failed in background modelling | 0       | 0     | 0
 number failed in summation            | 54988   | 3094  | 186
 number failed in profile fitting      | 60699   | 3315  | 276
 ibg                                   | 0.00    | 0.00  | 0.00
 i/sigi (summation)                    | 6.99    | 40.93 | 0.28
 i/sigi (profile fitting)              | 7.41    | 43.48 | 0.22
 cc prf                                | 0.00    | 0.00  | 0.00
 cc_pearson sum/prf                    | 0.95    | 0.95  | 0.62
 cc_spearman sum/prf                   | 0.94    | 0.98  | 0.61

Then going on to scale:

Master:

Resolution: 69.19 - 1.40
Observations: 276236
Unique reflections: 41116
Redundancy: 6.7
Completeness: 94.08%
Mean intensity: 86.7
Mean I/sigma(I): 12.0
R-merge: 0.066
R-meas:  0.072
R-pim:   0.028


Statistics by resolution bin:
 d_max  d_min   #obs  #uniq   mult.  %comp       <I>  <I/sI>    r_mrg   r_meas    r_pim   cc1/2   cc_ano
 69.27   3.80  14922   2234    6.68  99.02     588.9    41.5    0.041    0.045    0.017   0.997*   0.235*
  3.80   3.02  14677   2171    6.76  97.97     403.9    36.3    0.044    0.048    0.018   0.998*   0.256*
  3.02   2.63  14289   2137    6.69  97.49     168.5    26.7    0.057    0.061    0.023   0.998*   0.279*
  2.63   2.39  14905   2115    7.05  96.93     113.1    23.0    0.067    0.072    0.027   0.996*   0.286*
  2.39   2.22  14402   2136    6.74  96.61      88.8    19.1    0.074    0.081    0.031   0.996*   0.208*
  2.22   2.09  14570   2091    6.97  95.52      71.8    16.8    0.082    0.088    0.033   0.996*   0.238*
  2.09   1.99  13705   2072    6.61  95.44      53.6    13.6    0.093    0.101    0.039   0.995*   0.212*
  1.99   1.90  14689   2091    7.02  95.39      39.0    11.2    0.115    0.124    0.046   0.994*   0.098*
  1.90   1.83  13897   2052    6.77  94.17      26.9     8.7    0.139    0.151    0.057   0.991*   0.177*
  1.83   1.76  13783   2041    6.75  93.97      19.4     6.8    0.167    0.181    0.069   0.986*   0.151*
  1.76   1.71  13517   2045    6.61  93.81      14.2     5.4    0.201    0.219    0.084   0.978*   0.113*
  1.71   1.66  13788   2037    6.77  93.14      11.4     4.4    0.235    0.254    0.097   0.979*   0.091*
  1.66   1.62  14054   1999    7.03  92.50       9.8     3.9    0.265    0.286    0.108   0.976*   0.116*
  1.62   1.58  13842   2032    6.81  92.45       8.2     3.3    0.304    0.329    0.126   0.965*   0.094*
  1.58   1.54  13417   2005    6.69  91.80       6.9     2.9    0.335    0.364    0.140   0.955*   0.075*
  1.54   1.51  13015   1993    6.53  92.01       5.7     2.4    0.385    0.418    0.162   0.956*   0.051
  1.51   1.48  13426   2005    6.70  91.76       5.0     2.1    0.442    0.479    0.184   0.938*   0.037
  1.48   1.45  13220   1935    6.83  90.21       3.9     1.7    0.540    0.584    0.222   0.908*   0.082*
  1.45   1.42  13127   1998    6.57  90.82       3.2     1.4    0.621    0.674    0.260   0.899*   0.039
  1.42   1.40  10991   1927    5.70  90.22       3.1     1.2    0.644    0.709    0.293   0.879*   0.008
 69.19   1.40 276236  41116    6.72  94.08      86.7    12.0    0.066    0.072    0.028   0.998*   0.127*

This branch:

Resolution: 69.19 - 1.40
Observations: 276194
Unique reflections: 41116
Redundancy: 6.7
Completeness: 94.08%
Mean intensity: 86.6
Mean I/sigma(I): 12.0
R-merge: 0.066
R-meas:  0.072
R-pim:   0.028


Statistics by resolution bin:
 d_max  d_min   #obs  #uniq   mult.  %comp       <I>  <I/sI>    r_mrg   r_meas    r_pim   cc1/2   cc_ano
 69.27   3.80  14921   2234    6.68  99.02     588.7    41.6    0.041    0.045    0.017   0.997*   0.180*
  3.80   3.02  14674   2171    6.76  97.97     403.8    36.4    0.044    0.048    0.018   0.998*   0.262*
  3.02   2.63  14285   2137    6.68  97.49     168.4    26.7    0.057    0.061    0.023   0.998*   0.266*
  2.63   2.39  14904   2115    7.05  96.93     113.1    23.0    0.067    0.072    0.027   0.996*   0.282*
  2.39   2.22  14405   2136    6.74  96.61      88.8    19.1    0.074    0.080    0.031   0.996*   0.225*
  2.22   2.09  14570   2091    6.97  95.52      71.8    16.8    0.082    0.088    0.033   0.996*   0.239*
  2.09   1.99  13701   2072    6.61  95.44      53.6    13.6    0.093    0.101    0.039   0.995*   0.219*
  1.99   1.90  14685   2091    7.02  95.39      39.0    11.2    0.115    0.124    0.046   0.994*   0.108*
  1.90   1.83  13895   2052    6.77  94.17      26.9     8.7    0.139    0.151    0.057   0.991*   0.191*
  1.83   1.76  13778   2041    6.75  93.97      19.4     6.8    0.167    0.181    0.069   0.986*   0.125*
  1.76   1.71  13520   2045    6.61  93.81      14.2     5.4    0.201    0.219    0.084   0.978*   0.112*
  1.71   1.66  13784   2037    6.77  93.14      11.3     4.4    0.235    0.254    0.097   0.979*   0.098*
  1.66   1.62  14051   1999    7.03  92.50       9.8     3.9    0.265    0.286    0.108   0.976*   0.118*
  1.62   1.58  13837   2032    6.81  92.45       8.1     3.3    0.304    0.329    0.126   0.965*   0.091*
  1.58   1.54  13415   2005    6.69  91.80       6.8     2.9    0.335    0.364    0.140   0.955*   0.071*
  1.54   1.51  13012   1993    6.53  92.01       5.7     2.4    0.385    0.418    0.162   0.956*   0.055*
  1.51   1.48  13425   2005    6.70  91.76       5.0     2.1    0.442    0.479    0.184   0.939*   0.035
  1.48   1.45  13217   1935    6.83  90.21       3.9     1.7    0.540    0.584    0.222   0.906*   0.081*
  1.45   1.42  13127   1998    6.57  90.82       3.2     1.4    0.621    0.674    0.260   0.899*   0.038
  1.42   1.40  10988   1927    5.70  90.22       3.1     1.2    0.644    0.709    0.293   0.872*   0.008
 69.19   1.40 276194  41116    6.72  94.08      86.6    12.0    0.066    0.072    0.028   0.998*   0.151*

So looks pretty similar overall.

Would be useful to also test on bigger datasets to see the difference.

@graeme-winter
Copy link
Contributor

Reviewing now - first rerunning some tests inside of xia2 to get direct comparison

@graeme-winter
Copy link
Contributor

Seems to save a serious amount of time for the "normal" data set - was 344 now 285.

Subsequent analysis suggests we should use the same subset of reflections for sigma_b and also perform the zeta filtering upfront not just for sigma_m. Will test.

Filter reflections at the top, use consistent set of reflections for
both sigma_b and sigma_m, as a courtesy report the number of reflections
used.

Also separated out the still case for clarity though means slight code
duplication occurs.
@graeme-winter
Copy link
Contributor

Implemented the filter of reflections further up the stack, saves another ~ 5s. Now running full test.

@graeme-winter
Copy link
Contributor

Overall processing results with xia2 - before changes in this PR:

For AUTOMATIC/DEFAULT/NATIVE                 Overall    Low     High
High resolution limit                           1.37    3.72    1.37
Low resolution limit                           54.10   54.14    1.39
Completeness                                  100.0   100.0   100.0
Multiplicity                                    8.4     7.9     7.4
I/sigma                                        16.4    70.0     0.8
Rmerge(I)                                     0.058   0.025   1.005
Rmerge(I+/-)                                  0.055   0.022   0.936
Rmeas(I)                                      0.062   0.026   1.080
Rmeas(I+/-)                                   0.062   0.025   1.088
Rpim(I)                                       0.021   0.009   0.392
Rpim(I+/-)                                    0.029   0.012   0.547
CC half                                       0.999   0.999   0.779
Wilson B factor                              13.816
Anomalous completeness                        100.0    99.9   100.0
Anomalous multiplicity                          4.5     4.7     3.9
Anomalous correlation                         0.068   0.137  -0.028
Anomalous slope                               0.564
Total observations                           464523   24216   20090
Total unique                                  55000    3064    2697
Assuming spacegroup: P 41 21 2
Unit cell (with estimated std devs):
57.97635(6) 57.97635(6) 150.4096(2)

after:

For AUTOMATIC/DEFAULT/NATIVE                 Overall    Low     High
High resolution limit                           1.37    3.72    1.37
Low resolution limit                           54.10   54.14    1.39
Completeness                                  100.0   100.0   100.0
Multiplicity                                    8.4     7.9     7.5
I/sigma                                        16.4    69.9     0.8
Rmerge(I)                                     0.059   0.025   1.003
Rmerge(I+/-)                                  0.055   0.022   0.935
Rmeas(I)                                      0.062   0.027   1.078
Rmeas(I+/-)                                   0.062   0.025   1.086
Rpim(I)                                       0.021   0.009   0.391
Rpim(I+/-)                                    0.029   0.012   0.546
CC half                                       0.999   0.999   0.785
Wilson B factor                              13.721
Anomalous completeness                        100.0   100.0   100.0
Anomalous multiplicity                          4.5     4.7     3.9
Anomalous correlation                         0.068   0.189  -0.044
Anomalous slope                               0.575
Total observations                           464659   24223   20109
Total unique                                  55000    3064    2697
Assuming spacegroup: P 41 21 2
Unit cell (with estimated std devs):
57.97638(6) 57.97638(6) 150.4095(2)

wall clock time changed from 344 to 280s for the integration.

@graeme-winter
Copy link
Contributor

Having now made commits to the branch I am no longer qualified to comment on the pull request. Will ask for others to provide input.

@graeme-winter
Copy link
Contributor

For inspection:

Original:
/dls/mx-scratch/dials/performance-week/normal/review-942

Latest:
/dls/mx-scratch/dials/performance-week/normal/review-942-extra

@rjgildea
Copy link
Contributor

Merging stats look pretty much indistinguishable:

$ xia2.compare_merging_stats /dls/mx-scratch/dials/performance-week/normal/review-942{,-extra}/DataFiles/AUTOMATIC_DEFAULT_scaled_unmerged.mtz plot_labels="old new"

cc_one_half
i_over_sigma_mean
r_merge

@Anthchirp Anthchirp merged commit 5f18ca9 into master Sep 19, 2019
@Anthchirp Anthchirp deleted the sigma_m_calc branch October 22, 2019 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants