Determine scaling error model using anomalous groups. #1332

jbeilstenedmands · 2020-07-09T20:10:00Z

While investigating anomalous signal in DIALS processing #1215 , I have realised that anomalous pairs should be separated for error model determination in scaling.

Consider a low resolution reflection with a large anomalous difference measured many times. The I+ reflections should be normally distributed around I+, and the I- around I-. If the separation is large, then it is clearly wrong to consider that they should be normally distributed around Imean (which is the current assumption underlying the error model minimisation); and doing so should lead to an overinflation of the errors (larger error model parameters) and hence reduction in metrics such as dI/s(dI). I believe it is correct therefore that anomalous data should always be separated for error modelling, regardless of whether anomalous data is separate for scaling model determination.

I have tested this change on a selection of test datasets: In summary, the change in refined error model parameters leads to an overall increase in I/sigma, dI/s and anomalous slope, with some datasets affected more than others. The effect on anomalous correlation seems variable.

Beta-lactamase

	Master	PR
Anomalous correlation	0.312	0.358
Anomalous slope	0.805	1.097
dI/s(dI)	0.956	1.265
Error model a, b	0.975, 0.035	0.711, 0.038

x4-wide

	Master	PR
Anomalous correlation	0.320	0.280
Anomalous slope	0.163	0.280
dI/s(dI)	0.293	0.418
Error model a, b	4.306, 0.003	2.484, 0.003

Insulin (dataset 5, issue #1215)

	Master	PR
Anomalous correlation	0.173	0.161
Anomalous slope	1.371	1.712
dI/s(dI)	1.150	1.366
Error model a, b	1.149, 0.051	0.853, 0.056

Thaumatin (dataset 10, issue #1215) Less affected

	Master	PR
Anomalous correlation	0.436	0.443
Anomalous slope	0.283	0.301
dI/s(dI)	0.477	0.498
Error model a, b	2.465, 0.014	2.309, 0.015

weak thermolysin. Less affected

	Master	PR
Anomalous correlation	0.367	0.363
Anomalous slope	0.532	0.552
dI/s(dI)	0.606	0.631
Error model a, b	1.065, 0.035	1.033, 0.036

codecov · 2020-07-09T20:25:45Z

Codecov Report

Merging #1332 into master will increase coverage by 0.02%.
The diff coverage is 96.87%.

@@            Coverage Diff             @@
##           master    #1332      +/-   ##
==========================================
+ Coverage   64.19%   64.21%   +0.02%     
==========================================
  Files         617      617              
  Lines       69784    69832      +48     
  Branches     9557     9566       +9     
==========================================
+ Hits        44797    44846      +49     
+ Misses      23218    23217       -1     
  Partials     1769     1769

Impacted Files	Coverage Δ
algorithms/scaling/test_scale.py	`100.00% <ø> (ø)`
algorithms/scaling/scaler.py	`90.24% <96.77%> (+0.10%)`	⬆️
command_line/refine_error_model.py	`74.48% <100.00%> (ø)`
test/command_line/test_export.py	`98.05% <0.00%> (+1.05%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f68b77f...ecc57d4. Read the comment docs.

graeme-winter · 2020-07-10T09:02:15Z

@jbeilstenedmands this looks very likely to be exceedingly relevant. Is No. 2 on my queue this morning to review :-) thank you

jbeilstenedmands · 2020-07-10T09:04:59Z

Thanks, this is not going to solve the anomalous issue but I think is a step on the right path.

mevol · 2020-07-13T09:53:58Z

Yes, I think this moves the right way. Your explanation makes sense, even to me :-)

graeme-winter · 2020-07-16T14:07:21Z

LIC strong #1 - easy, big anomalous signal

Processing following main sequence, scaling with dials on master:

Statistics by resolution bin:
 d_max  d_min   #obs  #uniq   mult.  %comp       <I>  <I/sI>    r_mrg   r_meas    r_pim   cc1/2   cc_ano
122.06   3.36  67363   6474   10.41 100.00     448.9    50.9    0.047    0.049    0.015   0.998*   0.375*
  3.36   2.67  67272   6203   10.85 100.00     157.8    33.9    0.061    0.064    0.019   0.998*   0.440*
  2.67   2.33  66447   6124   10.85 100.00      84.3    23.6    0.080    0.084    0.025   0.997*   0.429*
  2.33   2.12  65262   6082   10.73 100.00      59.4    17.5    0.097    0.102    0.031   0.996*   0.350*
  2.12   1.97  66229   6059   10.93 100.00      38.1    12.4    0.123    0.130    0.039   0.995*   0.293*
  1.97   1.85  63794   6061   10.53 100.00      21.3     7.7    0.178    0.187    0.057   0.989*   0.203*
  1.85   1.76  62262   6027   10.33 100.00      12.1     4.7    0.253    0.266    0.082   0.976*   0.148*
  1.76   1.68  64911   6018   10.79 100.00       8.4     3.4    0.330    0.347    0.105   0.958*   0.116*
  1.68   1.62  61483   6004   10.24 100.00       6.0     2.4    0.424    0.446    0.138   0.930*   0.099*
  1.62   1.56  61212   6007   10.19 100.00       4.5     1.8    0.559    0.589    0.183   0.874*   0.053*
  1.56   1.51  63038   6015   10.48 100.00       3.4     1.4    0.711    0.747    0.229   0.841*   0.041*
  1.51   1.47  59623   5984    9.96 100.00       2.5     1.0    0.952    1.004    0.315   0.715*  -0.004
  1.47   1.43  49066   5965    8.23  99.88       1.9     0.6    1.230    1.313    0.451   0.550*   0.021
  1.43   1.40  34937   5669    6.16  94.99       1.6     0.5    1.448    1.581    0.613   0.376*  -0.009
  1.40   1.36  25789   5291    4.87  88.37       1.4     0.3    1.675    1.871    0.800   0.277*  -0.021
  1.36   1.33  18886   4679    4.04  78.31       1.1     0.2    2.027    2.313    1.073   0.185*   0.030
  1.33   1.31  13325   4138    3.22  68.91       0.8     0.2    2.655    3.126    1.595   0.077*   0.002
  1.31   1.28   8216   3367    2.44  56.48       0.8     0.1    2.875    3.557    2.046   0.049*  -0.024
  1.28   1.26   3925   2357    1.67  39.64       0.6     0.1    3.170    4.203    2.726   0.006  -0.154
  1.26   1.24    766    680    1.13  11.42       0.4     0.0   40.771   57.659   40.771  -0.074   0.000
121.47   1.24 923806 105204    8.78  87.11      51.2     9.6    0.085    0.090    0.028   0.999*   0.346*

The same, on the branch

Statistics by resolution bin:
 d_max  d_min   #obs  #uniq   mult.  %comp       <I>  <I/sI>    r_mrg   r_meas    r_pim   cc1/2   cc_ano
122.06   3.36  67157   6474   10.37 100.00     449.3    60.3    0.046    0.048    0.015   0.998*   0.440*
  3.36   2.67  67240   6203   10.84 100.00     157.9    41.7    0.061    0.064    0.019   0.998*   0.439*
  2.67   2.33  66441   6124   10.85 100.00      84.3    29.6    0.080    0.084    0.025   0.997*   0.430*
  2.33   2.12  65251   6082   10.73 100.00      59.5    22.2    0.097    0.102    0.031   0.996*   0.353*
  2.12   1.97  66224   6059   10.93 100.00      38.1    15.9    0.123    0.129    0.039   0.994*   0.293*
  1.97   1.85  63783   6061   10.52 100.00      21.4     9.9    0.177    0.186    0.057   0.989*   0.200*
  1.85   1.76  62253   6027   10.33 100.00      12.1     6.1    0.252    0.265    0.082   0.976*   0.155*
  1.76   1.68  64903   6018   10.78 100.00       8.4     4.4    0.329    0.345    0.104   0.960*   0.095*
  1.68   1.62  61478   6004   10.24 100.00       6.1     3.1    0.423    0.445    0.138   0.932*   0.104*
  1.62   1.56  61198   6007   10.19 100.00       4.5     2.3    0.556    0.586    0.182   0.880*   0.045*
  1.56   1.51  63028   6015   10.48 100.00       3.4     1.8    0.707    0.744    0.228   0.849*   0.042*
  1.51   1.47  59617   5984    9.96 100.00       2.5     1.3    0.949    1.000    0.314   0.728*  -0.005
  1.47   1.43  49065   5965    8.23  99.88       1.9     0.8    1.229    1.313    0.450   0.552*   0.021
  1.43   1.40  34937   5669    6.16  94.99       1.6     0.6    1.448    1.581    0.613   0.376*  -0.009
  1.40   1.36  25789   5291    4.87  88.37       1.4     0.4    1.675    1.871    0.800   0.277*  -0.021
  1.36   1.33  18886   4679    4.04  78.31       1.1     0.3    2.027    2.313    1.073   0.185*   0.030
  1.33   1.31  13325   4138    3.22  68.91       0.8     0.2    2.655    3.125    1.595   0.077*   0.002
  1.31   1.28   8216   3367    2.44  56.48       0.8     0.2    2.875    3.556    2.045   0.049*  -0.024
  1.28   1.26   3925   2357    1.67  39.64       0.6     0.1    3.169    4.201    2.725   0.006  -0.154
  1.26   1.24    766    680    1.13  11.42       0.4     0.1   41.413   58.568   41.413  -0.074   0.000
121.47   1.24 923482 105204    8.78  87.11      51.2    11.9    0.084    0.089    0.028   0.999*   0.349*

Enough of an improvement that I'm keen to have this in master, looks good. The tests all pass so 👍

Will look at the diffs now

graeme-winter

Looks sensible, and makes a measurable improvement, thank you.

I do think we should have some program documentation which explains how this works though - it's important for the user to be able to understand this without reading the code.

Determine scaling error model using anomalous groups.

4de992d

jbeilstenedmands requested a review from graeme-winter July 9, 2020 20:10

Add newsfragment

ecc57d4

graeme-winter approved these changes Jul 16, 2020

View reviewed changes

jbeilstenedmands merged commit 62b0ec3 into master Jul 16, 2020

Anthchirp deleted the anomalous_error_model branch October 30, 2020 10:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine scaling error model using anomalous groups. #1332

Determine scaling error model using anomalous groups. #1332

jbeilstenedmands commented Jul 9, 2020

codecov bot commented Jul 9, 2020 •

edited

graeme-winter commented Jul 10, 2020

jbeilstenedmands commented Jul 10, 2020

mevol commented Jul 13, 2020

graeme-winter commented Jul 16, 2020

graeme-winter left a comment

Determine scaling error model using anomalous groups. #1332

Determine scaling error model using anomalous groups. #1332

Conversation

jbeilstenedmands commented Jul 9, 2020

codecov bot commented Jul 9, 2020 • edited

Codecov Report

graeme-winter commented Jul 10, 2020

jbeilstenedmands commented Jul 10, 2020

mevol commented Jul 13, 2020

graeme-winter commented Jul 16, 2020

graeme-winter left a comment

Choose a reason for hiding this comment

codecov bot commented Jul 9, 2020 •

edited