Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine scaling error model using anomalous groups. #1332

Merged
merged 2 commits into from
Jul 16, 2020

Conversation

jbeilstenedmands
Copy link
Contributor

While investigating anomalous signal in DIALS processing #1215 , I have realised that anomalous pairs should be separated for error model determination in scaling.

Consider a low resolution reflection with a large anomalous difference measured many times. The I+ reflections should be normally distributed around I+, and the I- around I-. If the separation is large, then it is clearly wrong to consider that they should be normally distributed around Imean (which is the current assumption underlying the error model minimisation); and doing so should lead to an overinflation of the errors (larger error model parameters) and hence reduction in metrics such as dI/s(dI). I believe it is correct therefore that anomalous data should always be separated for error modelling, regardless of whether anomalous data is separate for scaling model determination.

I have tested this change on a selection of test datasets: In summary, the change in refined error model parameters leads to an overall increase in I/sigma, dI/s and anomalous slope, with some datasets affected more than others. The effect on anomalous correlation seems variable.

Beta-lactamase

Master PR
Anomalous correlation 0.312 0.358
Anomalous slope 0.805 1.097
dI/s(dI) 0.956 1.265
Error model a, b 0.975, 0.035 0.711, 0.038

x4-wide

Master PR
Anomalous correlation 0.320 0.280
Anomalous slope 0.163 0.280
dI/s(dI) 0.293 0.418
Error model a, b 4.306, 0.003 2.484, 0.003

Insulin (dataset 5, issue #1215)

Master PR
Anomalous correlation 0.173 0.161
Anomalous slope 1.371 1.712
dI/s(dI) 1.150 1.366
Error model a, b 1.149, 0.051 0.853, 0.056

Thaumatin (dataset 10, issue #1215) Less affected

Master PR
Anomalous correlation 0.436 0.443
Anomalous slope 0.283 0.301
dI/s(dI) 0.477 0.498
Error model a, b 2.465, 0.014 2.309, 0.015

weak thermolysin. Less affected

Master PR
Anomalous correlation 0.367 0.363
Anomalous slope 0.532 0.552
dI/s(dI) 0.606 0.631
Error model a, b 1.065, 0.035 1.033, 0.036

@codecov
Copy link

codecov bot commented Jul 9, 2020

Codecov Report

Merging #1332 into master will increase coverage by 0.02%.
The diff coverage is 96.87%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1332      +/-   ##
==========================================
+ Coverage   64.19%   64.21%   +0.02%     
==========================================
  Files         617      617              
  Lines       69784    69832      +48     
  Branches     9557     9566       +9     
==========================================
+ Hits        44797    44846      +49     
+ Misses      23218    23217       -1     
  Partials     1769     1769              
Impacted Files Coverage Δ
algorithms/scaling/test_scale.py 100.00% <ø> (ø)
algorithms/scaling/scaler.py 90.24% <96.77%> (+0.10%) ⬆️
command_line/refine_error_model.py 74.48% <100.00%> (ø)
test/command_line/test_export.py 98.05% <0.00%> (+1.05%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f68b77f...ecc57d4. Read the comment docs.

@graeme-winter
Copy link
Contributor

@jbeilstenedmands this looks very likely to be exceedingly relevant. Is No. 2 on my queue this morning to review :-) thank you

@jbeilstenedmands
Copy link
Contributor Author

Thanks, this is not going to solve the anomalous issue but I think is a step on the right path.

@mevol
Copy link

mevol commented Jul 13, 2020

Yes, I think this moves the right way. Your explanation makes sense, even to me :-)

@graeme-winter
Copy link
Contributor

LIC strong #1 - easy, big anomalous signal

Processing following main sequence, scaling with dials on master:

Statistics by resolution bin:
 d_max  d_min   #obs  #uniq   mult.  %comp       <I>  <I/sI>    r_mrg   r_meas    r_pim   cc1/2   cc_ano
122.06   3.36  67363   6474   10.41 100.00     448.9    50.9    0.047    0.049    0.015   0.998*   0.375*
  3.36   2.67  67272   6203   10.85 100.00     157.8    33.9    0.061    0.064    0.019   0.998*   0.440*
  2.67   2.33  66447   6124   10.85 100.00      84.3    23.6    0.080    0.084    0.025   0.997*   0.429*
  2.33   2.12  65262   6082   10.73 100.00      59.4    17.5    0.097    0.102    0.031   0.996*   0.350*
  2.12   1.97  66229   6059   10.93 100.00      38.1    12.4    0.123    0.130    0.039   0.995*   0.293*
  1.97   1.85  63794   6061   10.53 100.00      21.3     7.7    0.178    0.187    0.057   0.989*   0.203*
  1.85   1.76  62262   6027   10.33 100.00      12.1     4.7    0.253    0.266    0.082   0.976*   0.148*
  1.76   1.68  64911   6018   10.79 100.00       8.4     3.4    0.330    0.347    0.105   0.958*   0.116*
  1.68   1.62  61483   6004   10.24 100.00       6.0     2.4    0.424    0.446    0.138   0.930*   0.099*
  1.62   1.56  61212   6007   10.19 100.00       4.5     1.8    0.559    0.589    0.183   0.874*   0.053*
  1.56   1.51  63038   6015   10.48 100.00       3.4     1.4    0.711    0.747    0.229   0.841*   0.041*
  1.51   1.47  59623   5984    9.96 100.00       2.5     1.0    0.952    1.004    0.315   0.715*  -0.004
  1.47   1.43  49066   5965    8.23  99.88       1.9     0.6    1.230    1.313    0.451   0.550*   0.021
  1.43   1.40  34937   5669    6.16  94.99       1.6     0.5    1.448    1.581    0.613   0.376*  -0.009
  1.40   1.36  25789   5291    4.87  88.37       1.4     0.3    1.675    1.871    0.800   0.277*  -0.021
  1.36   1.33  18886   4679    4.04  78.31       1.1     0.2    2.027    2.313    1.073   0.185*   0.030
  1.33   1.31  13325   4138    3.22  68.91       0.8     0.2    2.655    3.126    1.595   0.077*   0.002
  1.31   1.28   8216   3367    2.44  56.48       0.8     0.1    2.875    3.557    2.046   0.049*  -0.024
  1.28   1.26   3925   2357    1.67  39.64       0.6     0.1    3.170    4.203    2.726   0.006  -0.154
  1.26   1.24    766    680    1.13  11.42       0.4     0.0   40.771   57.659   40.771  -0.074   0.000
121.47   1.24 923806 105204    8.78  87.11      51.2     9.6    0.085    0.090    0.028   0.999*   0.346*

The same, on the branch

Statistics by resolution bin:
 d_max  d_min   #obs  #uniq   mult.  %comp       <I>  <I/sI>    r_mrg   r_meas    r_pim   cc1/2   cc_ano
122.06   3.36  67157   6474   10.37 100.00     449.3    60.3    0.046    0.048    0.015   0.998*   0.440*
  3.36   2.67  67240   6203   10.84 100.00     157.9    41.7    0.061    0.064    0.019   0.998*   0.439*
  2.67   2.33  66441   6124   10.85 100.00      84.3    29.6    0.080    0.084    0.025   0.997*   0.430*
  2.33   2.12  65251   6082   10.73 100.00      59.5    22.2    0.097    0.102    0.031   0.996*   0.353*
  2.12   1.97  66224   6059   10.93 100.00      38.1    15.9    0.123    0.129    0.039   0.994*   0.293*
  1.97   1.85  63783   6061   10.52 100.00      21.4     9.9    0.177    0.186    0.057   0.989*   0.200*
  1.85   1.76  62253   6027   10.33 100.00      12.1     6.1    0.252    0.265    0.082   0.976*   0.155*
  1.76   1.68  64903   6018   10.78 100.00       8.4     4.4    0.329    0.345    0.104   0.960*   0.095*
  1.68   1.62  61478   6004   10.24 100.00       6.1     3.1    0.423    0.445    0.138   0.932*   0.104*
  1.62   1.56  61198   6007   10.19 100.00       4.5     2.3    0.556    0.586    0.182   0.880*   0.045*
  1.56   1.51  63028   6015   10.48 100.00       3.4     1.8    0.707    0.744    0.228   0.849*   0.042*
  1.51   1.47  59617   5984    9.96 100.00       2.5     1.3    0.949    1.000    0.314   0.728*  -0.005
  1.47   1.43  49065   5965    8.23  99.88       1.9     0.8    1.229    1.313    0.450   0.552*   0.021
  1.43   1.40  34937   5669    6.16  94.99       1.6     0.6    1.448    1.581    0.613   0.376*  -0.009
  1.40   1.36  25789   5291    4.87  88.37       1.4     0.4    1.675    1.871    0.800   0.277*  -0.021
  1.36   1.33  18886   4679    4.04  78.31       1.1     0.3    2.027    2.313    1.073   0.185*   0.030
  1.33   1.31  13325   4138    3.22  68.91       0.8     0.2    2.655    3.125    1.595   0.077*   0.002
  1.31   1.28   8216   3367    2.44  56.48       0.8     0.2    2.875    3.556    2.045   0.049*  -0.024
  1.28   1.26   3925   2357    1.67  39.64       0.6     0.1    3.169    4.201    2.725   0.006  -0.154
  1.26   1.24    766    680    1.13  11.42       0.4     0.1   41.413   58.568   41.413  -0.074   0.000
121.47   1.24 923482 105204    8.78  87.11      51.2    11.9    0.084    0.089    0.028   0.999*   0.349*

Enough of an improvement that I'm keen to have this in master, looks good. The tests all pass so 👍

Will look at the diffs now

Copy link
Contributor

@graeme-winter graeme-winter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sensible, and makes a measurable improvement, thank you.

I do think we should have some program documentation which explains how this works though - it's important for the user to be able to understand this without reading the code.

@jbeilstenedmands jbeilstenedmands merged commit 62b0ec3 into master Jul 16, 2020
@Anthchirp Anthchirp deleted the anomalous_error_model branch October 30, 2020 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants