Separate independent blocks of experiments in refinement #2336

dagewa · 2023-02-06T15:21:46Z

This changeset identifies subsets of the experiment list that are independent (i.e. don't share any models with other subsets) and performs refinement separately for each of these groups. This is put under control of the user of dials.refine via the separate_independent_sets parameter (True by default).

It turned out to be much easier to do this at a high level before any Refiners are made rather than attempting a precise analysis of the structure of the refinement problem. For that reason, this is not attempted at all if any constraints or restraints are present, as these may link the models in a way that is not clear from the experiment list alone.

Note that refinement results are not the same if refinement is done in separate groups compared to all at once. There could be many reasons for this:

Outlier rejection may differ
The number of steps to convergence may differ between each group and with the global job
Refinement engines may take a different path to convergence depending on the parameters available for refinement. For example, the Levenberg Marquardt algorithm uses a damping parameter, updates to which depend on the overall sum of squared residuals

Currently draft to assess the impact of the change. @phyy-nx, I'm interested to hear if this will affect any of your use cases. Note, if anything is shared between all the models (such as a beam or detector) then this will have no effect. Groups of experiments must be strictly independent from other groups for this to trigger. Also, as this occurs at a high level (inside the run_dials_refine function of the command line program) it does not affect other usages of Refiners, such as indexing.

I have not yet added any new tests or attempted to perform the refinement of separate groups in parallel. That could be interesting, but could be expensive in terms of memory usage.

Still to do: - Report on which ids are in each block (table in log file) - Rejoin results after refining each block - Report on overall RMSDs by experiment after rejoining - Tests

Also, return None for refiner and history if refinement has been done in disjoint blocks. In such a case, refuse to write out certain types of debugging/analysis files.

…xpected order

compared with not, so put this behaviour under user control.

codecov · 2023-02-06T17:04:51Z

Codecov Report

Merging #2336 (21d0ea7) into main (f117929) will increase coverage by 4.36%.
The diff coverage is 46.71%.

❗ Current head 21d0ea7 differs from pull request most recent head fda43ac. Consider uploading reports for the commit fda43ac to get more accurate results

@@            Coverage Diff             @@
##             main    #2336      +/-   ##
==========================================
+ Coverage   78.62%   82.99%   +4.36%     
==========================================
  Files         603      593      -10     
  Lines       73644    68676    -4968     
  Branches    10005     9246     -759     
==========================================
- Hits        57903    56998     -905     
+ Misses      13612     9539    -4073     
- Partials     2129     2139      +10

phyy-nx · 2023-02-06T19:09:09Z

Very interesting. I believe Brewster 2018, Figure 6 is exactly this use case. The blue curve is normal indexing with a slightly wrong model, the green is refining a detector model per image and the red is joint refinement with a single detector model. Both red and green go to the right location, but green has a wide distribution due to independence of each sample. This was with what would have been the old default of separate_independent_sets=False from this PR.

I deposited the expt files needed to make those plots but apparently i didn't deposit the refl files, so repeating that exactly isn't easy. I'll see what I can to reproduce it on this branch.

dagewa · 2023-02-07T09:36:05Z

To reassure that differences in refinement are not due to a bug introduced by this change we can compare output if we use the basic Gauss–Newton engine, with no outlier rejection, and restricting the number of steps to make sure both blocks take the same number of steps (in fact, in this case, they do anyway):

dials.import $(dials.data get -q insulin)/insulin_1_0{01..22}.img
dials.find_spots imported.expt
dials.index imported.expt strong.refl\
  output.experiments=first.expt output.reflections=first.refl
dials.import $(dials.data get -q insulin)/insulin_1_0{23..45}.img
dials.find_spots imported.expt
dials.index imported.expt strong.refl\
  output.experiments=second.expt output.reflections=second.refl
dials.combine_experiments first.{expt,refl} second.{expt,refl}
dials.refine combined.{expt,refl} scan_varying=false\
  outlier.algorithm=null engine=GaussNewton max_iterations=2\
  output.log=first.refined.log
dials.refine combined.{expt,refl} scan_varying=false\
  outlier.algorithm=null engine=GaussNewton max_iterations=2\
  separate_independent_sets=False output.log=second.refined.log

In both cases the final RMSD table is

RMSDs by experiment:
+-------+--------+----------+----------+------------+
|   Exp |   Nref |   RMSD_X |   RMSD_Y |     RMSD_Z |
|    id |        |     (px) |     (px) |   (images) |
|-------+--------+----------+----------+------------|
|     0 |  11002 |  0.24532 |  0.35418 |    0.25377 |
|     1 |  11012 |  0.26789 |  0.38878 |    0.24988 |
+-------+--------+----------+----------+------------+

phyy-nx · 2023-02-07T21:54:57Z

Ok, here's what I did. I re-indexed run 29 of the thermolysin data in Brewster 2018, but changing Z by 1mm. I then grabbed 1000 images and did hierarchy level 0 dials.refine on the main branch, with 1 or N detector models. Result looks like Figure 6 from the paper:

Blue is before refinement, orange is refinement with N detector models and green is refinement with 1 detector model. Orange is wider than green as expected.

Next I checked out this branch and refined with 1 detector model using the default separate_independent_sets=True, and with N detector models with separate_independent_sets=False. I verified I got identical unit cell distributions as on main. Note, the latter case took 4 minutes and 5 seconds.

Finally I refined with N detector models using separate_independent_sets=True. Here however I hit a problem. The program ran for a while and then my interactive node at LCLS became unresponsive and I was kicked off. I moved the data to the 64 core dials.lbl.gov and tried again. The program ran for 22 minutes before it crashed strangely with just the message Killed. I was watching top and memory slowly increased until it hit at least 40 GB of memory usage before it stopped.

Not sure where to go from here. I've been curious about this since we wrote that paper since refining independently should give the same result as refining together if the models are independent, but it doesn't, likely given the reasons you gave above @dagewa. So I hope there is a solution to this memory problem. Regardless, I've put the files in a google drive and shared them with you if it helps.

dagewa · 2023-02-08T07:33:02Z

Thanks @phyy-nx, that's interesting. I'm not sure why the memory usage is so high this way, but I'll try to profile it.

dagewa · 2023-02-09T14:13:58Z

I can see the memory problem. It's because of copy.deepcopy(params). I wanted that so that each refinement job got its own copy of the refinement PHIL parameters, because some of the Auto parameters would be modified by the Refiner. But guess what?

In [5]: params.input.reflections
Out[5]: [FilenameDataWrapper(filename='L785_pr2336/filtered.refl', data=<dials_array_family_flex_ext.reflection_table object at 0x7f0a9d560180>)]

Each copy of params also copies the entire reflection table (oh and the experiments too). So your job was duplicating a 30 MB reflection table 1000 times even before it got into refinement 😱

dagewa · 2023-02-09T14:21:02Z

I can work around that, but I've realised there is another problem. In your refine.phil you have outlier.separate_experiments=False. This is supposed to join the reflections from all experiments together for outlier rejection purposes. But the splitting is done before outlier rejection, so as a consequence this parameter is effectively ignored.

I guess it would be better to do the outlier rejection step prior to splitting, so it is really only the refinement that is done in independent sets. However, this is going to be a somewhat more significant change. At the moment, outlier rejection is done as part of the construction of a Refiner, not as a separate step.

…ying cheap

dagewa · 2023-02-10T16:05:54Z

Hi @phyy-nx, with the change in 1e4763d this job:

time dials.refine L785_pr2336/filtered.{expt,refl} L785_pr2336/refine.phil outlier.algorithm=None

completes in 1m25s on my laptop and uses very little memory. So, that's as it should be. But it's still not the job you want to be able to run. I had to set outlier.algorithm=None because otherwise the SauterPoon outlier algorithm removes all reflections from the first experiment and refinement fails.

Separating the outlier rejection from the rest of the construction of the Refiner is not as bad as I first thought. I think I'll be able to do that quite easily.

phyy-nx · 2023-02-10T16:08:41Z

Ok, good to hear!

1. _build_reflection_manager_and_predictor 2. _build_refiner This allows to interrupt building the refiner after outlier rejection is done. Add a method, reflections_after_outlier_rejection, which does this.

… is requested

dagewa · 2023-03-01T16:53:18Z

With those last changes, dials.refine will respect outlier.separate_experiments=False and do one round of global outlier rejection before getting stuck into refining each experiment independently. I ran this job:

time dials.refine L785_pr2336/filtered.{expt,refl} L785_pr2336/refine.phil output.log=branch.log output.experiments=branch.expt output.reflections=branch.refl

and the same on main. The resulting tables of RMSDs by experiment differ only by insignificant digits.

dagewa · 2023-05-03T10:15:58Z

Hi @graeme-winter, I don't want to merge this without a review / testing from you, as this came from your use case in #2235

graeme-winter · 2023-05-03T10:16:56Z

Hi @graeme-winter, I don't want to merge this without a review / testing from you, as this came from your use case in #2235

Noted, gimme a mo

graeme-winter · 2023-05-03T10:30:23Z

OK, ran a quick test against a 10 experiment lysozyme data set ->


      117.09 real       132.30 user        12.40 sys
          1348579328  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
             5327929  page reclaims
                1911  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   5  signals received
                5601  voluntary context switches
               54666  involuntary context switches
        798101697148  instructions retired
        351156830466  cycles elapsed
          1116266496  peak memory footprint

Meanwhile,

On main

      151.22 real       200.34 user        13.05 sys
          1531678720  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
             5964217  page reclaims
                 723  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   5  signals received
                7458  voluntary context switches
              154298  involuntary context switches
       1138406642559  instructions retired
        480421803314  cycles elapsed
          1347272704  peak memory footprint

Seems to have substantially reduced the wall clock time

graeme-winter · 2023-05-03T10:32:38Z

Modest but probably useful reduction in memory footprint

>>> main = 1531678720
>>> branch = 1348579328
>>> main - branch
183099392
>>> _ / (1024.**2)
174.6171875
>>> main / (1024.**2)
1460.72265625
>>> branch / (1024.**2)
1286.10546875
>>> 1 - (branch / main)
0.11954164382462662

graeme-winter · 2023-05-03T10:36:37Z

Output:

< RMSDs by experiment:
< +-------+--------+----------+----------+------------+
< |   Exp |   Nref |   RMSD_X |   RMSD_Y |     RMSD_Z |
< |    id |        |     (px) |     (px) |   (images) |
< |-------+--------+----------+----------+------------|
< |     0 |   7676 |  0.13675 |  0.12385 |   0.048474 |
< |     1 |   8082 |  0.14012 |  0.12605 |   0.04502  |
< |     2 |   7039 |  0.14684 |  0.12941 |   0.050837 |
< |     3 |   6868 |  0.14581 |  0.12924 |   0.056064 |
< |     4 |   7950 |  0.14091 |  0.12468 |   0.047592 |
< |     5 |   6974 |  0.14146 |  0.12772 |   0.048678 |
< |     6 |   8174 |  0.13517 |  0.12453 |   0.043553 |
< |     8 |   8207 |  0.143   |  0.13285 |   0.052857 |
---
> |     0 |   7676 |  0.13675 |  0.12384 |   0.048474 |
> |     1 |   8083 |  0.14014 |  0.12608 |   0.045023 |
> |     2 |   7039 |  0.14694 |  0.12942 |   0.0508   |
> |     3 |   6867 |  0.14588 |  0.1291  |   0.056095 |
> |     4 |   7950 |  0.14086 |  0.12467 |   0.047603 |
> |     5 |   6972 |  0.14163 |  0.12754 |   0.048595 |
> |     6 |   8174 |  0.13507 |  0.12463 |   0.043555 |
> |     8 |   8206 |  0.143   |  0.13283 |   0.052848 |

I consider these results to be equivalent

graeme-winter

Change set delivers on promise of reducing resource requirements without meaningfully affecting the quality of output -> win

Would ask: in the loop-over-data-set part for refinement please can we print the experiment ID being refined, as a progress report for the impatient user?

Otherwise looks excellent, thank you

newsfragments/2336.feature

src/dials/algorithms/refinement/refiner.py

src/dials/command_line/refine.py

dagewa · 2023-05-03T12:26:22Z

Thank you. Looks like a useful set of suggestions / comments, which I will work through soon.

…se-the-matrix

…roup

…separate refinement jobs on real data

dagewa · 2023-05-04T13:10:52Z

I've addressed all comments now, so just waiting for the green ticks.

dagewa added 7 commits February 3, 2023 11:40

Add a function to identify disjoint sets of experiments

90a29b1

Framework for looping over independent refinements. Not yet implemented

de0aa35

Do the refinement in independent blocks.

e50581e

Still to do: - Report on which ids are in each block (table in log file) - Rejoin results after refining each block - Report on overall RMSDs by experiment after rejoining - Tests

Reconstitute the original experiment list and reflections.

a353788

Also, return None for refiner and history if refinement has been done in disjoint blocks. In such a case, refuse to write out certain types of debugging/analysis files.

Report RMSDs by experiment after recombining the experiments in the e…

51c7691

…xpected order

Report on the contents of each refinement group after splitting

a4b7b05

Refinement gives different results if independent groups are separated

be481ba

compared with not, so put this behaviour under user control.

dagewa linked an issue Feb 6, 2023 that may be closed by this pull request

dials.refine: possible to block-diagonalise the matrix? #2235

Closed

dagewa requested review from graeme-winter and phyy-nx February 6, 2023 15:21

news

b91faf3

Remove reflections and experiments from the params object to make cop…

1e4763d

…ying cheap

dagewa added 2 commits March 1, 2023 16:14

Refactor _build_components into two parts.

7b6054f

1. _build_reflection_manager_and_predictor 2. _build_refiner This allows to interrupt building the refiner after outlier rejection is done. Add a method, reflections_after_outlier_rejection, which does this.

Allow one global outlier rejection job across all experiments if that…

bdf6667

… is requested

dagewa marked this pull request as ready for review March 1, 2023 16:56

Rename newsfragments/xxx.feature to newsfragments/2336.feature

90867d8

graeme-winter approved these changes May 3, 2023

View reviewed changes

newsfragments/2336.feature Outdated Show resolved Hide resolved

src/dials/algorithms/refinement/refiner.py Show resolved Hide resolved

src/dials/command_line/refine.py Show resolved Hide resolved

src/dials/command_line/refine.py Show resolved Hide resolved

dagewa added 5 commits May 3, 2023 13:46

Merge branch 'main' into 2235-dialsrefine-possible-to-block-diagonali…

0eb42c3

…se-the-matrix

Update newsfragment

244f4da

Report original experiment ids ahead of refinement of each disjoint g…

2834c88

…roup

Add reasonably fast test (completes in about 10s) comparing joint vs …

fda43ac

…separate refinement jobs on real data

Add unit test for _find_disjoint_sets function

db90128

phyy-nx approved these changes May 4, 2023

View reviewed changes

dagewa merged commit 9918059 into main May 5, 2023
16 checks passed

dagewa deleted the 2235-dialsrefine-possible-to-block-diagonalise-the-matrix branch May 5, 2023 08:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate independent blocks of experiments in refinement #2336

Separate independent blocks of experiments in refinement #2336

dagewa commented Feb 6, 2023 •

edited

codecov bot commented Feb 6, 2023 •

edited

phyy-nx commented Feb 6, 2023

dagewa commented Feb 7, 2023

phyy-nx commented Feb 7, 2023

dagewa commented Feb 8, 2023

dagewa commented Feb 9, 2023

dagewa commented Feb 9, 2023

dagewa commented Feb 10, 2023

phyy-nx commented Feb 10, 2023

dagewa commented Mar 1, 2023

dagewa commented May 3, 2023

graeme-winter commented May 3, 2023

graeme-winter commented May 3, 2023

graeme-winter commented May 3, 2023

graeme-winter commented May 3, 2023

graeme-winter left a comment

dagewa commented May 3, 2023

dagewa commented May 4, 2023

Separate independent blocks of experiments in refinement #2336

Separate independent blocks of experiments in refinement #2336

Conversation

dagewa commented Feb 6, 2023 • edited

codecov bot commented Feb 6, 2023 • edited

Codecov Report

phyy-nx commented Feb 6, 2023

dagewa commented Feb 7, 2023

phyy-nx commented Feb 7, 2023

dagewa commented Feb 8, 2023

dagewa commented Feb 9, 2023

dagewa commented Feb 9, 2023

dagewa commented Feb 10, 2023

phyy-nx commented Feb 10, 2023

dagewa commented Mar 1, 2023

dagewa commented May 3, 2023

graeme-winter commented May 3, 2023

graeme-winter commented May 3, 2023

graeme-winter commented May 3, 2023

graeme-winter commented May 3, 2023

graeme-winter left a comment

Choose a reason for hiding this comment

dagewa commented May 3, 2023

dagewa commented May 4, 2023

dagewa commented Feb 6, 2023 •

edited

codecov bot commented Feb 6, 2023 •

edited