Sparse scan-varying refinement #2022

dagewa · 2022-03-03T14:55:35Z

This PR enables better use of sparse storage during gradient calculation for scan-varying parameterisations. This avoids the storage of a large number of explicit zero elements. For a 3600 image, 360° test dataset this reduces the overall memory requirement of a dials.refine scan_varying=true to 93% of that required on the main branch. This saving should be improved for larger scans, i.e. multi-turn data collection, as sparseness of the Jacobian increases. As well as the memory reduction, total wall clock time for the run is reduced to 77% of the main branch (see #1800).

Fixes #1800.

This is a hangover from before refactoring in 0f6b2e4

Add tests for scalar versions, add tests for mat3 * mat3.

- parts() - dot(other) - rotate_around_origin(direction, angle)

This turned out to be quite complicated. The intersection and intersection_i_seqs methods did not quite do the right thing because they require sorted input arrays, whereas this has to preserve the sorted order. I ended up with a loop in Python to do what I want, but this should be converted to C++ for speed.

Just calculate as required - it's cheap enough

This currently fails in downstream calculations, so keep the dense version around for now, to remove later once this is fixed.

With changes in _grads_model_loop, the sparse calculations are seen to give the same results as the dense versions. Changes still need to be made to _grads_detector_loop.

Use a map to look up values. Still slow and needs to move to C++

This will record the normal matrix at each step in the refinement history. In addition, leave a stub for recording the Jacobian structure in the history as well. Also tidy some of the code.

elements of the index arrays.

…al Python version At the moment, including a test of total execution time. The Python version is 2 times *faster*.

results. Unfortunately this test fails when the inputs have random order!

version. Use this to find corner cases where the results are not the same.

The NumPy method returns sorted values, which means that it only works for selection when the input arrays are themselves sorted. This is often the case but cannot be guaranteed. Therefore this method is not suitable for use in SparseFlex.select. Added a test to demonstrate the failure of this method when the input arrays are unsorted.

For reasons I don't understand, using the unordered_map a little differently makes a dramatic difference to the execution speed. The C++ version is now the fastest: about 8 times faster than the pure Python version and 4 times faster than the NumPy version. Added a test that demonstrates execution speed.

Creating a SparseFlex with zero length data would lead to the data type not being set because of the early exit from `extend`. This change avoids that bug, but it would be better to always set the data type on construction

dagewa · 2022-03-03T14:56:57Z

Tagged @graeme-winter for review, and testing with multi-turn datasets

graeme-winter · 2022-03-04T14:10:00Z

Noted, wilco, just processing a ... 10-turn data set right now

graeme-winter · 2022-03-04T16:41:20Z

Right, this is going to be a process...

        Command being timed: "dials.refine ../indexed.expt ../indexed.refl"
        User time (seconds): 6713.07
        System time (seconds): 118.45
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:53:55
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 38208560
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 83111430
        Voluntary context switches: 8595
        Involuntary context switches: 2728
        Swaps: 0
        File system inputs: 0
        File system outputs: 8
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

timing for 1st run, on main, in P1

graeme-winter · 2022-03-09T10:05:03Z

Not P1, I23

        Command being timed: "dials.refine ../i23.expt ../i23.refl"
        User time (seconds): 1894.68
        System time (seconds): 56.56
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 32:34.73
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 20295136
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 36415813
        Voluntary context switches: 8592
        Involuntary context switches: 955
        Swaps: 0
        File system inputs: 0
        File system outputs: 8
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

graeme-winter · 2022-03-09T10:44:41Z

Branch; i23

        Command being timed: "dials.refine ../i23.expt ../i23.refl"
        User time (seconds): 1542.23
        System time (seconds): 28.70
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 26:21.68
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 20295900
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 832
        Minor (reclaiming a frame) page faults: 20870183
        Voluntary context switches: 12435
        Involuntary context switches: 643
        Swaps: 0
        File system inputs: 429744
        File system outputs: 8
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

graeme-winter · 2022-03-09T13:15:25Z

P1 on the branch

        Command being timed: "dials.refine ../indexed.expt ../indexed.refl"
        User time (seconds): 5928.33
        System time (seconds): 66.51
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:39:59
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 38212152
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 41046323
        Voluntary context switches: 8619
        Involuntary context switches: 2055
        Swaps: 0
        File system inputs: 0
        File system outputs: 8
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

dagewa · 2022-03-10T11:24:06Z

A reduction in wallclock time is good to see, but it is surprising there is essentially no reduction in memory requirements. With the 360° test dataset I saw a reduction to 93% of the memory requirement - so not a large reduction but measurable. I thought this would improve with more parameters, but maybe the memory is going somewhere else. I'm looking at the 10-turn datasets now...

dagewa · 2022-03-10T11:29:49Z

Here's some analysis of the Jacobian structure on this branch for the 10-turn data set, starting with a single-step scan-varying refinement run. First for the I23 case:

dials.refine scan_varying=True max_iterations=1\
  track_jacobian_structure=True history=history_I23.json\
  ../i23.expt ../i23.refl

Then extract the Jacobian structure and plot:

import json
import sys
from matplotlib import pyplot as plt

with open("history.json") as f:
    history = json.load(f)
jacobian_structure = history["data"]["jacobian_structure"][-1]
iparam = [i for i in range(len(jacobian_structure))]

structural_zeroes = [param["structural_zeroes"] for param in jacobian_structure]
all_zeroes = [param["all_zeroes"] for param in jacobian_structure]
all_rows = [param["nrows"] for param in jacobian_structure]
explicit_zeroes = [a-b for a,b in zip(all_zeroes, structural_zeroes)]
non_zeroes = [a-b for a,b in zip(all_rows, all_zeroes)]

fig, ax = plt.subplots()
ax.bar(iparam, structural_zeroes, label='Structural zeroes', width=1.0)
ax.bar(iparam, explicit_zeroes, label='Explicit zeroes', bottom=structural_zeroes, width=1.0)
ax.bar(iparam, non_zeroes, label='Non-zeroes', bottom=all_zeroes, width=1.0)
ax.legend(loc="lower right")
ax.set_ylabel("Rows")
ax.set_xlabel("Parameters")
ax.set_title("Jacobian structure")

plt.savefig("structure_I23.png", dpi=300)

This indicates that the Jacobian truly is sparse, with no explicit zeroes visible.

dagewa · 2022-03-10T11:35:08Z

Likewise for P1

In this case there are a very small number of explicit zeroes visible, which are when we get crystal parameters that cannot be determined for particular reflections (e.g. length of unit cell vector parallel to the beam at some orientation).

dagewa · 2022-03-10T11:49:01Z

This branch does not change the default refinement engine, only the sparseness of the Jacobian. One thing that seemed worth exploring is whether setting engine=SparseLevMar makes a difference, as this would use not only a sparse Jacobian, but a sparse normal matrix as well.

engine=LevMar (default)

Command being timed: "dials.refine scan_varying=True track_jacobian_structure=True ../i23.expt ../i23.refl max_iterations=1"
	User time (seconds): 115.05
	System time (seconds): 3.99
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 1:59.51
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 6948076
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 4550607
	Voluntary context switches: 66
	Involuntary context switches: 628
	Swaps: 0
	File system inputs: 2008
	File system outputs: 1423256
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

engine=SparseLevMar

Command being timed: "dials.refine scan_varying=True track_jacobian_structure=True ../i23.expt ../i23.refl max_iterations=1 engine=SparseLevMar"
	User time (seconds): 110.34
	System time (seconds): 6.96
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 1:57.84
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 10517444
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 8899464
	Voluntary context switches: 47
	Involuntary context switches: 903
	Swaps: 0
	File system inputs: 0
	File system outputs: 1423272
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

No, that didn't help! Peak memory requirements for SparseLevMar are 1.5 times higher than LevMar (for this job)

dagewa · 2022-03-10T12:02:33Z

The purpose of this branch is to properly account for sparseness during gradient calculations for scan-varying refinement. The plots indicate that has been achieved. However, it appears this only translates to a modest improvement in overall performance of dials.refine. Further reduction in the memory footprint would have to be explored outside the gradient calculations. I don't think that should occur on this branch though.

dagewa · 2022-03-10T14:19:22Z

Ok, something strange going on here. I tried a full scan-varying run on branch vs main, and I get a very significant difference in run time and memory requirements. The branch takes 1/3 of the time and uses 1/3 of the memory than main. At least, according to mprof. I'll follow up with /usr/bin/time -v runs next.

Command: dials.python -m mprof run /home/fcx32934/sw/cctbx/modules/dials/command_line/refine.py ../i23.expt ../i23.refl scan_varying=true

branch

main

dagewa · 2022-03-10T15:08:06Z

branch

	Command being timed: "dials.refine ../i23.expt ../i23.refl scan_varying=true log=branch.log"
	User time (seconds): 356.22
	System time (seconds): 4.47
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 6:01.25
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 6992504
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 4970387
	Voluntary context switches: 201
	Involuntary context switches: 7777
	Swaps: 0
	File system inputs: 64
	File system outputs: 1423488
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

main

	Command being timed: "dials.refine ../i23.expt ../i23.refl scan_varying=true log=main.log"
	User time (seconds): 1486.89
	System time (seconds): 74.38
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 26:02.26
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 18652552
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 105010952
	Voluntary context switches: 293
	Involuntary context switches: 8726
	Swaps: 0
	File system inputs: 152
	File system outputs: 1423208
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

The branch is better than 4 times faster and uses almost three times less memory

I note a difference with @graeme-winter's runs is that I set scan_varying=True rather than leaving it as Auto. Maybe something about doing the static run first is then entering the wrong code path on the scan-varying run, and then not getting the benefit on the branch. I'll try that next.

dagewa · 2022-03-10T15:10:33Z

Re correctness, output logs from the two runs are essentially identical:

dagewa · 2022-03-10T15:41:30Z

Yep, something stupid is happening when scan_varying=auto. This appears to enter the wrong code path and then not do the calculation sparsely 🤦‍♂️

/usr/bin/time -v dials.refine ../i23.expt ../i23.refl log=branch_sv_auto.log

        Command being timed: "dials.refine ../i23.expt ../i23.refl log=branch_sv_auto.log"
	User time (seconds): 1312.60
	System time (seconds): 37.22
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 22:30.87
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 19922788
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 48544572
	Voluntary context switches: 112
	Involuntary context switches: 13444
	Swaps: 0
	File system inputs: 0
	File system outputs: 1423248
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

scan-varying macrocycle. Enables a big improvement in performance for default dials.refine jobs.

dagewa · 2022-03-10T17:06:53Z

🙌 for code review. I had been testing setting scan_varying=True all the time and did not spot that a default dials.refine job with a static followed by scan-varying macrocycle did not benefit. Simple fix, and now all is good (I think):

	Command being timed: "dials.refine ../i23.expt ../i23.refl"
	User time (seconds): 449.05
	System time (seconds): 5.50
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 7:35.15
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 8288348
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 6541580
	Voluntary context switches: 149
	Involuntary context switches: 4470
	Swaps: 0
	File system inputs: 0
	File system outputs: 1423312
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

graeme-winter · 2022-03-11T14:21:37Z

Sounds like a successful review process. I will update and re-run to verify

graeme-winter · 2022-03-15T15:52:51Z

Sounds like a successful review process. I will update and re-run to verify

Man needs to review dictionary under "S for shortly"

dagewa · 2022-03-15T16:10:36Z

"real soon now" 😆

dagewa · 2022-03-22T11:37:10Z

@graeme-winter Are you still planning to run the branch against one of your test cases, or is my test sufficient?

graeme-winter · 2022-03-22T16:22:12Z

	Command being timed: "dials.refine ../indexed.expt ../indexed.refl"
	User time (seconds): 1215.07
	System time (seconds): 9.49
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 20:27.06
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 9025736
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 13874259
	Voluntary context switches: 8701
	Involuntary context switches: 1059
	Swaps: 0
	File system inputs: 888
	File system outputs: 8
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

dagewa · 2022-03-23T14:39:16Z

Thanks for checking! Looks like a reduction in memory usage of over four times and nearly 5 times shorter wallclock time.

Not sure why the "PR macos python38" check is hanging on amber?

codecov · 2022-03-24T15:31:06Z

Codecov Report

Merging #2022 (f90ca5f) into main (5aefc6a) will decrease coverage by 0.13%.
The diff coverage is 57.50%.

❗ Current head f90ca5f differs from pull request most recent head 974a0e2. Consider uploading reports for the commit 974a0e2 to get more accurate results

@@            Coverage Diff             @@
##             main    #2022      +/-   ##
==========================================
- Coverage   68.58%   68.44%   -0.14%     
==========================================
  Files         649      654       +5     
  Lines       75364    76396    +1032     
  Branches    10793    10914     +121     
==========================================
+ Hits        51686    52290     +604     
- Misses      21665    22066     +401     
- Partials     2013     2040      +27

dagewa added 30 commits July 27, 2021 10:38

Add SparseFlex class and tests

18d7406

Rename variables to remove 'beam'.

d834e74

This is a hangover from before refactoring in 0f6b2e4

Add division to SparseFlex for scalars.

b8c41c8

Add tests for scalar versions, add tests for mat3 * mat3.

Addition and subtraction of double and vec3 SparseFlex arrays

ee36aac

Add vec3-specific methods to SparseFlex and tests

ffba37a

- parts() - dot(other) - rotate_around_origin(direction, angle)

Add an extend function so SparseFlex can be accumulated

9352b0d

Raise an informative error message when indices is not flex.size_t

58472a0

Don't keep track of non_zeroes

87431bb

Just calculate as required - it's cheap enough

Make select and extend robust when the array is empty

a7a6759

Allow SparseFlex to be returned by StateDerivativeCache.build_gradients

d7991c8

This currently fails in downstream calculations, so keep the dense version around for now, to remove later once this is fixed.

Add test for SparseFlex.parts()

1779683

Use SparseFlex in StateDerivativeCache.build_gradients

92409f7

With changes in _grads_model_loop, the sparse calculations are seen to give the same results as the dense versions. Changes still need to be made to _grads_detector_loop.

Faster calculation in SparseFlex.select

6e2d253

Use a map to look up values. Still slow and needs to move to C++

Changes to allow use of SparseFlex for detector and beam derivatives

b8572c1

Allow processed data structures to be garbage-collected

d926d16

Variable renaming and tidying

e2da71c

Replace _print_normal_matrix with track_normal_matrix option

573da2f

This will record the normal matrix at each step in the refinement history. In addition, leave a stub for recording the Jacobian structure in the history as well. Also tidy some of the code.

Add content to _jacobian_structure method

62a5545

C++ function intersection_i_seqs_unsorted to replace Python loop over

66b16e0

elements of the index arrays.

Reinstate Python loop calculation as this seems faster!

66af0d3

Add a test that compares intersection_i_seqs_unsorted with the origin…

bd1914d

…al Python version At the moment, including a test of total execution time. The Python version is 2 times *faster*.

Use np.intersect1d as suggested by @rjgildea

f1030f4

Test NumPy and original Python select intersections give the same

e4d6dcc

results. Unfortunately this test fails when the inputs have random order!

Reinstate the Python intersection code to compare with the NumPy

d9f2f1e

version. Use this to find corner cases where the results are not the same.

Merge branch 'main' into sparse-scan-varying

18080d1

Use the faster C++ method

7f9ebdc

Bugfix

5d2db38

Creating a SparseFlex with zero length data would lead to the data type not being set because of the early exit from `extend`. This change avoids that bug, but it would be better to always set the data type on construction

dagewa requested a review from graeme-winter March 3, 2022 14:55

Rename newsfragments/xxx.feature to newsfragments/2022.feature

fa705fd

Reset sparse parameter, so Auto gets carried through for the

974a0e2

scan-varying macrocycle. Enables a big improvement in performance for default dials.refine jobs.

graeme-winter force-pushed the main branch from 8aa2ada to e611c76 Compare March 21, 2022 16:24

dagewa merged commit 5ffaab2 into main Mar 25, 2022

dagewa deleted the sparse-scan-varying branch March 25, 2022 13:45

dagewa mentioned this pull request Apr 5, 2022

In dials.refine, please could you print likely memory requirement #328

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse scan-varying refinement #2022

Sparse scan-varying refinement #2022

dagewa commented Mar 3, 2022

dagewa commented Mar 3, 2022

graeme-winter commented Mar 4, 2022

graeme-winter commented Mar 4, 2022

graeme-winter commented Mar 9, 2022

graeme-winter commented Mar 9, 2022

graeme-winter commented Mar 9, 2022

dagewa commented Mar 10, 2022

dagewa commented Mar 10, 2022

dagewa commented Mar 10, 2022

dagewa commented Mar 10, 2022 •

edited

dagewa commented Mar 10, 2022

dagewa commented Mar 10, 2022

dagewa commented Mar 10, 2022

dagewa commented Mar 10, 2022

dagewa commented Mar 10, 2022

dagewa commented Mar 10, 2022

graeme-winter commented Mar 11, 2022

graeme-winter commented Mar 15, 2022

dagewa commented Mar 15, 2022

dagewa commented Mar 22, 2022

graeme-winter commented Mar 22, 2022

dagewa commented Mar 23, 2022

codecov bot commented Mar 24, 2022

Sparse scan-varying refinement #2022

Sparse scan-varying refinement #2022

Conversation

dagewa commented Mar 3, 2022

dagewa commented Mar 3, 2022

graeme-winter commented Mar 4, 2022

graeme-winter commented Mar 4, 2022

graeme-winter commented Mar 9, 2022

graeme-winter commented Mar 9, 2022

graeme-winter commented Mar 9, 2022

dagewa commented Mar 10, 2022

dagewa commented Mar 10, 2022

dagewa commented Mar 10, 2022

dagewa commented Mar 10, 2022 • edited

dagewa commented Mar 10, 2022

dagewa commented Mar 10, 2022

dagewa commented Mar 10, 2022

dagewa commented Mar 10, 2022

dagewa commented Mar 10, 2022

dagewa commented Mar 10, 2022

graeme-winter commented Mar 11, 2022

graeme-winter commented Mar 15, 2022

dagewa commented Mar 15, 2022

dagewa commented Mar 22, 2022

graeme-winter commented Mar 22, 2022

dagewa commented Mar 23, 2022

codecov bot commented Mar 24, 2022

Codecov Report

dagewa commented Mar 10, 2022 •

edited