Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse scan-varying refinement #2022

Merged
merged 37 commits into from
Mar 25, 2022
Merged

Sparse scan-varying refinement #2022

merged 37 commits into from
Mar 25, 2022

Conversation

dagewa
Copy link
Member

@dagewa dagewa commented Mar 3, 2022

This PR enables better use of sparse storage during gradient calculation for scan-varying parameterisations. This avoids the storage of a large number of explicit zero elements. For a 3600 image, 360° test dataset this reduces the overall memory requirement of a dials.refine scan_varying=true to 93% of that required on the main branch. This saving should be improved for larger scans, i.e. multi-turn data collection, as sparseness of the Jacobian increases. As well as the memory reduction, total wall clock time for the run is reduced to 77% of the main branch (see #1800).

Fixes #1800.

dagewa added 30 commits July 27, 2021 10:38
This is a hangover from before refactoring in 0f6b2e4
Add tests for scalar versions, add tests for mat3 * mat3.
- parts()
- dot(other)
- rotate_around_origin(direction, angle)
This turned out to be quite complicated. The intersection and
intersection_i_seqs methods did not quite do the right thing
because they require sorted input arrays, whereas this has to
preserve the sorted order. I ended up with a loop in Python
to do what I want, but this should be converted to C++ for
speed.
Just calculate as required - it's cheap enough
This currently fails in downstream calculations, so keep the dense version
around for now, to remove later once this is fixed.
With changes in _grads_model_loop, the sparse calculations are seen
to give the same results as the dense versions. Changes still need
to be made to _grads_detector_loop.
Use a map to look up values. Still slow and needs to move to C++
This will record the normal matrix at each step in the refinement
history. In addition, leave a stub for recording the Jacobian
structure in the history as well. Also tidy some of the code.
…al Python version

At the moment, including a test of total execution time. The Python version is 2 times
*faster*.
results. Unfortunately this test fails when the inputs have random
order!
version. Use this to find corner cases where the results are not
the same.
The NumPy method returns sorted values, which means that it only
works for selection when the input arrays are themselves sorted.
This is often the case but cannot be guaranteed. Therefore this
method is not suitable for use in SparseFlex.select. Added a
test to demonstrate the failure of this method when the input
arrays are unsorted.
For reasons I don't understand, using the unordered_map a little
differently makes a dramatic difference to the execution speed.
The C++ version is now the fastest: about 8 times faster than the
pure Python version and 4 times faster than the NumPy version.

Added a test that demonstrates execution speed.
Creating a SparseFlex with zero length data would lead to the data
type not being set because of the early exit from `extend`. This
change avoids that bug, but it would be better to always set the
data type on construction
@dagewa
Copy link
Member Author

dagewa commented Mar 3, 2022

Tagged @graeme-winter for review, and testing with multi-turn datasets

@graeme-winter
Copy link
Contributor

Noted, wilco, just processing a ... 10-turn data set right now

@graeme-winter
Copy link
Contributor

Right, this is going to be a process...

        Command being timed: "dials.refine ../indexed.expt ../indexed.refl"
        User time (seconds): 6713.07
        System time (seconds): 118.45
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:53:55
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 38208560
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 83111430
        Voluntary context switches: 8595
        Involuntary context switches: 2728
        Swaps: 0
        File system inputs: 0
        File system outputs: 8
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

timing for 1st run, on main, in P1

@graeme-winter
Copy link
Contributor

Not P1, I23

        Command being timed: "dials.refine ../i23.expt ../i23.refl"
        User time (seconds): 1894.68
        System time (seconds): 56.56
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 32:34.73
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 20295136
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 36415813
        Voluntary context switches: 8592
        Involuntary context switches: 955
        Swaps: 0
        File system inputs: 0
        File system outputs: 8
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

@graeme-winter
Copy link
Contributor

Branch; i23

        Command being timed: "dials.refine ../i23.expt ../i23.refl"
        User time (seconds): 1542.23
        System time (seconds): 28.70
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 26:21.68
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 20295900
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 832
        Minor (reclaiming a frame) page faults: 20870183
        Voluntary context switches: 12435
        Involuntary context switches: 643
        Swaps: 0
        File system inputs: 429744
        File system outputs: 8
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

@graeme-winter
Copy link
Contributor

P1 on the branch

        Command being timed: "dials.refine ../indexed.expt ../indexed.refl"
        User time (seconds): 5928.33
        System time (seconds): 66.51
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:39:59
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 38212152
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 41046323
        Voluntary context switches: 8619
        Involuntary context switches: 2055
        Swaps: 0
        File system inputs: 0
        File system outputs: 8
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

@dagewa
Copy link
Member Author

dagewa commented Mar 10, 2022

A reduction in wallclock time is good to see, but it is surprising there is essentially no reduction in memory requirements. With the 360° test dataset I saw a reduction to 93% of the memory requirement - so not a large reduction but measurable. I thought this would improve with more parameters, but maybe the memory is going somewhere else. I'm looking at the 10-turn datasets now...

@dagewa
Copy link
Member Author

dagewa commented Mar 10, 2022

Here's some analysis of the Jacobian structure on this branch for the 10-turn data set, starting with a single-step scan-varying refinement run. First for the I23 case:

dials.refine scan_varying=True max_iterations=1\
  track_jacobian_structure=True history=history_I23.json\
  ../i23.expt ../i23.refl 

Then extract the Jacobian structure and plot:

import json
import sys
from matplotlib import pyplot as plt

with open("history.json") as f:
    history = json.load(f)
jacobian_structure = history["data"]["jacobian_structure"][-1]
iparam = [i for i in range(len(jacobian_structure))]

structural_zeroes = [param["structural_zeroes"] for param in jacobian_structure]
all_zeroes = [param["all_zeroes"] for param in jacobian_structure]
all_rows = [param["nrows"] for param in jacobian_structure]
explicit_zeroes = [a-b for a,b in zip(all_zeroes, structural_zeroes)]
non_zeroes = [a-b for a,b in zip(all_rows, all_zeroes)]

fig, ax = plt.subplots()
ax.bar(iparam, structural_zeroes, label='Structural zeroes', width=1.0)
ax.bar(iparam, explicit_zeroes, label='Explicit zeroes', bottom=structural_zeroes, width=1.0)
ax.bar(iparam, non_zeroes, label='Non-zeroes', bottom=all_zeroes, width=1.0)
ax.legend(loc="lower right")
ax.set_ylabel("Rows")
ax.set_xlabel("Parameters")
ax.set_title("Jacobian structure")

plt.savefig("structure_I23.png", dpi=300)

structure_I23
This indicates that the Jacobian truly is sparse, with no explicit zeroes visible.

@dagewa
Copy link
Member Author

dagewa commented Mar 10, 2022

Likewise for P1
structure_P1
In this case there are a very small number of explicit zeroes visible, which are when we get crystal parameters that cannot be determined for particular reflections (e.g. length of unit cell vector parallel to the beam at some orientation).

@dagewa
Copy link
Member Author

dagewa commented Mar 10, 2022

This branch does not change the default refinement engine, only the sparseness of the Jacobian. One thing that seemed worth exploring is whether setting engine=SparseLevMar makes a difference, as this would use not only a sparse Jacobian, but a sparse normal matrix as well.

engine=LevMar (default)

Command being timed: "dials.refine scan_varying=True track_jacobian_structure=True ../i23.expt ../i23.refl max_iterations=1"
	User time (seconds): 115.05
	System time (seconds): 3.99
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 1:59.51
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 6948076
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 4550607
	Voluntary context switches: 66
	Involuntary context switches: 628
	Swaps: 0
	File system inputs: 2008
	File system outputs: 1423256
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

engine=SparseLevMar

Command being timed: "dials.refine scan_varying=True track_jacobian_structure=True ../i23.expt ../i23.refl max_iterations=1 engine=SparseLevMar"
	User time (seconds): 110.34
	System time (seconds): 6.96
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 1:57.84
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 10517444
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 8899464
	Voluntary context switches: 47
	Involuntary context switches: 903
	Swaps: 0
	File system inputs: 0
	File system outputs: 1423272
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

No, that didn't help! Peak memory requirements for SparseLevMar are 1.5 times higher than LevMar (for this job)

@dagewa
Copy link
Member Author

dagewa commented Mar 10, 2022

The purpose of this branch is to properly account for sparseness during gradient calculations for scan-varying refinement. The plots indicate that has been achieved. However, it appears this only translates to a modest improvement in overall performance of dials.refine. Further reduction in the memory footprint would have to be explored outside the gradient calculations. I don't think that should occur on this branch though.

@dagewa
Copy link
Member Author

dagewa commented Mar 10, 2022

Ok, something strange going on here. I tried a full scan-varying run on branch vs main, and I get a very significant difference in run time and memory requirements. The branch takes 1/3 of the time and uses 1/3 of the memory than main. At least, according to mprof. I'll follow up with /usr/bin/time -v runs next.

Command: dials.python -m mprof run /home/fcx32934/sw/cctbx/modules/dials/command_line/refine.py ../i23.expt ../i23.refl scan_varying=true

branch
branch

main
main

@dagewa
Copy link
Member Author

dagewa commented Mar 10, 2022

branch

	Command being timed: "dials.refine ../i23.expt ../i23.refl scan_varying=true log=branch.log"
	User time (seconds): 356.22
	System time (seconds): 4.47
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 6:01.25
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 6992504
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 4970387
	Voluntary context switches: 201
	Involuntary context switches: 7777
	Swaps: 0
	File system inputs: 64
	File system outputs: 1423488
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

main

	Command being timed: "dials.refine ../i23.expt ../i23.refl scan_varying=true log=main.log"
	User time (seconds): 1486.89
	System time (seconds): 74.38
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 26:02.26
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 18652552
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 105010952
	Voluntary context switches: 293
	Involuntary context switches: 8726
	Swaps: 0
	File system inputs: 152
	File system outputs: 1423208
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

The branch is better than 4 times faster and uses almost three times less memory

I note a difference with @graeme-winter's runs is that I set scan_varying=True rather than leaving it as Auto. Maybe something about doing the static run first is then entering the wrong code path on the scan-varying run, and then not getting the benefit on the branch. I'll try that next.

@dagewa
Copy link
Member Author

dagewa commented Mar 10, 2022

Re correctness, output logs from the two runs are essentially identical:
Screenshot from 2022-03-10 15-09-48

@dagewa
Copy link
Member Author

dagewa commented Mar 10, 2022

Yep, something stupid is happening when scan_varying=auto. This appears to enter the wrong code path and then not do the calculation sparsely 🤦‍♂️

/usr/bin/time -v dials.refine ../i23.expt ../i23.refl log=branch_sv_auto.log
        Command being timed: "dials.refine ../i23.expt ../i23.refl log=branch_sv_auto.log"
	User time (seconds): 1312.60
	System time (seconds): 37.22
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 22:30.87
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 19922788
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 48544572
	Voluntary context switches: 112
	Involuntary context switches: 13444
	Swaps: 0
	File system inputs: 0
	File system outputs: 1423248
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

scan-varying macrocycle. Enables a big improvement in
performance for default dials.refine jobs.
@dagewa
Copy link
Member Author

dagewa commented Mar 10, 2022

🙌 for code review. I had been testing setting scan_varying=True all the time and did not spot that a default dials.refine job with a static followed by scan-varying macrocycle did not benefit. Simple fix, and now all is good (I think):

	Command being timed: "dials.refine ../i23.expt ../i23.refl"
	User time (seconds): 449.05
	System time (seconds): 5.50
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 7:35.15
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 8288348
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 6541580
	Voluntary context switches: 149
	Involuntary context switches: 4470
	Swaps: 0
	File system inputs: 0
	File system outputs: 1423312
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

@graeme-winter
Copy link
Contributor

Sounds like a successful review process. I will update and re-run to verify

@graeme-winter
Copy link
Contributor

Sounds like a successful review process. I will update and re-run to verify

Man needs to review dictionary under "S for shortly"

@dagewa
Copy link
Member Author

dagewa commented Mar 15, 2022

"real soon now" 😆

@dagewa
Copy link
Member Author

dagewa commented Mar 22, 2022

@graeme-winter Are you still planning to run the branch against one of your test cases, or is my test sufficient?

@graeme-winter
Copy link
Contributor

	Command being timed: "dials.refine ../indexed.expt ../indexed.refl"
	User time (seconds): 1215.07
	System time (seconds): 9.49
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 20:27.06
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 9025736
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 13874259
	Voluntary context switches: 8701
	Involuntary context switches: 1059
	Swaps: 0
	File system inputs: 888
	File system outputs: 8
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

@dagewa
Copy link
Member Author

dagewa commented Mar 23, 2022

Thanks for checking! Looks like a reduction in memory usage of over four times and nearly 5 times shorter wallclock time.

Not sure why the "PR macos python38" check is hanging on amber?

@codecov
Copy link

codecov bot commented Mar 24, 2022

Codecov Report

Merging #2022 (f90ca5f) into main (5aefc6a) will decrease coverage by 0.13%.
The diff coverage is 57.50%.

❗ Current head f90ca5f differs from pull request most recent head 974a0e2. Consider uploading reports for the commit 974a0e2 to get more accurate results

@@            Coverage Diff             @@
##             main    #2022      +/-   ##
==========================================
- Coverage   68.58%   68.44%   -0.14%     
==========================================
  Files         649      654       +5     
  Lines       75364    76396    +1032     
  Branches    10793    10914     +121     
==========================================
+ Hits        51686    52290     +604     
- Misses      21665    22066     +401     
- Partials     2013     2040      +27     

@dagewa dagewa merged commit 5ffaab2 into main Mar 25, 2022
@dagewa dagewa deleted the sparse-scan-varying branch March 25, 2022 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scan-varying refinement with the SparseLevMar engine is unhelpful
3 participants