New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redo the ImageSetData class to use less memory #438
Conversation
Note, as part of working on this changeset I created a combined experiment list with 10 experiments drawn from a single SwissFEL data file with over a thousand images. Normally, the combined experiment list would have 10 imagesets in it and be very slow to load. Compared to the single image case with takes 20 seconds, this takes 2 minutes to load in the image viewer. I then made a version with 10 experiments and a single imageset, where that imageset had single_file_indices set to an array of 10 (disjoint) image indices. Indeed, it only took 20 seconds to load. It's definitely where we need to go for processing stills. |
Alternative version on branch |
Looking into these test failures. Here's a simple snippet. (must run
@phyy-nx, just checking, can you confirm this is supposed to still work? |
This change alters ImageSetData to keep its models in an array of len(indices) instead of len(max(indices)). Concurrently, whenever the imageset dereferences a model, it needs to not dereference the indices array. Retain dereferencing the indices array when reading the data.
Needed to make sliced image sets when the list of models held is only as long as the number of indices.
Codecov Report
@@ Coverage Diff @@
## main #438 +/- ##
==========================================
+ Coverage 66.20% 66.22% +0.01%
==========================================
Files 185 185
Lines 16792 16813 +21
Branches 2375 2382 +7
==========================================
+ Hits 11117 11134 +17
- Misses 5110 5111 +1
- Partials 565 568 +3 |
I've done some testing of memory usage by loading one of three expt files which contain 1, ~25 and ~400 single-image imagesets and comparing heap usage before/after. Main branch: This branch: On the main branch, for each imageset, four large-ish chunks are allocated on the heap. It looks like this:
Reading down the heap addresses in order, the sizes of the 4 large objects grow up to ~100 kB, then reset to a few kB, then grow again. I can't see specifically what is there, but it becomes quite a lot. This is no longer observed with this PR. Would welcome any other suggestions for assessing the memory usage, but I hope it's clear enough that the improvement here is pretty big. dxtbx and xfel_regression tests all pass. @graeme-winter, any suggestion for who could review this? |
The four large objects are these ones allocated in the ImageSetData constructor, I believe @phyy-nx already appreciated this through an unusual technique of "understanding the code" :)
|
Note that the large chunks allocated on the heap are each 16 bytes x the length of the Reader that was used to construct the ImageSetData. Thus a chunk of size 78592 corresponds to an |
This was agreed to be merged at the 2021-12-09 DIALS core meeting. However, I'm reluctant to merge this in right before a release, so shall merge this once 3.8 is out. |
Since this is quite a significant change, I realised @dwpaley that you weren't in the AUTHORS list in dxtbx, so I added it. Please let me know if this isn't okay. Otherwise, I intend to merge this tomorrow morning. |
Not a problem of course :) Thanks! |
When image files are numbered from 0, the resulting ImageSequence has batch_offset -1. This situation was handled incorrectly after #438; this fixes the bug and adds a test. Closes dials/dials#2011.
Two specific fixes: - In _determine_max_memory_needed, check if the scan is a still, in which case its bbox z coordinates should be 0, 1 - In dials.import, in the manual geometry updater, fix convert_sequences_to_stills. This was likely fallout from cctbx/dxtbx#438, where the imagset indices were redone to use less memory. Fixes #2127
* Fix raster scans for dials.stills_process Two specific fixes: - In _determine_max_memory_needed, check if the scan is a still, in which case its bbox z coordinates should be 0, 1 - In dials.import, in the manual geometry updater, fix convert_sequences_to_stills. This was likely fallout from cctbx/dxtbx#438, where the imageset indices were redone to use less memory. Fixes #2127 Co-authored-by: James Beilsten-Edmands <jbeilstenedmands@gmail.com>
This change alters ImageSetData to keep its models in an array of len(indices) instead of len(max(indices)). Concurrently, whenever the imageset dereferences a model, it needs to not dereference the indices array. Retain dereferencing the indices array when reading the data.
Mostly untested. Needs review and likely further work. Work in progress.
Closes #437