Remove unneeded slicing and imageset creation in spotfinding, indexing, and integration #1705

phyy-nx · 2021-05-14T19:32:23Z

…g, and integration Fixes #1704

rjgildea · 2021-05-14T21:05:20Z

algorithms/indexing/indexer.py

+                    if isinstance(experiment.imageset, ImageSequence):
+                        experiment.imageset = ImageSet(
+                            experiment.imageset.data(), experiment.imageset.indices()
+                        )


It isn't clear to me why indexing is even creating imagesets in the first place... it appears to only be doing this in the stills indexer, so presumably you might have a better idea why this is the case?

Right, so in dials.stills_process, imageset_from_anyset is called on all images right at the start. So by the time the indexing is ran, there's no way to have an ImageSequence.

However, if dials.index is ran, with indexing.stills.indexer=stills, then it's possible an ImageSequence makes it this far, even though the user wishes it to be treated as a still. Therefore, the ImageSequence is reset as an ImageSet, and things like the StillsReflectionPredictor are properly picked.

In the dials.stills_process case, before this changeset, ImageSets were always recreated, which invalidated the cache and caused a data read later on. With this changeset, they are never recreated, because of the imageset_from_anyset call.

A data read wouldn't be triggered in the dials.index case because the data is never read in that program, so the cache doesn't matter.

I see. It seems a bit odd that refinement should care at all about the imageset - it should only depend on the models and input parameters. Perhaps a better (at least longer term) solution would be for refinement not to depend on the imageset type when choosing which predictor to use, then constructing a new imageset in indexing shouldn't be necessary at all.

Totally agree. My googling indicates that picking the StillsReflectionPredictor is the only place where ImageSequence is tested for explicitly, and I think it should instead be choosing the predictor based on if the scan is a set of stills:

Instead of

if isinstance(experiment.imageset, ImageSequence):

It should be

if experiment.scan and not experiment.scan.is_still():

Like it is elsewhere.

But that seemed like a bigger change than this fix warranted? Maybe not? If folks like that better I can add it to this PR.

TBH I think this is not a question of "like or dislike" - I think very shortly this will become necessary for sane processing of grid scan data beyond simply spot finding. As such, I'd probably suggest we put together a second PR where we actively eliminate

if isinstance(experiment.imageset, ImageSequence):

type checks from the code base - this would be work which was started back in the load before heat death days but never (yet) completed.

On a practical note @phyy-nx if you could put a comment in there as to why this check is in place would probably help for the medium term.

rjgildea · 2021-05-14T21:07:33Z

algorithms/integration/processor.py

@@ -499,7 +499,8 @@ def __call__(self):

            # I am 99% sure this is implied by all the code above
            assert (frame1 - frame0) <= len(imageset)
-            imageset = imageset[frame0:frame1]
+            if len(imageset) > 1:
+                imageset = imageset[frame0:frame1]


It seems slightly concerning that this apparently results in an extra image read - is this the case for rotation data, or just still imagesets?

I didn't profile dials.integrate, but I don't think it has the same problem. I believe profile fitting is done from the strong shoeboxes, and data reading then only happens during integration itself.

In dials.stills_process, the same imageset is handed from program to program and so it cannot be reset. In the dials.integrate it is invoked fresh and used once. I believe :)

Profile fitting is not just done from the strong shoeboxes - for rotation data they are only used to determine the profile parameters - all reflections are then gathered to generate the reference profiles.

We should probably check whether the rest is called for dials.integrate on rotation data as a separate exercise and raise an issue if so.

rjgildea · 2021-05-14T21:08:07Z

algorithms/spot_finding/finder.py

+            if len(imageset) == 1:
+                r, h = extract_spots(imageset)
+            else:
+                r, h = extract_spots(imageset[j0:j1])


As above - it would probably be instructive to understand why this results in an extra image read.

See above, but in this case it's because of the extra radial_average step we do in the XFEL GUI. That causes a data read, so when spotfinding happens and the imageset is sliced, the cache is lost. In this changeset, the same imageset is used at each step, preventing the cache loss.

phyy-nx · 2021-05-17T20:21:33Z

Thanks for the great questions @rjgildea. Happy to clarify further or even provide the cProfile plots as needed.

graeme-winter · 2021-05-18T05:37:30Z

Questions raised above suggest we should probably have a better collective insight into what is going on 🙂

graeme-winter

There are questions raised in here which are bigger in scope than the PR so seems unfair to ask them to be addressed in here.

Overall, I think the changes are limited in their scope & hopefully we will be revisiting the slightly questionable bits (checking if image sets are sequences) so I am not too worried.

graeme-winter · 2021-05-18T05:38:43Z

algorithms/indexing/indexer.py

+                    if isinstance(experiment.imageset, ImageSequence):
+                        experiment.imageset = ImageSet(
+                            experiment.imageset.data(), experiment.imageset.indices()
+                        )


On a practical note @phyy-nx if you could put a comment in there as to why this check is in place would probably help for the medium term.

graeme-winter · 2021-05-18T05:39:44Z

newsfragments/1705.bugfix

@@ -0,0 +1 @@
+Speed up processing still images by reducing the number of file reads per image


dials.stills_process: improve processing performance by preventing re-reading of image data

?

phyy-nx · 2021-05-18T15:55:36Z

Any other comments? I'll do the other ImageSequence PR after this goes through.

phyy-nx · 2021-05-19T19:26:06Z

Updates:

I was wrong about there only being one instance of checking for ImageSequence. It's done all over the place. Not worth chasing down at the moment or near future, IMO.
I profiled dials.integrate using 540 cbfs (the thaumatin 540 dataset). I ran dials.import, dials.find_spots, dials.index, dials.refine, all using default arguments. I then ran this:

libtbx.python -m cProfile -o integrate_540_nproc1.prof `libtbx.find_in_repositories dials`/command_line/integrate.py refined.* nproc=1

If you run without nproc=1, parallel map isn't profiled in such a way that you can see what's going on with get_raw_data. However, with nproc=1, it's clear that get_raw_data is called 1080 times, twice per image.

Folks interested in a second issue on this?

phyy-nx · 2021-05-19T21:38:37Z

Find spots is fine, 540 calls to get_raw_data:

libtbx.python -m cProfile -o find_spots_540_nproc1.prof `libtbx.find_in_repositories dials`/command_line/find_spots.py imported.expt nproc=1

phyy-nx · 2021-05-19T21:58:54Z

More on dials.integrate. As I mentioned, get_raw_data is called twice per image, but that's because the reflection profile modeler reads each image in turn, and then the integrator reads each image in turn. Since only one image at a time is cached, there's no way to avoid 2 data reads as currently implemented.

Unless the profile modeler is creating shoeboxes that are then discarded, and then recreated. In which case that does seem like a waste. The second read shouldn't be needed.

(for stills it doesn't matter since we don't use the profile modeler)

(moving on from this, FYI :)

graeme-winter · 2021-05-20T04:46:15Z

More on dials.integrate. As I mentioned, get_raw_data is called twice per image, but that's because the reflection profile modeler reads each image in turn, and then the integrator reads each image in turn. Since only one image at a time is cached, there's no way to avoid 2 data reads as currently implemented.

yup:

#1705 (comment)

Unless the profile modeler is creating shoeboxes that are then discarded, and then recreated. In which case that does seem like a waste. The second read shouldn't be needed.

too much memory - already this is very heavy on memory just storing the boxes for the "live" block

(for stills it doesn't matter since we don't use the profile modeler)

(moving on from this, FYI :)

yup, fair

Remove unneeded slicing and imageset creation in spotfinding, indexin…

bdda9c6

…g, and integration Fixes #1704

phyy-nx requested a review from graeme-winter May 14, 2021 19:32

Rename newsfragments/XXX.bugfix to newsfragments/1705.bugfix

ee92b0f

rjgildea reviewed May 14, 2021

View reviewed changes

graeme-winter approved these changes May 18, 2021

View reviewed changes

Update news and add comment

5cee1bd

phyy-nx merged commit a8050e2 into main May 19, 2021

phyy-nx deleted the avoid_slicing branch May 19, 2021 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove unneeded slicing and imageset creation in spotfinding, indexing, and integration #1705

Remove unneeded slicing and imageset creation in spotfinding, indexing, and integration #1705

phyy-nx commented May 14, 2021

rjgildea May 14, 2021

phyy-nx May 17, 2021

rjgildea May 17, 2021

phyy-nx May 17, 2021

graeme-winter May 18, 2021

graeme-winter May 18, 2021

rjgildea May 14, 2021

phyy-nx May 17, 2021

graeme-winter May 18, 2021

rjgildea May 14, 2021

phyy-nx May 17, 2021

phyy-nx commented May 17, 2021

graeme-winter commented May 18, 2021

graeme-winter left a comment

graeme-winter May 18, 2021

graeme-winter May 18, 2021

phyy-nx commented May 18, 2021

phyy-nx commented May 19, 2021

phyy-nx commented May 19, 2021

phyy-nx commented May 19, 2021

graeme-winter commented May 20, 2021

		@@ -0,0 +1 @@
		Speed up processing still images by reducing the number of file reads per image

Remove unneeded slicing and imageset creation in spotfinding, indexing, and integration #1705

Remove unneeded slicing and imageset creation in spotfinding, indexing, and integration #1705

Conversation

phyy-nx commented May 14, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phyy-nx commented May 17, 2021

graeme-winter commented May 18, 2021

graeme-winter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phyy-nx commented May 18, 2021

phyy-nx commented May 19, 2021

phyy-nx commented May 19, 2021

phyy-nx commented May 19, 2021

graeme-winter commented May 20, 2021