Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve spotfinding #758

Merged
merged 12 commits into from
Jun 7, 2019
Merged

Improve spotfinding #758

merged 12 commits into from
Jun 7, 2019

Conversation

jmp1985
Copy link
Contributor

@jmp1985 jmp1985 commented Apr 17, 2019

This pull request implements an extended dispersion threshold algorithm for spot finding to try and address #752.

The algorithm does the following steps:

  1. Compute the dispersion image
  2. Compute the dispersion mask as normal
  3. Erode the dispersion mask by half the kernel size
  4. Expand the kernel slightly
  5. Compute the mean pixel value excluding the dispersion mask
  6. Compute the strong mask as normal

The rationale for the algorithm is that the dispersion mask is computed by testing the null hypothesis that all the pixels in the local area are drawn from a Poisson distribution. If this hypothesis is false then the pixel is marked as True in the dispersion mask.

If a strong pixel is at the edge of the local area then the pixel at the centre of the box is marked True. This means that the dispersion image tends to have the strong pixels and a band of pixels 1/2 the kernel size around it. Therefore, for true strong spots we can safely erode the dispersion mask by 1/2 the kernel size. This results in a mask which pretty much fits the the spot exactly.

Finally, we want to only select the strong pixels (sometimes the erodes mask still incorporates weak pixels); however, the problem we had before is that we were using the mean calculated including the strong pixels. This was inevitable if we wanted to do the whole thing with a single summed area table. However, we can recompute the mean excluding the dispersion masked pixels with a slightly expanded kernel to ensure we have plenty of pixels to choose from. This gives actually a pretty good estimate of the background from which we can then threshold the strong pixels as done previously.

The end result is that we get much better segmentation of the strong spots. However, the algorithms is a bit slower (currently a lot slower since I have only implemented the debug version - once we are happy that the algorithm is an improvement I will implement a more optimised version). Once optimized I think it will still require more passes of the image data:

  1. One pass to compute the initial summed area table.
  2. One pass to compute the dispersion mask.
  3. Two passes to compute the distance metric for the erosion
  4. One pass to compute the eroded dispersion mask
  5. One pass to compute the second summed area table
  6. One pass to compute the final mask

However, if slower still means better then this may be acceptable.

Here are some examples of some output:

The raw data from an I23 image
data_1_0001_image_close

The threshold from the same image
data_1_0001_threshold_close

The spot finding results with strong pixels marked
data_1_0001_pixels

I have also modified the largest spot size to be 1000 pixels (10x10x10) since this will tend to find larger spots.

Another nice side effect is that the could now have pretty decent background estimates from the spot finding with little extra cost which might also make the change desirable.

Next steps are to test thoroughly, particular on the data that @rjgildea has been looking at. Then once we are satisfied we need to optimize the code!

@jmp1985
Copy link
Contributor Author

jmp1985 commented Apr 17, 2019

One thing to note is that I found that the kernel size of 3x3 actually performed well in the majority of cases; however, one CCD dataset I had with really massive spots did better with a kernel of 5x5. I don't think it should require a kernel any larger than that in the majority of cases.

@ndevenish
Copy link
Member

However, if slower still means better then this may be acceptable.

It's my understanding that we've explicitly rejected improving the spotfinding in the past because speed was a much higher priority. I'm assuming this will sit as an alternate method, but do you have any quantification of what the actual improvements are?

- Do the dispersion threshold

- Erode by half the kernel size

- Expand the kernel slightly

- Compute the local mean exluding dispersion threshold (i.e. non
  background pixel)

- Compute the strong pixel mask
@rjgildea
Copy link
Contributor

Anecdotally the spot-finding results look a lot better, e.g. kernel_size=3,3 for image 144 of the i23 germanate dataset:
Before:
image
After:
image
Comparison of merging statistics with the old algorithm (b=before) and the new algorithm (a=after):
i_over_sigma_mean
r_pim
chart
Possibly still some improvements to be made by increasing the kernel size slightly, but the changes are a lot smaller than before.

@rjgildea
Copy link
Contributor

For i03_thaum_eiger dataset there seems a lot less dependence of integration results on kernel size :
i_over_sigma_mean
r_pim
chart-1

@rjgildea
Copy link
Contributor

Similarly for I04_Thaum_eiger:
i_over_sigma_mean
r_pim
chart-2

@rjgildea
Copy link
Contributor

I04 thaumatin pilatus (i04_bag_training):
i_over_sigma_mean
r_pim
chart-3

@rjgildea
Copy link
Contributor

VMXi thaumatin:
i_over_sigma_mean
r_pim
chart-4

@jmp1985
Copy link
Contributor Author

jmp1985 commented Apr 24, 2019

@rjgildea Thanks for doing these tests. The results look positive.

rjgildea referenced this pull request Apr 25, 2019
Fixes case of looking at old pickles which do not have the n_signal column; was looked for in setting the defaults and failed
@jmp1985
Copy link
Contributor Author

jmp1985 commented Apr 26, 2019

I've now added a faster implementation and done a crude benchmark using the bag training data.

The old spot finding algorithm is obviously faster but extracts fewer spots, mainly due to filtering out small spots < 3 pixels. Since the spots extracted from the new algorithm are generally larger, this is less of a problem. However, 4 spots were filtered due to having more than 1000 pixels. The new algorithm takes approximately 1.6 times as long to run but as can be seen above, gives much better results

Old
Saved 116082 reflections to strong.mpack
Time Taken: 122.089928

New
Saved 126782 reflections to strong.mpack
Time Taken: 198.506274

@jmp1985
Copy link
Contributor Author

jmp1985 commented Apr 26, 2019

The average of ten runs:

Old:
Time taken: 124.1504409 seconds

New
Time taken: 175.5127746 seconds

@graeme-winter
Copy link
Contributor

Fails for me with Eiger data in xia2:

Grey-Area with :) $ cat DEFAULT/NATIVE/SWEEP1/index/2_dials.find_spots.log 
DIALS (2018) Acta Cryst. D74, 85-97. https://doi.org/10.1107/S2059798317017235
DIALS 2.dev.258-g67c4870f5
The following parameters have been modified:

output {
  reflections = "2_SWEEP1_strong.pickle"
  experiments = "2_SWEEP1_experiments.json"
}
spotfinder {
  write_hot_mask = True
  hot_mask_prefix = "2_hot_mask"
  scan_range = 1 488
  filter {
    min_spot_size = 3
  }
  mp {
    nproc = 8
  }
  threshold {
    dispersion {
      min_local = 0
    }
  }
}
input {
  experiments = /Users/graeme/Projects/chef-dI/improved-spotfinding/with/DEFAULT/NATIVE/SWEEP1/index/1_SWEEP1_experiments.json
}

Configuring spot finder from input parameters
--------------------------------------------------------------------------------
Finding strong spots in imageset 0
--------------------------------------------------------------------------------


Finding spots in image 1 to 488...
Setting chunksize=61
Extracting strong pixels from images
 Using multiprocessing with 8 parallel job(s)



easy_mp crash detected; subprocess trace: ----
Traceback (most recent call last):
  File "/Users/graeme/svn/cctbx/modules/cctbx_project/libtbx/scheduling/job_scheduler.py", line 64, in job_cycle
    value = target( *args, **kwargs )
  File "/Users/graeme/svn/cctbx/modules/dials/util/mp.py", line 191, in __call__
    result.append(self.func(i))
  File "/Users/graeme/svn/cctbx/modules/dials/algorithms/spot_finding/finder.py", line 285, in __call__
    result = self.function(task)
  File "/Users/graeme/svn/cctbx/modules/dials/algorithms/spot_finding/finder.py", line 147, in __call__
    threshold_mask = self.threshold_function.compute_threshold(im, mk)
  File "/Users/graeme/svn/cctbx/modules/dials/extensions/dispersion_extended_spotfinder_threshold_ext.py", line 66, in compute_threshold
    return self._algorithm(image, mask)
  File "/Users/graeme/svn/cctbx/modules/dials/algorithms/spot_finding/threshold.py", line 194, in __call__
    algorithm = threshold.DispersionExtendedThreshold(
'module' object has no attribute 'DispersionExtendedThreshold'
----------------------------------------------


Traceback (most recent call last):
  File "/Users/graeme/svn/cctbx/build/../modules/dials/command_line/find_spots.py", line 224, in <module>
    halraiser(e)
  File "/Users/graeme/svn/cctbx/build/../modules/dials/command_line/find_spots.py", line 222, in <module>
    script.run()
  File "/Users/graeme/svn/cctbx/build/../modules/dials/command_line/find_spots.py", line 153, in run
    reflections = flex.reflection_table.from_observations(experiments, params)
  File "/Users/graeme/svn/cctbx/modules/dials/array_family/flex.py", line 208, in from_observations
    return find_spots(experiments)
  File "/Users/graeme/svn/cctbx/modules/dials/algorithms/spot_finding/finder.py", line 770, in __call__
    table, hot_mask = self._find_spots_in_imageset(imageset)
  File "/Users/graeme/svn/cctbx/modules/dials/algorithms/spot_finding/finder.py", line 867, in _find_spots_in_imageset
    r, h = extract_spots(imageset[j0:j1])
  File "/Users/graeme/svn/cctbx/modules/dials/algorithms/spot_finding/finder.py", line 478, in __call__
    return self._find_spots(imageset)
  File "/Users/graeme/svn/cctbx/modules/dials/algorithms/spot_finding/finder.py", line 582, in _find_spots
    callback=process_output,
  File "/Users/graeme/svn/cctbx/modules/dials/util/mp.py", line 272, in batch_multi_node_parallel_map
    preserve_exception_message=True,
  File "/Users/graeme/svn/cctbx/modules/dials/util/mp.py", line 172, in multi_node_parallel_map
    preserve_exception_message=preserve_exception_message,
  File "/Users/graeme/svn/cctbx/modules/dials/util/mp.py", line 45, in parallel_map
    preserve_exception_message=preserve_exception_message,
  File "/Users/graeme/svn/cctbx/modules/cctbx_project/libtbx/easy_mp.py", line 628, in parallel_map
    result = res()
  File "/Users/graeme/svn/cctbx/modules/cctbx_project/libtbx/scheduling/result.py", line 119, in __call__
    self.traceback( exception = self.exception() )
  File "/Users/graeme/svn/cctbx/modules/cctbx_project/libtbx/scheduling/stacktrace.py", line 117, in __call__
    self.raise_handler( exception = exception )
  File "/Users/graeme/svn/cctbx/modules/cctbx_project/libtbx/scheduling/stacktrace.py", line 134, in raise_with_traceback
    raise_(type(exception), exception, self.traceback)
  File "/Users/graeme/svn/cctbx/modules/cctbx_project/libtbx/scheduling/mainthread.py", line 100, in poll
    value = target( *args, **kwargs )
  File "/Users/graeme/svn/cctbx/modules/dials/util/mp.py", line 88, in __call__
    preserve_exception_message=self.preserve_exception_message,
  File "/Users/graeme/svn/cctbx/modules/cctbx_project/libtbx/easy_mp.py", line 628, in parallel_map
    result = res()
  File "/Users/graeme/svn/cctbx/modules/cctbx_project/libtbx/scheduling/result.py", line 119, in __call__
    self.traceback( exception = self.exception() )
  File "/Users/graeme/svn/cctbx/modules/cctbx_project/libtbx/scheduling/stacktrace.py", line 117, in __call__
    self.raise_handler( exception = exception )
  File "/Users/graeme/svn/cctbx/modules/cctbx_project/libtbx/scheduling/stacktrace.py", line 147, in raise_with_stacktrace
    self.stacktrace( exception = exception )
  File "/Users/graeme/svn/cctbx/modules/cctbx_project/libtbx/scheduling/stacktrace.py", line 88, in __call__
    raise exception
AttributeError: Please report this error to dials-support@lists.sourceforge.net: 'module' object has no attribute 'DispersionExtendedThreshold'

@graeme-winter
Copy link
Contributor

@rjgildea asks "did you type Make"

No, I did not

@rjgildea
Copy link
Contributor

rjgildea commented May 24, 2019

FYI: for testing both algorithms:

$ dials.find_spots -c -e2 | grep algorithm
    algorithm = dispersion *dispersion_extended helen single```

@graeme-winter
Copy link
Contributor

On understanding that we will add a background estimate in the future, agree is a good idea.

branch started in pre-precommitbx era
@@ -61,7 +61,7 @@ def generate_phil_scope():
"to be accepted by the filtering algorithm."
.type = int(value_min=1)

max_spot_size = 100
max_spot_size = 1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly worried that this might have a negative effect on the standard spotfinding?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_spot_size=100 is overly conservative, especially in the context of smaller eiger pixels (remember this is the number of spot pixels in 3 dimensions).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I suppose rings picked up will either be noise or eliminated by the other filters.

@ndevenish
Copy link
Member

ndevenish commented Jun 5, 2019

Do we really want to move to this as default immediately?

Otherwise, I see no reason why not to merge this now, it mostly just inserts new things without affecting the existing structure.

@graeme-winter
Copy link
Contributor

@ndevenish makes spot finding slower but better - that said I am happy to see this merged (size change & all)

re: sizes - could have 100 for 2D spots, 1000 for 3D?

@Anthchirp Anthchirp merged commit 137eff7 into master Jun 7, 2019
@Anthchirp Anthchirp deleted the improve_spotfinding branch June 7, 2019 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants