Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defining image groupings for serial data #2251

Merged
merged 28 commits into from
Feb 20, 2023

Conversation

jbeilstenedmands
Copy link
Contributor

@jbeilstenedmands jbeilstenedmands commented Oct 18, 2022

This module implements ways to define groupings of still images by arbitrary metadata, and methods to use that to split experiment and reflection files by those groupings.

For an example use case, consider an integrated.expt, integrated.refl file pair that contains data from a dataset with an image template example_####.cbf, and the experiment is a dose series on a grid of crystals with each crystal receiving 10 exposures in series.
The code works by parsing a yaml file of the following example format:

metadata:
  dose_point:
    "/path/to/example_####.cbf" : "repeat=10"
grouping:
  merge_by:
    values:
      - dose_point

The following code can use such a yaml file to split the files into 10 .expt, .refl files containing the data from each dose point, or alternatively to write a group_id column into the data files.

datafiles = [FilePair(Path("integrated.expt"), Path("integrated.refl"))]
parsed = ParsedYAML("groupby.yaml")
handler = get_grouping_handler(parsed, "merge_by")
handler.split_files_to_groups(Path.cwd(), datafiles)    # create a new datafile for each group
handler.write_groupids_into_files(datafiles)      # write a group_id column into integrated.refl

The fully generalised example is for h5 image files, where on can link to an arbitrary metadata array. Furthermore, data can be grouped by multiple metadata items, with tolerances specified to allow grouping of continuous metadata values such as wavelength:

metadata:
  timepoint:
    "/path/to/example_master.h5" : "/path/to/example_master.h5:/timepoint"  # metadata contained in image file
    "/path/to/example_2_master.h5" : "/path/to/meta_2.h5:/timepoint"          # metadata in separate file
  wavelength:
    /path/to/example_master.h5 : 0.4                                        # metadata is a shared value for every image
    /path/to/example_2_master.h5 : 0.6
  crystal_id:
     /path/to/example_master.h5 : "/path/to/meta.h5:/crystal"
     /path/to/example_2_master.h5 : "/path/to/meta_2.h5:/crystal"
grouping:
  merge_by:                  # define a grouping for a particular process
    values:                  # the values are keys for the metadata
      - timepoint
      - wavelength
    tolerances:
      - 0
      - 0.01
  index_by:                  # define a grouping for a different process
    values:
      - crystal_id

Finally, a couple of special options are defined which can be helpful for templated data - "repeat={n}" or "block={f}:{l}:{n}", to indicate metadata data occurring on a repeat cycle of every $n$ image being equivalent (e.g. timepoints in a dose series), or equivalent in blocks of $n$ images (e.g. to define the physical crystals in a dose series).

This is not necessarily something that would be directly exposed to the general user of programs, more defining a backend structure of how we can define and use arbitrary groupings in a consistent manner.

@jbeilstenedmands jbeilstenedmands marked this pull request as ready for review November 3, 2022 14:41
@rjgildea rjgildea self-requested a review November 3, 2022 14:42
@jbeilstenedmands jbeilstenedmands merged commit 53f64c3 into dials:main Feb 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants