Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling image templates that don't use zero-padding #646

Closed
dagewa opened this issue Jul 20, 2023 · 15 comments · Fixed by #705
Closed

Handling image templates that don't use zero-padding #646

dagewa opened this issue Jul 20, 2023 · 15 comments · Fixed by #705
Assignees

Comments

@dagewa
Copy link
Member

dagewa commented Jul 20, 2023

          Now that you tested the code on several datasets, I think this is ready to be merged.

Thanks!

Non-essential suggestion:
How about mentioning the file name issue as a comment, or submiting an issue to the DIALS repository, so that someone who processes ROD datasets can find the workaround.

Originally posted by @biochem-fan in #645 (review)

@dagewa
Copy link
Member Author

dagewa commented Jul 20, 2023

Context: some Rigaku datasets (such as these ones: https://xrda.pdbj.org/entry/93) use an image incremental serial number that is not zero-padded.

Can we reasonably detect this automatically and import them properly?

At worst, perhaps we could provide an option at import that allows the template to be non-padded.

@biochem-fan
Copy link
Member

Link to a workaround:

#11 (comment)

@biochem-fan
Copy link
Member

How about using a single # as a placeholder for non-zero-padded digit(s)?
This would be least disruptive.

If you have non-padded images, images_1.cbf, images_2.cbf, images_3.cbf, ...images_300.cbf:

  • images_#.cbf matches all.
  • images_###.cbf does not match.

If you have zero-padded images, images_001.cbf, images_002.cbf, images_003.cbf, ...images_300.cbf:

  • images_#.cbf does not match.
  • images_###.cbf matches all.
  • images_00#.cbf matches the first nine images.

So this does not change the existing behaviour. Am I missing corner cases?

@graeme-winter
Copy link
Collaborator

How about using a single # as a placeholder for non-zero-padded digit(s)? This would be least disruptive.

If you have non-padded images, images_1.cbf, images_2.cbf, images_3.cbf, ...images_300.cbf:

  • images_#.cbf matches all.
  • images_###.cbf does not match.

If you have zero-padded images, images_001.cbf, images_002.cbf, images_003.cbf, ...images_300.cbf:

  • images_#.cbf does not match.
  • images_###.cbf matches all.
  • images_00#.cbf matches the first nine images.

So this does not change the existing behaviour. Am I missing corner cases?

I like this proposal, but I wonder how one would select just the first 9 images in the top example. i.e. where you do want to match just one digit. I would be tempted to bring ? into play, but that is recognised by shell glob so maybe not what we want.

Maybe however my concern does not matter because on balance what you suggest is far more useful than the current behaviour.

The only other thing which occurs to me in discussing this is whether to move away from the current template to something more printf-like which in fairness most people can understand so you would have foo_%d.cbf or foo_%03d.cbf i.e. someone else has already defined the grammar. This would seem like a sensible transition.

@biochem-fan
Copy link
Member

I like this proposal, but I wonder how one would select just the first 9 images in the top example. i.e. where you do want to match just one digit. I would be tempted to bring ? into play, but that is recognised by shell glob so maybe not what we want.

In this case one has to use dials.import template=image_#.cbf scan_range=1,9 or dials.import image_?.cbf. In the latter case, ? is expanded by the shell so DIALS sees only 1 to 9. Thus, it can workout the template and set the scan range correctly.

The only other thing which occurs to me in discussing this is whether to move away from the current template to something more printf-like which in fairness most people can understand so you would have foo_%d.cbf or foo_%03d.cbf i.e. someone else has already defined the grammar. This would seem like a sensible transition.

As a programmer, I like this. But for non-programmers, the printf-syntax might seem daunting.

@graeme-winter
Copy link
Collaborator

I like this proposal, but I wonder how one would select just the first 9 images in the top example. i.e. where you do want to match just one digit. I would be tempted to bring ? into play, but that is recognised by shell glob so maybe not what we want.

In this case one has to use dials.import template=image_#.cbf scan_range=1,9 or dials.import image_?.cbf. In the latter case, ? is expanded by the shell so DIALS sees only 1 to 9. Thus, it can workout the template and set the scan range correctly.

The only other thing which occurs to me in discussing this is whether to move away from the current template to something more printf-like which in fairness most people can understand so you would have foo_%d.cbf or foo_%03d.cbf i.e. someone else has already defined the grammar. This would seem like a sensible transition.

As a programmer, I like this. But for non-programmers, the printf-syntax might seem daunting.

OK, I agree with your viewpoint here on the single # use case. I ask that we consider also adding support for the printf case as well, since I think that is potentially very powerful (but I am happy that we add that as a second issue / piece of work)

@dagewa
Copy link
Member Author

dagewa commented Aug 2, 2023

I like this proposal, but I wonder how one would select just the first 9 images in the top example. i.e. where you do want to match just one digit. I would be tempted to bring ? into play, but that is recognised by shell glob so maybe not what we want.

It's a bit ugly but apparently this is currently possible:

dials.import template=../frames/exp_864_1_#.rodhypix image_range=1,5

So something similar could be used to select any range within the single digit numbers, even if # is made to match all images.

@dagewa
Copy link
Member Author

dagewa commented Aug 2, 2023

Ah I see @biochem-fan already suggested similar, using scan_range rather than image_range (the fact we have multiple options here is also confusing)

@dagewa
Copy link
Member Author

dagewa commented Aug 2, 2023

See also dials/dials#2376 (comment)

@dagewa dagewa self-assigned this Dec 7, 2023
@dagewa
Copy link
Member Author

dagewa commented Mar 8, 2024

I have been looking at this today and unfortunately I am stuck. The problem is that there is code used to expand a template to filenames, which does not know if the filenames should be zero-padded or not:

def _expand_template(template: str, indices: Iterable[int]) -> list[str]:
"""Expand a template string to a list of filenames.
Args:
template: The template string, with a block of '#' to replace
indices: The numeric indices to insert
"""
pfx = template.split("#")[0]
sfx = template.split("#")[-1]
count = template.count("#")
return [f"{pfx}{index:0{count}}{sfx}" for index in indices]

I can put a count == 1 special case to expand to a non zero-padded list, but this will fail if the dataset is actually using a zero-padded template. That is under the proposal above,

dials.import template=$(dials.data get -q x4wide)/X4_wide_M1S4_2_#.cbf

is supposed to match the files X4_wide_M1S4_2_[0001..0090].cbf, but that information is not available to this function, so the special case will expand to X4_wide_M1S4_2_[1..90].cbf instead and then fail later on.

@dagewa
Copy link
Member Author

dagewa commented Mar 8, 2024

Rather than overloading # I feel it might be somewhat easier to add a new syntax for templates, where zero-padding is explicit. @graeme-winter's suggestion of printf-like syntax seems appealing.

@dagewa
Copy link
Member Author

dagewa commented Mar 8, 2024

Although, reviewing the above I think I may have incorrectly implemented @biochem-fan's suggestion. I was working on the assumption that X4_wide_M1S4_2_#.cbf should match sequential image numbers whether zero-padded or not. However, #646 (comment) states

  • images_#.cbf does not match.

in this case, and now I can see why.

I'll revise my branch in the light of this and see if it can be rescued.

@dagewa
Copy link
Member Author

dagewa commented Mar 8, 2024

Ok, I think it works. PR incoming 🙂

A remaining question is whether e.g. dials.import image_*.rodhypix can be made clever enough to figure out that there is a single scan of non-zero-padded images and correctly import as such. At the moment this fails for me with

ValueError: Please report this error to dials-support@lists.sourceforge.net: Could not determine filename template

@dagewa
Copy link
Member Author

dagewa commented Mar 15, 2024

I fixed the ValueError by expanding the special case single-digit # template regex to match multiple digits. This enables me to import a Rigaku dataset consisting of images with non-zero padded indices in the normal way:

$ dials.import ../Tyrosine/exp_333/frames/exp_333_1_*.rodhypix
DIALS (2018) Acta Cryst. D74, 85-97. https://doi.org/10.1107/S2059798317017235
DIALS 3.dev.1101-g0f4a43535
The following parameters have been modified:

input {
  experiments = <image files>
}

--------------------------------------------------------------------------------
  format: <class 'dxtbx.format.FormatROD.FormatROD'>
  template: /home/fcx32934/data/Rigaku/tyrosine/Tyrosine/exp_333/frames/exp_333_1_#.rodhypix:1:560
  num images: 560
  sequences:
    still:    0
    sweep:    1
  num stills: 0
--------------------------------------------------------------------------------
Writing experiments to imported.expt

🎉

@graeme-winter, the single digit template was apparently contentious when it was added (dials/dials#972) and I have extended its application. However, I can't think of any realistic circumstances where this could go wrong (perhaps due to lack of imagination). Would appreciate your comments and review at #705

@dagewa
Copy link
Member Author

dagewa commented Mar 16, 2024

NB added a commit that changes tactic slightly. Rather than extending the single-digit pattern meant for images like image1.cbf or whatever, I added a new pattern that includes the _ character, which seems to be part of the Rigaku / CAP filename template.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants