Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dials.export: Add option to specify composition of asymmetric unit so a complete SHELX instruction file can be written. #2623

Merged
merged 8 commits into from Apr 16, 2024

Conversation

huwjenkins
Copy link
Contributor

As discussed in #2608

This PR adds the option to specify the composition of the asymmetric unit when using format=shelx in dials.export. This will allow the .ins file written to be used directly for SHELXT etc. The command line option is composition. The string from the user is parsed using a variant of parse_compound() from xia2.to_shelx

Depending on the users input one of four things happen:

  1. No input. Minimal SFAC and UNIT instructions are written:
SFAC C
UNIT 0

This is hard coded and does not call the added _parse_compound() function. Personally I see this as the most useful option - users can always edit the .ins file in Olex2/ShelXle/vi etc.

  1. A list of elements is given but each element occurs only once. The list of elements is written to the SFAC instruction and all zeros to the UNIT instruction. For SHELXT this is all that is required.

  2. A formula. The list of elements is written to the SFAC instruction and Z*the number of times each element in the formula UNIT instruction. For SHELXT this is not required but I think for SHELXD it is. Obviously this requires that the space group is correct before export.

  3. A formula with unknown element - an error is raised as it seems better to correct here rather than write a potentially broken .ins file.

@huwjenkins
Copy link
Contributor Author

Comparison with xia2.to_shelx:

dials.export scaled.refl refined_cell.expt format=shelx composition="CHNOS":

TITL 19 in P212121
CELL  0.68890   5.4217   8.1333  12.0225   90.000   90.000   90.000
ZERR    4.000   0.0000   0.0000   0.0001    0.000    0.000    0.000
LATT -1
SYMM -X+1/2,-Y,Z+1/2
SYMM X+1/2,-Y+1/2,-Z
SYMM -X,Y+1/2,-Z+1/2
SFAC C H N O S
UNIT 0 0 0 0 0
HKLF 4
END

xia2.to_shelx scaled.mtz l-cys CHNOS -c l-cys.cif

TITL l-cys in P 21 21 21
CELL 0.68890 5.4217  8.1333  12.0225  90.000 90.000 90.000
ZERR 4 0.000030 0.000050 0.000070 0.000000 0.000000 0.000000
LATT -1
SYMM x+1/2,-y+1/2,-z
SYMM -x,y+1/2,-z+1/2
SYMM -x+1/2,-y,z+1/2

SFAC C H N O S
DISP C 0 0 0
DISP H 0 0 0
DISP N 0 0 0
DISP O 0 0 0
DISP S 0 0 0
UNIT 4.0 4.0 4.0 4.0 4.0


S     5    0.000000    0.000000    0.000000    11.00000    0.00000
O     4    0.000000    0.000000    0.000000    11.00000    0.00000
N     3    0.000000    0.000000    0.000000    11.00000    0.00000
C     1    0.000000    0.000000    0.000000    11.00000    0.00000
H     2    0.000000    0.000000    0.000000    11.00000    0.00000
HKLF 4

Known issues:

  1. Formulae with brackets don't get parsed correctly e.g. composition="(CH3)2" in P2(1)2(1)2(1) is written as:
SFAC C H
UNIT 4 128

Probably lots more corner cases.

Copy link
Member

@dagewa dagewa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @huwjenkins, this looks great! CI failures seem unrelated.

Copy link
Contributor

@graeme-winter graeme-winter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions which are not for this PR per se since they are improvements which are needed in the source material: the implementation you copied has much room for improvement (said the author thereof)

def _parse_compound(composition):
"""
Modified from parse_compund() in xia2/src/xia2/cli/to_shelx.py
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I am sure is already available in cctbx somewhere else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - cctbx/eltbx/chemical_elements.h which I assume is exported somewhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yes - thanks!

from cctbx.eltbx.chemical_elements import proper_caps_list
proper_caps_list()[:94]

Can replace the list of elements with scattering factors encoded in SHELXL included here.

result = {}
element = ""
number = ""
composition += "Q" # Not in any element symbol
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benjaminhwilliams has suggestions here, I would not write this this way today

@@ -584,6 +584,39 @@ def test_shelx_ins_best_unit_cell(dials_data, tmp_path):
assert result == pytest.approx(cell_esds[instruction], abs=0.001)


def test_shelx_ins_composition(dials_data, tmp_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth adding a test which actually executes shelxt if available to verify?

@huwjenkins
Copy link
Contributor Author

@graeme-winter - I'd be happy to remove _parse_compound() and the composition option. My opinion is dials.export should write a .ins that contains the minimal input required for this file to be accepted by Olex-2 or shelXle and run SHELXT as that is I imagine the most common use case?

@dagewa
Copy link
Member

dagewa commented Mar 19, 2024

@huwjenkins, I think @benjaminhwilliams might have some concrete suggestions coming soon, so hold on for now...

@huwjenkins
Copy link
Contributor Author

ok. One other thought was perhaps making the option scatterers and then the user supplies a list of elements. I think this could be used in combination with .type = choice(multi=True) in the command line options to remove the need to check user input in export_shelx.py but I'm not sure how this would work from the command line.

Certainly .type = strings with .multiple = True would work.

I think the user supplying a comma separated list as input to a dials.xxx program is fairly common - e.g. detector.fix_list=, exclude_images=, best_unit_cell= etc

@huwjenkins
Copy link
Contributor Author

I think this could be used in combination with .type = choice(multi=True) in the command line options to remove the need to check user input in export_shelx.py but I'm not sure how this would work from the command line.

I can't get this to work. If I have:

  shelx {
    scatterers = C H N O S
    .type = choice(multi=True)
  }
$ dials.python test_phil.py shelx.scatterers=C+H 

fails:

Sorry: Not a possible choice for shelx.scatterers: C (command line argument, line 1)
  Possible choices are:
    C
    H
    N
    O
    S

but

$ dials.python test_phil.py shelx.scatterers=c+h

works:

The following parameters have been modified:

shelx {
  scatterers = *C *H N O S
}

@dagewa
Copy link
Member

dagewa commented Mar 19, 2024

I think @benjaminhwilliams was going to provide a regex-based solution? Getting string parsing right might be a bit of a pain, but in the end it will be a much more satisfactory approach than using PHIL choices

Copy link
Member

@benjaminhwilliams benjaminhwilliams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @huwjenkins for resurrecting this old code from xia2 and tidying it up somewhat. You'll see I have some non-blocking suggestions, but it may be that they really belong in a subsequent PR, which I'm happy to provide. If you prefer to roll them into this PR, I'm equally happy to add some commits to this branch. Just let me know your preference.

Comment on lines 297 to 328
composition += "Q" # Not in any element symbol
for c in composition:
if c in string.ascii_uppercase:
if not element:
element += c
continue
if number == "":
count = 1
else:
count = int(number)
if element not in result:
if element in elements:
result[element] = 0
else:
raise Sorry(f"Element {element} in {composition} not recognised")
if element in elements:
result[element] += count
element = "" + c
number = ""
if c == "Q":
break
elif c in string.ascii_lowercase:
element += c
elif c in string.digits:
number += c
if "C" in result:
# move C to start of list
elements.insert(0, elements.pop(5))
if all(n == 1 for n in result.values()):
# Assume user has just supplied list of elements so UNIT instruction cannot be caluclated
result = result.fromkeys(result, 0)
sorted_result = {e: result[e] for e in elements if e in result}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few things that have always troubled me about xia2.cli.to_shelx, which predates my time working on DIALS, and I once removed it altogether for being underused and undocumented, without sufficient support to tidy it up (that removal was subsequently reverted). However, it does now have some users so I'm glad to see it being tidied up and disentangled from intermediate MTZs.

There are lingering issues though, which I think merit a refactor, ideally with the addition of documentation of the valid (and invalid) input. Chief among the issues is this behaviour:

>>> from dials.util.export_shelx import _parse_compound as parse_compound
>>> parse_compound("C6H10N4O2")  # This is OK.
{'C': 6, 'H': 10, 'N': 4, 'O': 2}
>>> parse_compound("6CH10N4O2")  # This is dubious input, but apparently OK.
{'C': 6, 'H': 10, 'N': 4, 'O': 2}
>>> parse_compound("6C10HN4O2")  # This is equally dubious input, and doesn't fail but gives nonsense.
{'C': 610, 'H': 1, 'N': 4, 'O': 2}

Of course, this is a bit unfair as a criticism of this PR, since these are long-standing issues imported from elsewhere, so I'm happy to fix these in a subsequent PR, if you prefer to get this merged pronto, as is.

The refactor I'm proposing would exploit the fact that we're basically matching to a regular expression /([A-Z][a-z]?)(\d*)/g. We probably want to check first that the composition doesn't contain any rubbish by first doing something like

re.fullmatch(r"(?:([A-Z][a-z]?)(\d*))+", composition)

or match to r"(?:([A-Z][a-z]?)(\d*)\s*)+" or r"(?:([A-Z][a-z]?)\s*(\d*)\s*)+" or even do

re.fullmatch(r"(?:([A-Z][a-z]?)\s*(\d*)\s*)+", composition.strip())

depending how permissive you want to be of the user putting spaces in the input. Then do something like

result = {element: int(count or 1) for element, count in re.findall(r"([A-Z][a-z]?)(\d*)", composition)}

or match to r"([A-Z][a-z]?)\s*(\d*)" if you want to permit spaces between element and count in the input, or do something slightly more sophisticated to sum up counts when handling condensed formulae with duplicate elements, as in CH3CH2CH2CH3.

As I say, I'm happy to make these changes myself (probably tomorrow, since it's now evening in UK time), either as a change set in this PR, or as a separate PR after this is merged in its present state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously, if I was being good and not just scribbling things down, I'd be using re.compile. Examples are serving suggestions only.

src/dials/util/export_shelx.py Outdated Show resolved Hide resolved
@dagewa
Copy link
Member

dagewa commented Mar 19, 2024

Just a note to say it might also be worth evaluating what's already out there for handling chemical formulae. Some quick googling led me to https://pypi.org/project/chemformula/

Edit, also: https://pypi.org/project/chemparse/

@benjaminhwilliams
Copy link
Member

Just a note to say it might also be worth evaluating what's already out there for handling chemical formulae. Some quick googling led me to https://pypi.org/project/chemformula/

Edit, also: https://pypi.org/project/chemparse/

Ooh, ChemFormula is nice. I hadn't seen that before. Scratch my review!

@huwjenkins
Copy link
Contributor Author

I just pushed an update to use the regex suggested by @benjaminhwilliams - I think as long as dials.export recognises composition like: 'C11 H10 N O', 'C52 H43 B1 F24 Mn1 N1 O5 P2', 'C H O F S P Fe Pd' etc. it'll be fine.

It doesn't currently check first that the composition doesn't contain any rubbish - I'm not sure how to turn the result of the check suggested by @benjaminhwilliams into a useful error message...

Those examples came from _chemical_oxdiff_formula in some .cif files I have been sent (so presumably record what the chemists typed into some software whilst collecting the data)?

# Assume user has just supplied list of elements so UNIT instruction cannot be calculated
result = {element: 0 for element, count in m}
else:
result = {element: int(count or 1) for element, count in m}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thought occurs to me. Does SHELX input support non-stoichiometric formulae? Should int be float?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it is not an error for shelxl, but it would be unconventional. I haven't found any examples of non-integer values after UNIT online. Wouldn't this just be represented by the occupancy column for the atom?

As for shelxt, it ignores what's in UNIT entirely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UNIT is element count in unit cell not asu. I was multiplying the formula entered by Z. Quite common to have fractions of the formula in the asu but fractions of an atom in the unit cell ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep this as int. Convenient to have the composition multiplied by Z 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I now understand @benjaminhwilliams's comment. Take an example like CCDC 1880252. In that case I think with the current code you'd want to write 'C6 H5 B0.25 F P0.25' if you scaled the data in I-4. So maybe we should have float here and convert to int when writing the UNIT card? Or allow users to set Z (as in _cell_formula_units_Z) to override space_group().order_z()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, gotcha. Allow float input and convert to int makes sense to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking some more about this probably the best approach is to not allow non-integer values but add a parameter shelx.zprime so for the example above the user would enter 'C24 H20 B F4 P' and zprime=0.25 and get Z = 2.

In that case should shelx.composition be renamed shelx.formula? or should we have both - where shelx.composition is just element types and shelx.formula is always interpreted as a formula even when there no numbers.

dagewa and others added 2 commits April 15, 2024 14:19
Including @benjaminhwilliams's suggestion from review

Co-authored-by: Ben Williams <benjaminhwilliams@users.noreply.github.com>
@dagewa
Copy link
Member

dagewa commented Apr 16, 2024

The CI ❌ seems unrelated. Merging this now, so any further work on "rubbish detection" and non-stoichiometric forumulae could go in a future PR

@dagewa dagewa merged commit 20c0e11 into dials:main Apr 16, 2024
9 of 11 checks passed
amyjaynethompson pushed a commit to amyjaynethompson/dials that referenced this pull request Apr 24, 2024
… a complete SHELX instruction file can be written. (dials#2623)


---------

Co-authored-by: Ben Williams <benjaminhwilliams@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants