New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dials.export: Add option to specify composition of asymmetric unit so a complete SHELX instruction file can be written. #2623
Conversation
… a complete SHELX instruction file can be written.
Comparison with
Known issues:
Probably lots more corner cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @huwjenkins, this looks great! CI failures seem unrelated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions which are not for this PR per se since they are improvements which are needed in the source material: the implementation you copied has much room for improvement (said the author thereof)
src/dials/util/export_shelx.py
Outdated
def _parse_compound(composition): | ||
""" | ||
Modified from parse_compund() in xia2/src/xia2/cli/to_shelx.py | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This I am sure is already available in cctbx
somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes - cctbx/eltbx/chemical_elements.h
which I assume is exported somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh yes - thanks!
from cctbx.eltbx.chemical_elements import proper_caps_list
proper_caps_list()[:94]
Can replace the list of elements with scattering factors encoded in SHELXL included here.
src/dials/util/export_shelx.py
Outdated
result = {} | ||
element = "" | ||
number = "" | ||
composition += "Q" # Not in any element symbol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benjaminhwilliams has suggestions here, I would not write this this way today
@@ -584,6 +584,39 @@ def test_shelx_ins_best_unit_cell(dials_data, tmp_path): | |||
assert result == pytest.approx(cell_esds[instruction], abs=0.001) | |||
|
|||
|
|||
def test_shelx_ins_composition(dials_data, tmp_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth adding a test which actually executes shelxt
if available to verify?
@graeme-winter - I'd be happy to remove |
@huwjenkins, I think @benjaminhwilliams might have some concrete suggestions coming soon, so hold on for now... |
ok. One other thought was perhaps making the option Certainly I think the user supplying a comma separated list as input to a |
I can't get this to work. If I have:
fails:
but
works:
|
I think @benjaminhwilliams was going to provide a regex-based solution? Getting string parsing right might be a bit of a pain, but in the end it will be a much more satisfactory approach than using PHIL choices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @huwjenkins for resurrecting this old code from xia2 and tidying it up somewhat. You'll see I have some non-blocking suggestions, but it may be that they really belong in a subsequent PR, which I'm happy to provide. If you prefer to roll them into this PR, I'm equally happy to add some commits to this branch. Just let me know your preference.
src/dials/util/export_shelx.py
Outdated
composition += "Q" # Not in any element symbol | ||
for c in composition: | ||
if c in string.ascii_uppercase: | ||
if not element: | ||
element += c | ||
continue | ||
if number == "": | ||
count = 1 | ||
else: | ||
count = int(number) | ||
if element not in result: | ||
if element in elements: | ||
result[element] = 0 | ||
else: | ||
raise Sorry(f"Element {element} in {composition} not recognised") | ||
if element in elements: | ||
result[element] += count | ||
element = "" + c | ||
number = "" | ||
if c == "Q": | ||
break | ||
elif c in string.ascii_lowercase: | ||
element += c | ||
elif c in string.digits: | ||
number += c | ||
if "C" in result: | ||
# move C to start of list | ||
elements.insert(0, elements.pop(5)) | ||
if all(n == 1 for n in result.values()): | ||
# Assume user has just supplied list of elements so UNIT instruction cannot be caluclated | ||
result = result.fromkeys(result, 0) | ||
sorted_result = {e: result[e] for e in elements if e in result} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few things that have always troubled me about xia2.cli.to_shelx
, which predates my time working on DIALS, and I once removed it altogether for being underused and undocumented, without sufficient support to tidy it up (that removal was subsequently reverted). However, it does now have some users so I'm glad to see it being tidied up and disentangled from intermediate MTZs.
There are lingering issues though, which I think merit a refactor, ideally with the addition of documentation of the valid (and invalid) input. Chief among the issues is this behaviour:
>>> from dials.util.export_shelx import _parse_compound as parse_compound
>>> parse_compound("C6H10N4O2") # This is OK.
{'C': 6, 'H': 10, 'N': 4, 'O': 2}
>>> parse_compound("6CH10N4O2") # This is dubious input, but apparently OK.
{'C': 6, 'H': 10, 'N': 4, 'O': 2}
>>> parse_compound("6C10HN4O2") # This is equally dubious input, and doesn't fail but gives nonsense.
{'C': 610, 'H': 1, 'N': 4, 'O': 2}
Of course, this is a bit unfair as a criticism of this PR, since these are long-standing issues imported from elsewhere, so I'm happy to fix these in a subsequent PR, if you prefer to get this merged pronto, as is.
The refactor I'm proposing would exploit the fact that we're basically matching to a regular expression /([A-Z][a-z]?)(\d*)/g
. We probably want to check first that the composition
doesn't contain any rubbish by first doing something like
re.fullmatch(r"(?:([A-Z][a-z]?)(\d*))+", composition)
or match to r"(?:([A-Z][a-z]?)(\d*)\s*)+"
or r"(?:([A-Z][a-z]?)\s*(\d*)\s*)+"
or even do
re.fullmatch(r"(?:([A-Z][a-z]?)\s*(\d*)\s*)+", composition.strip())
depending how permissive you want to be of the user putting spaces in the input. Then do something like
result = {element: int(count or 1) for element, count in re.findall(r"([A-Z][a-z]?)(\d*)", composition)}
or match to r"([A-Z][a-z]?)\s*(\d*)"
if you want to permit spaces between element and count in the input, or do something slightly more sophisticated to sum up counts when handling condensed formulae with duplicate elements, as in CH3CH2CH2CH3
.
As I say, I'm happy to make these changes myself (probably tomorrow, since it's now evening in UK time), either as a change set in this PR, or as a separate PR after this is merged in its present state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obviously, if I was being good and not just scribbling things down, I'd be using re.compile
. Examples are serving suggestions only.
Just a note to say it might also be worth evaluating what's already out there for handling chemical formulae. Some quick googling led me to https://pypi.org/project/chemformula/ Edit, also: https://pypi.org/project/chemparse/ |
Ooh, ChemFormula is nice. I hadn't seen that before. Scratch my review! |
I just pushed an update to use the regex suggested by @benjaminhwilliams - I think as long as It doesn't currently check first that the composition doesn't contain any rubbish - I'm not sure how to turn the result of the check suggested by @benjaminhwilliams into a useful error message... Those examples came from |
# Assume user has just supplied list of elements so UNIT instruction cannot be calculated | ||
result = {element: 0 for element, count in m} | ||
else: | ||
result = {element: int(count or 1) for element, count in m} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A thought occurs to me. Does SHELX input support non-stoichiometric formulae? Should int
be float
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it is not an error for shelxl
, but it would be unconventional. I haven't found any examples of non-integer values after UNIT
online. Wouldn't this just be represented by the occupancy column for the atom?
As for shelxt
, it ignores what's in UNIT
entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UNIT
is element count in unit cell not asu. I was multiplying the formula entered by Z. Quite common to have fractions of the formula in the asu but fractions of an atom in the unit cell ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep this as int. Convenient to have the composition multiplied by Z 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I now understand @benjaminhwilliams's comment. Take an example like CCDC 1880252. In that case I think with the current code you'd want to write 'C6 H5 B0.25 F P0.25' if you scaled the data in I-4. So maybe we should have float
here and convert to int when writing the UNIT
card? Or allow users to set Z (as in _cell_formula_units_Z
) to override space_group().order_z()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, gotcha. Allow float input and convert to int makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking some more about this probably the best approach is to not allow non-integer values but add a parameter shelx.zprime
so for the example above the user would enter 'C24 H20 B F4 P' and zprime=0.25
and get Z = 2.
In that case should shelx.composition
be renamed shelx.formula
? or should we have both - where shelx.composition
is just element types and shelx.formula
is always interpreted as a formula even when there no numbers.
Including @benjaminhwilliams's suggestion from review Co-authored-by: Ben Williams <benjaminhwilliams@users.noreply.github.com>
The CI ❌ seems unrelated. Merging this now, so any further work on "rubbish detection" and non-stoichiometric forumulae could go in a future PR |
… a complete SHELX instruction file can be written. (dials#2623) --------- Co-authored-by: Ben Williams <benjaminhwilliams@users.noreply.github.com>
As discussed in #2608
This PR adds the option to specify the composition of the asymmetric unit when using
format=shelx
indials.export
. This will allow the.ins
file written to be used directly for SHELXT etc. The command line option iscomposition
. The string from the user is parsed using a variant ofparse_compound()
fromxia2.to_shelx
Depending on the users input one of four things happen:
SFAC
andUNIT
instructions are written:This is hard coded and does not call the added
_parse_compound()
function. Personally I see this as the most useful option - users can always edit the.ins
file in Olex2/ShelXle/vi etc.A list of elements is given but each element occurs only once. The list of elements is written to the
SFAC
instruction and all zeros to theUNIT
instruction. For SHELXT this is all that is required.A formula. The list of elements is written to the
SFAC
instruction and Z*the number of times each element in the formulaUNIT
instruction. For SHELXT this is not required but I think for SHELXD it is. Obviously this requires that the space group is correct before export.A formula with unknown element - an error is raised as it seems better to correct here rather than write a potentially broken
.ins
file.