Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse complex chemical formula strings #3954

Merged
merged 5 commits into from
May 7, 2020

Conversation

lorisercole
Copy link
Contributor

The "parse_formula" function can now parse more complex formulas, including elements with partial occupancies and groups of elements enclosed in parentheses.
Spaces separators are no more needed.
Elements are counted and a dictionary is returned.
This can be useful to parse the formulas included in cif files (for example to debug a cif parser), or those generated by pymatgen.

e.g. C[NH2]3NO3 --> {'C': 1, 'N': 4, 'H': 6, 'O': 3}

The "parse_formula" function can now parse more complex formulas,
including elements with partial occupancies and groups of elements enclosed
in parentheses. Spaces separators are no more needed.
Elements are counted and a dictionary is returned.
This can be useful to parse the formulas included in cif files, or those
generated by pymatgen.

e.g.  C[NH2]3NO3  -->  {'C': 1, 'N': 4, 'H': 6, 'O': 3}
@lorisercole lorisercole changed the title Parse complex chemical formulas strings Parse complex chemical formula strings Apr 21, 2020
@codecov
Copy link

codecov bot commented Apr 21, 2020

Codecov Report

Merging #3954 into develop will increase coverage by 0.01%.
The diff coverage is 87.88%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #3954      +/-   ##
===========================================
+ Coverage    78.45%   78.45%   +0.01%     
===========================================
  Files          461      461              
  Lines        34079    34094      +15     
===========================================
+ Hits         26733    26746      +13     
- Misses        7346     7348       +2     
Flag Coverage Δ
#django 70.50% <87.88%> (+0.01%) ⬆️
#sqlalchemy 71.35% <87.88%> (+0.01%) ⬆️
Impacted Files Coverage Δ
aiida/orm/nodes/data/cif.py 86.58% <87.88%> (-0.54%) ⬇️
aiida/engine/daemon/client.py 73.72% <0.00%> (+1.15%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a038876...11bf5c8. Read the comment docs.

@sphuber sphuber self-requested a review April 21, 2020 21:37
Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot Loris! What I would like to ask is for you to add some unit tests, to make sure that the implementation works as expected and will continue to do so in the future. To help you along your way, I have made a start. Just create the file tests/orm/data/test_cif.py and put the following content in there:

import pytest

from aiida.orm.nodes.data.cif import parse_formula


def test_parse_formula():
    """Test the `parse_formula` utility function."""
    assert parse_formula('C H') == {'C': 1, 'H': 1}
    assert parse_formula('C5 H1') == {'C': 5, 'H': 1}
    assert parse_formula('Ca5 Ho') == {'Ca': 5, 'Ho': 1}
    assert parse_formula('H0.5 O') == {'H': 0.5, 'O': 1}
    assert parse_formula('C0 O0') == {'C': 0, 'O': 0}
    assert parse_formula('C1 H1 ') == {'C': 1, 'H': 1}
    assert parse_formula(' C1 H1') == {'C': 1, 'H': 1}
    assert parse_formula('C[NH2]3NO3') == {'C': 1, 'N': 4, 'H': 6, 'O': 3}

    with pytest.raises(ValueError):
        parse_formula('H0.5.2 O')

This should give you an idea how to write tests. You should add some more with specific cases that you think this implementation should cover.

Most of these tests already existed in another file, tests/test_dataclasses.py line 578 to 594. These were in the old style, but I have now translated them and added one example that you described in the docstring. You can delete those lines now and add some more tests in the file you will create.

After having created the file, you can run the tests by executing pytest tests/orm/data/test_cif.py.

If you run into trouble let me know.

Copy link
Contributor

@CasperWA CasperWA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great with a little love to this part of the code! Thanks @lorisercole!

I have some minor beauty-suggestions, not really related to the content and logic as a whole, but still relevant.

I haven't tested your regular expressions extensively, however, maybe you could somewhat do this yourself, i.e., create a pytest test for this function that goes through a list of varying formulas and check that what the function returns is also what you would expect.
Both for valid and invalid formulas.

aiida/orm/nodes/data/cif.py Outdated Show resolved Hide resolved
aiida/orm/nodes/data/cif.py Outdated Show resolved Hide resolved
aiida/orm/nodes/data/cif.py Outdated Show resolved Hide resolved
aiida/orm/nodes/data/cif.py Outdated Show resolved Hide resolved
aiida/orm/nodes/data/cif.py Outdated Show resolved Hide resolved
aiida/orm/nodes/data/cif.py Outdated Show resolved Hide resolved
aiida/orm/nodes/data/cif.py Outdated Show resolved Hide resolved
aiida/orm/nodes/data/cif.py Outdated Show resolved Hide resolved
aiida/orm/nodes/data/cif.py Outdated Show resolved Hide resolved
@lorisercole
Copy link
Contributor Author

OK, thanks guys.
I implemented the suggestions by @CasperWA and added some tests.
The parser should work for the chemical formulas that can be read from COD, ICSD, MPDS cif files, as well as those generated by the many modes of StructureData.get_formula, by pymatgen, and ase.

In the commit e0c6704ee35d7820d8084a10fe807f964b514f0b I also edited the function CifData.get_formulae to be able to read the chemical formulas of MPDS cif files, which use a different tag (only available for v1.0.5+ files, recently updated).

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvements @lorisercole . There are a few minor problems in the tests still, but mostly I am wondering if we should deviate from the CIF standard as this PR does. @giovannipizzi what do you think? Is it okay to support tags and tag formats that are invalid according to the CIF spec? See my comments in the PR for details

tests/orm/data/test_cif.py Outdated Show resolved Hide resolved
tests/orm/data/test_cif.py Outdated Show resolved Hide resolved
aiida/orm/nodes/data/cif.py Show resolved Hide resolved
aiida/orm/nodes/data/cif.py Outdated Show resolved Hide resolved
aiida/orm/nodes/data/cif.py Show resolved Hide resolved
aiida/orm/nodes/data/cif.py Outdated Show resolved Hide resolved
Test the chemical formula parser against many cumbersome cases.
It is possible to specify 'custom_tags' that will try to be read by the
'CifData.get_formulae' method.
@lorisercole
Copy link
Contributor Author

Thanks. I updated the branch following your suggestions.

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the improvements @lorisercole !

@sphuber sphuber dismissed CasperWA’s stale review May 7, 2020 10:12

Changes addressed

@sphuber sphuber merged commit 6220e85 into aiidateam:develop May 7, 2020
@sphuber sphuber added this to the v1.3.0 milestone May 20, 2020
@ltalirz
Copy link
Member

ltalirz commented Jul 1, 2020

Is there somewhere a specification of the "more complex" formula format we are supporting here?

@lorisercole
Copy link
Contributor Author

@ltalirz see 9ead2b1:

The "parse_formula" function can now parse more complex formulas,
including elements with partial occupancies and groups of elements enclosed
in parentheses. Spaces separators are no more needed.
Elements are counted and a dictionary is returned.
This can be useful to parse the formulas included in cif files, or those
generated by pymatgen.

e.g.  C[NH2]3NO3  -->  {'C': 1, 'N': 4, 'H': 6, 'O': 3}

More (weird) examples in the tests:

def test_parse_formula():
"""Test the `parse_formula` utility function."""
assert parse_formula('C H') == {'C': 1, 'H': 1}
assert parse_formula('C5 H1') == {'C': 5, 'H': 1}
assert parse_formula('Ca5 Ho') == {'Ca': 5, 'Ho': 1}
assert parse_formula('H0.5 O') == {'H': 0.5, 'O': 1}
assert parse_formula('C0 O0') == {'C': 0, 'O': 0}
assert parse_formula('C1 H1 ') == {'C': 1, 'H': 1}
assert parse_formula(' C1 H1') == {'C': 1, 'H': 1}
assert parse_formula('CaHClO') == {'Ca': 1, 'H': 1, 'Cl': 1, 'O': 1}
assert parse_formula('C70 H108 Al4 La4 N4 O10') == {'C': 70, 'H': 108, 'Al': 4, 'La': 4, 'N': 4, 'O': 10}
assert parse_formula('C70H108Al4Li4N4O10') == {'C': 70, 'H': 108, 'Al': 4, 'Li': 4, 'N': 4, 'O': 10}
assert parse_formula('C36 H59LiN2 O3 Si') == {'C': 36, 'H': 59, 'Li': 1, 'N': 2, 'O': 3, 'Si': 1}
assert parse_formula('C63.5H83.5Li2N2O3.25P2') == {'C': 63.5, 'H': 83.5, 'Li': 2, 'N': 2, 'O': 3.25, 'P': 2}
assert parse_formula('Fe Li0.667 O4 P1') == {'Fe': 1, 'Li': 0.667, 'O': 4, 'P': 1}
assert parse_formula('Fe2.05Ni0.05O4 Zn0.9') == {'Fe': 2.05, 'Ni': 0.05, 'O': 4, 'Zn': 0.9}
assert parse_formula('Li3O6(Al0.23Li0.77)2(Li0.07Te0.93)') == {'Li': 4.61, 'O': 6, 'Al': 0.46, 'Te': 0.93}
assert parse_formula('Li2{Cr0.05Li0.95X0.00}{Cr0.24Li0.76}2{Li0.02Te0.98}O6') == {
'Li': 4.49,
'Cr': 0.53,
'X': 0,
'Te': 0.98,
'O': 6
}
assert parse_formula('C[NH2]3NO3') == {'C': 1, 'N': 4, 'H': 6, 'O': 3}
assert parse_formula('H80 C104{C0.50 X0.50}8N8 Cl4(Cl0.50X0.50)8.0O8') == {
'H': 80,
'C': 108.0,
'X': 8.0,
'N': 8,
'Cl': 8.0,
'O': 8
}
assert parse_formula('Na1.28[NH]0.28{N H2}0.72') == {'Na': 1.28, 'N': 1.0, 'H': 1.72}
for test_formula in ('H0.5.2 O', 'Fe2.05Ni0.05.4', 'Na1.28[NH]0.28.3{NH2}0.72'):
with pytest.raises(ValueError):
parse_formula(test_formula)

@ltalirz
Copy link
Member

ltalirz commented Jul 1, 2020

Thanks, so I see it includes square brackets, curly braces as well as round braces, with and without spaces; integers, floating point numbers etc.
Is this following a standard specified somewhere or is this something you "invented"?

I'm just mentioning this because it can be very useful to tell people that this field is guaranteed to support formulas of format X. Ideally X is fully specified in itself; a second-best option would be X = specification A + features B,C

@lorisercole
Copy link
Contributor Author

Yes it supports round, square, and curly brackets.

  • round brackets are used in the chemical formula tags of some CIF files, such as from ICSD
  • square brackets are found in the chemical formula tags of some CIF files, such as from MPDS
  • curly brackets are used in the formulas returned by StructureData.get_formula()
  • Spaces are sometimes used, sometimes not.

The tests contain intentionally more general/complex cases than one should ever find (this would even allow one to parse formulas written "by hand" in weird ways, for whatever reason).
The parser counts all the elements and returns a dictionary of them. If an element is present in several bracketed groups, the sum will be returned, e.g.

parse_formula('Na1.28[NH]0.28{N H2}0.72')  -->  {'Na': 1.28, 'N': 1.0, 'H': 1.72}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants