-
Notifications
You must be signed in to change notification settings - Fork 379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Whitespace padding of atom names in mmCIF files #148
Comments
Agreed. Keeping and using the padding spaces internally can be dangerous and should be avoided. Especially if they are used sometimes to distinguish atoms, e.g. see discussion in #144. As the discussion says one reason to use the padding is to distinguish Calpha and Calcium atoms which happen to have the same name ("CA") in the PDB. But there are alternatives ways to distinguish them, like through This is especially important since mmCIF will become the official format some time soon. |
Ok, so I fixed the immediate bugs by improving the mmcif parser's |
Would it be ok to keep this open? I think the issue is going to resurface at some point. I'd like to give it a go when I get some time and try to fix the original problem. |
- some exception handling fixes biojava#111 - important bug fixed in alt loc handling, the lookup map for atom names wasn't properly reset when adding a new alt loc group - fixed implementation of StructureTools.getBackboneAtomArray (was including CB atoms, and excluding all GLY groups) - added some tests for the StructureTools methods
- some improvement in exceptions and logging biojava#111 and biojava#155
- some exception handling fixes biojava#111 - important bug fixed in alt loc handling, the lookup map for atom names wasn't properly reset when adding a new alt loc group - fixed implementation of StructureTools.getBackboneAtomArray (was including CB atoms, and excluding all GLY groups) - added some tests for the StructureTools methods
- some improvement in exceptions and logging biojava#111 and biojava#155
sorry, mistakenly re-opened. |
That's indeed an issue, which we discussed already some time ago in #175. As already discussed there, fixing it requires a new method In any case this really calls for removing the lookup HashMap in |
What's the proper way to deal with atom names? For PDB files we are careful to treat them as 4-character strings and to include whitespace in all checks (eg alpha-carbon " CA " is different from calcium "CA "). HetatomImpl even keeps separate dictionaries of the trimmed and untrimmed names for accessing atoms. However, in mmCIF the atom names are always left justified, rather than using quoting to preserve the spaces. So calcium and alpha carbon can only be distinguished by comparing the element.
This bug was exposed by 41d53c3, which results in
TestAltLocs.test3PIUmmcif()
failing despitetest3PIUpdb
passing. Basically, the mmcif has "CA " atoms, which are matched to alpha carbons in some contexts but not others.Are there any reasons to store the 4-letter version besides outputting PDB files? We need to use the correct justification when writing PDB files, but internally it should be whitespace insensitive, right? Could we store the trimmed version, and only generate the whitespace version in toPDB based on a lookup table for each element?
The text was updated successfully, but these errors were encountered: