Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whitespace padding of atom names in mmCIF files #148

Closed
sbliven opened this issue Aug 15, 2014 · 5 comments
Closed

Whitespace padding of atom names in mmCIF files #148

sbliven opened this issue Aug 15, 2014 · 5 comments
Assignees
Labels
bug Bugs and bugfixes question Open discussions about the library
Milestone

Comments

@sbliven
Copy link
Member

sbliven commented Aug 15, 2014

What's the proper way to deal with atom names? For PDB files we are careful to treat them as 4-character strings and to include whitespace in all checks (eg alpha-carbon " CA " is different from calcium "CA "). HetatomImpl even keeps separate dictionaries of the trimmed and untrimmed names for accessing atoms. However, in mmCIF the atom names are always left justified, rather than using quoting to preserve the spaces. So calcium and alpha carbon can only be distinguished by comparing the element.

This bug was exposed by 41d53c3, which results in TestAltLocs.test3PIUmmcif() failing despite test3PIUpdb passing. Basically, the mmcif has "CA " atoms, which are matched to alpha carbons in some contexts but not others.

Are there any reasons to store the 4-letter version besides outputting PDB files? We need to use the correct justification when writing PDB files, but internally it should be whitespace insensitive, right? Could we store the trimmed version, and only generate the whitespace version in toPDB based on a lookup table for each element?

@josemduarte
Copy link
Contributor

Agreed. Keeping and using the padding spaces internally can be dangerous and should be avoided. Especially if they are used sometimes to distinguish atoms, e.g. see discussion in #144. As the discussion says one reason to use the padding is to distinguish Calpha and Calcium atoms which happen to have the same name ("CA") in the PDB. But there are alternatives ways to distinguish them, like through atom.getElement() or through group.getType().

This is especially important since mmCIF will become the official format some time soon.

@sbliven sbliven mentioned this issue Aug 15, 2014
@sbliven
Copy link
Member Author

sbliven commented Aug 15, 2014

Ok, so I fixed the immediate bugs by improving the mmcif parser's fixFullAtomName hack to add spaces. Should I close this (after the merge), or does someone want to eventually tackle the underlying problem? I don't have the time myself.

@josemduarte
Copy link
Contributor

Would it be ok to keep this open? I think the issue is going to resurface at some point. I'd like to give it a go when I get some time and try to fix the original problem.

@josemduarte josemduarte reopened this Aug 15, 2014
josemduarte added a commit to josemduarte/biojava that referenced this issue Sep 20, 2014
- some exception handling fixes biojava#111
- important bug fixed in alt loc handling, the lookup map for atom names wasn't properly reset when adding a new alt loc group
- fixed implementation of StructureTools.getBackboneAtomArray (was including CB atoms, and excluding all GLY groups)
- added some tests for the StructureTools methods
josemduarte added a commit to josemduarte/biojava that referenced this issue Sep 20, 2014
- some improvement in exceptions and logging biojava#111 and biojava#155
josemduarte added a commit to josemduarte/biojava that referenced this issue Sep 20, 2014
- some exception handling fixes biojava#111
- important bug fixed in alt loc handling, the lookup map for atom names wasn't properly reset when adding a new alt loc group
- fixed implementation of StructureTools.getBackboneAtomArray (was including CB atoms, and excluding all GLY groups)
- added some tests for the StructureTools methods
josemduarte added a commit to josemduarte/biojava that referenced this issue Sep 20, 2014
- some improvement in exceptions and logging biojava#111 and biojava#155
@andreasprlic andreasprlic added this to the BioJava 4.0.0 milestone Oct 4, 2014
@andreasprlic andreasprlic reopened this Jan 28, 2016
@andreasprlic
Copy link
Member

sorry, mistakenly re-opened.

@josemduarte
Copy link
Contributor

That's indeed an issue, which we discussed already some time ago in #175.

As already discussed there, fixing it requires a new method getAtom(String, Element) which would be more precise than just getAtom(String) (the javadoc of Group.getAtom(String) also explains the issue).

In any case this really calls for removing the lookup HashMap in HetatomImpl as discussed in #391. It really gives very little speed-up in expense of quite some memory. This atom name ambiguity issue is another indication that it's not useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and bugfixes question Open discussions about the library
Projects
None yet
Development

No branches or pull requests

3 participants