Whitespace padding of atom names in mmCIF files #148

sbliven · 2014-08-15T12:59:27Z

What's the proper way to deal with atom names? For PDB files we are careful to treat them as 4-character strings and to include whitespace in all checks (eg alpha-carbon " CA " is different from calcium "CA "). HetatomImpl even keeps separate dictionaries of the trimmed and untrimmed names for accessing atoms. However, in mmCIF the atom names are always left justified, rather than using quoting to preserve the spaces. So calcium and alpha carbon can only be distinguished by comparing the element.

This bug was exposed by 41d53c3, which results in TestAltLocs.test3PIUmmcif() failing despite test3PIUpdb passing. Basically, the mmcif has "CA " atoms, which are matched to alpha carbons in some contexts but not others.

Are there any reasons to store the 4-letter version besides outputting PDB files? We need to use the correct justification when writing PDB files, but internally it should be whitespace insensitive, right? Could we store the trimmed version, and only generate the whitespace version in toPDB based on a lookup table for each element?

The text was updated successfully, but these errors were encountered:

josemduarte · 2014-08-15T13:25:33Z

Agreed. Keeping and using the padding spaces internally can be dangerous and should be avoided. Especially if they are used sometimes to distinguish atoms, e.g. see discussion in #144. As the discussion says one reason to use the padding is to distinguish Calpha and Calcium atoms which happen to have the same name ("CA") in the PDB. But there are alternatives ways to distinguish them, like through atom.getElement() or through group.getType().

This is especially important since mmCIF will become the official format some time soon.

sbliven · 2014-08-15T15:26:21Z

Ok, so I fixed the immediate bugs by improving the mmcif parser's fixFullAtomName hack to add spaces. Should I close this (after the merge), or does someone want to eventually tackle the underlying problem? I don't have the time myself.

josemduarte · 2014-08-15T18:06:35Z

Would it be ok to keep this open? I think the issue is going to resurface at some point. I'd like to give it a go when I get some time and try to fix the original problem.

- some exception handling fixes biojava#111 - important bug fixed in alt loc handling, the lookup map for atom names wasn't properly reset when adding a new alt loc group - fixed implementation of StructureTools.getBackboneAtomArray (was including CB atoms, and excluding all GLY groups) - added some tests for the StructureTools methods

- some improvement in exceptions and logging biojava#111 and biojava#155

- some exception handling fixes biojava#111 - important bug fixed in alt loc handling, the lookup map for atom names wasn't properly reset when adding a new alt loc group - fixed implementation of StructureTools.getBackboneAtomArray (was including CB atoms, and excluding all GLY groups) - added some tests for the StructureTools methods

- some improvement in exceptions and logging biojava#111 and biojava#155

andreasprlic · 2016-01-28T06:01:54Z

sorry, mistakenly re-opened.

josemduarte · 2016-01-28T06:16:33Z

That's indeed an issue, which we discussed already some time ago in #175.

As already discussed there, fixing it requires a new method getAtom(String, Element) which would be more precise than just getAtom(String) (the javadoc of Group.getAtom(String) also explains the issue).

In any case this really calls for removing the lookup HashMap in HetatomImpl as discussed in #391. It really gives very little speed-up in expense of quite some memory. This atom name ambiguity issue is another indication that it's not useful.

sbliven added bug labels Aug 15, 2014

sbliven mentioned this issue Aug 15, 2014

Core bug fixes #149

Merged

sbliven closed this as completed in 5f1a731 Aug 15, 2014

josemduarte reopened this Aug 15, 2014

sbliven assigned josemduarte Aug 19, 2014

josemduarte added a commit to josemduarte/biojava that referenced this issue Sep 20, 2014

Full fix for biojava#148, all tests pass

6613e03

- some improvement in exceptions and logging biojava#111 and biojava#155

josemduarte added a commit to josemduarte/biojava that referenced this issue Sep 20, 2014

Full fix for biojava#148, all tests pass

6b50276

- some improvement in exceptions and logging biojava#111 and biojava#155

josemduarte mentioned this issue Sep 20, 2014

Removing padding spaces in internal atom name representation #175

Merged

josemduarte closed this as completed Oct 4, 2014

andreasprlic added this to the BioJava 4.0.0 milestone Oct 4, 2014

andreasprlic reopened this Jan 28, 2016

andreasprlic closed this as completed Jan 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whitespace padding of atom names in mmCIF files #148

Whitespace padding of atom names in mmCIF files #148

sbliven commented Aug 15, 2014

josemduarte commented Aug 15, 2014

sbliven commented Aug 15, 2014

josemduarte commented Aug 15, 2014

andreasprlic commented Jan 28, 2016

josemduarte commented Jan 28, 2016

Whitespace padding of atom names in mmCIF files #148

Whitespace padding of atom names in mmCIF files #148

Comments

sbliven commented Aug 15, 2014

josemduarte commented Aug 15, 2014

sbliven commented Aug 15, 2014

josemduarte commented Aug 15, 2014

andreasprlic commented Jan 28, 2016

josemduarte commented Jan 28, 2016