PDB File Models Missing Element Column #537

lafita · 2016-07-15T08:58:19Z

We have recently noticed that some structural bioinformatics programs (structure refinement or modelling) generate PDB files where the Element column is missing. The element column is the last column, where the periodic table element of the Atom is indicated.

Parsing these files with BioJava currently does not allow the calculation of structural alignments or symmetry (and any other analysis using C-alpha atoms), because to extract the C-alpha atoms of a structure the name (CA) and element (C) of the Atoms is checked (in StructureTools.getRepresentativeAtoms()).

The Element column is not completely redundant, because in case of a modified aminoacid with calcium bound to it, the name CA alone does not distinguish the calcium from the C-alpha carbon and the element column is needed to do so.

On the other hand, we could print a warning when parsing such models and guess and fill the Element column from the Atom names (at least for the Atoms in aminoacids), in order to support the incomplete files.

Question: is there any drawback in guessing and filling the Element of the Atoms? Can we use the Chemical Components for that?

The text was updated successfully, but these errors were encountered:

lafita · 2016-07-15T09:23:08Z

The Element column is not completely redundant, because in case of a modified aminoacid with calcium bound to it, the name CA alone does not distinguish the calcium from the C-alpha carbon and the element column is needed to do so.

@gcapitani noticed that the atom name column in PDB files if shifted one position to the left for calcium, with respect to C-alpha atoms, so the Element column is in this case redundant.
Thus, it is possible to identify while parsing if a CA Atom is a carbon or a calcium.

josemduarte · 2016-07-15T14:09:32Z

Guessing the elements for standard aminoacids should be fine, but we need to print warnings. You don't even need to look at the column where "CA" is, standard aminoacids don't have calcium so it's safe to always assign a "C". I'd say for any other molecule we shouldn't try guessing, instead fail with a nice error message "Element is missing for line ..."

josemduarte · 2016-07-15T14:13:12Z

This relates to #305

josemduarte · 2016-07-15T14:18:30Z

Actually using the chemical component dictionary (CCD) for this would provide a general solution: if the residue's 3-letter code is that of a valid chem comp and the element is not present we can read it by looking up the atom name in the CCD. That would work for any kind of molecule.

lafita · 2016-07-15T14:29:18Z

Great! I think that the optimal solution then is to complete the Element field of Atoms from the Chemical Component information when parsing and print a warning explaining the PDB format error and the bugs it may cause.

sbliven · 2016-07-18T13:35:00Z

BioJava used to include whitespace in atom names, so that " CA " (carbon) was distinct from "CA " (calcium). However, this broke in the transition to mmcif, which trims whitespace from fields. I suppose it could be used as a hint for guessing, but it shouldn't be stored in data structures.

lafita · 2016-07-27T09:01:29Z

I have been looking at the code and the Element field is parsed in PDBFileParser with the following code, which already handles missing Element column in PDB files:

// Parse element from the element field. If this field is
// missing (i.e. misformatted PDB file), then parse the
// name from the atom name.
Element element = Element.R;
if ( line.length() > 77 ) {
    // parse element from element field
    try {
        element = Element.valueOfIgnoreCase(line.substring (76, 78).trim());
    }  catch (IllegalArgumentException e){}
} else {
    // parse the name from the atom name
    String elementSymbol = null;
    // for atom names with 4 characters, the element is
    // at the first position, example HG23 in Valine
    if (fullname.trim().length() == 4) {
        elementSymbol = fullname.substring(0, 1);
    } else if ( fullname.trim().length() > 1){
        elementSymbol = fullname.substring(0, 2).trim();
    } 

    try {
        if (elementSymbol!=null)
            element = Element.valueOfIgnoreCase(elementSymbol);
    } catch (IllegalArgumentException e){
        logger.warn("Element {} was not recognised. Assigning generic element R to it",
                    elementSymbol);
    }
}
atom.setElement(element);

There are three things to improve:

There is a bug where Atom names of a single letter are ignored an Element.R is always assigned (the bug is in the conditional fullname.trim().length() > 1, where it should be >=1).
The Atom names have to be trimmed before taking the substrings, otherwise spaces in the beginning make a difference (like CA and CA that Spencer described).
If the Element column is present, but empty (spaces), the Element assigned is always Element.R.

I would rewrite this handling, because I think that using the Chemical Component Dictionary information is a better solution. I will create a pull request with the change.

Fix biojava#537

Fix #537 - handle missing and empty Element column in PDB files

lafita added the question Open discussions about the library label Jul 15, 2016

lafita added this to the BioJava 5.0.0 milestone Jul 15, 2016

lafita added enhancement Improvement of existing code or method and removed question Open discussions about the library labels Jul 15, 2016

lafita self-assigned this Jul 27, 2016

lafita added a commit to lafita/biojava that referenced this issue Jul 27, 2016

Handle an empty Element column in a PDB file

db1839e

Fix biojava#537

lafita mentioned this issue Jul 27, 2016

Fix #537 - handle missing and empty Element column in PDB files #540

Merged

josemduarte closed this as completed in 231b41a Jul 27, 2016

josemduarte added a commit that referenced this issue Jul 27, 2016

Merge pull request #540 from lafita/fix537

0694acc

Fix #537 - handle missing and empty Element column in PDB files

josemduarte added a commit to josemduarte/biojava that referenced this issue Jul 27, 2016

Fixing issue introduced in PR biojava#537

4fa6833

josemduarte mentioned this issue Jul 28, 2016

Fix for issue 517 #541

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDB File Models Missing Element Column #537

PDB File Models Missing Element Column #537

lafita commented Jul 15, 2016 •

edited

lafita commented Jul 15, 2016

josemduarte commented Jul 15, 2016

josemduarte commented Jul 15, 2016

josemduarte commented Jul 15, 2016

lafita commented Jul 15, 2016

sbliven commented Jul 18, 2016

lafita commented Jul 27, 2016 •

edited

PDB File Models Missing Element Column #537

PDB File Models Missing Element Column #537

Comments

lafita commented Jul 15, 2016 • edited

lafita commented Jul 15, 2016

josemduarte commented Jul 15, 2016

josemduarte commented Jul 15, 2016

josemduarte commented Jul 15, 2016

lafita commented Jul 15, 2016

sbliven commented Jul 18, 2016

lafita commented Jul 27, 2016 • edited

lafita commented Jul 15, 2016 •

edited

lafita commented Jul 27, 2016 •

edited