Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDB File Models Missing Element Column #537

Closed
lafita opened this issue Jul 15, 2016 · 7 comments
Closed

PDB File Models Missing Element Column #537

lafita opened this issue Jul 15, 2016 · 7 comments
Assignees
Labels
enhancement Improvement of existing code or method
Milestone

Comments

@lafita
Copy link
Member

lafita commented Jul 15, 2016

We have recently noticed that some structural bioinformatics programs (structure refinement or modelling) generate PDB files where the Element column is missing. The element column is the last column, where the periodic table element of the Atom is indicated.

Parsing these files with BioJava currently does not allow the calculation of structural alignments or symmetry (and any other analysis using C-alpha atoms), because to extract the C-alpha atoms of a structure the name (CA) and element (C) of the Atoms is checked (in StructureTools.getRepresentativeAtoms()).

The Element column is not completely redundant, because in case of a modified aminoacid with calcium bound to it, the name CA alone does not distinguish the calcium from the C-alpha carbon and the element column is needed to do so.

On the other hand, we could print a warning when parsing such models and guess and fill the Element column from the Atom names (at least for the Atoms in aminoacids), in order to support the incomplete files.

Question: is there any drawback in guessing and filling the Element of the Atoms? Can we use the Chemical Components for that?

@lafita lafita added the question Open discussions about the library label Jul 15, 2016
@lafita
Copy link
Member Author

lafita commented Jul 15, 2016

The Element column is not completely redundant, because in case of a modified aminoacid with calcium bound to it, the name CA alone does not distinguish the calcium from the C-alpha carbon and the element column is needed to do so.

@gcapitani noticed that the atom name column in PDB files if shifted one position to the left for calcium, with respect to C-alpha atoms, so the Element column is in this case redundant.
Thus, it is possible to identify while parsing if a CA Atom is a carbon or a calcium.

@josemduarte
Copy link
Contributor

Guessing the elements for standard aminoacids should be fine, but we need to print warnings. You don't even need to look at the column where "CA" is, standard aminoacids don't have calcium so it's safe to always assign a "C". I'd say for any other molecule we shouldn't try guessing, instead fail with a nice error message "Element is missing for line ..."

@josemduarte
Copy link
Contributor

This relates to #305

@josemduarte
Copy link
Contributor

Actually using the chemical component dictionary (CCD) for this would provide a general solution: if the residue's 3-letter code is that of a valid chem comp and the element is not present we can read it by looking up the atom name in the CCD. That would work for any kind of molecule.

@lafita lafita added this to the BioJava 5.0.0 milestone Jul 15, 2016
@lafita
Copy link
Member Author

lafita commented Jul 15, 2016

Great! I think that the optimal solution then is to complete the Element field of Atoms from the Chemical Component information when parsing and print a warning explaining the PDB format error and the bugs it may cause.

@lafita lafita added enhancement Improvement of existing code or method and removed question Open discussions about the library labels Jul 15, 2016
@sbliven
Copy link
Member

sbliven commented Jul 18, 2016

BioJava used to include whitespace in atom names, so that " CA " (carbon) was distinct from "CA " (calcium). However, this broke in the transition to mmcif, which trims whitespace from fields. I suppose it could be used as a hint for guessing, but it shouldn't be stored in data structures.

@lafita lafita self-assigned this Jul 27, 2016
@lafita
Copy link
Member Author

lafita commented Jul 27, 2016

I have been looking at the code and the Element field is parsed in PDBFileParser with the following code, which already handles missing Element column in PDB files:

// Parse element from the element field. If this field is
// missing (i.e. misformatted PDB file), then parse the
// name from the atom name.
Element element = Element.R;
if ( line.length() > 77 ) {
    // parse element from element field
    try {
        element = Element.valueOfIgnoreCase(line.substring (76, 78).trim());
    }  catch (IllegalArgumentException e){}
} else {
    // parse the name from the atom name
    String elementSymbol = null;
    // for atom names with 4 characters, the element is
    // at the first position, example HG23 in Valine
    if (fullname.trim().length() == 4) {
        elementSymbol = fullname.substring(0, 1);
    } else if ( fullname.trim().length() > 1){
        elementSymbol = fullname.substring(0, 2).trim();
    } 

    try {
        if (elementSymbol!=null)
            element = Element.valueOfIgnoreCase(elementSymbol);
    } catch (IllegalArgumentException e){
        logger.warn("Element {} was not recognised. Assigning generic element R to it",
                    elementSymbol);
    }
}
atom.setElement(element);

There are three things to improve:

  1. There is a bug where Atom names of a single letter are ignored an Element.R is always assigned (the bug is in the conditional fullname.trim().length() > 1, where it should be >=1).
  2. The Atom names have to be trimmed before taking the substrings, otherwise spaces in the beginning make a difference (like CA and CA that Spencer described).
  3. If the Element column is present, but empty (spaces), the Element assigned is always Element.R.

I would rewrite this handling, because I think that using the Chemical Component Dictionary information is a better solution. I will create a pull request with the change.

lafita added a commit to lafita/biojava that referenced this issue Jul 27, 2016
josemduarte added a commit that referenced this issue Jul 27, 2016
Fix #537 - handle missing and empty Element column in PDB files
josemduarte added a commit to josemduarte/biojava that referenced this issue Jul 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement of existing code or method
Projects
None yet
Development

No branches or pull requests

3 participants