More robust PDB/mmCIF parsing for non-deposited files #305

josemduarte · 2015-07-24T11:14:03Z

BioJava is very good at parsing deposited PDB/mmCIF files but some work is required to make it more robust to unconventional formatting when parsing non-deposited PDB/mmCIF files.

The aim should be that the parsers are robust enough to produce informative warnings for bad formatting, if possible without crashing altogether.

The 2 formats (especially PDB format) are abused a lot and many softwares depart from the standard, so surely all cases are impossible to cover.

A good idea would be to add a test and a set of files from the most popular softwares that produce coordinate files (refmac, phenix, cyana, rosetta, modeller... ). Note that there is already a test for some of these issues in org.biojava.nbio.structure.io.TestNonDepositedFiles.

The text was updated successfully, but these errors were encountered:

andreasprlic · 2015-07-24T17:19:14Z

While I agree that we should not be too strict about the file format definitions, I don't think we should try to go out of our way too far, either. The PDB file format has been abused quite heavily in the past and the future is with the more extensible (and XML-like) mmCif.

Now all non-poly chains will be ignored in compound finder, #305

Also reverting mmcif parser to not remove purely non-poly chains. Like that it matches what the pdb parser does.

andreasprlic · 2016-01-09T16:33:43Z

What is the status of this one. I see there were several commits. Can we close it?

josemduarte · 2016-01-09T16:44:59Z

I think it needs some more testing with a wider set of files. We are going to try doing more testing on this in the next 2-3 weeks, I'll close it after.

larsonmattr · 2016-01-11T02:50:05Z

Before this ticket is closed, I propose to add a commit for better supporting non-deposited files. The pdb parser is already robust for handling chains containing only non-polymeric hetatm and water Groups. In this case, these Groups will still be added to the produced Structure instance.

The current mmCIF parser will discard these non-polymeric groups after parsing to prevent them from being included in the Structure.

In this branch (https://github.com/edlunde-dnastar/biojava/tree/mmcif-nonpolymer-chains) an added FileParsingParameter sets these Groups to be included in the parsed Structure as they are for the PDB parser. This would be beneficial to make the mmCIF parser behave more closely to the PDB parser and better handle non-deposited structures that often will contain newly added ligands on unique chains. Discarding such ligands in this case would be presenting incorrect data.

andreasprlic · 2016-01-11T05:51:49Z

@larsonmattr This is a rare thing in official PDB files. e.g. one such case is 3o6j, which has a single water molecule in chain Z. If you want to get such cases added, your branch makes sense.

I guess the tricky thing about creating non-deposited mmCif files is to get the data relationships right. To get a Compound there should be an entity_src_nat (or _gen, _syn) , StructAsym, _pdbx_entity_nonpoly , and perhaps other categories...

josemduarte · 2016-01-11T19:47:23Z

In my opinion the behavior should be:

Purely water chains: should not be allowed at all, that really breaks many assumptions about what a chain is supposed to be in the "classic" PDB data model, i.e. the model currently followed in Biojava.
Purely non-polymeric chains: they should not be allowed either in our current data model. A chain right now must be either a nucleic acid or protein chain with optionally some non-polymer ligands and waters. Non-polymeric-only chains wouldn't make sense in Biojava, e.g. they don't have a sequence, so lots of the assumptions about what can be done with chains break for them.

Having said that, a mode that makes this optional is a good compromise to be able to keep that data in those rare cases when deposited files have purely non-polymeric chains (in my opinion those are errors of data modelling at deposition time). One thing is important though, the allowNonPolymericChains mode should be respected by both the PDB parser and the mmCIF parser. Also I'd say that by default it should be switched off.

All this is in in any case related to a more general issue in the biojava-structure module. At the moment we are following the "classic" PDB data model (essentially the PDB file data model) instead of using the mmCIF data model. With the move to mmCIF and more and more complicated structures coming I do think Biojava should move to follow the strict mmCIF model. That would mean every polymer is a chain with its own chain id AND also every non-polymer is an independent chain with its own chain id. We would then need to split the current Structure.getChains into a Structure.getPolymericChains and a Structure.getNonPolymericChains That would make a lot more sense in general and would avoid inconsistency issues we are seeing already (see for instance #337, #294).

larsonmattr · 2016-01-12T04:05:52Z

The option to let the user of the API determine if they want the information provided by non-polymeric chains would be helpful. Beyond convention or the intent to say that X ligand and Y waters are associated with Z chain there isn't a benefit to the PDB convention. If the mmCIF format is completely adopted, the additional freedom of unlimited asym ids with groupings of molecular entities could dissolve this convention.

It seems that the BioJava PDB parser is at the moment allowing completely non-polymeric chains to parsed by default? I don't think this a bug because it helps deal with the wide gamut of PDB files that often break conventions and there is no need to penalize people for 'breaking' the convention. Until cleaning up of the structures for submission it is common to have non-polymeric/solvent chains.

larsonmattr · 2016-01-20T15:49:46Z

Issue 332 on DBref records also discussed how best to handle short line lengths when writing/parsing PDB files. I wanted to add a comment here so that if the issue 332 is closed, that some of the fixing related to short lines might be handled as part of making more robust parsing.

+1 to all toPDB() methods to return 80 char lines to conform with PDB format.

Also, when parsing CONECT records the parser expects a minimum line length or throws an exception:

java.lang.StringIndexOutOfBoundsException: String index out of range: 26
    at java.lang.String.substring(String.java:1907)
    at org.biojava.nbio.structure.io.PDBFileParser.conect_helper(PDBFileParser.java:2097)
    at org.biojava.nbio.structure.io.PDBFileParser.pdb_CONECT_Handler(PDBFileParser.java:2144)
    at org.biojava.nbio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2779)
    at org.biojava.nbio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2675)

Any parsing of a record that would expect a minimum line length should check line length.

larsonmattr · 2016-01-29T22:18:19Z

_All this is in in any case related to a more general issue in the biojava-structure module. At the moment we are following the "classic" PDB data model (essentially the PDB file data model) instead of using the mmCIF data model. With the move to mmCIF and more and more complicated structures coming I do think Biojava should move to follow the strict mmCIF model. That would mean every polymer is a chain with its own chain id AND also every non-polymer is an independent chain with its own chain id. We would then need to split the current Structure.getChains into a Structure.getPolymericChains and a Structure.getNonPolymericChains That would make a lot more sense in general and would avoid inconsistency issues we are seeing already (see for instance #337, #294). _

+1 to Jose's comment. BioJava may need a new StructureImpl that is an accurate representation of mmCIF data structure. A shared Structure interface could support common functionality of PDB/mmCIF, but it might be more difficult with time to shoehorn both PDB/mmCIF-derived data structures into one class. Long term, it might be better to more closely respect the mmCIF entities (polymer, non-polymer, and water) and to provide methods to access polymer, non-polymer, water asyms (Chains) and query things more closely to the mmCIF data structure.

For now, I'm wanting to support similar behavior between PDB/mmCIF. I am back-tracking on making another FileParsingParameter - it seems better to not clutter the API with too many options. I've submitted a PR for this - if this is a no-go I can take it back to the drawing board.

This changes the behavior of parsers quite a lot: compounds are assigned to pure non-poly chains.

josemduarte · 2016-02-02T22:22:23Z

I've now added a few commits that should take care of non-poly chains issues

andreasprlic · 2016-03-02T16:00:17Z

can we close this one ?

josemduarte self-assigned this Jul 24, 2015

josemduarte mentioned this issue Jul 24, 2015

Testing version 3 with non-deposited files eppic-team/eppic#52

Closed

josemduarte added a commit that referenced this issue Jul 24, 2015

Better parsing of COMPND lines in PDB files, #305

36a8fea

josemduarte added a commit that referenced this issue Jul 27, 2015

More parsing fixes, #305

6f70909

josemduarte added a commit that referenced this issue Jul 27, 2015

Reverting parser change: removing pure non-poly chains is a bad idea.

bec7990

Now all non-poly chains will be ignored in compound finder, #305

josemduarte added a commit that referenced this issue Jul 27, 2015

Adding some more non-deposited files tests, #305.

782e2f7

Also reverting mmcif parser to not remove purely non-poly chains. Like that it matches what the pdb parser does.

josemduarte mentioned this issue Jan 11, 2016

Modified PDBParser to fix bug #330 #332

Merged

larsonmattr mentioned this issue Jan 29, 2016

More robust support for MMcif parsing non-polymeric chains #394

Merged

josemduarte added a commit that referenced this issue Feb 2, 2016

Fixing issue related to PR #394, see also issue #305.

ff7c661

This changes the behavior of parsers quite a lot: compounds are assigned to pure non-poly chains.

josemduarte added a commit that referenced this issue Feb 2, 2016

Removing unused code, relates to #305 and #394

7d0fb89

andreasprlic added this to the BioJava 4.2.0 milestone Mar 2, 2016

andreasprlic closed this as completed Mar 2, 2016

josemduarte mentioned this issue Jul 15, 2016

PDB File Models Missing Element Column #537

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More robust PDB/mmCIF parsing for non-deposited files #305

More robust PDB/mmCIF parsing for non-deposited files #305

josemduarte commented Jul 24, 2015

andreasprlic commented Jul 24, 2015

andreasprlic commented Jan 9, 2016

josemduarte commented Jan 9, 2016

larsonmattr commented Jan 11, 2016

andreasprlic commented Jan 11, 2016

josemduarte commented Jan 11, 2016

larsonmattr commented Jan 12, 2016

larsonmattr commented Jan 20, 2016

larsonmattr commented Jan 29, 2016

josemduarte commented Feb 2, 2016

andreasprlic commented Mar 2, 2016

More robust PDB/mmCIF parsing for non-deposited files #305

More robust PDB/mmCIF parsing for non-deposited files #305

Comments

josemduarte commented Jul 24, 2015

andreasprlic commented Jul 24, 2015

andreasprlic commented Jan 9, 2016

josemduarte commented Jan 9, 2016

larsonmattr commented Jan 11, 2016

andreasprlic commented Jan 11, 2016

josemduarte commented Jan 11, 2016

larsonmattr commented Jan 12, 2016

larsonmattr commented Jan 20, 2016

larsonmattr commented Jan 29, 2016

josemduarte commented Feb 2, 2016

andreasprlic commented Mar 2, 2016