Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More robust PDB/mmCIF parsing for non-deposited files #305

Closed
josemduarte opened this issue Jul 24, 2015 · 11 comments
Closed

More robust PDB/mmCIF parsing for non-deposited files #305

josemduarte opened this issue Jul 24, 2015 · 11 comments
Assignees
Milestone

Comments

@josemduarte
Copy link
Contributor

BioJava is very good at parsing deposited PDB/mmCIF files but some work is required to make it more robust to unconventional formatting when parsing non-deposited PDB/mmCIF files.

The aim should be that the parsers are robust enough to produce informative warnings for bad formatting, if possible without crashing altogether.

The 2 formats (especially PDB format) are abused a lot and many softwares depart from the standard, so surely all cases are impossible to cover.

A good idea would be to add a test and a set of files from the most popular softwares that produce coordinate files (refmac, phenix, cyana, rosetta, modeller... ). Note that there is already a test for some of these issues in org.biojava.nbio.structure.io.TestNonDepositedFiles.

@andreasprlic
Copy link
Member

While I agree that we should not be too strict about the file format definitions, I don't think we should try to go out of our way too far, either. The PDB file format has been abused quite heavily in the past and the future is with the more extensible (and XML-like) mmCif.

josemduarte added a commit that referenced this issue Jul 27, 2015
josemduarte added a commit that referenced this issue Jul 27, 2015
Now all non-poly chains will be ignored in compound finder, #305
josemduarte added a commit that referenced this issue Jul 27, 2015
Also reverting mmcif parser to not remove purely non-poly chains. Like
that it matches what the pdb parser does.
@andreasprlic
Copy link
Member

What is the status of this one. I see there were several commits. Can we close it?

@josemduarte
Copy link
Contributor Author

I think it needs some more testing with a wider set of files. We are going to try doing more testing on this in the next 2-3 weeks, I'll close it after.

@larsonmattr
Copy link
Contributor

Before this ticket is closed, I propose to add a commit for better supporting non-deposited files. The pdb parser is already robust for handling chains containing only non-polymeric hetatm and water Groups. In this case, these Groups will still be added to the produced Structure instance.

The current mmCIF parser will discard these non-polymeric groups after parsing to prevent them from being included in the Structure.

In this branch (https://github.com/edlunde-dnastar/biojava/tree/mmcif-nonpolymer-chains) an added FileParsingParameter sets these Groups to be included in the parsed Structure as they are for the PDB parser. This would be beneficial to make the mmCIF parser behave more closely to the PDB parser and better handle non-deposited structures that often will contain newly added ligands on unique chains. Discarding such ligands in this case would be presenting incorrect data.

@andreasprlic
Copy link
Member

@larsonmattr This is a rare thing in official PDB files. e.g. one such case is 3o6j, which has a single water molecule in chain Z. If you want to get such cases added, your branch makes sense.

I guess the tricky thing about creating non-deposited mmCif files is to get the data relationships right. To get a Compound there should be an entity_src_nat (or _gen, _syn) , StructAsym, _pdbx_entity_nonpoly , and perhaps other categories...

@josemduarte
Copy link
Contributor Author

In my opinion the behavior should be:

  • Purely water chains: should not be allowed at all, that really breaks many assumptions about what a chain is supposed to be in the "classic" PDB data model, i.e. the model currently followed in Biojava.
  • Purely non-polymeric chains: they should not be allowed either in our current data model. A chain right now must be either a nucleic acid or protein chain with optionally some non-polymer ligands and waters. Non-polymeric-only chains wouldn't make sense in Biojava, e.g. they don't have a sequence, so lots of the assumptions about what can be done with chains break for them.

Having said that, a mode that makes this optional is a good compromise to be able to keep that data in those rare cases when deposited files have purely non-polymeric chains (in my opinion those are errors of data modelling at deposition time). One thing is important though, the allowNonPolymericChains mode should be respected by both the PDB parser and the mmCIF parser. Also I'd say that by default it should be switched off.

All this is in in any case related to a more general issue in the biojava-structure module. At the moment we are following the "classic" PDB data model (essentially the PDB file data model) instead of using the mmCIF data model. With the move to mmCIF and more and more complicated structures coming I do think Biojava should move to follow the strict mmCIF model. That would mean every polymer is a chain with its own chain id AND also every non-polymer is an independent chain with its own chain id. We would then need to split the current Structure.getChains into a Structure.getPolymericChains and a Structure.getNonPolymericChains That would make a lot more sense in general and would avoid inconsistency issues we are seeing already (see for instance #337, #294).

@larsonmattr
Copy link
Contributor

The option to let the user of the API determine if they want the information provided by non-polymeric chains would be helpful. Beyond convention or the intent to say that X ligand and Y waters are associated with Z chain there isn't a benefit to the PDB convention. If the mmCIF format is completely adopted, the additional freedom of unlimited asym ids with groupings of molecular entities could dissolve this convention.

It seems that the BioJava PDB parser is at the moment allowing completely non-polymeric chains to parsed by default? I don't think this a bug because it helps deal with the wide gamut of PDB files that often break conventions and there is no need to penalize people for 'breaking' the convention. Until cleaning up of the structures for submission it is common to have non-polymeric/solvent chains.

@larsonmattr
Copy link
Contributor

Issue 332 on DBref records also discussed how best to handle short line lengths when writing/parsing PDB files. I wanted to add a comment here so that if the issue 332 is closed, that some of the fixing related to short lines might be handled as part of making more robust parsing.

+1 to all toPDB() methods to return 80 char lines to conform with PDB format.

Also, when parsing CONECT records the parser expects a minimum line length or throws an exception:

java.lang.StringIndexOutOfBoundsException: String index out of range: 26
    at java.lang.String.substring(String.java:1907)
    at org.biojava.nbio.structure.io.PDBFileParser.conect_helper(PDBFileParser.java:2097)
    at org.biojava.nbio.structure.io.PDBFileParser.pdb_CONECT_Handler(PDBFileParser.java:2144)
    at org.biojava.nbio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2779)
    at org.biojava.nbio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2675)

Any parsing of a record that would expect a minimum line length should check line length.

@larsonmattr
Copy link
Contributor

_All this is in in any case related to a more general issue in the biojava-structure module. At the moment we are following the "classic" PDB data model (essentially the PDB file data model) instead of using the mmCIF data model. With the move to mmCIF and more and more complicated structures coming I do think Biojava should move to follow the strict mmCIF model. That would mean every polymer is a chain with its own chain id AND also every non-polymer is an independent chain with its own chain id. We would then need to split the current Structure.getChains into a Structure.getPolymericChains and a Structure.getNonPolymericChains That would make a lot more sense in general and would avoid inconsistency issues we are seeing already (see for instance #337, #294). _

+1 to Jose's comment. BioJava may need a new StructureImpl that is an accurate representation of mmCIF data structure. A shared Structure interface could support common functionality of PDB/mmCIF, but it might be more difficult with time to shoehorn both PDB/mmCIF-derived data structures into one class. Long term, it might be better to more closely respect the mmCIF entities (polymer, non-polymer, and water) and to provide methods to access polymer, non-polymer, water asyms (Chains) and query things more closely to the mmCIF data structure.

For now, I'm wanting to support similar behavior between PDB/mmCIF. I am back-tracking on making another FileParsingParameter - it seems better to not clutter the API with too many options. I've submitted a PR for this - if this is a no-go I can take it back to the drawing board.

josemduarte added a commit that referenced this issue Feb 2, 2016
This changes the behavior of parsers quite a lot: compounds are assigned
to pure non-poly chains.
@josemduarte
Copy link
Contributor Author

I've now added a few commits that should take care of non-poly chains issues

@andreasprlic
Copy link
Member

can we close this one ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants