New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More robust PDB/mmCIF parsing for non-deposited files #305
Comments
While I agree that we should not be too strict about the file format definitions, I don't think we should try to go out of our way too far, either. The PDB file format has been abused quite heavily in the past and the future is with the more extensible (and XML-like) mmCif. |
Now all non-poly chains will be ignored in compound finder, #305
Also reverting mmcif parser to not remove purely non-poly chains. Like that it matches what the pdb parser does.
What is the status of this one. I see there were several commits. Can we close it? |
I think it needs some more testing with a wider set of files. We are going to try doing more testing on this in the next 2-3 weeks, I'll close it after. |
Before this ticket is closed, I propose to add a commit for better supporting non-deposited files. The pdb parser is already robust for handling chains containing only non-polymeric hetatm and water Groups. In this case, these Groups will still be added to the produced Structure instance. The current mmCIF parser will discard these non-polymeric groups after parsing to prevent them from being included in the Structure. In this branch (https://github.com/edlunde-dnastar/biojava/tree/mmcif-nonpolymer-chains) an added FileParsingParameter sets these Groups to be included in the parsed Structure as they are for the PDB parser. This would be beneficial to make the mmCIF parser behave more closely to the PDB parser and better handle non-deposited structures that often will contain newly added ligands on unique chains. Discarding such ligands in this case would be presenting incorrect data. |
@larsonmattr This is a rare thing in official PDB files. e.g. one such case is 3o6j, which has a single water molecule in chain Z. If you want to get such cases added, your branch makes sense. I guess the tricky thing about creating non-deposited mmCif files is to get the data relationships right. To get a Compound there should be an entity_src_nat (or _gen, _syn) , StructAsym, _pdbx_entity_nonpoly , and perhaps other categories... |
In my opinion the behavior should be:
Having said that, a mode that makes this optional is a good compromise to be able to keep that data in those rare cases when deposited files have purely non-polymeric chains (in my opinion those are errors of data modelling at deposition time). One thing is important though, the All this is in in any case related to a more general issue in the |
The option to let the user of the API determine if they want the information provided by non-polymeric chains would be helpful. Beyond convention or the intent to say that X ligand and Y waters are associated with Z chain there isn't a benefit to the PDB convention. If the mmCIF format is completely adopted, the additional freedom of unlimited asym ids with groupings of molecular entities could dissolve this convention. It seems that the BioJava PDB parser is at the moment allowing completely non-polymeric chains to parsed by default? I don't think this a bug because it helps deal with the wide gamut of PDB files that often break conventions and there is no need to penalize people for 'breaking' the convention. Until cleaning up of the structures for submission it is common to have non-polymeric/solvent chains. |
Issue 332 on DBref records also discussed how best to handle short line lengths when writing/parsing PDB files. I wanted to add a comment here so that if the issue 332 is closed, that some of the fixing related to short lines might be handled as part of making more robust parsing. +1 to all toPDB() methods to return 80 char lines to conform with PDB format. Also, when parsing CONECT records the parser expects a minimum line length or throws an exception:
Any parsing of a record that would expect a minimum line length should check line length. |
_All this is in in any case related to a more general issue in the biojava-structure module. At the moment we are following the "classic" PDB data model (essentially the PDB file data model) instead of using the mmCIF data model. With the move to mmCIF and more and more complicated structures coming I do think Biojava should move to follow the strict mmCIF model. That would mean every polymer is a chain with its own chain id AND also every non-polymer is an independent chain with its own chain id. We would then need to split the current Structure.getChains into a Structure.getPolymericChains and a Structure.getNonPolymericChains That would make a lot more sense in general and would avoid inconsistency issues we are seeing already (see for instance #337, #294). _ +1 to Jose's comment. BioJava may need a new StructureImpl that is an accurate representation of mmCIF data structure. A shared Structure interface could support common functionality of PDB/mmCIF, but it might be more difficult with time to shoehorn both PDB/mmCIF-derived data structures into one class. Long term, it might be better to more closely respect the mmCIF entities (polymer, non-polymer, and water) and to provide methods to access polymer, non-polymer, water asyms (Chains) and query things more closely to the mmCIF data structure. For now, I'm wanting to support similar behavior between PDB/mmCIF. I am back-tracking on making another FileParsingParameter - it seems better to not clutter the API with too many options. I've submitted a PR for this - if this is a no-go I can take it back to the drawing board. |
This changes the behavior of parsers quite a lot: compounds are assigned to pure non-poly chains.
I've now added a few commits that should take care of non-poly chains issues |
can we close this one ? |
BioJava is very good at parsing deposited PDB/mmCIF files but some work is required to make it more robust to unconventional formatting when parsing non-deposited PDB/mmCIF files.
The aim should be that the parsers are robust enough to produce informative warnings for bad formatting, if possible without crashing altogether.
The 2 formats (especially PDB format) are abused a lot and many softwares depart from the standard, so surely all cases are impossible to cover.
A good idea would be to add a test and a set of files from the most popular softwares that produce coordinate files (refmac, phenix, cyana, rosetta, modeller... ). Note that there is already a test for some of these issues in
org.biojava.nbio.structure.io.TestNonDepositedFiles
.The text was updated successfully, but these errors were encountered: