New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mmCIF parsing support for missing SEQRES information #353
Comments
We have implemented this feature in my company and are testing it internally. We will be submitting a PR. |
There is some support for SEQRES records for mmCIF files at the moment, but it will only work when setting Related to that and regarding testing, there is a very useful test in the biojava-integrationtest module that does a thorough comparison of the parsed data from PDB and mmCIF files to check the consistency of the parsers: TestLongPdbVsMmCifParsing. It runs the parsers through 1000 randomly-selected PDB entries. The test is not enabled by default because it takes many minutes to run. But I've found it very useful whenever I modify anything related to the parsers. Run it manually from command line with:
|
Jose, If the current SimpleMMCIFConsumer populates the SEQRES groups when Another related feature we were looking at is to be able to parse SEQRES Best regards, On Mon, Nov 23, 2015 at 6:07 PM, Jose Manuel Duarte <
Matt Larson, PhD |
I would say that it is not such a bad idea to set As a related note, there's yet another flag related to this in |
Regarding the For instance in the use case you present, I'm not convinced that one would really gain much in avoiding parsing of atom sites. After all (especially in the mmCIF case) the tokenizer/parser has to scan the whole file anyway (whether you keep headers only or store the whole thing). If it can really be shown that there's considerable overhead and that there's a measurable difference, then I'm all for it. |
I coarsely estimate the overhead is 3-4%, which translates to an extra hour of parsing over the entire PDB (single thread, Intel Core i7-2600K @ 3.4 GHz x 4, 32 GB RAM, SSD drive). Back-of-the-envelope estimation:
I expect there will be a measurable difference; however, I wasn't planning to perform a definitive experiment in the near future.
|
Great estimate, thanks for that! So as it seems there is some overhead that could be avoided with the headerOnly mode. Only one point: we should reconsider all the different modes we have right now in
Perhaps we could drop storeEmptySeqRes from there? In any case we should make alignSeqRes the default. We should also double-check that the current modes work well for mmCIF, most of them were implemented for PDB and sometimes not (fully) implemented for mmCIF. |
I agree that alignSeqRes should be the default. Also, just FYI, all PDB sequences are available in a separate FASTA file at: ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt.gz On Thu, Dec 3, 2015 at 10:54 AM, Jose Manuel Duarte <
Peter Rose, Ph.D. |
@pwrose, thanks for sharing that link for everyone. My motivation is to continually update a non-redundant protein sequence set of the PDB in the style of the Dunbrack Lab, such that resolution and R-values are considered in the selection process. I could not find the R-values in the derived data tables from wwPDB. Rather than abuse the PDB REST services, I planned to parse the header information locally. I considered this task to be "data mining the PDB header," which BioPython admits is a weakness of theirs. Since BioJava supports header-only parsing, I thought this could further distinguish BioJava (albeit in a small way). Also, 👍 for alignSeqRes = true by default |
So, |
@andreasprlic This sounds very safe to make default, if it's not already. |
There are eight combinations to support for the three current SeqRes related flags (headerOnly, storeEmptySeqRes, alignSeqRes). I started to try describing how each should behave, but quickly came to the conclusion that the current mechanism is too complicated. Since SEQRES groups are expected for full internal support, I present this design for consideration. I attempted to combine suggestions from @josemduarte and @andreasprlic:
|
That looks like a great plan to me! +1 to dropping storeEmptySeqRes, +1 to alignSeqRes=true by default |
+1 |
+1 On Thu, Dec 10, 2015 at 9:16 AM, darnells notifications@github.com wrote:
Peter Rose, Ph.D. |
ok great, thanks for volunteering! I have assigned the respective tickets to you. |
Andreas, I am working with Steve on these issues - I would like to submit the pull request for issues 342 and 343 - could you On Thu, Dec 10, 2015 at 5:56 PM, Andreas Prlic notifications@github.com
Matt Larson, PhD |
Hi Matt, You can submit pull requests also without being assigned to a ticket. This is really more to keep track who is working on which issue. GitHub will keep track of the changes to code and give due credit for the contributions also without ticket ownership. |
The pull request #389 should have fixed all this, please reopen if someone finds any problems with it. |
The current PDBParser supports parsing SEQRES records from a PDB file header; however, the MMCIFParser does not yet support parsing sequence information from a mmCIF file.
The current implementation in SimpleMMCIFConsumer is not using the mmCIF-equivalent SEQRES records; it uses ATOM group equivalents to construct the current component list.
The SimpleMMCIFParser should parse the analogous records from an mmCIF file (_pdbx_poly_seq_scheme)--the direct equivalent to the SEQRES record--and construct a component list representing the SEQRES sequence as does the PDBParser (this is required for passing existing integration tests:
HeaderOnlyTest>MMcifTest.testLoad:63->MMcifTest.comparePDB2cif
).The text was updated successfully, but these errors were encountered: