Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mmCIF parsing support for missing SEQRES information #353

Closed
darnells opened this issue Nov 23, 2015 · 20 comments
Closed

mmCIF parsing support for missing SEQRES information #353

darnells opened this issue Nov 23, 2015 · 20 comments
Assignees
Milestone

Comments

@darnells
Copy link

The current PDBParser supports parsing SEQRES records from a PDB file header; however, the MMCIFParser does not yet support parsing sequence information from a mmCIF file.

The current implementation in SimpleMMCIFConsumer is not using the mmCIF-equivalent SEQRES records; it uses ATOM group equivalents to construct the current component list.

The SimpleMMCIFParser should parse the analogous records from an mmCIF file (_pdbx_poly_seq_scheme)--the direct equivalent to the SEQRES record--and construct a component list representing the SEQRES sequence as does the PDBParser (this is required for passing existing integration tests: HeaderOnlyTest>MMcifTest.testLoad:63->MMcifTest.comparePDB2cif).

@darnells
Copy link
Author

We have implemented this feature in my company and are testing it internally. We will be submitting a PR.

@josemduarte
Copy link
Contributor

There is some support for SEQRES records for mmCIF files at the moment, but it will only work when setting FileParsingParameters.setAlignSeqRes() to true. When that flag is set, the SEQRES groups in the Chains are populated. What other information from pdbx_poly_seq_scheme are you looking for? More that what's stored in Chain.getSeqResGroups() ?

Related to that and regarding testing, there is a very useful test in the biojava-integrationtest module that does a thorough comparison of the parsed data from PDB and mmCIF files to check the consistency of the parsers: TestLongPdbVsMmCifParsing. It runs the parsers through 1000 randomly-selected PDB entries. The test is not enabled by default because it takes many minutes to run. But I've found it very useful whenever I modify anything related to the parsers. Run it manually from command line with:

mvn -Dtest=TestLongPdbVsMmCifParsing#testLongPdbVsMmCif test

@larsonmattr
Copy link
Contributor

Jose,

If the current SimpleMMCIFConsumer populates the SEQRES groups when
setAlignSeqRes(true), then the parsing of the SEQRES is working as
intended. We were looking at filling the SEQRES groups for each chain
based on the _pdbx_poly_seq_scheme. Do you know if SEQRES groups are
parsed by default when reading PDB files? Perhaps the seAlignSeqRes()
should be a default behavior?

Another related feature we were looking at is to be able to parse SEQRES
groups from mmCIF when operating in headerOnly() mode. The goal would be
pull sequence information out of structure files without having to load and
build all the AtomGroups. Would the current implementation work with this?

Best regards,
Matt

On Mon, Nov 23, 2015 at 6:07 PM, Jose Manuel Duarte <
notifications@github.com> wrote:

There is some support for SEQRES records for mmCIF files at the moment,
but it will only work when setting FileParsingParameters.setAlignSeqRes()
to true. When that flag is set, the SEQRES groups in the Chains are
populated. What other information from pdbx_poly_seq_scheme are you looking
for? More that what's stored in Chain.getSeqResGroups() ?

Related to that and regarding testing, there is a very useful test in the
biojava-integrationtest module that does a thorough comparison of the
parsed data from PDB and mmCIF files to check the consistency of the
parsers: TestLongPdbVsMmCifParsing
https://github.com/biojava/biojava/blob/master/biojava-integrationtest/src/test/java/org/biojava/nbio/structure/test/io/TestLongPdbVsMmCifParsing.java.
It runs the parsers through 1000 randomly-selected PDB entries. The test is
not enabled by default because it takes many minutes to run. But I've found
it very useful whenever I modify anything related to the parsers. Run it
manually from command line with:

mvn -Dtest=TestLongPdbVsMmCifParsing#testLongPdbVsMmCif test


Reply to this email directly or view it on GitHub
#353 (comment).

Matt Larson, PhD
Madison, WI 53705 U.S.A.

@josemduarte
Copy link
Contributor

setAlignSeqRes() is by default false for all parsers. That means that by default SEQRES groups are not filled in Chains. The issue is that setAlignSeqRes() does introduce some (minimal) overhead because it needs to align the 2 sequences. That's the reason why it is not set by default. Have a look at TestLongPdbVsMmCifParsing#L509 where it is explicitly set to true so that SEQRES tests can pass. Simply set it to false and you will see where it fails.

I would say that it is not such a bad idea to set setAlignSeqRes() to true by default. At the moment a few things depend on the SEQRES groups being filled and it is confusing when they fail and there's no indication anywhere why they fail. So +1 on that.

As a related note, there's yet another flag related to this in FileParsingParameters: setStoreEmptySeqRes(). The intention of the flag is to store SEQRES without going through the alignment overhead. It looks like it is not actually implemented yet for the mmCIF parser. Perhaps that could serve for the use case where one needs to get the sequence only (the headerOnly() behavior you describe)?

@josemduarte
Copy link
Contributor

Regarding the isHeaderOnly() flag, I'm not sure what it is really doing right now. Does it actually save any parsing time?

For instance in the use case you present, I'm not convinced that one would really gain much in avoiding parsing of atom sites. After all (especially in the mmCIF case) the tokenizer/parser has to scan the whole file anyway (whether you keep headers only or store the whole thing).

If it can really be shown that there's considerable overhead and that there's a measurable difference, then I'm all for it.

@darnells
Copy link
Author

I coarsely estimate the overhead is 3-4%, which translates to an extra hour of parsing over the entire PDB (single thread, Intel Core i7-2600K @ 3.4 GHz x 4, 32 GB RAM, SSD drive).

Back-of-the-envelope estimation:

  1. Bin the bottom 95% of the PDB (113,971 structures) by molecular weight into 10 KDa bins (10 to 250 KDa)
  2. Calculate the weighted average using the central value of the interval (as of 11/30/15: 31.9 KDa)
  3. Assuming every structure is entirely protein (no water, ligands, nucleic acid), calculate the average residues in a structure (31,000 Da / 110 Da per aa residue = 290 aa)
  4. Measure the average parse time for 4WZ6 (one chain, 290 residues) using setHeaderOnly(true) and then setAlignSeqRes(true) (code snippet below, 16 trials and throw out the top/bottom 3, independent runs to avoid AtomCache; HeaderOnly = 0.669 s, AlignSeqRes = 0.692 s, Diff = 0.024 s or 3.4%)
  5. Extrapolate over entire PDB: 0.024 s * 111,971 = 2678 s ~= 45 min)

I expect there will be a measurable difference; however, I wasn't planning to perform a definitive experiment in the near future.

public class Issue353 {

    public static void main(String[] args) {
        MMCIFFileReader fr = new MMCIFFileReader();
        FileParsingParameters par = new FileParsingParameters();
        //par.setAlignSeqRes(true);
        par.setHeaderOnly(true);
        fr.setFileParsingParameters(par);
        fr.setFetchBehavior(FetchBehavior.FETCH_FILES);

        Structure s = null;
        long start = System.nanoTime();
        try {
            s = fr.getStructureById("4WZ6");
        } catch (IOException e) {
            e.printStackTrace();
            System.exit(1);
        }
        long stop = System.nanoTime();
        double diff = (stop - start) / 1000000000.0;
        System.out.println(String.format("[%s] Elapsed time: %.3f s", s.getIdentifier(), diff));
    }

}

@josemduarte
Copy link
Contributor

Great estimate, thanks for that! So as it seems there is some overhead that could be avoided with the headerOnly mode.

Only one point: we should reconsider all the different modes we have right now in FileParsingParameters and try not to repeat/overlap functionality between them. An explosion of the number of parsing modes is not desirable and very difficult to maintain. The list of modes related to SEQRES parsing is at the moment:

  • alignSeqRes
  • storeEmptySeqRes
  • headerOnly

Perhaps we could drop storeEmptySeqRes from there? In any case we should make alignSeqRes the default.

We should also double-check that the current modes work well for mmCIF, most of them were implemented for PDB and sometimes not (fully) implemented for mmCIF.

@pwrose
Copy link
Member

pwrose commented Dec 3, 2015

I agree that alignSeqRes should be the default.

Also, just FYI, all PDB sequences are available in a separate FASTA file at:

ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt.gz

On Thu, Dec 3, 2015 at 10:54 AM, Jose Manuel Duarte <
notifications@github.com> wrote:

Great estimate, thanks for that! So as it seems there is some overhead
that could be avoided with the headerOnly mode.

Only one point: we should reconsider all the different modes we have right
now in FileParsingParameters and try not to repeat/overlap functionality
between them. An explosion of the number of parsing modes is not desirable
and very difficult to maintain. The list of modes related to SEQRES parsing
is at the moment:

  • alignSeqRes
  • storeEmptySeqRes
  • headerOnly

Perhaps we could drop storeEmptySeqRes from there? In any case we should
make alignSeqRes the default.

We should also double-check that the current modes work well for mmCIF,
most of them were implemented for PDB and sometimes not (fully) implemented
for mmCIF.


Reply to this email directly or view it on GitHub
#353 (comment).

Peter Rose, Ph.D.
Site Head, RCSB Protein Data Bank West (http://www.rcsb.org)
San Diego Supercomputer Center (http://bioinformatics.sdsc.edu)
University of California, San Diego
+1-858-822-5497

@darnells
Copy link
Author

darnells commented Dec 3, 2015

@pwrose, thanks for sharing that link for everyone.

My motivation is to continually update a non-redundant protein sequence set of the PDB in the style of the Dunbrack Lab, such that resolution and R-values are considered in the selection process. I could not find the R-values in the derived data tables from wwPDB. Rather than abuse the PDB REST services, I planned to parse the header information locally. I considered this task to be "data mining the PDB header," which BioPython admits is a weakness of theirs. Since BioJava supports header-only parsing, I thought this could further distinguish BioJava (albeit in a small way).

Also, 👍 for alignSeqRes = true by default

@andreasprlic
Copy link
Member

So, alignSeqRes = true triggers several loops and calls to SeqRes2AtomAligner, which recreates the mapping between SEQRES and ATOM records. NOT having to go through this will speed up parsing. By default the parser should use the information from _pdbx_poly_seq_scheme to rebuild this mapping and only call alignSeqRes() if this is a custom mmCIF file and the data missing.

@sbliven
Copy link
Member

sbliven commented Dec 8, 2015

@andreasprlic This sounds very safe to make default, if it's not already.

@darnells
Copy link
Author

darnells commented Dec 9, 2015

There are eight combinations to support for the three current SeqRes related flags (headerOnly, storeEmptySeqRes, alignSeqRes). I started to try describing how each should behave, but quickly came to the conclusion that the current mechanism is too complicated.

Since SEQRES groups are expected for full internal support, I present this design for consideration. I attempted to combine suggestions from @josemduarte and @andreasprlic:

  • The storeEmptySeqRes option is dropped
  • The SEQRES/_pdbx_poly_seq_scheme records are always parsed
    • Consistent with the PDB format v3.3 since other Primary Structure Section records (DBREF, DBREF1, DBREF2) are already considered in "the header"
    • Guarantees compatibility with other BioJava methods
    • Minor overhead for 'headerOnly = true', but reduces confusion for new developers
  • 'headerOnly = true' disables all atomic coordinate parsing
  • alignSeqRes is true by default
    • For canonical mmcif files, mapping is defined by _pdbx_poly_seq_scheme records
    • For custom mmcif files or PDB format files, mapping is defined by sequence alignment
    • 'alignSeqRes = false' is now functionally equivalent to the deprecated 'storeEmptySeqRes = true'

@josemduarte
Copy link
Contributor

That looks like a great plan to me! +1 to dropping storeEmptySeqRes, +1 to alignSeqRes=true by default

@sbliven
Copy link
Member

sbliven commented Dec 10, 2015

+1

@darnells
Copy link
Author

We are willing to implement this design in conjunction with the other mmcif issues we opened (#341, #342, #343).

@pwrose
Copy link
Member

pwrose commented Dec 10, 2015

+1

On Thu, Dec 10, 2015 at 9:16 AM, darnells notifications@github.com wrote:

We are willing to implement this design in conjunction with the other
mmcif issues we opened (#341
#341, #342
#342, #343
#343).


Reply to this email directly or view it on GitHub
#353 (comment).

Peter Rose, Ph.D.
Site Head, RCSB Protein Data Bank West (http://www.rcsb.org)
San Diego Supercomputer Center (http://bioinformatics.sdsc.edu)
University of California, San Diego
+1-858-822-5497

@andreasprlic
Copy link
Member

ok great, thanks for volunteering! I have assigned the respective tickets to you.

@larsonmattr
Copy link
Contributor

Andreas,

I am working with Steve on these issues -

I would like to submit the pull request for issues 342 and 343 - could you
reassign those tickets to me? We will finish the implementation for 341
and submit a pull request soon as well.

On Thu, Dec 10, 2015 at 5:56 PM, Andreas Prlic notifications@github.com
wrote:

ok great, thanks for volunteering! I have assigned the respective tickets
to you.


Reply to this email directly or view it on GitHub
#353 (comment).

Matt Larson, PhD
Madison, WI 53705 U.S.A.

@andreasprlic
Copy link
Member

Hi Matt,

You can submit pull requests also without being assigned to a ticket. This is really more to keep track who is working on which issue. GitHub will keep track of the changes to code and give due credit for the contributions also without ticket ownership.

@josemduarte
Copy link
Contributor

The pull request #389 should have fixed all this, please reopen if someone finds any problems with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants