Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle feature references another sequence #2334

Merged

Conversation

peterjc
Copy link
Member

@peterjc peterjc commented Nov 8, 2019

This pull request addresses issue #808, including the work of Adam Sjøgren via the mailing list.

  • I hereby agree to dual licence this and any previous contributions under both
    the Biopython License Agreement AND the BSD 3-Clause License.

  • I have read the CONTRIBUTING.rst file, have run flake8 locally, and
    understand that AppVeyor and TravisCI will be used to confirm the Biopython unit
    tests and style checks pass with these changes.

  • I have added my name to the alphabetical contributors listings in the files
    NEWS.rst and CONTRIB.rst as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

@codecov
Copy link

codecov bot commented Nov 8, 2019

Codecov Report

Merging #2334 into master will increase coverage by 0.00%.
The diff coverage is 87.50%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #2334   +/-   ##
=======================================
  Coverage   83.88%   83.89%           
=======================================
  Files         317      317           
  Lines       51548    51556    +8     
=======================================
+ Hits        43242    43251    +9     
+ Misses       8306     8305    -1     
Impacted Files Coverage Δ
Bio/SeqFeature.py 87.12% <87.50%> (+0.38%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db43482...585fc9a. Read the comment docs.

@peterjc
Copy link
Member Author

peterjc commented Dec 20, 2019

Rebasing and dealing with conflicts and new release...

@peterjc peterjc force-pushed the handle-feature-references-another-sequence branch from 026f378 to 4d87f65 Compare December 20, 2019 16:04
@peterjc
Copy link
Member Author

peterjc commented Feb 13, 2020

I haven't forgotten about this, but we've been distracted with the Python 2/3 cleanup in the meantime.

@peterjc
Copy link
Member Author

peterjc commented Mar 23, 2020

Rebasing again (and leaving out the dual licensing changes)...

@peterjc peterjc force-pushed the handle-feature-references-another-sequence branch 3 times, most recently from 4f3320d to 07d000d Compare March 23, 2020 18:44
@peterjc
Copy link
Member Author

peterjc commented Mar 23, 2020

Code seems to be working fine, but have not found a good example yet - for instance none of the features here give the expected translation:

from Bio import SeqIO

files = ("FP885871.gbk", "FP885876.gbk")

references = SeqIO.to_dict(SeqIO.read(_, "gb") for _ in files)

for acc, ref in references.items():
    print(f"{acc} length {len(ref)} has {len(ref.features)} features")
    for f in ref.features:
        if f.type == "CDS" and len(f.location.parts) > 1:
            print(f"{f.type} feature at {f.location}")
            #print(f)
            expected = f.qualifiers["translation"][0]
            try:
                translation = f.extract(ref.seq, references).translate(cds=True)
            except:
                translation = f.extract(ref.seq, references).translate(to_stop=True)
            print(expected + "<-- expected")
            print(translation + "<-- computed", expected == translation)
            if "exception" in f.qualifiers and f.qualifiers["exception"] == ["RNA editing"]:
                continue
            assert expected == translation, f"{expected} vs {translation}"

Some (all?) of them have human readable comments explaining why (in more detail than just "RNA editing") at least.

@peterjc peterjc force-pushed the handle-feature-references-another-sequence branch from 67bd109 to 90e50a6 Compare March 23, 2020 21:42
@peterjc
Copy link
Member Author

peterjc commented Mar 29, 2020

Sadly this example from ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Arabidopsis_thaliana/ used in the Tutorial from drawing tRNA on chromosomes does not appear to have any clear trans-splicing annotation:

filenames = [
    "CHR_I/NC_003070.gbk",
    "CHR_II/NC_003071.gbk",
    "CHR_III/NC_003074.gbk",
    "CHR_IV/NC_003075.gbk",
    "CHR_V/NC_003076.gbk",
]

references = SeqIO.index_db("Arabidopsis.idx", filenames, "genbank")

Perhaps a newer version might... or maybe this is just not an appropriate example.

@peterjc
Copy link
Member Author

peterjc commented May 14, 2020

Zebrafish to the rescue - a well behaved example from Adam Sjøgren, BX465837.11 which has a couple of CDS with trans-splicing with BX537337.9

https://www.ncbi.nlm.nih.gov/nuccore/BX465837.11
https://www.ncbi.nlm.nih.gov/nuccore/BX537337.9

Should work in EMBL format too:

https://www.ebi.ac.uk/ena/browser/api/embl/BX465837.11
https://www.ebi.ac.uk/ena/browser/api/embl/BX537337.9

@peterjc
Copy link
Member Author

peterjc commented May 16, 2020

@MarkusPiotrowski - would you be a good person to review this?

I'm hoping to turn the Zebrafish example into something for the Tutorial and/or docstrings.

@peterjc
Copy link
Member Author

peterjc commented Jul 28, 2020

This is going to need some merge conflicts resolved now...

Adam Sjøgren and others added 5 commits August 29, 2020 14:45
If the location refers to other records, those records can be supplied
in an optional references dictionary, where the records will be looked
up by the ref (key) and the value is expected to be the same type as the
parent_sequence parameter (and thus the type extract() returns).

Refs biopython#808
This is to faciliate using SeqIO.to_dict or SeqIO.index etc with the extract method.

In that use case, want to focus on the annotation of the feature, not the parent records.
@peterjc peterjc force-pushed the handle-feature-references-another-sequence branch from 90e50a6 to 1fdc9ab Compare August 29, 2020 14:02
@peterjc
Copy link
Member Author

peterjc commented Aug 29, 2020

@mdehoon do you want to review?

@mdehoon
Copy link
Contributor

mdehoon commented Aug 30, 2020

I guess these exceptions:

            raise ValueError("Feature references another sequence.")

in _shift and _flip can now be removed?

@peterjc
Copy link
Member Author

peterjc commented Aug 30, 2020

Yes, as per my comment #808 (comment) in those cases _shift and _flip are relative to the local sequence, and thus are a no-operation relative to an external sequence.

Bio/SeqFeature.py Outdated Show resolved Hide resolved
@peterjc peterjc force-pushed the handle-feature-references-another-sequence branch from 304dcd3 to 585fc9a Compare August 30, 2020 12:53
@peterjc peterjc merged commit cbd0ea3 into biopython:master Aug 31, 2020
@peterjc peterjc deleted the handle-feature-references-another-sequence branch August 31, 2020 13:43
@peterjc
Copy link
Member Author

peterjc commented Aug 31, 2020

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants