Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to deal with "Feature references another sequence?" #808

Closed
clalogiudice opened this issue Apr 12, 2016 · 23 comments
Closed

How to deal with "Feature references another sequence?" #808

clalogiudice opened this issue Apr 12, 2016 · 23 comments

Comments

@clalogiudice
Copy link

Hi to all,
I'm extracting CDS with biopython but I'm experiencing this problem on some sequences like this:

join{[37692:37922](-), [35638:36854](-), FP885871.1[246160:246182](+), FP885871.1[163236:163631](+), FP885871.1[164613:164763](+)}

I'm quite sure that the reference to other sequence is not managed (yet) by biopython.
Any suggestion to overcome?

THanx in advance.

Edit: I added the triple back-tick characters to display the Biopython compound location string properly; Peter

@peterjc
Copy link
Member

peterjc commented Apr 12, 2016

If you are hoping to use the location information to extract the sequence, Biopython doesn't do that (yet). One idea was to extend the SeqFeature object's .extract(...) method to optionally take a dictionary of sequences keyed by the accessions, as well as the main sequence.

However, it may be there is a better way to get the data you want from existing online resources?

Update - To be explicit, you'd get a ValueError message of Feature references another sequence if you try.

@clalogiudice
Copy link
Author

Dear Peterjc, thanks for your kind reply:
I'm still a novice.
I've tried in this manner:

for rec in SeqIO.parse(handle, 'genbank'):
    for feature in rec.features:
        try:
            if feature.type == 'CDS':
                print (feature.location.extract(rec).seq)
        except ValueError:
            print(rec.seq[feature.location.start.position:feature.location.end.position])

My problem is how to manage transplicing events in this case....
Could you suggest me the easiest way?

Edit: I added the triple back-tick characters to display the Python code example properly on GitHub. Peter.

@peterjc
Copy link
Member

peterjc commented Apr 12, 2016

You've picked a hard problem (trans-splicing). The easiest way would as I said probably be to find the data elsewhere online - perhaps via the NCBI gene/protein databases?

With trans-splicing you must not just use the overall location start and end values - they won't give you want you want. This snippet of code might help:

for sub_location in feature.location.parts:
    print sub_location

What's the accession / URL of the example record you are looking at? It would be easier for me to work from that.

@clalogiudice
Copy link
Author

Dear Peter,
here's an example of record that goes wrong:

http://www.ncbi.nlm.nih.gov/nuccore/FP885871.1

@peterjc
Copy link
Member

peterjc commented Apr 12, 2016

Thanks for giving the specific example. Unfortunately I have some bad news.

First look at http://www.ncbi.nlm.nih.gov/nuccore/FP885871.1 is the default "GenBank" view:

LOCUS       FP885871              269136 bp    DNA     circular PLN 29-SEP-2011
DEFINITION  Beta vulgaris subsp. maritima genotype male-sterile G
            mitochondrion, partial genome.
ACCESSION   FP885871
VERSION     FP885871.1  GI:320147991
KEYWORDS    Beta vulgaris; mitochondrion.
SOURCE      mitochondrion Beta vulgaris subsp. maritima
  ORGANISM  Beta vulgaris subsp. maritima
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
            Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae;
            Pentapetalae; Caryophyllales; Amaranthaceae; Betoideae; Beta.
...
     gene            order(93403..93788,complement(50250..50308),
                     FP885876.1:59654..59912,89375,90851,90854,90861,90897,
                     90898,93404,93569,93617,93667,93709,93710,89326..89407,
                     90830..91021)
                     /gene="nad1"
     CDS             join(93403..93788,89326..89407,90830..91021,
                     complement(50250..50308),FP885876.1:59654..59912)
                     /gene="nad1"
                     /exception="RNA editing"
                     /trans_splicing
                     /note="Protein sequence is in conflict with the conceptual
                     translation; Protein sequence is in conflict with the
                     conceptual translation; C to U RNA editing creates start
                     codon"
                     /codon_start=1
                     /product="NADH dehydrogenase subunit 1"
                     /protein_id="CBX33244.1"
                     /db_xref="GI:320147992"
                     /db_xref="GOA:E6ZDX8"
                     /db_xref="InterPro:IPR001694"
                     /db_xref="InterPro:IPR018086"
                     /db_xref="UniProtKB/TrEMBL:E6ZDX8"
                     /translation="MYIAVPAEILGIILPLLLGVAFLVLAERKVMAFVQRRKGPDVVG
                     SFGLLQPLADGLKLILKEPISPSSANFFLFRMAPVTTFMLSLVAWAVVPFDYGMVLSD
                     LNIGLLYLFAISSLGVYGIIIAGWSSNSKYAFLGALRSAAQMVSYEVSIGLILITVLI
                     CVGSCNLSEIVMAQKQIWFGIPLFPVLVMFFISCLAETNRAPFDLPEAEAELVAGYNV
                     EYSSMGFALFFLGEYANMILMSGLCTLLFLGGWLPILDLPIFKRIPGSIWFSIKVILF
                     LFLYIWVRAAFPRYRYDQLMGLGWKVFLPLSLAWVVAVSGVLVTFQWLP"
...

Now toggle the view from "GenBank" to "GenBank (full)" and I get http://www.ncbi.nlm.nih.gov/nuccore/FP885871.1?report=gbwithparts&log$=seqview

LOCUS       FP885871              269136 bp    DNA     circular PLN 29-SEP-2011
DEFINITION  Beta vulgaris subsp. maritima genotype male-sterile G
            mitochondrion, partial genome.
ACCESSION   FP885871
VERSION     FP885871.1  GI:320147991
KEYWORDS    Beta vulgaris; mitochondrion.
SOURCE      mitochondrion Beta vulgaris subsp. maritima
  ORGANISM  Beta vulgaris subsp. marítima
...

     gene            order(93403..93788,complement(50250..50308),89375,90851,
                     90854,90861,90897,90898,93404,93569,93617,93667,93709,
                     93710,89326..89407,90830..91021)
                     /gene="nad1"
     CDS             join(93403..93788,89326..89407,90830..91021,
                     complement(50250..50308))
                     /gene="nad1"
                     /exception="RNA editing"
                     /trans_splicing
                     /note="Protein sequence is in conflict with the conceptual
                     translation; Protein sequence is in conflict with the
                     conceptual translation; C to U RNA editing creates start
                     codon"
                     /codon_start=1
                     /product="NADH dehydrogenase subunit 1"
                     /protein_id="CBX33244.1"
                     /db_xref="GI:320147992"
                     /db_xref="GOA:E6ZDX8"
                     /db_xref="InterPro:IPR001694"
                     /db_xref="InterPro:IPR018086"
                     /db_xref="UniProtKB/TrEMBL:E6ZDX8"
                     /translation="MYIAVPAEILGIILPLLLGVAFLVLAERKVMAFVQRRKGPDVVG
                     SFGLLQPLADGLKLILKEPISPSSANFFLFRMAPVTTFMLSLVAWAVVPFDYGMVLSD
                     LNIGLLYLFAISSLGVYGIIIAGWSSNSKYAFLGALRSAAQMVSYEVSIGLILITVLI
                     CVGSCNLSEIVMAQKQIWFGIPLFPVLVMFFISCLAETNRAPFDLPEAEAELVAGYNV
                     EYSSMGFALFFLGEYANMILMSGLCTLLFLGGWLPILDLPIFKRIPGSIWFSIKVILF
                     LFLYIWVRAAFPRYRYDQLMGLGWKVFLPLSLAWVVAVSGVLVTFQWLP"

Notice the problematic bit of the location FP885876.1:59654..59912 has actually vanished! Do you agree? Does it do the same from your browser? If so, please report this to the NCBI.

I reported a very similar NCBI bug once before: http://blastedbio.blogspot.co.uk/2012/03/missing-external-exons-in-genbank-with.html

So be very careful if you ever use the "GenBank (full)" view.

@clalogiudice
Copy link
Author

Dear Peter, first of all thank you.
This is good for me, even I hope that in a near future biopython will have also implement this tricky function.
My problem was also that I was trying to parse gb files (merged with cat) and directly downloaded with a script that was catching genbank.records not in the full view..
Just to show you:

for genome_id in genome_ids:
    record = Entrez.efetch(db="nucleotide", id=genome_id, rettype="gb", retmode="text")
    filename = path + 'genBankRecord_{}.gb'.format(genome_id) 

Is there a possibility to download genbank record full?
Thanks for your time and I hope that this will be of help also for other people.

@peterjc
Copy link
Member

peterjc commented Apr 12, 2016

In Entrez you can replace rettype='gb' (or rettype='genbank') with rettype='gbwithparts' which is the equivalent of "GenBank (full)" on the NCBI website.

See https://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/?report=objectonly

However, since there seems to be a problem with the NCBI script which currently generates the "full" GenBank files, I would avoid this for now. You would be safer using the plain GenBank file instead.

@clalogiudice
Copy link
Author

Dear Peter,
thanks for your tips ;)

@peterjc
Copy link
Member

peterjc commented Apr 12, 2016

For reference, you can see the same NCBI GenBank "full" or gbwithparts problem using Entrez at the terminal (without Biopython):

curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?rettype=gb&db=nuccore&id=FP885871.1&retmode=text" | head -n 50
...
     gene            order(93403..93788,complement(50250..50308),
                     FP885876.1:59654..59912,89375,90851,90854,90861,90897,
                     90898,93404,93569,93617,93667,93709,93710,89326..89407,
                     90830..91021)
                     /gene="nad1"
...

Versus:

$ curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?rettype=gbwithparts&db=nuccore&id=FP885871.1&retmode=text" | head -n 50
...
     gene            order(93403..93788,complement(50250..50308),89375,90851,
                     90854,90861,90897,90898,93404,93569,93617,93667,93709,
                     93710,89326..89407,90830..91021)
                     /gene="nad1"
...

Notice the external exon FP885876.1:59654..59912 is missing for gbwithparts.

I have reported this to the NCBI by email.

@peterjc
Copy link
Member

peterjc commented Apr 12, 2016

Getting back to your original question, I downloaded this example file at the command line - you could do this with Bio.Entrez in Biopython, or in your web browser too and ought to get the same raw file:

$ curl -o FP885871.gbk "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?rettype=gb&db=nuccore&id=FP885871.1&retmode=text" 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  458k    0  458k    0     0   538k      0 --:--:-- --:--:-- --:--:--  538k

Now, does this code help?

from Bio import SeqIO
record = SeqIO.read("FP885871.gbk", "gb")
cds_feature = record.features[2]
print(cds_feature.location)
print("This has %i parts" % len(cds_feature.location.parts))
for part in cds_feature.location.parts:
    print(part)
    print("Reference: %r, start: %r, end: %r" % (part.ref, part.start, part.end))

You should get:

join{[93402:93788](+), [89325:89407](+), [90829:91021](+), [50249:50308](-), FP885876.1[59653:59912](+)}
This has 5 parts
[93402:93788](+)
Reference: None, start: ExactPosition(93402), end: ExactPosition(93788)
[89325:89407](+)
Reference: None, start: ExactPosition(89325), end: ExactPosition(89407)
[90829:91021](+)
Reference: None, start: ExactPosition(90829), end: ExactPosition(91021)
[50249:50308](-)
Reference: None, start: ExactPosition(50249), end: ExactPosition(50308)
FP885876.1[59653:59912](+)
Reference: 'FP885876.1', start: ExactPosition(59653), end: ExactPosition(59912)

Compare this to the raw GenBank file entry,

...
     CDS             join(93403..93788,89326..89407,90830..91021,
                     complement(50250..50308),FP885876.1:59654..59912)
                     /gene="nad1"
...

In order to process the complex feature, you would need to check the .ref and match it to the externally referenced sequence, and then get the described region from the start/end values.

If you try cds_feature.extract(record.seq) here you get an explicit error (which perhaps should be a not yet implemented exception):

ValueError: Feature references another sequence.

The reason this isn't yet handled in Biopython directly is (a) it's very complicated, and (b) I had no reason to use it so couldn't justify spending work time on it. What I was thinking was extending the .extract(...) method to allow you to do cds_feature.extract(record.seq, dictionary_of_extra_sequences).

@clalogiudice
Copy link
Author

Yes even I'm a novice with python I think this is very complicated. The only thing....
In the genbank full record in browser, just clicking on cds is possible to catch all the sequence reconstructed... that I can use as a ref.

@clalogiudice
Copy link
Author

Peter,
just for curiosity and if you have time...
How do you manage this? KF754801 join{86890:87276, KF754803.1470127:470209, KF754803.1471593:471785, KF754803.12893473:2893532, KF754803.12504114:2504373}

I think there is another bug on NCBI

@peterjc
Copy link
Member

peterjc commented Apr 13, 2016

I would expect this Genbank (Full) problem to affect all the NCBI pages with splicing between records, not just the one record we looked at. They have not replied to me yet - maybe you should email them too?

To expand on my outline/hint a bit more:

from Bio import SeqIO
record = SeqIO.read("FP885871.gbk", "gb")
cds_feature = record.features[2]
print(cds_feature.location)
print("This has %i parts" % len(cds_feature.location.parts))
for part in cds_feature.location.parts:
    print("\nReference: %r, start: %r, end: %r" % (part.ref, part.start, part.end))
    if part.ref:
        print("You must download the record %s" % part.ref)
        print("and then do part.extract(other_sequence) or similar!")
    else:
        print("Extracted from current record: %s" % part.extract(record.seq))

giving:

join{[93402:93788](+), [89325:89407](+), [90829:91021](+), [50249:50308](-), FP885876.1[59653:59912](+)}
This has 5 parts

Reference: None, start: ExactPosition(93402), end: ExactPosition(93788)
Extracted from current record: ACGTACATAGCTGTTCCAGCTGAAATACTTGGAATAATTCTACCACTTCTACTAGGAGTAGCCTTTTTAGTGCTAGCTGAACGTAAAGTAATGGCTTTTGTGCAACGTCGAAAGGGTCCTGATGTAGTGGGATCGTTCGGATTGTTACAACCTCTAGCAGATGGTTCGAAATTGATTCTAAAAGAACCTATTTCACCAAGTAGTGCTAATTTCTCCCTTTTTAGAATGGCTCCAGTCACTACATTTATGCTAAGTCTGGTTGCTCGGGCCGTTGTACCTTTTGATTATGGTATGGTATTGTCAGATCCGAACATAGGGCTACTTTATTTGTTTGCCATATCTTCGCTAGGTGTTTATGGAATTATTATAGCAGGTTGGTCTAGTAA

Reference: None, start: ExactPosition(89325), end: ExactPosition(89407)
Extracted from current record: TTCGAAATATGCCTTTCTAGGAGCATTACGATCTGCAGCTCAAATGGTCCCTTATGAAGTCTCTATTGGTCTTATTCTTATT

Reference: None, start: ExactPosition(90829), end: ExactPosition(91021)
Extracted from current record: ACTGTACTAATATGTGTAGGTCCTCGTAATTCGAGTGAGATTGTCATGGCGCAAAAGCAGATATGGTCCGGTATTCCCTTGTTCCCTGTATTGGTTATGTTCTTTATTTCTTGTTTAGCAGAAACTAATCGAGCTCCGTTTGATCTCCCAGAAGCGGAAGCTGAATTAGTTGCAGGCTATAATGTAGAATAT

Reference: None, start: ExactPosition(50249), end: ExactPosition(50308)
Extracted from current record: TCTTCAATGGGGTCTGCTCTTTTTTTTTTAGGAGAGTATGCCAATATGATCTTAATGAG

Reference: 'FP885876.1', start: ExactPosition(59653), end: ExactPosition(59912)
You must download the record FP885876.1
and then do part.extract(other_sequence) or similar!

I would do this in steps, first loop over all the features checking what extra sequences you need, download them, and index them with SeqIO.index(...) or SeqIO.index_db(...) so you can pull out the externally referenced sequence easily. Here's the first bit:

from Bio import SeqIO
record = SeqIO.read("FP885871.gbk", "gb")
external = set()
for feature in record.features:
    for part in feature.location.parts:
        if part.ref:
            external.add(part.ref)
print("%i external records are referenced:" % len(external))
for name in sorted(external):
    print(name)

In this case, just there is only one externally referenced record:

1 external records are referenced:
FP885876.1

@clalogiudice
Copy link
Author

Yes I'll email them explaining the situation,
I hope they will solve in some manner also because I've found at least 12 records affected by this.
In particular with:
KF754801 join{86890:87276, KF754803.1470127:470209, KF754803.1471593:471785, KF754803.12893473:2893532, KF754803.12504114:2504373}

Clicking on cds it will extract only the sequence between 8691 and 87276...
Fasta format is bugged as you can see from:
http://www.ncbi.nlm.nih.gov/nuccore/KF754801

@peterjc
Copy link
Member

peterjc commented Apr 13, 2016

I got a reply from the NCBI earlier today, the "full" record problem has been passed to their developers.

@clalogiudice
Copy link
Author

Same here.
Biopython has a nice extract method, I hope to keep using it in the future, even I'm still not so able with dictionaries.

@szhan
Copy link

szhan commented Jan 3, 2018

Hi, I think I am experiencing a related issue. There has been no follow-up? Thanks!

@peterjc
Copy link
Member

peterjc commented Jan 3, 2018

The NCBI didn't get back to me, I have emailed them again to see where things stand.

Right now the nad1 example discussed earlier in FP885871.1 appears to still be broken :(

@clalogiudice
Copy link
Author

Dear Peter, if it can be of help I would like to participate to solve the problem acting on biopython source.

Happy new year.

@peterjc
Copy link
Member

peterjc commented Nov 8, 2019

I'm looking at some code from Adam Sjøgren to extend the extract method:
https://mailman.open-bio.org/pipermail/biopython/2019-November/016725.html

I've just been looking at the two trans-splicing examples from Amborella trichopoda between mtDNA I and III (accessions KF754803.1 and KF754801.1), i.e. proteins AHA47098.1 and AHA47124.1). Their annotation says there is RNA editing, so even while we can get a CDS and translate it, the exact protein does not match (even trying alternative genetic codes).

Likewise for the Beta vulgaris mtDNA example (FP885871 and FP885876), there are three trans-spliced proteins. CBJ20660.1 is close, but CBX33245.3 and CBJ23338.3 looks to be one amino acid longer. At least all three are annotated with the warning Protein sequence is in conflict with the conceptual translation.

I wonder where the annotated protein sequences came from - not clear. Anyway, sadly these do not make good test cases, nor are they ideal examples for documentation.

peterjc pushed a commit to peterjc/biopython that referenced this issue Nov 8, 2019
If the location refers to other records, those records can be supplied
in an optional references dictionary, where the records will be looked
up by the ref (key) and the value is expected to be the same type as the
parent_sequence parameter (and thus the type extract() returns).

Refs biopython#808
peterjc pushed a commit to peterjc/biopython that referenced this issue Nov 8, 2019
If the location refers to other records, those records can be supplied
in an optional references dictionary, where the records will be looked
up by the ref (key) and the value is expected to be the same type as the
parent_sequence parameter (and thus the type extract() returns).

Refs biopython#808
peterjc pushed a commit to peterjc/biopython that referenced this issue Dec 20, 2019
If the location refers to other records, those records can be supplied
in an optional references dictionary, where the records will be looked
up by the ref (key) and the value is expected to be the same type as the
parent_sequence parameter (and thus the type extract() returns).

Refs biopython#808
peterjc pushed a commit to peterjc/biopython that referenced this issue Mar 23, 2020
If the location refers to other records, those records can be supplied
in an optional references dictionary, where the records will be looked
up by the ref (key) and the value is expected to be the same type as the
parent_sequence parameter (and thus the type extract() returns).

Refs biopython#808
@mdehoon
Copy link
Contributor

mdehoon commented Aug 29, 2020

It looks like one of our test cases has the same issue:

>>> from Bio import SeqIO
>>> record = SeqIO.read("GenBank/one_of.gb", 'genbank')
>>> record[1:]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/biopython-1.77-py3.8-macosx-10.9-x86_64.egg/Bio/SeqRecord.py", line 516, in __getitem__
    answer.features.append(f._shift(-start))
...
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/biopython-1.77-py3.8-macosx-10.9-x86_64.egg/Bio/SeqFeature.py", line 1015, in _shift
    raise ValueError("Feature references another sequence.")
ValueError: Feature references another sequence.

@peterjc
Copy link
Member

peterjc commented Aug 29, 2020

@mdehoon that example is related, certainly. In that case rather than a ValueError the shift should probably be a noop here (since the shift in coordinates is in terms of the main reference sequence, not the externally referenced sequence).

I should probably deal with the merge conflicts and merge #2334 even without a compelling example for the documentaiton.

peterjc pushed a commit to peterjc/biopython that referenced this issue Aug 29, 2020
If the location refers to other records, those records can be supplied
in an optional references dictionary, where the records will be looked
up by the ref (key) and the value is expected to be the same type as the
parent_sequence parameter (and thus the type extract() returns).

Refs biopython#808
peterjc pushed a commit that referenced this issue Aug 31, 2020
If the location refers to other records, those records can be supplied
in an optional references dictionary, where the records will be looked
up by the ref (key) and the value is expected to be the same type as the
parent_sequence parameter (and thus the type extract() returns).

Refs #808
@peterjc
Copy link
Member

peterjc commented Aug 31, 2020

Fixed in #2334, thank you Adam.

@peterjc peterjc closed this as completed Aug 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants