How to deal with "Feature references another sequence?" #808

clalogiudice · 2016-04-12T12:17:52Z

Hi to all,
I'm extracting CDS with biopython but I'm experiencing this problem on some sequences like this:

join{[37692:37922](-), [35638:36854](-), FP885871.1[246160:246182](+), FP885871.1[163236:163631](+), FP885871.1[164613:164763](+)}

I'm quite sure that the reference to other sequence is not managed (yet) by biopython.
Any suggestion to overcome?

THanx in advance.

Edit: I added the triple back-tick characters to display the Biopython compound location string properly; Peter

The text was updated successfully, but these errors were encountered:

peterjc · 2016-04-12T12:25:41Z

If you are hoping to use the location information to extract the sequence, Biopython doesn't do that (yet). One idea was to extend the SeqFeature object's .extract(...) method to optionally take a dictionary of sequences keyed by the accessions, as well as the main sequence.

However, it may be there is a better way to get the data you want from existing online resources?

Update - To be explicit, you'd get a ValueError message of Feature references another sequence if you try.

clalogiudice · 2016-04-12T12:59:08Z

Dear Peterjc, thanks for your kind reply:
I'm still a novice.
I've tried in this manner:

for rec in SeqIO.parse(handle, 'genbank'):
    for feature in rec.features:
        try:
            if feature.type == 'CDS':
                print (feature.location.extract(rec).seq)
        except ValueError:
            print(rec.seq[feature.location.start.position:feature.location.end.position])

My problem is how to manage transplicing events in this case....
Could you suggest me the easiest way?

Edit: I added the triple back-tick characters to display the Python code example properly on GitHub. Peter.

peterjc · 2016-04-12T13:06:45Z

You've picked a hard problem (trans-splicing). The easiest way would as I said probably be to find the data elsewhere online - perhaps via the NCBI gene/protein databases?

With trans-splicing you must not just use the overall location start and end values - they won't give you want you want. This snippet of code might help:

for sub_location in feature.location.parts:
    print sub_location

What's the accession / URL of the example record you are looking at? It would be easier for me to work from that.

clalogiudice · 2016-04-12T13:28:27Z

Dear Peter,
here's an example of record that goes wrong:

http://www.ncbi.nlm.nih.gov/nuccore/FP885871.1

peterjc · 2016-04-12T14:38:21Z

Thanks for giving the specific example. Unfortunately I have some bad news.

First look at http://www.ncbi.nlm.nih.gov/nuccore/FP885871.1 is the default "GenBank" view:

LOCUS       FP885871              269136 bp    DNA     circular PLN 29-SEP-2011
DEFINITION  Beta vulgaris subsp. maritima genotype male-sterile G
            mitochondrion, partial genome.
ACCESSION   FP885871
VERSION     FP885871.1  GI:320147991
KEYWORDS    Beta vulgaris; mitochondrion.
SOURCE      mitochondrion Beta vulgaris subsp. maritima
  ORGANISM  Beta vulgaris subsp. maritima
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
            Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae;
            Pentapetalae; Caryophyllales; Amaranthaceae; Betoideae; Beta.
...
     gene            order(93403..93788,complement(50250..50308),
                     FP885876.1:59654..59912,89375,90851,90854,90861,90897,
                     90898,93404,93569,93617,93667,93709,93710,89326..89407,
                     90830..91021)
                     /gene="nad1"
     CDS             join(93403..93788,89326..89407,90830..91021,
                     complement(50250..50308),FP885876.1:59654..59912)
                     /gene="nad1"
                     /exception="RNA editing"
                     /trans_splicing
                     /note="Protein sequence is in conflict with the conceptual
                     translation; Protein sequence is in conflict with the
                     conceptual translation; C to U RNA editing creates start
                     codon"
                     /codon_start=1
                     /product="NADH dehydrogenase subunit 1"
                     /protein_id="CBX33244.1"
                     /db_xref="GI:320147992"
                     /db_xref="GOA:E6ZDX8"
                     /db_xref="InterPro:IPR001694"
                     /db_xref="InterPro:IPR018086"
                     /db_xref="UniProtKB/TrEMBL:E6ZDX8"
                     /translation="MYIAVPAEILGIILPLLLGVAFLVLAERKVMAFVQRRKGPDVVG
                     SFGLLQPLADGLKLILKEPISPSSANFFLFRMAPVTTFMLSLVAWAVVPFDYGMVLSD
                     LNIGLLYLFAISSLGVYGIIIAGWSSNSKYAFLGALRSAAQMVSYEVSIGLILITVLI
                     CVGSCNLSEIVMAQKQIWFGIPLFPVLVMFFISCLAETNRAPFDLPEAEAELVAGYNV
                     EYSSMGFALFFLGEYANMILMSGLCTLLFLGGWLPILDLPIFKRIPGSIWFSIKVILF
                     LFLYIWVRAAFPRYRYDQLMGLGWKVFLPLSLAWVVAVSGVLVTFQWLP"
...

Now toggle the view from "GenBank" to "GenBank (full)" and I get http://www.ncbi.nlm.nih.gov/nuccore/FP885871.1?report=gbwithparts&log$=seqview

LOCUS       FP885871              269136 bp    DNA     circular PLN 29-SEP-2011
DEFINITION  Beta vulgaris subsp. maritima genotype male-sterile G
            mitochondrion, partial genome.
ACCESSION   FP885871
VERSION     FP885871.1  GI:320147991
KEYWORDS    Beta vulgaris; mitochondrion.
SOURCE      mitochondrion Beta vulgaris subsp. maritima
  ORGANISM  Beta vulgaris subsp. marítima
...

     gene            order(93403..93788,complement(50250..50308),89375,90851,
                     90854,90861,90897,90898,93404,93569,93617,93667,93709,
                     93710,89326..89407,90830..91021)
                     /gene="nad1"
     CDS             join(93403..93788,89326..89407,90830..91021,
                     complement(50250..50308))
                     /gene="nad1"
                     /exception="RNA editing"
                     /trans_splicing
                     /note="Protein sequence is in conflict with the conceptual
                     translation; Protein sequence is in conflict with the
                     conceptual translation; C to U RNA editing creates start
                     codon"
                     /codon_start=1
                     /product="NADH dehydrogenase subunit 1"
                     /protein_id="CBX33244.1"
                     /db_xref="GI:320147992"
                     /db_xref="GOA:E6ZDX8"
                     /db_xref="InterPro:IPR001694"
                     /db_xref="InterPro:IPR018086"
                     /db_xref="UniProtKB/TrEMBL:E6ZDX8"
                     /translation="MYIAVPAEILGIILPLLLGVAFLVLAERKVMAFVQRRKGPDVVG
                     SFGLLQPLADGLKLILKEPISPSSANFFLFRMAPVTTFMLSLVAWAVVPFDYGMVLSD
                     LNIGLLYLFAISSLGVYGIIIAGWSSNSKYAFLGALRSAAQMVSYEVSIGLILITVLI
                     CVGSCNLSEIVMAQKQIWFGIPLFPVLVMFFISCLAETNRAPFDLPEAEAELVAGYNV
                     EYSSMGFALFFLGEYANMILMSGLCTLLFLGGWLPILDLPIFKRIPGSIWFSIKVILF
                     LFLYIWVRAAFPRYRYDQLMGLGWKVFLPLSLAWVVAVSGVLVTFQWLP"

Notice the problematic bit of the location FP885876.1:59654..59912 has actually vanished! Do you agree? Does it do the same from your browser? If so, please report this to the NCBI.

I reported a very similar NCBI bug once before: http://blastedbio.blogspot.co.uk/2012/03/missing-external-exons-in-genbank-with.html

So be very careful if you ever use the "GenBank (full)" view.

clalogiudice · 2016-04-12T14:53:55Z

Dear Peter, first of all thank you.
This is good for me, even I hope that in a near future biopython will have also implement this tricky function.
My problem was also that I was trying to parse gb files (merged with cat) and directly downloaded with a script that was catching genbank.records not in the full view..
Just to show you:

for genome_id in genome_ids:
    record = Entrez.efetch(db="nucleotide", id=genome_id, rettype="gb", retmode="text")
    filename = path + 'genBankRecord_{}.gb'.format(genome_id)

Is there a possibility to download genbank record full?
Thanks for your time and I hope that this will be of help also for other people.

peterjc · 2016-04-12T15:12:40Z

In Entrez you can replace rettype='gb' (or rettype='genbank') with rettype='gbwithparts' which is the equivalent of "GenBank (full)" on the NCBI website.

See https://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/?report=objectonly

However, since there seems to be a problem with the NCBI script which currently generates the "full" GenBank files, I would avoid this for now. You would be safer using the plain GenBank file instead.

clalogiudice · 2016-04-12T15:14:51Z

Dear Peter,
thanks for your tips ;)

peterjc · 2016-04-12T15:29:40Z

For reference, you can see the same NCBI GenBank "full" or gbwithparts problem using Entrez at the terminal (without Biopython):

curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?rettype=gb&db=nuccore&id=FP885871.1&retmode=text" | head -n 50
...
     gene            order(93403..93788,complement(50250..50308),
                     FP885876.1:59654..59912,89375,90851,90854,90861,90897,
                     90898,93404,93569,93617,93667,93709,93710,89326..89407,
                     90830..91021)
                     /gene="nad1"
...

Versus:

$ curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?rettype=gbwithparts&db=nuccore&id=FP885871.1&retmode=text" | head -n 50
...
     gene            order(93403..93788,complement(50250..50308),89375,90851,
                     90854,90861,90897,90898,93404,93569,93617,93667,93709,
                     93710,89326..89407,90830..91021)
                     /gene="nad1"
...

Notice the external exon FP885876.1:59654..59912 is missing for gbwithparts.

I have reported this to the NCBI by email.

peterjc · 2016-04-12T15:44:05Z

Getting back to your original question, I downloaded this example file at the command line - you could do this with Bio.Entrez in Biopython, or in your web browser too and ought to get the same raw file:

$ curl -o FP885871.gbk "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?rettype=gb&db=nuccore&id=FP885871.1&retmode=text" 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  458k    0  458k    0     0   538k      0 --:--:-- --:--:-- --:--:--  538k

Now, does this code help?

from Bio import SeqIO
record = SeqIO.read("FP885871.gbk", "gb")
cds_feature = record.features[2]
print(cds_feature.location)
print("This has %i parts" % len(cds_feature.location.parts))
for part in cds_feature.location.parts:
    print(part)
    print("Reference: %r, start: %r, end: %r" % (part.ref, part.start, part.end))

You should get:

join{[93402:93788](+), [89325:89407](+), [90829:91021](+), [50249:50308](-), FP885876.1[59653:59912](+)}
This has 5 parts
[93402:93788](+)
Reference: None, start: ExactPosition(93402), end: ExactPosition(93788)
[89325:89407](+)
Reference: None, start: ExactPosition(89325), end: ExactPosition(89407)
[90829:91021](+)
Reference: None, start: ExactPosition(90829), end: ExactPosition(91021)
[50249:50308](-)
Reference: None, start: ExactPosition(50249), end: ExactPosition(50308)
FP885876.1[59653:59912](+)
Reference: 'FP885876.1', start: ExactPosition(59653), end: ExactPosition(59912)

Compare this to the raw GenBank file entry,

...
     CDS             join(93403..93788,89326..89407,90830..91021,
                     complement(50250..50308),FP885876.1:59654..59912)
                     /gene="nad1"
...

In order to process the complex feature, you would need to check the .ref and match it to the externally referenced sequence, and then get the described region from the start/end values.

If you try cds_feature.extract(record.seq) here you get an explicit error (which perhaps should be a not yet implemented exception):

ValueError: Feature references another sequence.

The reason this isn't yet handled in Biopython directly is (a) it's very complicated, and (b) I had no reason to use it so couldn't justify spending work time on it. What I was thinking was extending the .extract(...) method to allow you to do cds_feature.extract(record.seq, dictionary_of_extra_sequences).

clalogiudice · 2016-04-12T15:55:34Z

Yes even I'm a novice with python I think this is very complicated. The only thing....
In the genbank full record in browser, just clicking on cds is possible to catch all the sequence reconstructed... that I can use as a ref.

clalogiudice · 2016-04-13T09:10:53Z

Peter,
just for curiosity and if you have time...
How do you manage this? KF754801 join{86890:87276, KF754803.1470127:470209, KF754803.1471593:471785, KF754803.12893473:2893532, KF754803.12504114:2504373}

I think there is another bug on NCBI

peterjc · 2016-04-13T10:50:49Z

I would expect this Genbank (Full) problem to affect all the NCBI pages with splicing between records, not just the one record we looked at. They have not replied to me yet - maybe you should email them too?

To expand on my outline/hint a bit more:

from Bio import SeqIO
record = SeqIO.read("FP885871.gbk", "gb")
cds_feature = record.features[2]
print(cds_feature.location)
print("This has %i parts" % len(cds_feature.location.parts))
for part in cds_feature.location.parts:
    print("\nReference: %r, start: %r, end: %r" % (part.ref, part.start, part.end))
    if part.ref:
        print("You must download the record %s" % part.ref)
        print("and then do part.extract(other_sequence) or similar!")
    else:
        print("Extracted from current record: %s" % part.extract(record.seq))

giving:

join{[93402:93788](+), [89325:89407](+), [90829:91021](+), [50249:50308](-), FP885876.1[59653:59912](+)}
This has 5 parts

Reference: None, start: ExactPosition(93402), end: ExactPosition(93788)
Extracted from current record: ACGTACATAGCTGTTCCAGCTGAAATACTTGGAATAATTCTACCACTTCTACTAGGAGTAGCCTTTTTAGTGCTAGCTGAACGTAAAGTAATGGCTTTTGTGCAACGTCGAAAGGGTCCTGATGTAGTGGGATCGTTCGGATTGTTACAACCTCTAGCAGATGGTTCGAAATTGATTCTAAAAGAACCTATTTCACCAAGTAGTGCTAATTTCTCCCTTTTTAGAATGGCTCCAGTCACTACATTTATGCTAAGTCTGGTTGCTCGGGCCGTTGTACCTTTTGATTATGGTATGGTATTGTCAGATCCGAACATAGGGCTACTTTATTTGTTTGCCATATCTTCGCTAGGTGTTTATGGAATTATTATAGCAGGTTGGTCTAGTAA

Reference: None, start: ExactPosition(89325), end: ExactPosition(89407)
Extracted from current record: TTCGAAATATGCCTTTCTAGGAGCATTACGATCTGCAGCTCAAATGGTCCCTTATGAAGTCTCTATTGGTCTTATTCTTATT

Reference: None, start: ExactPosition(90829), end: ExactPosition(91021)
Extracted from current record: ACTGTACTAATATGTGTAGGTCCTCGTAATTCGAGTGAGATTGTCATGGCGCAAAAGCAGATATGGTCCGGTATTCCCTTGTTCCCTGTATTGGTTATGTTCTTTATTTCTTGTTTAGCAGAAACTAATCGAGCTCCGTTTGATCTCCCAGAAGCGGAAGCTGAATTAGTTGCAGGCTATAATGTAGAATAT

Reference: None, start: ExactPosition(50249), end: ExactPosition(50308)
Extracted from current record: TCTTCAATGGGGTCTGCTCTTTTTTTTTTAGGAGAGTATGCCAATATGATCTTAATGAG

Reference: 'FP885876.1', start: ExactPosition(59653), end: ExactPosition(59912)
You must download the record FP885876.1
and then do part.extract(other_sequence) or similar!

I would do this in steps, first loop over all the features checking what extra sequences you need, download them, and index them with SeqIO.index(...) or SeqIO.index_db(...) so you can pull out the externally referenced sequence easily. Here's the first bit:

from Bio import SeqIO
record = SeqIO.read("FP885871.gbk", "gb")
external = set()
for feature in record.features:
    for part in feature.location.parts:
        if part.ref:
            external.add(part.ref)
print("%i external records are referenced:" % len(external))
for name in sorted(external):
    print(name)

In this case, just there is only one externally referenced record:

1 external records are referenced:
FP885876.1

clalogiudice · 2016-04-13T11:08:29Z

Yes I'll email them explaining the situation,
I hope they will solve in some manner also because I've found at least 12 records affected by this.
In particular with:
KF754801 join{86890:87276, KF754803.1470127:470209, KF754803.1471593:471785, KF754803.12893473:2893532, KF754803.12504114:2504373}

Clicking on cds it will extract only the sequence between 8691 and 87276...
Fasta format is bugged as you can see from:
http://www.ncbi.nlm.nih.gov/nuccore/KF754801

peterjc · 2016-04-13T17:03:09Z

I got a reply from the NCBI earlier today, the "full" record problem has been passed to their developers.

clalogiudice · 2016-04-13T17:19:25Z

Same here.
Biopython has a nice extract method, I hope to keep using it in the future, even I'm still not so able with dictionaries.

szhan · 2018-01-03T07:18:59Z

Hi, I think I am experiencing a related issue. There has been no follow-up? Thanks!

peterjc · 2018-01-03T11:40:46Z

The NCBI didn't get back to me, I have emailed them again to see where things stand.

Right now the nad1 example discussed earlier in FP885871.1 appears to still be broken :(

clalogiudice · 2018-01-04T15:49:47Z

Dear Peter, if it can be of help I would like to participate to solve the problem acting on biopython source.

Happy new year.

peterjc · 2019-11-08T12:20:00Z

I'm looking at some code from Adam Sjøgren to extend the extract method:
https://mailman.open-bio.org/pipermail/biopython/2019-November/016725.html

I've just been looking at the two trans-splicing examples from Amborella trichopoda between mtDNA I and III (accessions KF754803.1 and KF754801.1), i.e. proteins AHA47098.1 and AHA47124.1). Their annotation says there is RNA editing, so even while we can get a CDS and translate it, the exact protein does not match (even trying alternative genetic codes).

Likewise for the Beta vulgaris mtDNA example (FP885871 and FP885876), there are three trans-spliced proteins. CBJ20660.1 is close, but CBX33245.3 and CBJ23338.3 looks to be one amino acid longer. At least all three are annotated with the warning Protein sequence is in conflict with the conceptual translation.

I wonder where the annotated protein sequences came from - not clear. Anyway, sadly these do not make good test cases, nor are they ideal examples for documentation.

If the location refers to other records, those records can be supplied in an optional references dictionary, where the records will be looked up by the ref (key) and the value is expected to be the same type as the parent_sequence parameter (and thus the type extract() returns). Refs biopython#808

mdehoon · 2020-08-29T04:18:15Z

It looks like one of our test cases has the same issue:

>>> from Bio import SeqIO
>>> record = SeqIO.read("GenBank/one_of.gb", 'genbank')
>>> record[1:]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/biopython-1.77-py3.8-macosx-10.9-x86_64.egg/Bio/SeqRecord.py", line 516, in __getitem__
    answer.features.append(f._shift(-start))
...
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/biopython-1.77-py3.8-macosx-10.9-x86_64.egg/Bio/SeqFeature.py", line 1015, in _shift
    raise ValueError("Feature references another sequence.")
ValueError: Feature references another sequence.

peterjc · 2020-08-29T13:41:05Z

@mdehoon that example is related, certainly. In that case rather than a ValueError the shift should probably be a noop here (since the shift in coordinates is in terms of the main reference sequence, not the externally referenced sequence).

I should probably deal with the merge conflicts and merge #2334 even without a compelling example for the documentaiton.

If the location refers to other records, those records can be supplied in an optional references dictionary, where the records will be looked up by the ref (key) and the value is expected to be the same type as the parent_sequence parameter (and thus the type extract() returns). Refs biopython#808

If the location refers to other records, those records can be supplied in an optional references dictionary, where the records will be looked up by the ref (key) and the value is expected to be the same type as the parent_sequence parameter (and thus the type extract() returns). Refs #808

peterjc · 2020-08-31T13:44:18Z

Fixed in #2334, thank you Adam.

peterjc mentioned this issue Jan 17, 2018

add translate method to Bio.SeqFeature #1509

Merged

peterjc mentioned this issue Nov 8, 2019

Handle feature references another sequence #2334

Merged

3 tasks

peterjc closed this as completed Aug 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to deal with "Feature references another sequence?" #808

How to deal with "Feature references another sequence?" #808

clalogiudice commented Apr 12, 2016

peterjc commented Apr 12, 2016 •

edited

clalogiudice commented Apr 12, 2016

peterjc commented Apr 12, 2016

clalogiudice commented Apr 12, 2016

peterjc commented Apr 12, 2016

clalogiudice commented Apr 12, 2016

peterjc commented Apr 12, 2016

clalogiudice commented Apr 12, 2016

peterjc commented Apr 12, 2016

peterjc commented Apr 12, 2016

clalogiudice commented Apr 12, 2016

clalogiudice commented Apr 13, 2016

peterjc commented Apr 13, 2016

clalogiudice commented Apr 13, 2016

peterjc commented Apr 13, 2016

clalogiudice commented Apr 13, 2016

szhan commented Jan 3, 2018

peterjc commented Jan 3, 2018 •

edited

clalogiudice commented Jan 4, 2018

peterjc commented Nov 8, 2019

mdehoon commented Aug 29, 2020

peterjc commented Aug 29, 2020

peterjc commented Aug 31, 2020

How to deal with "Feature references another sequence?" #808

How to deal with "Feature references another sequence?" #808

Comments

clalogiudice commented Apr 12, 2016

peterjc commented Apr 12, 2016 • edited

clalogiudice commented Apr 12, 2016

peterjc commented Apr 12, 2016

clalogiudice commented Apr 12, 2016

peterjc commented Apr 12, 2016

clalogiudice commented Apr 12, 2016

peterjc commented Apr 12, 2016

clalogiudice commented Apr 12, 2016

peterjc commented Apr 12, 2016

peterjc commented Apr 12, 2016

clalogiudice commented Apr 12, 2016

clalogiudice commented Apr 13, 2016

peterjc commented Apr 13, 2016

clalogiudice commented Apr 13, 2016

peterjc commented Apr 13, 2016

clalogiudice commented Apr 13, 2016

szhan commented Jan 3, 2018

peterjc commented Jan 3, 2018 • edited

clalogiudice commented Jan 4, 2018

peterjc commented Nov 8, 2019

mdehoon commented Aug 29, 2020

peterjc commented Aug 29, 2020

peterjc commented Aug 31, 2020

peterjc commented Apr 12, 2016 •

edited

peterjc commented Jan 3, 2018 •

edited