New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to deal with "Feature references another sequence?" #808
Comments
If you are hoping to use the location information to extract the sequence, Biopython doesn't do that (yet). One idea was to extend the However, it may be there is a better way to get the data you want from existing online resources? Update - To be explicit, you'd get a ValueError message of |
Dear Peterjc, thanks for your kind reply:
My problem is how to manage transplicing events in this case.... Edit: I added the triple back-tick characters to display the Python code example properly on GitHub. Peter. |
You've picked a hard problem (trans-splicing). The easiest way would as I said probably be to find the data elsewhere online - perhaps via the NCBI gene/protein databases? With trans-splicing you must not just use the overall location start and end values - they won't give you want you want. This snippet of code might help:
What's the accession / URL of the example record you are looking at? It would be easier for me to work from that. |
Dear Peter, |
Thanks for giving the specific example. Unfortunately I have some bad news. First look at http://www.ncbi.nlm.nih.gov/nuccore/FP885871.1 is the default "GenBank" view:
Now toggle the view from "GenBank" to "GenBank (full)" and I get http://www.ncbi.nlm.nih.gov/nuccore/FP885871.1?report=gbwithparts&log$=seqview
Notice the problematic bit of the location I reported a very similar NCBI bug once before: http://blastedbio.blogspot.co.uk/2012/03/missing-external-exons-in-genbank-with.html So be very careful if you ever use the "GenBank (full)" view. |
Dear Peter, first of all thank you.
Is there a possibility to download genbank record full? |
In Entrez you can replace However, since there seems to be a problem with the NCBI script which currently generates the "full" GenBank files, I would avoid this for now. You would be safer using the plain GenBank file instead. |
Dear Peter, |
For reference, you can see the same NCBI GenBank "full" or
Versus:
Notice the external exon FP885876.1:59654..59912 is missing for I have reported this to the NCBI by email. |
Getting back to your original question, I downloaded this example file at the command line - you could do this with
Now, does this code help?
You should get:
Compare this to the raw GenBank file entry,
In order to process the complex feature, you would need to check the If you try
The reason this isn't yet handled in Biopython directly is (a) it's very complicated, and (b) I had no reason to use it so couldn't justify spending work time on it. What I was thinking was extending the |
Yes even I'm a novice with python I think this is very complicated. The only thing.... |
Peter, I think there is another bug on NCBI |
I would expect this Genbank (Full) problem to affect all the NCBI pages with splicing between records, not just the one record we looked at. They have not replied to me yet - maybe you should email them too? To expand on my outline/hint a bit more:
giving:
I would do this in steps, first loop over all the features checking what extra sequences you need, download them, and index them with
In this case, just there is only one externally referenced record:
|
Yes I'll email them explaining the situation, Clicking on cds it will extract only the sequence between 8691 and 87276... |
I got a reply from the NCBI earlier today, the "full" record problem has been passed to their developers. |
Same here. |
Hi, I think I am experiencing a related issue. There has been no follow-up? Thanks! |
The NCBI didn't get back to me, I have emailed them again to see where things stand. Right now the nad1 example discussed earlier in FP885871.1 appears to still be broken :( |
Dear Peter, if it can be of help I would like to participate to solve the problem acting on biopython source. Happy new year. |
I'm looking at some code from Adam Sjøgren to extend the extract method: I've just been looking at the two trans-splicing examples from Amborella trichopoda between mtDNA I and III (accessions KF754803.1 and KF754801.1), i.e. proteins AHA47098.1 and AHA47124.1). Their annotation says there is RNA editing, so even while we can get a CDS and translate it, the exact protein does not match (even trying alternative genetic codes). Likewise for the Beta vulgaris mtDNA example (FP885871 and FP885876), there are three trans-spliced proteins. CBJ20660.1 is close, but CBX33245.3 and CBJ23338.3 looks to be one amino acid longer. At least all three are annotated with the warning Protein sequence is in conflict with the conceptual translation. I wonder where the annotated protein sequences came from - not clear. Anyway, sadly these do not make good test cases, nor are they ideal examples for documentation. |
If the location refers to other records, those records can be supplied in an optional references dictionary, where the records will be looked up by the ref (key) and the value is expected to be the same type as the parent_sequence parameter (and thus the type extract() returns). Refs biopython#808
If the location refers to other records, those records can be supplied in an optional references dictionary, where the records will be looked up by the ref (key) and the value is expected to be the same type as the parent_sequence parameter (and thus the type extract() returns). Refs biopython#808
If the location refers to other records, those records can be supplied in an optional references dictionary, where the records will be looked up by the ref (key) and the value is expected to be the same type as the parent_sequence parameter (and thus the type extract() returns). Refs biopython#808
If the location refers to other records, those records can be supplied in an optional references dictionary, where the records will be looked up by the ref (key) and the value is expected to be the same type as the parent_sequence parameter (and thus the type extract() returns). Refs biopython#808
It looks like one of our test cases has the same issue: >>> from Bio import SeqIO
>>> record = SeqIO.read("GenBank/one_of.gb", 'genbank')
>>> record[1:]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/biopython-1.77-py3.8-macosx-10.9-x86_64.egg/Bio/SeqRecord.py", line 516, in __getitem__
answer.features.append(f._shift(-start))
...
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/biopython-1.77-py3.8-macosx-10.9-x86_64.egg/Bio/SeqFeature.py", line 1015, in _shift
raise ValueError("Feature references another sequence.")
ValueError: Feature references another sequence. |
@mdehoon that example is related, certainly. In that case rather than a ValueError the shift should probably be a noop here (since the shift in coordinates is in terms of the main reference sequence, not the externally referenced sequence). I should probably deal with the merge conflicts and merge #2334 even without a compelling example for the documentaiton. |
If the location refers to other records, those records can be supplied in an optional references dictionary, where the records will be looked up by the ref (key) and the value is expected to be the same type as the parent_sequence parameter (and thus the type extract() returns). Refs biopython#808
If the location refers to other records, those records can be supplied in an optional references dictionary, where the records will be looked up by the ref (key) and the value is expected to be the same type as the parent_sequence parameter (and thus the type extract() returns). Refs #808
Fixed in #2334, thank you Adam. |
Hi to all,
I'm extracting CDS with biopython but I'm experiencing this problem on some sequences like this:
I'm quite sure that the reference to other sequence is not managed (yet) by biopython.
Any suggestion to overcome?
THanx in advance.
Edit: I added the triple back-tick characters to display the Biopython compound location string properly; Peter
The text was updated successfully, but these errors were encountered: