New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Circular permutations of sequences #2158
Comments
Wouldn't it make more sense to just make the We've been thinking about contributing some higher level "neighbourhood" functions to Biopython along a similar line. Basically a way to say "give me 20 kbp upstream and downstream of feature X", and then have the |
I think that could well be a good way to go, though patching on to the SeqRecord as it currently exists struck me as possibly a drastic change. I can foresee the possible need to redefine the That kind of neighbourhood functionality is exactly what I had in mind though, just to hide the string slicing and subsequence logic away from the end user. |
I think a whole new class is overkill, there are a lot of special cases to support. What I did ponder was a def roll(self, shift):
"""Roll the sequence assuming it is circular."""
# TODO - which direction should be positive?
# TODO - bounds checking, or apply modulo arithmetic
# TODO - can we preserve more annotation than this does?
# Specifically if the cut is mid-feature, turn the feature into a join
# (or in the case of a source feature, it can probably be reused as is)
return self[:shift] + self[shift:] However, provided you avoid cutting in the middle of a feature, the documented cut-and-add approach is still short, perhaps even clearer? |
Breaking features at the new origin seems fairly well defined, just use the join functionality (in INSDC terminology, the First, origin spanning features like Second, the special case of the source feature usually You would need a comprehensive set of test cases as part of this work. As to the other annotation, per-letter-annotation is simple (e.g. a circular contig in FASTQ format), but the others like id, name, description, features, etc need some thought. Probably the same approach as the https://github.com/biopython/biopython/blob/biopython-173/Bio/SeqRecord.py#L1002 When I was looking at this, I probably convinced myself this was complicated enough that I didn't have a clear day or two spare to do it properly ;) Also, one of the main reasons to apply an origin shift and roll the sequence is to fix an origin spanning feature, so careful choice of the shift value avoids this - thus the documented approach is usually enough. |
If you apply the The |
I think the Another option might be to 'preserve the history' of any rolled sequences by adding an additional per-letter annotation for the original source 0th character during the For preserving the annotations, I think preserving everything 'as is' but with altered coordinates, similar to slicing out a subrecord, makes the most sense to me (as an end user I'd prefer not to delve in to manually reconstructing the annotation). This seems to be what the plasmid viewer SnapGene does. The only issue, unless I'm missing something, would be how to label a feature which spans the new join. Copying the Truthy/Falsey behaviour of Dillon Barker pointed out to me that the For reference, this is the def rotate(self, n=1):
length = len(self)
if length <= 1:
return
halflen = length >> 1
if n > halflen or n < -halflen:
n %= length
if n > halflen:
n -= length
elif n < -halflen:
n += length
while n > 0:
self.appendleft(self.pop())
n -= 1
while n < 0:
self.append(self.popleft())
n += 1 EDIT: I see that Chris has pointed out |
Two votes for As to chimeras with complex source features, I wrote about some examples here: That brings up another potentially problematic feature location type which would require consideration ( |
Hi all, Maybe I am late for the party, but the |
After reading this thread, I am wondering what use cases should be served and if the SeqRecord class is the right place to implement a rotation feature. Concrete use cases would definitely help to find out what would be most useful. In general, I think the SeqRecord should serve as a ground truth and any mutation/extraction/annotation should be something separate, either a function or a class containing the SeqRecord and (potentially) returning a new on. What I think would be valuable additions to SeqRecord are the following:
The reason I would argue against a rotate method for SeqRecord is that for me it is not clear if the record id and name should remain the same after the rotation and/or other sequence manipulations or if we are talking about something new. Do you want rotate to change the object or should it just provide a different view point? |
The only use case I've come up with is to apply an origin shift and roll the sequence is to fix an origin spanning feature (either during assembly and annotation, or in visualisation of someone else's published genome). Here careful choice of the shift value avoids most of the issues and the simple approach is usually enough: The GenBank parser already records the circular topology in the SeqRecord. The SeqFeature extract method already supports origin spanning locations. As to recording naming etc, I'd expect any |
@kblin did you ever do anything like this? I have code (that I will hopefully open source later this year) that does neighborhood extraction (and lots of other useful things) on linear contigs. I'm currently trying to patch it to handle origin-spanning annotations and neighborhoods that cross the origin. The comments on this page have been helpful so far, but I still have a ways to go. |
@seanrjohnson I did implement this into pydna, that sits on top of biopython.
https://gist.github.com/BjornFJohansson/d334dc74cdc79b203acc8283f62327c4 |
@BjornFJohansson It would also be great to have something like this integrated into Biopython. |
I think my colleague @SJShaw built something for us, but I don't think it hit mainly yet. |
There's two circular-related things I've written, one is a simple rotate script that just cuts the record in two and reassembles them in reverse order (with some optional padding), adjusting all the features/locations as appropriate. It's simple enough that it doesn't cover the case of merging as was mentioned much further above. The other builds on antiSMASH's |
@SJShaw have you finished the code? I would be intereseted in a ready to use |
@gatoniel This was implemented in pydna with lots of unit tests. There is a method for the dseqrecord.shifted method and also shift_feature and shift_location functions in pydna.utils. |
The script to rotate a file is here, though it's fairly trivial to adjust it to use an in-memory SeqRecord: https://gist.github.com/SJShaw/30df7a6b7551a219a0f8779702a425d4 The antiSMASH implementation won't have a |
Hi all,
This isn't a bug report, so apologies if this belongs elsewhere.
I've recently been asked about a task which requires some slightly advanced circular permutation manipulation.
The existing docs offer this 'manual' approach to shifting the origin:
I'm thinking that some of this behaviour might make sense to have a layer of abstraction, e.g. a function that rotates the sequence by a specified amount, and so on.
Since this isn't already a feature, I'd consider making this a pull request as it seems like something which could be generally useful. With that in mind, I wanted to get your input about how best to implement it such that it would have the most seamless integration. I will just add that I've not contributed to something as big as BioPython before so want to make sure I go about this sensibly (I'm reading the contributing guidlines and testing documentation presently).
My first thought is that a
CircularSeqRecord
class which inherits fromSeqRecord
might be the way to go, and then to override the relevant functions where the existing ones don't make sense for a circular sequence if necessary.A constructor which handles the 'forced' carry over of all the annotations and cross references might also make sense.
Has anyone given this much thought before? I thought it would make sense to ask first before I lead myself down a blind alley, or to know whether this functionality isn't in BioPython for some very good reason!
Thanks,
Joe
PS. If there's a better place for discussing this, please let me know and I'll move it there.
The text was updated successfully, but these errors were encountered: