Circular permutations of sequences #2158

jrjhealey · 2019-07-02T16:27:52Z

Hi all,

This isn't a bug report, so apologies if this belongs elsewhere.

I've recently been asked about a task which requires some slightly advanced circular permutation manipulation.

The existing docs offer this 'manual' approach to shifting the origin:

Now, for an example with features, we’ll use a GenBank file. Suppose you have a circular genome:

>>> from Bio import SeqIO
>>> record = SeqIO.read("NC_005816.gb", "genbank")
>>> record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',
IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.',
dbxrefs=['Project:10638'])
>>> len(record)
9609
>>> len(record.features)
41
>>> record.dbxrefs
['Project:58037']
>>> record.annotations.keys()
['comment', 'sequence_version', 'source', 'taxonomy', 'keywords', 'references',
'accessions', 'data_file_division', 'date', 'organism', 'gi']
You can shift the origin like this:

>>> shifted = record[2000:] + record[:2000]
>>> shifted
SeqRecord(seq=Seq('GATACGCAGTCATATTTTTTACACAATTCTCTAATCCCGACAAGGTCGTAGGTC...GGA',
IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.',
dbxrefs=[])
>>> len(shifted)
9609

I'm thinking that some of this behaviour might make sense to have a layer of abstraction, e.g. a function that rotates the sequence by a specified amount, and so on.

Since this isn't already a feature, I'd consider making this a pull request as it seems like something which could be generally useful. With that in mind, I wanted to get your input about how best to implement it such that it would have the most seamless integration. I will just add that I've not contributed to something as big as BioPython before so want to make sure I go about this sensibly (I'm reading the contributing guidlines and testing documentation presently).

My first thought is that a CircularSeqRecord class which inherits from SeqRecord might be the way to go, and then to override the relevant functions where the existing ones don't make sense for a circular sequence if necessary.

A constructor which handles the 'forced' carry over of all the annotations and cross references might also make sense.

Has anyone given this much thought before? I thought it would make sense to ask first before I lead myself down a blind alley, or to know whether this functionality isn't in BioPython for some very good reason!

Thanks,

Joe

PS. If there's a better place for discussing this, please let me know and I'll move it there.

The text was updated successfully, but these errors were encountered:

kblin · 2019-07-03T07:12:13Z

Wouldn't it make more sense to just make the SeqRecord run some slightly different logic for circular genomes? You could simply make your rotate function raise an exception when called for a linear chromosome and stick it on the SeqRecord.

We've been thinking about contributing some higher level "neighbourhood" functions to Biopython along a similar line. Basically a way to say "give me 20 kbp upstream and downstream of feature X", and then have the SeqRecord do the right thing also for circular genomes.

jrjhealey · 2019-07-03T07:28:51Z

I think that could well be a good way to go, though patching on to the SeqRecord as it currently exists struck me as possibly a drastic change.

I can foresee the possible need to redefine the reprs, str methods, and probably a bunch of other stuff too, which would mean that each method would then also need to test if the sequence is circular (or has been designated as circular), before returning its ‘linear’ behaviour or ‘circular’ behaviour accordingly. This would presumably lead to redefining a bunch of the existing seqrecord tests too? I’m happy to be wrong though!

That kind of neighbourhood functionality is exactly what I had in mind though, just to hide the string slicing and subsequence logic away from the end user.

peterjc · 2019-07-03T08:32:52Z

I think a whole new class is overkill, there are a lot of special cases to support. What I did ponder was a .roll() method which would work like this:

    def roll(self, shift):
        """Roll the sequence assuming it is circular."""
        # TODO - which direction should be positive?
        # TODO - bounds checking, or apply modulo arithmetic
        # TODO - can we preserve more annotation than this does?
        # Specifically if the cut is mid-feature, turn the feature into a join
        # (or in the case of a source feature, it can probably be reused as is)
        return self[:shift] + self[shift:]

However, provided you avoid cutting in the middle of a feature, the documented cut-and-add approach is still short, perhaps even clearer?

peterjc · 2019-07-03T09:06:11Z

Breaking features at the new origin seems fairly well defined, just use the join functionality (in INSDC terminology, the CompoundLocation object in Biopython). However, I can think of two corner cases immediately where that default processing would be less than ideal.

First, origin spanning features like join(Y..N,1..X) on a sequence of length N, if the shift is large enough they could become a single contiguous region and expressed as A..B rather than a join of two touching regions. I think we'd need special case code that checks to see if we have abutting pieces at the shift point aka old origin, e.g. anything like join(A..X,Y,..B) where X and Y=X+1 is the old origin break / shift value applied becomes A..B instead.

Second, the special case of the source feature usually 1..N on a sequence of length N, should stay as 1..N rather than becoming something like join(Y..N,1..X) where X and Y=X+1 is slice point. However, chimeric source features are far more tricky - here we're sort of overlapping with the case above, but not quite.

You would need a comprehensive set of test cases as part of this work.

As to the other annotation, per-letter-annotation is simple (e.g. a circular contig in FASTQ format), but the others like id, name, description, features, etc need some thought. Probably the same approach as the . reverse_complement(...) method would be best - although perhaps preserving everything except the name and ID by default?

https://github.com/biopython/biopython/blob/biopython-173/Bio/SeqRecord.py#L1002

When I was looking at this, I probably convinced myself this was complicated enough that I didn't have a clear day or two spare to do it properly ;)

Also, one of the main reasons to apply an origin shift and roll the sequence is to fix an origin spanning feature, so careful choice of the shift value avoids this - thus the documented approach is usually enough.

chris-rands · 2019-07-03T09:21:43Z

If you apply the .roll() method within the Seq class rather than SeqRecord it would avoid these complications, but might not be worthwhile.

The .roll() method reminds me of deque.rotate()

jrjhealey · 2019-07-03T09:23:36Z

I think the source feature would indeed be fine to (re)define as necessary as simply 1:len(seq) regardless of how the sequence has been permuted. I'm not too familiar with chimeric sources though, do you have an example to illustrate?

Another option might be to 'preserve the history' of any rolled sequences by adding an additional per-letter annotation for the original source 0th character during the roll() process? This doesn't particularly matter to me for the functionality I have in mind for my own needs, but I can imagine this might end up being useful to people if they want to restore the sequence (and therefore source) relative to the original sequence.

For preserving the annotations, I think preserving everything 'as is' but with altered coordinates, similar to slicing out a subrecord, makes the most sense to me (as an end user I'd prefer not to delve in to manually reconstructing the annotation). This seems to be what the plasmid viewer SnapGene does. The only issue, unless I'm missing something, would be how to label a feature which spans the new join. Copying the Truthy/Falsey behaviour of .reverse_complement looks like it could make sense :)

Dillon Barker pointed out to me that the collections.deque object has a .rotate() method. I would also perhaps propose sticking with this nomenclature too? We may also be able to borrow from it to answer the question of which direction should be positive etc.

For reference, this is the deque.rotate() method in pure python from PyPy:

    def rotate(self, n=1):
        length = len(self)
        if length <= 1:
            return
        halflen = length >> 1
        if n > halflen or n < -halflen:
            n %= length
            if n > halflen:
                n -= length
            elif n < -halflen:
                n += length
        while n > 0:
            self.appendleft(self.pop())
            n -= 1
        while n < 0:
            self.append(self.popleft())
            n += 1

EDIT: I see that Chris has pointed out deque.rotate too.

peterjc · 2019-07-03T11:54:00Z

Two votes for .rotate, which also makes sense for a circle - so I'm OK with that.

As to chimeras with complex source features, I wrote about some examples here:
https://blastedbio.blogspot.com/2013/11/entrez-trouble-with-chimeras.html

That brings up another potentially problematic feature location type which would require consideration (order(...), used in place of join(...) where the order is not known or not important).

BjornFJohansson · 2020-05-08T14:40:52Z

Hi all, Maybe I am late for the party, but the pydna Dseqrecord class has a shifted method for circular sequences. The Dseqrecord is a derivative of the Seqrecord for double stranded sequences. Perhaps the code could be of use or inspiration
https://github.com/BjornFJohansson/pydna/blob/master/pydna/dseqrecord.py

fabianegli · 2021-06-10T10:36:02Z

After reading this thread, I am wondering what use cases should be served and if the SeqRecord class is the right place to implement a rotation feature. Concrete use cases would definitely help to find out what would be most useful.

In general, I think the SeqRecord should serve as a ground truth and any mutation/extraction/annotation should be something separate, either a function or a class containing the SeqRecord and (potentially) returning a new on.

What I think would be valuable additions to SeqRecord are the following:

attribute that says it is a circular sequence (this might already be the case)
ability to extract features that span the origin (if the sequence is circular)

The reason I would argue against a rotate method for SeqRecord is that for me it is not clear if the record id and name should remain the same after the rotation and/or other sequence manipulations or if we are talking about something new. Do you want rotate to change the object or should it just provide a different view point?

peterjc · 2021-06-11T12:22:26Z

The only use case I've come up with is to apply an origin shift and roll the sequence is to fix an origin spanning feature (either during assembly and annotation, or in visualisation of someone else's published genome). Here careful choice of the shift value avoids most of the issues and the simple approach is usually enough: record[:shift] + record[shift:]

The GenBank parser already records the circular topology in the SeqRecord. The SeqFeature extract method already supports origin spanning locations.

As to recording naming etc, I'd expect any .roll(...) method to adopt something like the .reverse_complete(...) for controlling which annotation properties would be kept by default https://github.com/biopython/biopython/blob/biopython-179/Bio/SeqRecord.py#L1051

seanrjohnson · 2023-02-02T22:55:04Z

We've been thinking about contributing some higher level "neighbourhood" functions to Biopython along a similar line. Basically a way to say "give me 20 kbp upstream and downstream of feature X", and then have the SeqRecord do the right thing also for circular genomes.

@kblin did you ever do anything like this? I have code (that I will hopefully open source later this year) that does neighborhood extraction (and lots of other useful things) on linear contigs. I'm currently trying to patch it to handle origin-spanning annotations and neighborhoods that cross the origin. The comments on this page have been helpful so far, but I still have a ways to go.

BjornFJohansson · 2023-02-03T12:00:16Z

@seanrjohnson I did implement this into pydna, that sits on top of biopython.
I reproduced the example at the start of this thread:

# Now, for an example with features, we’ll use a GenBank file. 
# Suppose you have a circular genome:

from pydna.genbank import genbank

record = genbank("NC_005816.1")

print(record.seq[:30])

# TGTAACGAACGGTGCAATAGTGATCCACAC

assert len(record) == 9609

assert record.circular

assert record.description == 'Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence'

assert record.dbxrefs==['BioProject:PRJNA224116', 'BioSample:SAMN02602970',  'Assembly:GCF_000007885.1']

assert len(record.features) == 19

shifted_record = record.shifted(2000)

assert len(shifted_record.features) == 19

print(shifted_record.seq[:30])

# GATACGCAGTCATATTTTTTACACAATTCT

assert sorted(ft.extract(record).seq for ft in record.features) == sorted(ft.extract(shifted_record).seq for ft in shifted_record.features)

https://gist.github.com/BjornFJohansson/d334dc74cdc79b203acc8283f62327c4

seanrjohnson · 2023-02-03T19:48:28Z

@BjornFJohansson
I found the code for shifted in the pydna repository. It is very helpful. Thanks!

It would also be great to have something like this integrated into Biopython.

kblin · 2023-02-03T21:38:01Z

I think my colleague @SJShaw built something for us, but I don't think it hit mainly yet.

SJShaw · 2023-02-08T12:11:30Z

I think my colleague @SJShaw built something for us, but I don't think it hit mainly yet.

There's two circular-related things I've written, one is a simple rotate script that just cuts the record in two and reassembles them in reverse order (with some optional padding), adjusting all the features/locations as appropriate. It's simple enough that it doesn't cover the case of merging as was mentioned much further above.

The other builds on antiSMASH's secmet.Record that allows for locations to be extended, while being aware of circularity, then supports all the normal functionality like fetching features within a location and so on. I think it's nearly ready, but I want to do some more thorough testing of it. Once it's ready I'm happy to link to it in here, in case people find it of use.

gatoniel · 2024-02-07T17:08:07Z

@SJShaw have you finished the code? I would be intereseted in a ready to use .roll() function that also handles features wrapping the origin in the original sequence but also in the rolled one.

BjornFJohansson · 2024-02-08T07:21:36Z

@gatoniel This was implemented in pydna with lots of unit tests. There is a method for the dseqrecord.shifted method and also shift_feature and shift_location functions in pydna.utils.

SJShaw · 2024-02-08T08:59:16Z

@SJShaw have you finished the code? I would be intereseted in a ready to use .roll() function that also handles features wrapping the origin in the original sequence but also in the rolled one.

The script to rotate a file is here, though it's fairly trivial to adjust it to use an in-memory SeqRecord: https://gist.github.com/SJShaw/30df7a6b7551a219a0f8779702a425d4

The antiSMASH implementation won't have a .roll() as such, it just supports cross-origin features and extractions and allows for circular aware extensions of locations, as antiSMASH doesn't (and shouldn't) ever change the input sequence itself.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Circular permutations of sequences #2158

Circular permutations of sequences #2158

jrjhealey commented Jul 2, 2019

kblin commented Jul 3, 2019

jrjhealey commented Jul 3, 2019 •

edited

peterjc commented Jul 3, 2019

peterjc commented Jul 3, 2019

chris-rands commented Jul 3, 2019

jrjhealey commented Jul 3, 2019 •

edited

peterjc commented Jul 3, 2019

BjornFJohansson commented May 8, 2020 •

edited

fabianegli commented Jun 10, 2021

peterjc commented Jun 11, 2021

seanrjohnson commented Feb 2, 2023 •

edited

BjornFJohansson commented Feb 3, 2023 •

edited

seanrjohnson commented Feb 3, 2023

kblin commented Feb 3, 2023

SJShaw commented Feb 8, 2023

gatoniel commented Feb 7, 2024

BjornFJohansson commented Feb 8, 2024

SJShaw commented Feb 8, 2024

Circular permutations of sequences #2158

Circular permutations of sequences #2158

Comments

jrjhealey commented Jul 2, 2019

kblin commented Jul 3, 2019

jrjhealey commented Jul 3, 2019 • edited

peterjc commented Jul 3, 2019

peterjc commented Jul 3, 2019

chris-rands commented Jul 3, 2019

jrjhealey commented Jul 3, 2019 • edited

peterjc commented Jul 3, 2019

BjornFJohansson commented May 8, 2020 • edited

fabianegli commented Jun 10, 2021

peterjc commented Jun 11, 2021

seanrjohnson commented Feb 2, 2023 • edited

BjornFJohansson commented Feb 3, 2023 • edited

seanrjohnson commented Feb 3, 2023

kblin commented Feb 3, 2023

SJShaw commented Feb 8, 2023

gatoniel commented Feb 7, 2024

BjornFJohansson commented Feb 8, 2024

SJShaw commented Feb 8, 2024

jrjhealey commented Jul 3, 2019 •

edited

jrjhealey commented Jul 3, 2019 •

edited

BjornFJohansson commented May 8, 2020 •

edited

seanrjohnson commented Feb 2, 2023 •

edited

BjornFJohansson commented Feb 3, 2023 •

edited