Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Circular permutations of sequences #2158

Open
jrjhealey opened this issue Jul 2, 2019 · 18 comments
Open

Circular permutations of sequences #2158

jrjhealey opened this issue Jul 2, 2019 · 18 comments

Comments

@jrjhealey
Copy link

Hi all,

This isn't a bug report, so apologies if this belongs elsewhere.

I've recently been asked about a task which requires some slightly advanced circular permutation manipulation.

The existing docs offer this 'manual' approach to shifting the origin:

Now, for an example with features, well use a GenBank file. Suppose you have a circular genome:

>>> from Bio import SeqIO
>>> record = SeqIO.read("NC_005816.gb", "genbank")
>>> record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',
IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.',
dbxrefs=['Project:10638'])
>>> len(record)
9609
>>> len(record.features)
41
>>> record.dbxrefs
['Project:58037']
>>> record.annotations.keys()
['comment', 'sequence_version', 'source', 'taxonomy', 'keywords', 'references',
'accessions', 'data_file_division', 'date', 'organism', 'gi']
You can shift the origin like this:

>>> shifted = record[2000:] + record[:2000]
>>> shifted
SeqRecord(seq=Seq('GATACGCAGTCATATTTTTTACACAATTCTCTAATCCCGACAAGGTCGTAGGTC...GGA',
IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.',
dbxrefs=[])
>>> len(shifted)
9609

I'm thinking that some of this behaviour might make sense to have a layer of abstraction, e.g. a function that rotates the sequence by a specified amount, and so on.

Since this isn't already a feature, I'd consider making this a pull request as it seems like something which could be generally useful. With that in mind, I wanted to get your input about how best to implement it such that it would have the most seamless integration. I will just add that I've not contributed to something as big as BioPython before so want to make sure I go about this sensibly (I'm reading the contributing guidlines and testing documentation presently).

My first thought is that a CircularSeqRecord class which inherits from SeqRecord might be the way to go, and then to override the relevant functions where the existing ones don't make sense for a circular sequence if necessary.

A constructor which handles the 'forced' carry over of all the annotations and cross references might also make sense.

Has anyone given this much thought before? I thought it would make sense to ask first before I lead myself down a blind alley, or to know whether this functionality isn't in BioPython for some very good reason!

Thanks,

Joe

PS. If there's a better place for discussing this, please let me know and I'll move it there.

@kblin
Copy link
Contributor

kblin commented Jul 3, 2019

Wouldn't it make more sense to just make the SeqRecord run some slightly different logic for circular genomes? You could simply make your rotate function raise an exception when called for a linear chromosome and stick it on the SeqRecord.

We've been thinking about contributing some higher level "neighbourhood" functions to Biopython along a similar line. Basically a way to say "give me 20 kbp upstream and downstream of feature X", and then have the SeqRecord do the right thing also for circular genomes.

@jrjhealey
Copy link
Author

jrjhealey commented Jul 3, 2019

I think that could well be a good way to go, though patching on to the SeqRecord as it currently exists struck me as possibly a drastic change.

I can foresee the possible need to redefine the reprs, str methods, and probably a bunch of other stuff too, which would mean that each method would then also need to test if the sequence is circular (or has been designated as circular), before returning its ‘linear’ behaviour or ‘circular’ behaviour accordingly. This would presumably lead to redefining a bunch of the existing seqrecord tests too? I’m happy to be wrong though!

That kind of neighbourhood functionality is exactly what I had in mind though, just to hide the string slicing and subsequence logic away from the end user.

@peterjc
Copy link
Member

peterjc commented Jul 3, 2019

I think a whole new class is overkill, there are a lot of special cases to support. What I did ponder was a .roll() method which would work like this:

    def roll(self, shift):
        """Roll the sequence assuming it is circular."""
        # TODO - which direction should be positive?
        # TODO - bounds checking, or apply modulo arithmetic
        # TODO - can we preserve more annotation than this does?
        # Specifically if the cut is mid-feature, turn the feature into a join
        # (or in the case of a source feature, it can probably be reused as is)
        return self[:shift] + self[shift:]

However, provided you avoid cutting in the middle of a feature, the documented cut-and-add approach is still short, perhaps even clearer?

@peterjc
Copy link
Member

peterjc commented Jul 3, 2019

Breaking features at the new origin seems fairly well defined, just use the join functionality (in INSDC terminology, the CompoundLocation object in Biopython). However, I can think of two corner cases immediately where that default processing would be less than ideal.

First, origin spanning features like join(Y..N,1..X) on a sequence of length N, if the shift is large enough they could become a single contiguous region and expressed as A..B rather than a join of two touching regions. I think we'd need special case code that checks to see if we have abutting pieces at the shift point aka old origin, e.g. anything like join(A..X,Y,..B) where X and Y=X+1 is the old origin break / shift value applied becomes A..B instead.

Second, the special case of the source feature usually 1..N on a sequence of length N, should stay as 1..N rather than becoming something like join(Y..N,1..X) where X and Y=X+1 is slice point. However, chimeric source features are far more tricky - here we're sort of overlapping with the case above, but not quite.

You would need a comprehensive set of test cases as part of this work.

As to the other annotation, per-letter-annotation is simple (e.g. a circular contig in FASTQ format), but the others like id, name, description, features, etc need some thought. Probably the same approach as the . reverse_complement(...) method would be best - although perhaps preserving everything except the name and ID by default?

https://github.com/biopython/biopython/blob/biopython-173/Bio/SeqRecord.py#L1002

When I was looking at this, I probably convinced myself this was complicated enough that I didn't have a clear day or two spare to do it properly ;)

Also, one of the main reasons to apply an origin shift and roll the sequence is to fix an origin spanning feature, so careful choice of the shift value avoids this - thus the documented approach is usually enough.

@chris-rands
Copy link
Contributor

If you apply the .roll() method within the Seq class rather than SeqRecord it would avoid these complications, but might not be worthwhile.

The .roll() method reminds me of deque.rotate()

@jrjhealey
Copy link
Author

jrjhealey commented Jul 3, 2019

I think the source feature would indeed be fine to (re)define as necessary as simply 1:len(seq) regardless of how the sequence has been permuted. I'm not too familiar with chimeric sources though, do you have an example to illustrate?

Another option might be to 'preserve the history' of any rolled sequences by adding an additional per-letter annotation for the original source 0th character during the roll() process? This doesn't particularly matter to me for the functionality I have in mind for my own needs, but I can imagine this might end up being useful to people if they want to restore the sequence (and therefore source) relative to the original sequence.

For preserving the annotations, I think preserving everything 'as is' but with altered coordinates, similar to slicing out a subrecord, makes the most sense to me (as an end user I'd prefer not to delve in to manually reconstructing the annotation). This seems to be what the plasmid viewer SnapGene does. The only issue, unless I'm missing something, would be how to label a feature which spans the new join. Copying the Truthy/Falsey behaviour of .reverse_complement looks like it could make sense :)

Dillon Barker pointed out to me that the collections.deque object has a .rotate() method. I would also perhaps propose sticking with this nomenclature too? We may also be able to borrow from it to answer the question of which direction should be positive etc.

For reference, this is the deque.rotate() method in pure python from PyPy:

    def rotate(self, n=1):
        length = len(self)
        if length <= 1:
            return
        halflen = length >> 1
        if n > halflen or n < -halflen:
            n %= length
            if n > halflen:
                n -= length
            elif n < -halflen:
                n += length
        while n > 0:
            self.appendleft(self.pop())
            n -= 1
        while n < 0:
            self.append(self.popleft())
            n += 1

EDIT: I see that Chris has pointed out deque.rotate too.

@peterjc
Copy link
Member

peterjc commented Jul 3, 2019

Two votes for .rotate, which also makes sense for a circle - so I'm OK with that.

As to chimeras with complex source features, I wrote about some examples here:
https://blastedbio.blogspot.com/2013/11/entrez-trouble-with-chimeras.html

That brings up another potentially problematic feature location type which would require consideration (order(...), used in place of join(...) where the order is not known or not important).

@BjornFJohansson
Copy link

BjornFJohansson commented May 8, 2020

Hi all, Maybe I am late for the party, but the pydna Dseqrecord class has a shifted method for circular sequences. The Dseqrecord is a derivative of the Seqrecord for double stranded sequences. Perhaps the code could be of use or inspiration
https://github.com/BjornFJohansson/pydna/blob/master/pydna/dseqrecord.py

@fabianegli
Copy link
Contributor

After reading this thread, I am wondering what use cases should be served and if the SeqRecord class is the right place to implement a rotation feature. Concrete use cases would definitely help to find out what would be most useful.

In general, I think the SeqRecord should serve as a ground truth and any mutation/extraction/annotation should be something separate, either a function or a class containing the SeqRecord and (potentially) returning a new on.

What I think would be valuable additions to SeqRecord are the following:

  • attribute that says it is a circular sequence (this might already be the case)
  • ability to extract features that span the origin (if the sequence is circular)

The reason I would argue against a rotate method for SeqRecord is that for me it is not clear if the record id and name should remain the same after the rotation and/or other sequence manipulations or if we are talking about something new. Do you want rotate to change the object or should it just provide a different view point?

@peterjc
Copy link
Member

peterjc commented Jun 11, 2021

The only use case I've come up with is to apply an origin shift and roll the sequence is to fix an origin spanning feature (either during assembly and annotation, or in visualisation of someone else's published genome). Here careful choice of the shift value avoids most of the issues and the simple approach is usually enough: record[:shift] + record[shift:]

The GenBank parser already records the circular topology in the SeqRecord. The SeqFeature extract method already supports origin spanning locations.

As to recording naming etc, I'd expect any .roll(...) method to adopt something like the .reverse_complete(...) for controlling which annotation properties would be kept by default https://github.com/biopython/biopython/blob/biopython-179/Bio/SeqRecord.py#L1051

@seanrjohnson
Copy link
Contributor

seanrjohnson commented Feb 2, 2023

We've been thinking about contributing some higher level "neighbourhood" functions to Biopython along a similar line. Basically a way to say "give me 20 kbp upstream and downstream of feature X", and then have the SeqRecord do the right thing also for circular genomes.

@kblin did you ever do anything like this? I have code (that I will hopefully open source later this year) that does neighborhood extraction (and lots of other useful things) on linear contigs. I'm currently trying to patch it to handle origin-spanning annotations and neighborhoods that cross the origin. The comments on this page have been helpful so far, but I still have a ways to go.

@BjornFJohansson
Copy link

BjornFJohansson commented Feb 3, 2023

@seanrjohnson I did implement this into pydna, that sits on top of biopython.
I reproduced the example at the start of this thread:

# Now, for an example with features, we’ll use a GenBank file. 
# Suppose you have a circular genome:

from pydna.genbank import genbank

record = genbank("NC_005816.1")

print(record.seq[:30])

# TGTAACGAACGGTGCAATAGTGATCCACAC

assert len(record) == 9609

assert record.circular

assert record.description == 'Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence'

assert record.dbxrefs==['BioProject:PRJNA224116', 'BioSample:SAMN02602970',  'Assembly:GCF_000007885.1']

assert len(record.features) == 19

shifted_record = record.shifted(2000)

assert len(shifted_record.features) == 19

print(shifted_record.seq[:30])

# GATACGCAGTCATATTTTTTACACAATTCT

assert sorted(ft.extract(record).seq for ft in record.features) == sorted(ft.extract(shifted_record).seq for ft in shifted_record.features)

https://gist.github.com/BjornFJohansson/d334dc74cdc79b203acc8283f62327c4

@seanrjohnson
Copy link
Contributor

@BjornFJohansson
I found the code for shifted in the pydna repository. It is very helpful. Thanks!

It would also be great to have something like this integrated into Biopython.

@kblin
Copy link
Contributor

kblin commented Feb 3, 2023

I think my colleague @SJShaw built something for us, but I don't think it hit mainly yet.

@SJShaw
Copy link

SJShaw commented Feb 8, 2023

I think my colleague @SJShaw built something for us, but I don't think it hit mainly yet.

There's two circular-related things I've written, one is a simple rotate script that just cuts the record in two and reassembles them in reverse order (with some optional padding), adjusting all the features/locations as appropriate. It's simple enough that it doesn't cover the case of merging as was mentioned much further above.

The other builds on antiSMASH's secmet.Record that allows for locations to be extended, while being aware of circularity, then supports all the normal functionality like fetching features within a location and so on. I think it's nearly ready, but I want to do some more thorough testing of it. Once it's ready I'm happy to link to it in here, in case people find it of use.

@gatoniel
Copy link

gatoniel commented Feb 7, 2024

@SJShaw have you finished the code? I would be intereseted in a ready to use .roll() function that also handles features wrapping the origin in the original sequence but also in the rolled one.

@BjornFJohansson
Copy link

@gatoniel This was implemented in pydna with lots of unit tests. There is a method for the dseqrecord.shifted method and also shift_feature and shift_location functions in pydna.utils.

@SJShaw
Copy link

SJShaw commented Feb 8, 2024

@SJShaw have you finished the code? I would be intereseted in a ready to use .roll() function that also handles features wrapping the origin in the original sequence but also in the rolled one.

The script to rotate a file is here, though it's fairly trivial to adjust it to use an in-memory SeqRecord: https://gist.github.com/SJShaw/30df7a6b7551a219a0f8779702a425d4

The antiSMASH implementation won't have a .roll() as such, it just supports cross-origin features and extractions and allows for circular aware extensions of locations, as antiSMASH doesn't (and shouldn't) ever change the input sequence itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants