# How to get a slice that wraps around a string

In [1]:
string = "0123456789"

Suppose I'm giving a starting position, and I want the 3 characters before and 2 characters after that position, so we get a string of length 6. This is fine if our starting position comes after 2 or before 8.

In [2]:
left = 3    # upstream of sequence
right = 2   # downstream of sequence
for idx in range(len(string)):
    print(idx, string[idx-left:idx+right+1])

0 
1 
2 
3 012345
4 123456
5 234567
6 345678
7 456789
8 56789
9 6789


Suppose this is a circular string, like a bacterial genomic sequence. 

In [3]:
"..." + string*3 + "..."

'...012345678901234567890123456789...'

So if `(idx - left) < 0` or `(idx + right) > len(string)` we need to wrap around.

### Case 1: `(idx - left) < 0`

...01234567 8*90**1**23* 4567890123456789...

If we focused on index 1, we need to capture 890123.

We need to capture what's in bold: **0123**4567**89**

In [4]:
print( string[:4], string[-2:] )

0123 89


We are safe catching anything to the right (offset by 1), but then we have remaining characters we need to catch at the end of our sequence.

In [5]:
idx = 1
print( string[0:(idx+right+1)], string[-(left-idx):] )

0123 89


Notice that it only works for when the index is less than 3.

In [6]:
idx = 0
print( string[0:(idx+right+1)], string[-(left-idx):] )
idx = 1
print( string[0:(idx+right+1)], string[-(left-idx):] )
idx = 2
print( string[0:(idx+right+1)], string[-(left-idx):] )
idx = 3
print( string[0:(idx+right+1)], string[-(left-idx):] )
idx = 8
print( string[0:(idx+right+1)], string[-(left-idx):] )

012 789
0123 89
01234 9
012345 0123456789
0123456789 56789


### Case 2: `(idx + right) + 1 > len(string) `, alt.   `(len(string) - right) - 1 < idx`

In [7]:
idx = 9
diff = len(string) - right - 1
print( string[idx-left:], string[0:(idx - diff)] )
print( string[idx-left:] + string[0:(idx - diff)] )

6789 01
678901


Putting these two cases together, along with the case where no wrapping around is necessary, we can define the following function. Just plug in the position/index from which you want to go upstream and downstream.

In [8]:
def cut_string(idx):
    if idx - left < 0:
        # wrap around left
        segment = string[-(left-idx):] + string[0:(idx+right+1)]
        print(segment)
    elif idx + right + 1 > len(string):
        # wrap around right
        diff = len(string) - right - 1
        segment = string[idx-left:] + string[0:(idx - diff)]
        print(segment)
    else:
        segment = string[idx-left:idx+right+1]
        print(segment)

Let's check this. Run this function for every starting position of this string, from 0 to 9.

In [9]:
for i in range(10):
    cut_string(i)

789012
890123
901234
012345
123456
234567
345678
456789
567890
678901


## Get promoter on circular genome

Let's say we need to capture promoter regions 80 nt upstream and 20 nt downstream from a transcription site. If the transcription site position is 70, we will have a problem going 80 nt upstream. If the transcription site is less than 20 nucleotides away from the end of the sequence, we will also have a problem. But if our genome is circular, like many bacterial genomes are (e.g. e.coli), we need to wrap around the sequence.

A naive approach is to start by pasting 2-3 copies of the sequence together. That approach probably could have worked for the problem above since the sequence is so short. But the e. coli complete genome is almost 5 million base pairs long. That's already taking up a lot of memory when we have that stored as a string. So the approach spelled out above will work, we just need to modify it.

Since we have to do this for each transcription site, we will have to repeat this process many times. It's best to write this as a function. We need to provide the transcription site position, and how far up and down stream we will go. We also want the function to return a promoter.

In [10]:
seq = "AAAACCCCTTTTGGGG"
def subsequence(tss, upstream, downstream):
    idx = tss
    left = upstream
    right = downstream
    if idx - left < 0:
        # wrap around left
        promoter = seq[-(left-idx):] + seq[0:(idx+right+1)]
        return promoter
    elif idx + right + 1 > len(seq):
        # wrap around right
        diff = len(seq) - right - 1
        promoter = seq[idx-left:] + seq[0:(idx % diff)]
        return promoter
    else:
        promoter = seq[idx-left:idx+right+1]
        return promoter

In [11]:
idx = 2
up = 4
down = 3
print(seq[idx])
subsequence(idx, up, down)

A


'GGAAAACC'

In [12]:
idx = 8
up = 10
down = 5
print(seq[idx])
subsequence(idx, up, down)

T


'GGAAAACCCCTTTTGG'