Take one strain that has one long chromosome, t.ex. DA63366

Check the repeats coordinates that you got:

1. how many instances one repeat each record in GRF output corresponds to?
2. what are their locations?
3. between-match distances?


In [2]:
import pandas as pd

def parse_spacer(header):
    """
    :param header: a char string like '>1:0-190543:951:190482:14m'
    :return: a list of values from parsed header
    """
    first_split = header.split(":")
    record_id = first_split[0][1:]
    repeat_len = int(first_split[-1][:-1])
    range_ = first_split[1].split("-")
    range_start, range_end = int(range_[0]), int(range_[-1])
    repeat_1_start_in_range = int(first_split[2])
    repeat_2_end_in_range = int(first_split[3])

    repeat_1_start_in_chrom = range_start + repeat_1_start_in_range
    repeat_1_end_in_chrom = repeat_1_start_in_chrom + repeat_len

    repeat_2_end_in_chrom = range_start + repeat_2_end_in_range
    repeat_2_start_in_chrom = repeat_2_end_in_chrom - repeat_len

    return [record_id, repeat_1_start_in_chrom, repeat_1_end_in_chrom, repeat_2_start_in_chrom,
            repeat_2_end_in_chrom, repeat_len]


in_spacers = "/home/andrei/Data/HeteroR/results/direct_repeats/DA63366/repeats/perfect.spacer.id"
with open(in_spacers) as f:
    spacer_ids = [line.rstrip() for line in f.readlines()]

parsed_headers = [parse_spacer(line) for line in spacer_ids]
repeats = pd.DataFrame(columns=["record_id", "start_1", "end_1", "start_2", "end_2", "length"], data=parsed_headers)

In [3]:
repeats

Unnamed: 0,record_id,start_1,end_1,start_2,end_2,length
0,1,156785,156796,355577,355588,11
1,1,155793,155803,354429,354439,10
2,1,156991,157004,355523,355536,13
3,1,155876,155886,354108,354118,10
4,1,157996,158006,355931,355941,10
...,...,...,...,...,...,...
1536315,2,67576,67587,68111,68122,11
1536316,2,57930,57940,58456,58466,10
1536317,2,55882,55892,56401,56411,10
1536318,2,90519,90534,91034,91049,15


So, first match (=repeat) 156785-156796 (exclude last character) has 11 matches in the chromosome:

355578-355588 (this one is in the same row as the original match)

452879-452889

753332-753342

1126877-1126887

...
and so on

Can you find these coords in the table?

(you can just grep the original output file)

```
 $ grep "156785" perfect.spacer.id
>1:157454-360569:34544:156785:11m
>1:434016-635175:156785:183822:10m
>1:1054774-1255306:39708:156785:11m
>1:1261074-1464189:89622:156785:11m
>1:1358770-1559935:154610:156785:10m
>1:1359933-1561473:4250:156785:10m
>1:1527329-1728974:27411:156785:10m
>1:1679323-1880692:114499:156785:10m
>1:1736873-1938041:140950:156785:13m
>1:2246869-2447236:51288:156785:10m
>1:2247221-2447552:2073:156785:10m
>1:3032219-3233453:57464:156785:10m
>1:3032219-3233453:144235:156785:12m
>1:3522920-3724079:145392:156785:10m
>1:4242343-4443376:156785:157479:10m
```

which is 15 times

it means that we can find 156785 in columns **start_1** and in **end_2**

But IRL I can't do this

In [9]:
repeats[repeats.end_2 == 156785]


Unnamed: 0,record_id,start_1,end_1,start_2,end_2,length


In [10]:
repeats[repeats.start_1 == 156785]

Unnamed: 0,record_id,start_1,end_1,start_2,end_2,length
0,1,156785,156796,355577,355588,11


In [11]:
a = 157454+34544

repeats[repeats.start_1 == a]

Unnamed: 0,record_id,start_1,end_1,start_2,end_2,length
4259,1,191998,192009,314228,314239,11
30647,1,191998,192009,314228,314239,11
57257,1,191998,192009,314228,314239,11


there are duplicated rows!


In [13]:
repeats.drop_duplicates(inplace=True)

repeats[repeats.start_1 == a]

Unnamed: 0,record_id,start_1,end_1,start_2,end_2,length
4259,1,191998,192009,314228,314239,11
