## Test insertions

I have sampled local matches with different reads and error rates from `ref.fasta`. I have then inserted these local matches into `query/query.fasta` at random positions to create `query/with_insertions_{er}.fasta`. The file `ground_truth/{er}.tsv` shows the location of each local match in the `query/with_insertions_{er}.fasta`. 

This notebook tests if the ground truth file has the correct locations.  

In [287]:
from Bio import SeqIO
import pandas as pd
import random

Compare the original query file and the query with insertions. Check if matches from ground truth actually appear where they are supposed to.

In [288]:
er = "0"

original_query_file = "../query/one_line.fasta"
query_with_insertions_file = "../query/with_insertions_" + er + ".fasta"
# the ground truth file shows positions in the query with insertions
ground_truth_file = "../ground_truth/" + er + ".tsv"
local_matches_file = "../local_matches/" + er + ".fastq"

In [289]:
original_query = list(SeqIO.parse(original_query_file, "fasta"))
query_with_insertions = list(SeqIO.parse(query_with_insertions_file, "fasta"))
local_matches = list(SeqIO.parse(local_matches_file, "fastq"))
ground_truth = pd.read_csv(ground_truth_file, sep='\t')

Take a random local match from the local matches file. Find it's assigned position in the ground truth file and extract that location of the query with insertions file.

In [290]:
random.seed(42)

random_insertion = random.choice(local_matches)
random_insertion

SeqRecord(seq=Seq('CAGGCGAACCTATCACAGCATATTGCGTTTTCTATCATCTGGCGATCCATAGAA...TAG', SingleLetterAlphabet()), id='l150-77', name='l150-77', description='l150-77', dbxrefs=[])

In [291]:
ground_truth.loc[ground_truth['id'] == random_insertion.name]

Unnamed: 0,id,position,length
21,l150-77,55351,150


In [292]:
position = ground_truth.loc[ground_truth['id'] == random_insertion.name]["position"].values[0]
position

55351

In [293]:
length = ground_truth.loc[ground_truth['id'] == random_insertion.name]["length"].values[0]
length

150

In [294]:
original_insertion = str(random_insertion.seq)
original_insertion

'CAGGCGAACCTATCACAGCATATTGCGTTTTCTATCATCTGGCGATCCATAGAAAAGCTGACTCCGTGCCCCGCATCTAAGCGCCCGGTGCCAGATTCGGGTGGACCCTCGGTCAAGCGAGCTAAGCCTGAGAGGTGAAAGCCGCTTTAG'

In [295]:
inserted_into_query = str(query_with_insertions[0].seq[position:position + length])
inserted_into_query

'CAGGCGAACCTATCACAGCATATTGCGTTTTCTATCATCTGGCGATCCATAGAAAAGCTGACTCCGTGCCCCGCATCTAAGCGCCCGGTGCCAGATTCGGGTGGACCCTCGGTCAAGCGAGCTAAGCCTGAGAGGTGAAAGCCGCTTTAG'

In [296]:
assert(original_insertion == inserted_into_query),"Something went wrong with the insertion"

### Check insertions

Does the simulated local match actually appear at the position that is specified in the ground truth file

In [297]:
def check_insertion_correctness():
    random_insertion = random.choice(local_matches)
    position = ground_truth.loc[ground_truth['id'] == random_insertion.name]["position"].values[0]
    length = ground_truth.loc[ground_truth['id'] == random_insertion.name]["length"].values[0]
    original_insertion = str(random_insertion.seq)
    inserted_into_query = str(query_with_insertions[0].seq[position:position + length])
    assert(original_insertion == inserted_into_query),"Something went wrong with the insertion"
    print("Insertion " + random_insertion.name + " found at position " + str(position))

In [305]:
for i in range(10):
    check_insertion_correctness()

Insertion l200-39 found at position 398846
Insertion l200-70 found at position 390861
Insertion l50-3 found at position 249116
Insertion l200-13 found at position 21185
Insertion l200-37 found at position 924291
Insertion l50-81 found at position 111065
Insertion l150-107 found at position 17906
Insertion l100-91 found at position 254782
Insertion l100-49 found at position 894858
Insertion l100-17 found at position 1077273


### Check positions before first insertion

Is the leading sequence (before the first insertion) equal for the original query and the query with insertions?

In [299]:
ground_truth.head()

Unnamed: 0,id,position,length
0,l100-109,851,100
1,l150-101,1299,150
2,l200-124,3796,200
3,l150-97,4970,150
4,l200-43,7405,200


In [300]:
first_insertion_position = ground_truth.iloc[0]['position']

str(original_query[0].seq[0:first_insertion_position]) == \
str(query_with_insertions[0].seq[0:first_insertion_position])

True

### Check a sample of positions in the middle

In [301]:
# get random positions from the ground truth list to verify them
ran_ind_list = random.sample(range(len(ground_truth['id'])), 10)
ran_ind_list.sort()

In [302]:
for ind in ran_ind_list:
    if (ind == 0):
        continue
    if (ind == len(ground_truth['id'])):
        continue
    # the ground truth positions represent indices in the query with insertions
    previous_insert_position = ground_truth.iloc[ind-1]['position']
    previous_insert_length = ground_truth.iloc[ind-1]['length']
    downstream_bias = sum(ground_truth.truncate(after=(ind-1))['length'])

    insert_position = ground_truth.iloc[ind]['position']
    insert_length = ground_truth.iloc[ind]['length']
    upstream_bias = sum(ground_truth.truncate(after=(ind))['length'])

    next_insert_position = ground_truth.iloc[ind+1]['position']

    original_query_downstream = str(original_query[0].seq[previous_insert_position - downstream_bias \
                                                          + previous_insert_length \
                                                          :insert_position - downstream_bias])

    insertion_downstream = str(query_with_insertions[0].seq[previous_insert_position + \
                                                            previous_insert_length:insert_position])

    if (original_query_downstream == insertion_downstream):
        print("Verified downstream sequence for insertion " + ground_truth.iloc[ind]['id'])
    else:
        print("Incorrect downstream " + ground_truth.iloc[ind]['id'])

    original_query_upstream = str(original_query[0].seq[insert_position - upstream_bias \
                                                        + insert_length \
                                                        :next_insert_position - upstream_bias])
    insertion_upstream = str(query_with_insertions[0].seq[insert_position + insert_length:next_insert_position])

    if (original_query_upstream == insertion_upstream):
        print("Verified upstream sequence for insertion " + ground_truth.iloc[ind]['id'])
    else:
        print("Incorrect upstream " + ground_truth.iloc[ind]['id'])
    

Verified downstream sequence for insertion l200-71
Verified upstream sequence for insertion l200-71
Verified downstream sequence for insertion l150-6
Verified upstream sequence for insertion l150-6
Verified downstream sequence for insertion l50-80
Verified upstream sequence for insertion l50-80
Verified downstream sequence for insertion l150-111
Verified upstream sequence for insertion l150-111
Verified downstream sequence for insertion l50-72
Verified upstream sequence for insertion l50-72
Verified downstream sequence for insertion l200-102
Verified upstream sequence for insertion l200-102
Verified downstream sequence for insertion l50-116
Verified upstream sequence for insertion l50-116
Verified downstream sequence for insertion l200-115
Verified upstream sequence for insertion l200-115
Verified downstream sequence for insertion l100-7
Verified upstream sequence for insertion l100-7
Verified downstream sequence for insertion l50-14
Verified upstream sequence for insertion l50-14


### Check positions after last insertion

Is the trailing sequence (after the last insertion) equal for the original query and the query with insertions?

In [303]:
ground_truth.tail()

Unnamed: 0,id,position,length
495,l50-108,1086625,50
496,l100-42,1086830,100
497,l100-69,1097227,100
498,l100-8,1099134,100
499,l50-25,1109413,50


In [304]:
last_insertion_position = ground_truth.iloc[-1]['position']
last_insertion_length = ground_truth.iloc[-1]['length']
bias = sum(ground_truth["length"]) - last_insertion_length

str(original_query[0].seq[last_insertions_position - bias:]) == \
str(query_with_insertions[0].seq[last_insertions_position + last_insertion_length:])

True