## Test insertions

I have sampled local matches with different reads and error rates from `ref.fasta`. I have then inserted these local matches into `query/query.fasta` at random positions to create `query/with_insertions_{er}.fasta`. The file `ground_truth/{er}.tsv` shows the location of each local match in the `query/with_insertions_{er}.fasta`. 

This notebook tests if the ground truth file has the correct locations.  

In [6]:
from Bio import SeqIO
import pandas as pd
import random

Compare the original query file and the query with insertions. Check if matches from ground truth actually appear where they are supposed to.

In [26]:
rep = "2"
er = "0075"

original_query_file = "../query/one_line_" + rep + ".fasta"
query_with_insertions_file = "../query/with_insertions_" + rep + "_" + er + ".fasta"
# the ground truth file shows positions in the query with insertions
ground_truth_file = "../ground_truth/" + rep + "_" + er + ".tsv"
local_matches_file = "../local_matches/" + rep + "_" + er + ".fastq"

In [27]:
original_query = list(SeqIO.parse(original_query_file, "fasta"))
query_with_insertions = list(SeqIO.parse(query_with_insertions_file, "fasta"))
local_matches = list(SeqIO.parse(local_matches_file, "fastq"))
ground_truth = pd.read_csv(ground_truth_file, sep='\t')

Take a random local match from the local matches file. Find it's assigned position in the ground truth file and extract that location of the query with insertions file.

In [30]:
random.seed(42)

random_insertion = random.choice(local_matches)
random_insertion

SeqRecord(seq=Seq('GTAATGTTGTGGGGATAGGGCGTCAGCTGGCAGGTTACGACACAACCTAGACCC...CTC', SingleLetterAlphabet()), id='l150-77', name='l150-77', description='l150-77', dbxrefs=[])

In [31]:
ground_truth.loc[ground_truth['id'] == random_insertion.name]

Unnamed: 0,id,position,length
21,l150-77,55351,150


In [32]:
position = ground_truth.loc[ground_truth['id'] == random_insertion.name]["position"].values[0]
position

55351

In [33]:
length = ground_truth.loc[ground_truth['id'] == random_insertion.name]["length"].values[0]
length

150

In [34]:
original_insertion = str(random_insertion.seq)
original_insertion

'GTAATGTTGTGGGGATAGGGCGTCAGCTGGCAGGTTACGACACAACCTAGACCCCACACCGGCCACTCGTCCAACTAGAGGGCACAATTACGTGTTGATAGCTTGCACGCCCGTCTCTGAAAGTTTATTCAATCCACGACGGTAAGCCTC'

In [35]:
inserted_into_query = str(query_with_insertions[0].seq[position:position + length])
inserted_into_query

'GTAATGTTGTGGGGATAGGGCGTCAGCTGGCAGGTTACGACACAACCTAGACCCCACACCGGCCACTCGTCCAACTAGAGGGCACAATTACGTGTTGATAGCTTGCACGCCCGTCTCTGAAAGTTTATTCAATCCACGACGGTAAGCCTC'

In [36]:
assert(original_insertion == inserted_into_query),"Something went wrong with the insertion"

### Check insertions

Does the simulated local match actually appear at the position that is specified in the ground truth file

In [37]:
def check_insertion_correctness():
    random_insertion = random.choice(local_matches)
    position = ground_truth.loc[ground_truth['id'] == random_insertion.name]["position"].values[0]
    length = ground_truth.loc[ground_truth['id'] == random_insertion.name]["length"].values[0]
    original_insertion = str(random_insertion.seq)
    inserted_into_query = str(query_with_insertions[0].seq[position:position + length])
    assert(original_insertion == inserted_into_query),"Something went wrong with the insertion"
    print("Insertion " + random_insertion.name + " found at position " + str(position))

In [38]:
for i in range(10):
    check_insertion_correctness()

Insertion l50-57 found at position 475341
Insertion l50-12 found at position 141772
Insertion l200-4 found at position 162131
Insertion l100-15 found at position 620859
Insertion l100-0 found at position 842225
Insertion l50-114 found at position 1038871
Insertion l50-71 found at position 359361
Insertion l200-2 found at position 353762
Insertion l50-52 found at position 938049
Insertion l150-96 found at position 952990


### Check positions before first insertion

Is the leading sequence (before the first insertion) equal for the original query and the query with insertions?

In [39]:
ground_truth.head()

Unnamed: 0,id,position,length
0,l100-109,851,100
1,l150-101,1299,150
2,l200-124,3796,200
3,l150-97,4970,150
4,l200-43,7405,200


In [40]:
first_insertion_position = ground_truth.iloc[0]['position']

str(original_query[0].seq[0:first_insertion_position]) == \
str(query_with_insertions[0].seq[0:first_insertion_position])

True

### Check a sample of positions in the middle

In [41]:
# get random positions from the ground truth list to verify them
ran_ind_list = random.sample(range(len(ground_truth['id'])), 10)
ran_ind_list.sort()

In [42]:
for ind in ran_ind_list:
    if (ind == 0):
        continue
    if (ind == len(ground_truth['id'])):
        continue
    # the ground truth positions represent indices in the query with insertions
    previous_insert_position = ground_truth.iloc[ind-1]['position']
    previous_insert_length = ground_truth.iloc[ind-1]['length']
    downstream_bias = sum(ground_truth.truncate(after=(ind-1))['length'])

    insert_position = ground_truth.iloc[ind]['position']
    insert_length = ground_truth.iloc[ind]['length']
    upstream_bias = sum(ground_truth.truncate(after=(ind))['length'])

    next_insert_position = ground_truth.iloc[ind+1]['position']

    original_query_downstream = str(original_query[0].seq[previous_insert_position - downstream_bias \
                                                          + previous_insert_length \
                                                          :insert_position - downstream_bias])

    insertion_downstream = str(query_with_insertions[0].seq[previous_insert_position + \
                                                            previous_insert_length:insert_position])

    if (original_query_downstream == insertion_downstream):
        print("Verified downstream sequence for insertion " + ground_truth.iloc[ind]['id'])
    else:
        print("Incorrect downstream " + ground_truth.iloc[ind]['id'])

    original_query_upstream = str(original_query[0].seq[insert_position - upstream_bias \
                                                        + insert_length \
                                                        :next_insert_position - upstream_bias])
    insertion_upstream = str(query_with_insertions[0].seq[insert_position + insert_length:next_insert_position])

    if (original_query_upstream == insertion_upstream):
        print("Verified upstream sequence for insertion " + ground_truth.iloc[ind]['id'])
    else:
        print("Incorrect upstream " + ground_truth.iloc[ind]['id'])
    

Verified downstream sequence for insertion l200-104
Verified upstream sequence for insertion l200-104
Verified downstream sequence for insertion l100-9
Verified upstream sequence for insertion l100-9
Verified downstream sequence for insertion l50-81
Verified upstream sequence for insertion l50-81
Verified downstream sequence for insertion l50-45
Verified upstream sequence for insertion l50-45
Verified downstream sequence for insertion l200-16
Verified upstream sequence for insertion l200-16
Verified downstream sequence for insertion l150-64
Verified upstream sequence for insertion l150-64
Verified downstream sequence for insertion l200-102
Verified upstream sequence for insertion l200-102
Verified downstream sequence for insertion l100-26
Verified upstream sequence for insertion l100-26
Verified downstream sequence for insertion l100-12
Verified upstream sequence for insertion l100-12
Verified downstream sequence for insertion l150-38
Verified upstream sequence for insertion l150-38


### Check positions after last insertion

Is the trailing sequence (after the last insertion) equal for the original query and the query with insertions?

In [43]:
ground_truth.tail()

Unnamed: 0,id,position,length
495,l50-108,1086625,50
496,l100-42,1086830,100
497,l100-69,1097227,100
498,l100-8,1099134,100
499,l50-25,1109413,50


In [44]:
last_insertion_position = ground_truth.iloc[-1]['position']
last_insertion_length = ground_truth.iloc[-1]['length']
bias = sum(ground_truth["length"]) - last_insertion_length

str(original_query[0].seq[last_insertion_position - bias:]) == \
str(query_with_insertions[0].seq[last_insertion_position + last_insertion_length:])

True