# Overlaps and Edit Distance - An Analysis



> In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

*[Edit Distance - Wikipedia](https://en.wikipedia.org/wiki/Edit_distance)*



In [None]:
from Py.geneReader import geneReader

filename = 'SeqFiles/chr1.GRCh38.excerpt.fasta'

data = open ( filename, 'r' )

reads = geneReader ( filename )

data.close ()

In [None]:
from Py.editDistance import editDistance

import numpy as np

In [None]:
x = "GATTTACCAGATTGAG"

y = reads

D = [ ]

In [None]:
# Range covers the offset row plus the length of the pattern

for i in range ( len ( x ) + 1 ) :

    # Initializes the dimensions of the matrix with 0s. 

    D.append ( [ 0 ] * ( len ( y ) + 1 ) )

In [None]:
print ( 'Length of pattern:', len  ( x ) )

In [None]:
print ( 'Length of sequence:', len ( y ) )

In [None]:
D1 = np.matrix ( D )

D1 = D1.view ( )

print ( D1 )

In [None]:
np.shape ( D )

In [None]:
for i in range ( len ( x ) + 1 ) :

    D [ i ] [ 0 ] = i

In [None]:
D1 = np.matrix ( D )

D1 = D1.view ( )

print ( D1 )

In [None]:
for j in range ( len ( y ) + 1 ) :
        
    D [ 0 ] [ j ] = 0

In [None]:
D1 = np.matrix ( D )

D1 = D1.view ( )

print ( D1 )

In [None]:
# Fills in the rest of the matrix rows and columns.
#
# Starts at  row 1. 

for i in range ( 1, len ( x ) + 1 ) :

    # goes by column, starts at column 1

    for j in range ( 1, len ( y ) + 1 ) : 

    # value that is left adjacent to the current value, 
        # plus 1 is the penalty for character skipping

        distHor = D [ i ] [ j - 1 ] + 1 

        # value that is up adjacent to the current value, 
            # plus 1 is the penalty for character skipping

        distVer = D [ i - 1 ] [ j ] + 1

        # edit distance does not further increase if there is a match

            # aka, if matches, does not incur penalty

        if x [ i - 1 ] == y [ j - 1 ] : 

            # Diagonal up/left distance

            distDiag = D [ i - 1 ] [ j - 1 ] 


        # otherwise, diagonal distance value increases by 1

        else :

            distDiag = D [ i - 1 ] [ j - 1 ] + 1 


        # min () takes the minimum edit distance of the 3 possible values
        # so this value will be inserted for the current iteration
        # of row i, column j. 

        D [ i ] [ j ] = min ( distHor, distVer, distDiag ) 

In [None]:
D1 = np.matrix ( D )

D1 = D1.view ( )

print ( D1 )

In [None]:
# We are interested in the minimum value of the bottom row.

print ( min ( D [ -1 ] ) )

In [33]:
from Py.geneReader_Q import geneReader_Q

filename = 'SeqFiles/ERR266411_1.for_asm.fastq'

reads = geneReader_Q ( filename )

In [34]:
# Checking that our function works by outputting the first 10 sequences. 

reads[:10]

['TAAACAAGCAGTAGTAATTCCTGCTTTATCAAGATAATTTTTCGACTCATCAGAAATATCCGAAAGTGTTAACTTCTGCGTCATGGAAGCGATAAAACTC',
 'AACAAGCAGTAGTAATTCCTGCTTTATCAAGATAATTTTTCGACTCATCAGAAATATACGAAAGTGTTAACTTCTGCGTCATGGACACGAAAAAACTCCC',
 'AACAAGCAGTAGTAATTCCTGCTTTATCAAGATAATTTTTCGACTCATCAGAAATATCCGAAAGTGTTAACTTCTGCGTCATGGAAGCGATAAAACTCTG',
 'AGCCGACGTTTTGGCGGCGCAACCTGTGACGACAAATCTGCTCAAATTTATGCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCAGAGTTTTATC',
 'GACAAATCTGCTCAAATTTATGCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCAGAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCG',
 'CTGTAGCCGACGTTTTGGCGGCGCAACCTGTGACGACAAATCTGCTCAAATTTATGCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCAGAGTTT',
 'CTGTGACGACAAATCTGCTCAAATTTATGCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCAGAGTTTTATCGCTTCCATGACGCAGAAGTTAAC',
 'CAAATCTGCTCAAATTTATGCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCAGAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGA',
 'GTAAACAAGCAGTAGTAATTCCTGCTTTATCAAGATAATTTTTCGACTCATCAGCAATATCCGAAAGAGTTAACTTTTGCGTCATGGAAGCGATAAAACC',
 'GTAAACAAGCAGTAGTAATTCCTGCTTTATCAAGATAATTTTTCGACTCATCA

In [None]:
from itertools import permutations

list ( permutations ( [ 1, 2, 3 ], 2 ) )

In [31]:
# Starting with an empty set object, we will then add every k-mer association to it

k = 30

setObj = set()

?setObj

[1;31mType:[0m        set
[1;31mString form:[0m set()
[1;31mLength:[0m      0
[1;31mDocstring:[0m  
set() -> new empty set object
set(iterable) -> new set object

Build an unordered collection of unique elements.

In [51]:
k = 30

read = reads [ 0 ]

for i in range ( 0, len ( read ) - k + 1 ) :
    
    # We use the add method because we are dealing with a set, not a list.
    
    setObj.add ( read [ i : i + k ] )
    
?setObj

[1;31mType:[0m        set
[1;31mString form:[0m {'AGCAGTAGTAATTCCTGCTTTATCAAGATA', 'TCCTGCTTTATCAAGATAATTTTTCGACTC', 'CTTTATCAAGATAATTTTTCGACTCAT <...> GCAGTAGTAATTCCTGCTTTATCAAGA', 'TATCAAGATAATTTTTCGACTCATCAGAAA', 'GAAAGTGTTAACTTCTGCGTCATGGAAGCG'}
[1;31mLength:[0m      71
[1;31mDocstring:[0m  
set() -> new empty set object
set(iterable) -> new set object

Build an unordered collection of unique elements.