Dan Shea  
2021-06-02  
#### Edit Distance
Given two strings $s$ and $t$ (of possibly different lengths), the edit distance $dE(s,t)$ is the minimum number of edit operations needed to transform $s$ into $t$, where an edit operation is defined as the substitution, insertion, or deletion of a single symbol.

The latter two operations incorporate the case in which a contiguous interval is inserted into or deleted from a string; such an interval is called a gap. For the purposes of this problem, the insertion or deletion of a gap of length $k$ still counts as $k$ distinct edit operations.

__Given:__ Two protein strings $s$ and $t$ in FASTA format (each of length at most 1000 aa).

__Return:__ The edit distance $dE(s,t)$.

##### Sample Dataset
```
>Rosalind_39
PLEASANTLY
>Rosalind_11
MEANLY
```

##### Sample Output
```
5
```

In [1]:
import numpy as np
def levenshtein(a,b):
    if len(a) == 0:
        return len(b)
    if len(b) == 0:
        return len(a)
    I,J = len(a)+1,len(b)+1
    M = np.zeros((I,J))
    for i in range(I):
        M[i,0] = i
    for j in range(J):
        M[0,j] = j
    # loop over b
    for j, lb in enumerate(b,1):
        for i, la in enumerate(a,1):
            if la == lb:
                cost = 0
            else:
                cost = 1
            M[i,j] = min([M[i,j-1]+1, M[i-1,j]+1, M[i-1,j-1]+cost])
    return M[-1,-1]

In [2]:
from Bio import SeqIO
def parse_input_print_ans(filename):
    seqio = SeqIO.parse(filename, 'fasta')
    A = next(seqio).seq
    B = next(seqio).seq
    return int(levenshtein(A, B))

In [3]:
parse_input_print_ans('sample.fasta')

5

In [4]:
parse_input_print_ans('rosalind_edit.txt')

417