### The Levenshtein distance (edit distance) problem

Edit distance is a way of quantifying how different two strings are from one another by counting the minimum number of operations required to transform one string into another.

The Levenshtein distance between two words is the minimum number of single-character edits (ie, insertions, deletions, or substitutions) required to chaned one word into the other. Each of these operations has a unit cost.

For example:
```
Levenshtein distance between kitten and sitting is 3.

kitten -> sitten (substitution of s for k)
sitten -> sittin (substitution of i for e)
sittin -> sitting (insertion of g at the end)
```

This problem has optimal substructure. The problem can be broken into smaller, simple subproblems, which can also be broken down, until the solution becomes trivial.

We can think about transforming X[1...m] into Y[1...n] by first thinking about transforming substring x[1...i] into Y[1...j]

#### Case 1: We have reached the end of either substring.

If substring X is empty, insert all remaining characters of Y into X. The cost is equal to the number of characters left in substring Y.

If substring Y is empty, delete all remaining characters of X. The cost is equal to the number of characters left in X.

#### Case 2: The last characters of substrings X and Y are the same.

If the last characters match, nothing needs to be done. Keep going for X[1...i-1] and Y[1...j-1]. No cost.

#### Case 3: The last characters of substrings X and Y are different.

If the last characters are different, return the minimum of the following operations:

1. Insert the last character of Y into X. Then continue with the recursion. The size of Y reduces by 1 and X remains the same. ```('ABA', 'ABC') -> ('ABAC', 'ABC') -> ('ABA', 'AB')```
2. Delete the last character of X. Then continue with the recursion. ```('ABA', 'ABC') -> ('AB', 'ABC')```
3. Substitute the current character of X by the current character of Y. Then continue with the recursion. ```('ABA', 'ABC') -> ('ABC', 'ABC') -> ('AB', 'AB')```


In [10]:
def edit_distance_recursion(a, b, ia=None, ib=None):
    if ia == None or ib == None:
        ia = len(a) - 1
        ib = len(b) - 1
    if ia == -1 or ib == -1:
        return max(ia, ib) + 1
    if a[ia] == b[ib]:
        return edit_distance_recursion(a, b, ia-1, ib-1)
    insertion = edit_distance_recursion(a, b, ia, ib-1) + 1
    deletion = edit_distance_recursion(a, b, ia-1, ib) + 1
    substitution = edit_distance_recursion(a, b, ia-1, ib-1) + 1
    return min(insertion, deletion, substitution)
    

In [7]:
string1 = "kitten"
string2 = "sitting"

In [11]:
%%time
edit_distance_recursion(string1, string2)

CPU times: user 1.59 ms, sys: 8 µs, total: 1.6 ms
Wall time: 1.62 ms


3

### With dynamic programming, we can build from the bottom up

Make a 2D array so we can visualize comparing strings as they grow.

To convert a non string to another string, it's all insertion, so our cost is the same as the length of the desired string. We set that up as our baseline.

As we fill in the array, if the letters match, there is no work to be done. We can look to see what the strings would be if the letters weren't there (upper left) and use that cost value there.

If the strings do not match, we add one cost to whatever is least if we had substituted, deleted, or inserted the letter. For substitution, look to the top left. For deletion, look left. For insertion, look up.

                  k     i     t     t     e     n
             0    1     2     3     4     5     6
         s   1    1     2     3     4     5     6
         i   2    2     1     2     3     4     5
         t   3    3     2     1     2     3     4
         t   4    4     3     2     1     2     3
         i   5    5     4     3     2     2     3
         n   6    6     5     4     3     3     2
         g   7    7     6     5     4     4     3

In [17]:
def edit_distance_dy(a, b):
    lookup = [[0] * (len(b) + 1) for _ in range(len(a)+1)]
    for i in range(len(lookup)):
        lookup[i][0] = i
    for j in range(len(lookup[0])):
        lookup[0][j] = j
    for row in range(1, len(lookup)):
        for col in range(1, len(lookup[0])):
            if a[row-1] == b[col-1]:
                lookup[row][col] = lookup[row-1][col-1]
            else:
                lookup[row][col] = min(lookup[row-1][col-1], lookup[row-1][col], lookup[row][col-1]) + 1
    return lookup[-1][-1]

In [19]:
%%time
edit_distance_dy(string1, string2)

CPU times: user 46 µs, sys: 1 µs, total: 47 µs
Wall time: 48.9 µs


3