In [None]:
# setup
from IPython.core.display import display,HTML
display(HTML('<style>.prompt{width: 0px; min-width: 0px; visibility: collapse}</style>'))
display(HTML(open('rise.css').read()))

# imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style="whitegrid", font_scale=1.5, rc={'figure.figsize':(12, 6)})


# CMPS 2200
# Introduction to Algorithms

## Dynamic Programming - Edit Distance


Let's look at 4 different paradigms of algorithm design:

- Divide and Conquer
- Brute Force
- Greedy
- Dynamic Programming 

(Randomization can be added to any of these.)

Divide and Conquer and Greedy both had a restriction to them in that we always recurse on one or more solutions to *fixed size* subproblems.

In certain cases, for example using a greedy approach for the Knapsack Problem, this prevented us from reaching an optimal solution.

Dynamic Programming considers **all possible** subproblems. This is sort of a hybrid of brute force and Greedy/Divide and Conquer.

### Edit Distance

Given two strings $S, T \in \Sigma^*$, how similar are they?

We can measure this using *edit distance*, which is the number of insertions and deletions needed to turn $S$ into $T$. Note that we can also go from $T$ to $S$ if we just reverse the edits (by turning insertions into deletions)

Example: $S$ = `abcdefghijkl`, $T$ = `abcdghikjl`. How many edits are needed?

Consider following edit sequence:

$S$: `abcdefghijkl---`<br>
$T$: `abcd--ghi---kjl`

This has 5 deletions and 3 insertions, for a total of 8 edits. What about this one:

$S$: `abcdefghijk-l`<br>
$T$: `abcd--ghi-kjl`

We have 3 deletions and 1 insertion for a total of 4 edits.

Our goal is to compute the **minimum edit distance** between two strings $S$ and $T$ of lengths $m$ and $n$, respectively.

It might seem like a toy problem, but this is a critical problem in comparing gene/protein sequences, and also online tools (e.g., Git/Overleaf/Google Doc). By attaching  weights to insertions and deletions, we can assess the evolutionary distance between two sequences.



Notice that once again, if we greedily apply edits to the beginning or end of the string we might miss a set of edits interspersed throughout the string. 


**Can we identify an optimal substructure property for this problem?**

<br>

Let's use case-based reasoning about the optimal solution as we did for Knapsack. Let $\mathit{MED}(S, T)$ be the optimal number of edits between $S$ and $T$. 

<br><br><br>

In an optimal sequence of edits, how would we deal with the first two characters of $S$ and $T$, respectively?

<br><br>

For the base cases, is $S$ is empty and $T$ is not, what is the edit cost?  
S=` ` T=`abcde`

<br><br><br>


If either string is empty, then the edit cost is simply the length of the other string.

<br><br>

What if $S[0] = T[0]$?  
S=`abc` T=`ade`

<br><br>

then there is no benefit to editing and $\mathit{MED}(S, T) = \mathit{MED}(S[1:], T[1:])$. 

<br><br>
What if $S[0] \neq T[0]$?  
S=`abc` T=`bde`

<br><br><br>
then we must incur 1 edit either `insertion` or `deletion`. The less costly edit is either `Delete S[0] from S` or `Insert T[0] to S`.

$\rightarrow 1+\mathit{MED}(S[1:], T)~~~~$    e.g, $1+\mathit{MED}($ `bc` , `bde` $)$  
or   
$\rightarrow 1+\mathit{MED}(S, T[1:])~~~~$  e.g, $1+\mathit{MED}($ `abc` , `de` $)$  


<br><br>
If we allow substitution, then we can easily replace `S[0]` with `T[0]`, in this case, we have another solution.
$\rightarrow 1+\mathit{MED}(S[1:], T[1:])~~~~$<br>
e.g, S = `abc`, T = `dbc`.  



<br>

**Optimal Substructure for Edit Distance**: Let $S$ and $T$ be strings of length $m$ and $n$. Then,

$$\mathit{MED}(S, T) = 
\begin{cases}
\mathit{MED}(S[1:], T[1:]), \mbox{if}~~~S[0]=T[0] \\
1+\min\{\mathit{MED}(S[1:], T),\mathit{MED}(S, T[1:])\}, \mbox{otherwise} \\
\end{cases}
$$

Just as with Knapsack, this recursion tree for this recurrence yields an exponential number of nodes. How many nodes are there, and what is the depth? 


The recursion tree has $O(2^{m+n})$ nodes and depth $O(m+n)$. Are there shared subproblems?

For $S$=`ABC` and $T$=`DBC` we have the following DAG:

<img src="edit_distance_DAG.jpg" width="60%">

How much sharing is possible? In other words, how many distinct subproblems are there?

In any recursive call, the subproblems we consider consist of strings with one less character. So there are $O(mn)$ subproblems, each of which can each be computed in $O(1)$ time (if we have precomputed the necessary dependencies). The longest path in the recursion DAG is $O(m+n)$.


In [1]:
def MED(S, T):
    #print("S:%s, T:%s" % (S, T))
    if (S == ""):
        return(len(T))
    elif (T == ""):
        return(len(S))
    else:
        if (S[0] == T[0]):
            return(MED(S[1:], T[1:]))
        else:
            return(1 + min(MED(S, T[1:]), MED(S[1:], T)))

# S= "abcdefghijkl"
# T= "abcdghikjl"
S = 'kitten'
T = 'sitting'
print(MED(S, T))

5


Any idea for memoization??

<img src="med_table.png" width="60%">


<img src="med_table_rs.png" width="60%">