In [1]:
try:
    %load_ext autotime
except:
    !pip install ipython-autotime
    %load_ext autotime

Collecting ipython-autotime
  Downloading ipython_autotime-0.3.1-py2.py3-none-any.whl (6.8 kB)
Installing collected packages: ipython-autotime
Successfully installed ipython-autotime-0.3.1
time: 1.98 ms (started: 2023-10-21 02:11:12 +00:00)


## Dynamic Programming Algorithm for Longest Common Subsequence

$\newcommand\len{\mathsf{len}}
\newcommand\lcss{\mathsf{lcss}}$


Recall the recurrence that we implement to be `Python` friendly assuming that the arguments $i,j$ satisfy
$0 \leq i \leq \len(s_1)$ and $0 \leq j \leq \len(s_2)$.


$$\lcss(i, j) = \begin{cases}
0 & i \geq \len(s_1) \\
0 & j \geq \len(s_2) \\
1 + \lcss(i+1, j+1 ) &  s_1[i] = s_2[j] \\
\max( \lcss(i+1, j), \lcss(i, j+1) ) & \text{otherwise} \\
\end{cases} $$

In [2]:
def lcs(s1, s2, i, j):
    assert 0 <= i and i <= len(s1)
    assert 0 <= j and j <= len(s2)
    if i == len(s1):
        return 0
    if j == len(s2):
        return 0
    if s1[i] == s2[j]:
        return 1 + lcs(s1, s2, i+1, j+1)
    else:
        return max(lcs(s1, s2, i+1, j), lcs(s1, s2, i, j+1))

time: 2.17 ms (started: 2023-10-21 02:11:12 +00:00)


Warning: the recurrence above is quite inefficient. See for yourself.

In [3]:
s1 = "GATTACA"
s2 = "ACTGATAACAA"
print(lcs(s1, s2, 0, 0))

6
time: 6.36 ms (started: 2023-10-21 02:11:12 +00:00)


In [4]:
s1 = "GGATTACCATTATGGAGGCGGA"
s2 = "ACTTAGGTAGG"
print(lcs(s1, s2, 0, 0))

10
time: 348 ms (started: 2023-10-21 02:11:12 +00:00)


In [5]:
# This is just slightly longer and will take more than a minute and a half to run
s1 = "GGATTACCATTATGGAGGCGGA"
s2 = "ACTTAGGTAGATTATCCG"
print(lcs(s1, s2, 0, 0))

11
time: 1min 39s (started: 2023-10-21 02:11:12 +00:00)


In [6]:
#slightly longer strings will take "forever" to run
s1 = "GGATTACACATTACCTATAGGTATAAT"
s2 = "GGATTTATCTATAAATTACCTATTTATTATATTACCGTATGGTATGC"
print(lcs(s1, s2, 0, 0))

KeyboardInterrupt: ignored

time: 43min 57s (started: 2023-10-21 02:12:52 +00:00)


In [7]:
#Let's memoize

def memoize_lcs(s1, s2):
    m = len(s1)
    n = len(s2)
    # let's create a memo table and fill it with zeros. This will nicely take care of the base cases.
    memo_tbl = [ [0 for j in range(n+1)] for i in range(m+1)]
    sol_info = [ ['' for j in range(n+1)] for i in range(m+1)] # This will help us recover solutions
    for i in range(m-1, -1, -1): # iterate from m-1 to 0 with a step of -1
        for j in range(n-1, -1, -1):
            if s1[i] == s2[j]:
                memo_tbl[i][j] = memo_tbl[i+1][j+1] + 1
                sol_info[i][j] = 'match'
            else:
                # Python allows us to compare and assign tuples
                # This nifty bit of code saves us an if then else condition and assignments
                # if you are new to python feel free to write out the logic carefully
                memo_tbl[i][j], sol_info[i][j] = max((memo_tbl[i+1][j],'right'), (memo_tbl[i][j+1], 'down'))
    # Now let us recover the longest common sub sequence
    lcs = '' # initialize it to empty string
    match_locations = [] # matches of (i,j)
    i = 0
    j = 0 # start at top left corner
    while (i < m and j < n):
        if sol_info[i][j] == 'match':
            assert s1[i] == s2[j]
            lcs = lcs + s1[i]
            match_locations.append((i,j))
            i,j = i + 1, j + 1
        elif sol_info[i][j] == 'right':
            i, j = i+1, j
        else:
            assert sol_info[i][j] == 'down'
            i, j = i, j+1
    return lcs, match_locations

time: 2.68 ms (started: 2023-10-21 02:56:53 +00:00)


In [8]:
s1 = "GATTACA"
s2 = "ACTGATAACAA"
(lcs, match_locations) = memoize_lcs(s1, s2)
print(f'Longest common subsequence: {lcs} length= {len(lcs)}')
print('Matches:')
print('\t Char:\t i, j')
for (i, j) in match_locations:
    print(f'\t {s1[i]}:\t {i}, {j}')


Longest common subsequence: ATTACA length= 6
Matches:
	 Char:	 i, j
	 A:	 1, 0
	 T:	 2, 2
	 T:	 3, 5
	 A:	 4, 6
	 C:	 5, 8
	 A:	 6, 9
time: 8.39 ms (started: 2023-10-21 02:56:56 +00:00)


In [9]:
s1 = "GGATTACCATTATGGAGGCGGA"
s2 = "ACTTAGGTAGG"
(lcs, match_locations) = memoize_lcs(s1, s2)
print(f'Longest common subsequence: {lcs} length= {len(lcs)}')
print('Matches:')
print('\t Char:\t i, j')
for (i, j) in match_locations:
    print(f'\t {s1[i]}:\t {i}, {j}')

Longest common subsequence: ACTTAGGAGG length= 10
Matches:
	 Char:	 i, j
	 A:	 2, 0
	 C:	 6, 1
	 T:	 9, 2
	 T:	 10, 3
	 A:	 11, 4
	 G:	 13, 5
	 G:	 14, 6
	 A:	 15, 8
	 G:	 16, 9
	 G:	 17, 10
time: 1.95 ms (started: 2023-10-21 02:56:59 +00:00)


In [10]:
s1 = "GGATTACCATTATGGAGGCGGA"
s2 = "ACTTAGGTAGATTATCCG"
(lcs, match_locations) = memoize_lcs(s1, s2)
print(f'Longest common subsequence: {lcs} length= {len(lcs)}')
print('Matches:')
print('\t Char:\t i, j')
for (i, j) in match_locations:
    print(f'\t {s1[i]}:\t {i}, {j}')

Longest common subsequence: ACTTAGGAGCG length= 11
Matches:
	 Char:	 i, j
	 A:	 2, 0
	 C:	 6, 1
	 T:	 9, 2
	 T:	 10, 3
	 A:	 11, 4
	 G:	 13, 5
	 G:	 14, 6
	 A:	 15, 8
	 G:	 16, 9
	 C:	 18, 15
	 G:	 20, 17
time: 2.04 ms (started: 2023-10-21 02:57:01 +00:00)


In [11]:
#slightly longer strings will run instantaneously given that we are memoizing
s1 = "GGATTACACATTACCTATAGGTATAAT"
s2 = "GGATTTATCTATAAATTACCTATTTATTATATTACCGTATGGTATGC"
(lcs, match_locations) = memoize_lcs(s1, s2)
print(f'Longest common subsequence: {lcs} length= {len(lcs)}')
print('Matches:')
print('\t Char:\t i, j')
for (i, j) in match_locations:
    print(f'\t {s1[i]}:\t {i}, {j}')

Longest common subsequence: GGATTACAATTACCTATATATAAT length= 24
Matches:
	 Char:	 i, j
	 G:	 0, 0
	 G:	 1, 1
	 A:	 2, 2
	 T:	 3, 3
	 T:	 4, 4
	 A:	 5, 6
	 C:	 6, 8
	 A:	 7, 10
	 A:	 9, 12
	 T:	 10, 15
	 T:	 11, 16
	 A:	 12, 17
	 C:	 13, 18
	 C:	 14, 19
	 T:	 15, 20
	 A:	 16, 21
	 T:	 17, 22
	 A:	 18, 25
	 T:	 21, 26
	 A:	 22, 28
	 T:	 23, 29
	 A:	 24, 30
	 A:	 25, 33
	 T:	 26, 37
time: 6.44 ms (started: 2023-10-21 02:57:41 +00:00)
