## Longest Repeated Subsequence Problem

https://www.techiedelight.com/longest-repeated-subsequence-problem/

The longest repeating susequence (LRS) problem is finding the longest subsequences of a string that occurs at least twice.

Unlike substrings, subsequences are not required to occupy consecutive positions within the original string.

Consider the sequence `ATACTCGGA`

The longest repeating subsequence is `ATCG`

**A** **T** A **C** T C **G** G A

A T **A** C **T** **C** G **G** A

The repeated characters hold a different index in the input string.

In [1]:
"""
Solve with recursion
"""
def lrs_rec(string, subseq=0, i=0, j=1):
    if len(string) < 2:
        return 0
    if j >= len(string):
        return subseq
    if i >= j:
        return lrs_rec(string, subseq, i, j+1)
    if string[i] == string[j]:
        added = lrs_rec(string, subseq+1, i+1, j+1)
        return added
    else:
        iskip = lrs_rec(string, subseq, i+1, j)
        jskip = lrs_rec(string, subseq, i, j+1)
        return max(iskip, jskip)

In [2]:
%%time
lrs_rec("ATACTCGGA")

CPU times: user 2.9 ms, sys: 38 µs, total: 2.94 ms
Wall time: 3.36 ms


4

In [3]:
lrs_rec("ATATT")

3

In [4]:
"""
Given solution, which is pretty much the same thing
"""

def lrs_rec_sol(X, m, n):
    if m == 0 or n == 0:
        return 0
    if X[m-1] == X[n-1] and m != n:
        return lrs_rec_sol(X, m-1, n-1) + 1
    return max(lrs_rec_sol(X, m, n-1), lrs_rec_sol(X, m-1, n))

In [5]:
%%time
X = "SQUIDDYPIEISTOOSALTYSQUIDPIESHOULDNOTBESOSALTY"
m = len(X)
lrs_rec_sol(X, m, m)

KeyboardInterrupt: 

In [18]:
X = "ATACTCGGA"
m = len(X)
lrs_rec_sol(X, m, m)

4

In [7]:
X = "AAAAAAAAAA"
m = len(X)
lrs_rec_sol(X, m, m)

9

In [8]:
X = "ATATT"
m = len(X)
lrs_rec_sol(X, m, m)

3

## With dynamic programming

Make a matrix to compare the letters of the string to each other.

Do not compare the same letter to itself or anything past it, we only need half the matrix to get our answer.

The idea is we're looking for subsequences if we only had one letter, then two letters, then three letters.

We can carry over the answer from the last iteration and add one if we get find a new match. Look upper left for the last iteration.

        A    T    A    C    T    C     G     G     A
    A   0    0    0    0    0    0     0     0     0
    T   0    0    0    0    0    0     0     0     0
    A   1    1    1    1    1    1     1     1     1
    C   1    1    1    1    1    1     1     1     1   
    T   1    2    2    2    2    2     2     2     2 
    C   1    2    2    3    3    3     3     3     3
    G   1    2    2    3    3    3     3     3     3
    G   1    2    2    3    3    3     4     4     4
    A   1    2    2    3    3    3     4     4     4

In [9]:
def dy_lcr(string):
    lookup = [[0]*(len(string) + 1) for _ in range(len(string) + 1)]
    for row in range(1, len(string) + 1):
        for col in range(1, len(string) + 1):
            if col >= row:
                lookup[row][col] = lookup[row][col-1]
            elif string[row-1] == string[col-1]:
                lookup[row][col] = lookup[row-1][col-1] + 1
            else:
                lookup[row][col] = max(lookup[row][col-1], lookup[row-1][col])
    return lookup

In [10]:
def dy_lcr_length(string):
    return dy_lcr(string)[-1][-1]

In [11]:
X = "ATACTCGGA"
test1 = dy_lcr_length(X)

In [12]:
test1

4

In [13]:
X = "AAAAAAAAAA"
dy_lcr(X)

[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 [0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2],
 [0, 1, 2, 3, 3, 3, 3, 3, 3, 3, 3],
 [0, 1, 2, 3, 4, 4, 4, 4, 4, 4, 4],
 [0, 1, 2, 3, 4, 5, 5, 5, 5, 5, 5],
 [0, 1, 2, 3, 4, 5, 6, 6, 6, 6, 6],
 [0, 1, 2, 3, 4, 5, 6, 7, 7, 7, 7],
 [0, 1, 2, 3, 4, 5, 6, 7, 8, 8, 8],
 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 9]]

In [14]:
X = "ATATT"
dy_lcr(X)

[[0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0],
 [0, 1, 1, 1, 1, 1],
 [0, 1, 2, 2, 2, 2],
 [0, 1, 2, 2, 3, 3]]

In [15]:
def printSubsequence(string, lookup):
    subsequence = ""
    row = len(lookup) - 1
    col = len(lookup[0]) - 1
    length = lookup[row][col]
    curr = length
    while len(subsequence) < length:
        while lookup[row][col-1] == curr:
            col -= 1
        while lookup[row-1][col] == curr:
            row -= 1
        subsequence = string[row-1] + subsequence
        row -= 1
        col -= 1
        curr = lookup[row][col]
    return subsequence

In [16]:
X = "ATACTCGGA"
printSubsequence(X, dy_lcr(X))

'ATCG'

SQUIDDYPIEISTOOSALTYSQUIDPIESHOULDNOTBESOSALTY

SQUID  PIE S OO   T  SQUIDPIES O   OT ESO  LT

                            ES OL   T
SQUIDPIESOOSALTY = 16

SQUIDPIESOOTESOLT = 17

In [19]:
X = "SQUIDDYPIEISTOOSALTYSQUIDPIESHOULDNOTBESOSALTY"
printSubsequence(X, dy_lcr(X))

'SQUIDPIESOOTESOLT'