Dan Shea  
2021-06-02  
#### Rosalind problem LCSQ

http://rosalind.info/problems/lcsq/

##### Problem
A string $u$ is a common subsequence of strings $s$ and $t$ if the symbols of $u$ appear in order as a subsequence of both $s$ and $t$.

For example, `ACTG` is a common subsequence of `AACCTTGG` and `ACACTGTGA`.

Analogously to the definition of longest common substring, $u$ is a longest common subsequence of $s$ and $t$ if there does not exist a longer common subsequence of the two strings. Continuing our above example, `ACCTTG` is a longest common subsequence of `AACCTTGG` and `ACACTGTGA`, as is `AACTGG`.

__Given:__ Two DNA strings $s$ and $t$ (each having length at most 1 kbp) in FASTA format.

__Return:__ A longest common subsequence of $s$ and $t$. (If more than one solution exists, you may return any one.)

##### Sample Dataset
```
>Rosalind_23
AACCTTGG
>Rosalind_64
ACACTGTGA
```
##### Sample Output
```
ACCTGG
```

In [1]:
from Bio import SeqIO
import numpy as np

In [2]:
# create scoring matrix O(len(A)*len(B))
def compute_scores(A, B):
    # Scoring matrix to be filled out
    scoring_matrix = np.zeros((len(A)+1, len(B)+1))
    # The first row and column are 0-filled so our comparison indices start at 1
    ridx=1
    for a in A:
        cidx=1
        for b in B:
            if a == b:
                scoring_matrix[ridx,cidx] = scoring_matrix[ridx-1,cidx-1] + 1
            else:
                scoring_matrix[ridx,cidx] = max(scoring_matrix[ridx,cidx-1], scoring_matrix[ridx-1,cidx])
            cidx += 1
        ridx += 1
    return scoring_matrix

In [3]:
compute_scores('AACCTTGG', 'ACACTGTGA')

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [0., 1., 1., 2., 2., 2., 2., 2., 2., 2.],
       [0., 1., 2., 2., 3., 3., 3., 3., 3., 3.],
       [0., 1., 2., 2., 3., 3., 3., 3., 3., 3.],
       [0., 1., 2., 2., 3., 4., 4., 4., 4., 4.],
       [0., 1., 2., 2., 3., 4., 4., 5., 5., 5.],
       [0., 1., 2., 2., 3., 4., 5., 5., 6., 6.],
       [0., 1., 2., 2., 3., 4., 5., 5., 6., 6.]])

In [4]:
# Reconstruct the LCS by walking backwards through the matrix
def construct_lcs(A, B, scoring_matrix):
    lcs = []
    # We start in the lower right corner of the matrix
    ridx, cidx = scoring_matrix.shape
    ridx -= 1
    cidx -= 1
    while (ridx != 0) and (cidx != 0):
        if A[ridx-1] == B[cidx-1]:
            lcs.append(A[ridx-1])
            ridx -= 1
            cidx -= 1
        else:
            ridx, cidx, score = max((ridx,cidx-1,scoring_matrix[ridx,cidx-1]),(ridx-1,cidx,scoring_matrix[ridx-1,cidx]),key=lambda x: x[2])
    return ''.join(reversed(lcs))

In [5]:
def compute_lcs(A,B):
    scoring_matrix = compute_scores(A, B)
    lcs = construct_lcs(A, B, scoring_matrix)
    return lcs

In [6]:
compute_lcs('AACCTTGG', 'ACACTGTGA')

'ACCTGG'

In [7]:
def parse_input_print_result(filename):
    seqio = SeqIO.parse(filename, 'fasta')
    A = next(seqio).seq
    B = next(seqio).seq
    return compute_lcs(A, B)

In [8]:
parse_input_print_result('rosalind_lcsq.txt')

'TGCGGTAGTGATGACTTTAACAACGGCTCAATTTAGTCTAAGGGGGGACTCACCATATCTGCCTCTTATGCTCAGTTGTAGGACTCCAGTCTACGAACTGGCTAATTGAGTAGGGGGCCCATAGTAGGTGGCATGGTGGCGTATAAGAGCCCGAGCGCCAACTAGTAGTGCCCTTGGACGGACTTAATGACAGATAAATGTTGGGGTTCATTAGCAACATCCAAGTTCTAAAAGGGAATCCTATAATCTCCATTACTCTTTACCTCGGTTGATCGTTCGTAGAAGTACGCCCCACACCTAACACTAAGGATTTGTGTTGGTTTTAACTAATCAGATGCAATGATTGAACCCGTGCGTATTTGCGCTAAACGATACACCCGCTATGAATTAACAGTTTTTTCGTAAAGGTTTCGAGTCCGTTAACGAAGAGGCGACATGGGGCAGCAGGCTATGTGCGGGCGGGTATGTTTTGTTACGCCCCCAGATTGTTAATGAACAAGGGCCAGTGGGCCAACAGAGCCTCCGCGAATGGAGCTCGCCAGTTGGGCAGGAT'