# Finding a Shared Spliced Motif

## Background Info

### Locating Motifs Despite Introns

* **Motif:** An interval of nucleotides (in DNA/RNA) or of amino acids (in proteins) that has biological importance. It could represent an important functional unit of a protein shared by many members of the same species, or a rare gene encoding a disorder. Motif is usually represented by a substring of a genetic string that we'd like to locate.

We can search through a database containing multiple genetic strings (a DNA, RNA, or amino acid) to find a longest common substring of these strings, which serve as a **motif** shared by two strings. However, coding regions of DNA are often interspersed by introns that do not code for proteins. Therefore, there's a need to locate shared motifs that are separated across exons (motifs don't have to be contiguous). To model such situation, we use subsequences.

## Problem

**Given:** Two DNA strings $s$ and $t$ (each having length at most 1kbp) in FASTA format. The two strings are not necessarily equal in length.<br>
**Return:** A longest common subsequence of $s$ and $t$.

## Solution Explained

We can use dynamic programmiing to solve this problem, as this problem has an optimal substructure and overlapping subproblems. It has an optimal substructure as the longest common subsequence of a substring of $s$ and $t$ can be combined of the previous longest common subsequence to obtain the optimal solution. The longest common subsequence problem also has overlapping subproblems as it can be broken down into subproblems where each subproblem is repeated and a recursive algorithm can be used to solve the same subproblem. The subproblem consists of 2 possible scenarios:<br>
Let's only focus at the very last character of strings $s$ and $t$ to obtain the length of the longest common subsequence. We will call this function $LCS(s,t)$ where the function $LCS$ carries out the following steps with the input strings.<br>
1. If the last character of $s$ and that of $t$ are the same characters, then we can add 1 to the length of the longest common subsequence value, cut off the last character from both $s$ and $t$, to obtain the longest common subsequence of those cut-off strings $s$ and $t$.
2. If the last character of $s$ and that of $t$ are not the same, then, we can try to look for the length of the longest common subsequence between $s$ and $t$ where $t$ has its last character cut off, and for the length of the longest common subsequence between $s$ and $t$ where $s$ has its last character cut off. Then, we can obtain the maximum between the 2 length values to obtain the length of the longest common subsequence between $s$ and $t$.

First, let's read in the two DNA sequences.

In [8]:
from Bio import SeqIO
import numpy as np

dnas = SeqIO.parse("../Sample_Finding_a_Shared_Spliced_Motif.fa", "fasta")
# dnas = SeqIO.parse("./rosalind_lcsq.fa", "fasta")
dna_lst = []
for dna in dnas:
    dna_lst.append(str(dna.seq))
dna_lst = sorted(dna_lst, key=len, reverse=True)
s = dna_lst[0]
t = dna_lst[1]

print(dna_lst)

['ACACTGTGA', 'AACCTTGG']


Because we want to take into account the case of matching a character with an empty character when carrying out the Dynamic Programming algorithm on this problem, I added an empty space to each of the two strings.

In [10]:
s = " " + s
t = " " + t

Each column represents each character in string $t$, and each row represents each character in string $s$ in order, where the very first cell at position row 1 and column 1 is the empty string $""$. Each cell in the dynamic programming table represents carrying out the function: $LCS(\text{character at index i of s}, \text{character at index j of t})$.

In [18]:
dp_table = np.zeros((len(s), len(t)))

for i in range(1, len(s)):
    for j in range(1, len(t)):
        if s[i] == t[j]:
            dp_table[i][j] = dp_table[i - 1][j - 1] + 1
        else:
            dp_table[i][j] = max(dp_table[i - 1][j], dp_table[i][j - 1])

print("The length of the longest common subsequence between s and t is:", dp_table[-1][-1])

The length of the longest common subsequence between s and t is: 6.0


Now that we've built the dynamic programming table where we obtained the length of the longest common subsequence, we have to traverse the dynamic programming table to obtain the actual longest common subsequence.

In [13]:
i = len(s) - 1
j = len(t) - 1
lcs = ""
while dp_table[i][j] > 0:
# while i * j != 0:
    curr = dp_table[i][j]
    left = dp_table[i][j - 1]
    up = dp_table[i - 1][j]
    if up == curr:
        i -= 1
    elif left == curr:
        j -= 1
    else:
        lcs += s[i]
        i -= 1
        j -= 1

In [29]:
lcs = lcs[::-1]

print("The longest common subsequence between s and t is: ", lcs)

The longest common subsequence between s and t is:  AACTTG


The sequence above might not match the answer provided by Rosalind, but the length is the same, which means my answer is one of many possible answers.

## Actual dataset

In [28]:
dnas = SeqIO.parse("../rosalind_lcsq.fa", "fasta")
dna_lst = []
for dna in dnas:
    dna_lst.append(str(dna.seq))
dna_lst = sorted(dna_lst, key=len, reverse=True)
s = dna_lst[0]
t = dna_lst[1]

In [30]:
dp_table = np.zeros((len(s), len(t)))

for i in range(1, len(s)):
    for j in range(1, len(t)):
        if s[i] == t[j]:
            dp_table[i][j] = dp_table[i - 1][j - 1] + 1
        else:
            dp_table[i][j] = max(dp_table[i - 1][j], dp_table[i][j - 1])

In [31]:
i = len(s) - 1
j = len(t) - 1
lcs = ""
while dp_table[i][j] > 0:
# while i * j != 0:
    curr = dp_table[i][j]
    left = dp_table[i][j - 1]
    up = dp_table[i - 1][j]
    if up == curr:
        i -= 1
    elif left == curr:
        j -= 1
    else:
        lcs += s[i]
        i -= 1
        j -= 1

In [32]:
lcs = lcs[::-1]

print("The longest common subsequence between s and t is: ", lcs)

The longest common subsequence between s and t is:  CCGGGTAAGCTTTAACACGTGCCGATGTGGCATGCTCCGCGTTGTCATTCTTTCCTGGAAGTTTTGCTCGCCTCAATTGTAAGGCCTTTCGTCCAGACCGAGCCCAAAGTAAGGGAAGTTATTTATGCACGTAAGGCCCCTACTGTGAGAAACTGCATCCCCTTCCGTACGCAGTGTCACTAGGGTGCAGTTTCCGCCAGAAATAGGTGTGAGGGTGCCCACAACCGAGACACTTTATCTGTAAATTTGCGTAGACGAGCAGGTGCTGAATGCTTATTCCCCTACTCCTCCCCATGGGGCCCATCACCATAGGTTATTTGTCGCACCATCAGCAGGCTGGTTCTATTTACCCGCAGCGGAAAGACCACCCGTTGGGGAGAGTGTTAATATGAGCATCCCGGCAACTCACCTCTCCATATTCAAACCGTGGGAGGACCAGCTCATCTCCGAACTGCAGCCTCCACGTGGGTTGTGGCTAGCTCAACTTGATTTGGAGTACGGGCCCTCAGACCTAAATAGTCTATGTAGACGGTTGACGCGCGTAGCTGAGGAGCTGACGGACACCAACAACTACTAGGTCAAACCGTCCATTCGCCTTTTATTAGACAGGCTA


## Problem Solved!