# Longest Common Subsequence
## Own simple implementation

First a simple base implementation of longest common subsequence that only finds one possile sequence:

In [1]:
# dynamic programming variation to find one sequence
def longest_common_subsequenceDP(str1, str2):
    # base case automatically included where both have nothing in common
    matrix = [["" for x in range(len(str2))] for x in range(len(str1))]

    for i in range(len(str1)):
        for j in range(len(str2)):
            # common
            if str1[i] == str2[j]:
                # we basically take the char as results
                if i == 0 or j == 0:
                    matrix[i][j] = str1[i]
                # we add the char to our result up to this point
                else:
                    matrix[i][j] = matrix[i-1][j-1] + str1[i]
            # not in common
            else:
                # take the longest sequence up to this point
                matrix[i][j] = max(matrix[i-1][j], matrix[i][j-1], key=len)

    # resulting sequence
    return matrix[-1][-1]

A simple function to convert the result into a beautified string:

In [2]:
def stringify_lcs(sequence, str1, str2):
    # printf-ish formatting
    return ("Longest common subsequence of {0} and {1}: \"" + sequence + "\"; length of sequence: " + str(len(sequence))).format(str1, str2)

A few examples:

In [3]:
sequence = longest_common_subsequenceDP("misspelled", "misinterpretted")

print(stringify_lcs(sequence, "misspelled", "misinterpretted"))

Longest common subsequence of misspelled and misinterpretted: "mispeed"; length of sequence: 7


In [4]:
sequence = longest_common_subsequenceDP("xBCDxFGxxxKLMx", "aBCDeFGhijKLMn")

print(stringify_lcs(sequence, "xBCDxFGxxxKLMx", "aBCDeFGhijKLMn"))

Longest common subsequence of xBCDxFGxxxKLMx and aBCDeFGhijKLMn: "BCDFGKLM"; length of sequence: 8


In [5]:
sequence = longest_common_subsequenceDP("information", "retrieval")

print(stringify_lcs(sequence, "information", "retrieval"))

Longest common subsequence of information and retrieval: "rti"; length of sequence: 3


## Library implementation

For the same examples as above we compose the longest common subsequence:

In [6]:
import pylcs

In [7]:
str1 = "misspelled"
str2 = "misinterpretted"

indexes_to_keep = pylcs.lcs_sequence_idx(str1, str2)
sequence = ''.join([str2[i] for i in indexes_to_keep if i != -1])

print(stringify_lcs(sequence, str1, str2))

Longest common subsequence of misspelled and misinterpretted: "mispeed"; length of sequence: 7


In [8]:
str1 = "xBCDxFGxxxKLMx"
str2 = "aBCDeFGhijKLMn"

indexes_to_keep = pylcs.lcs_sequence_idx(str1, str2)
sequence = ''.join([str2[i] for i in indexes_to_keep if i != -1])

print(stringify_lcs(sequence, str1, str2))

Longest common subsequence of xBCDxFGxxxKLMx and aBCDeFGhijKLMn: "BCDFGKLM"; length of sequence: 8


In [9]:
str1 = "information"
str2 = "retrieval"

indexes_to_keep = pylcs.lcs_sequence_idx(str1, str2)
sequence = ''.join([str2[i] for i in indexes_to_keep if i != -1])

print(stringify_lcs(sequence, str1, str2))

Longest common subsequence of information and retrieval: "rti"; length of sequence: 3


In case, only the length of the sequence is wanted the library provides another function:

In [10]:
str1 = "misspelled"
str2 = "misinterpretted"

lcs_length = pylcs.lcs_sequence_length(str1, str2)

print(lcs_length)

7
