### Implement Diff Utility

Given two similar strings, efficiently list out all differences between them.

The diff utility is a data comparison tool that calculates and displays the differences between the two texts. It tries to determine the smallest set of deletions and insertions and create one text from the other. Diff is line-oriented rather than character-oriented, unlike edit distance.

```
Input:
X = "XMJYAUZ"
Y = "XMJAATZ"

Output:
XMJ -Y A -U +A +T Z

(- indicates the characted is deleted from Y but present in X)
(+ indicates the character is inserted in Y but not present in X)

since the longest common subsequence is XMJAZ we can iterate over the first sequence - X, M, J all check out. Y is missing from LCS, so we -Y. A matches. U is missing so we -U. Next we're looking for a Z but it's not there, instead we have to +A and +T. Then we see the Z so that matches.
```

We can use the Longest Common Subsequence to solve this problem. Find the longest sequence of characters present in both original sequences in the same order. From there, it is only a small step to get the diff-like output:

If a character is absent in the subsequence but present in the first original sequence, it must have been deleted.

If a character is absent in the subsequence but present in the second original sequence, it must have been inserted.

In [39]:
def longestCommonSubsequence(a, b):
    lookup = [[0] * (len(a)+1) for _ in range(len(b)+1)]

    for row in range(1, len(lookup)):
        for col in range(1, len(lookup[0])):
            if a[col-1] != b[row-1]:
                lookup[row][col] = max(lookup[row-1][col], lookup[row][col-1])
            else:
                lookup[row][col] = lookup[row-1][col-1] + 1
    return lookup

In [48]:
def getSubsequence(a, lookup):
    row = len(lookup) - 1
    col = len(lookup[0]) - 1
    count = lookup[row][col]
    subseq = ""
    while count > 0:
        while lookup[row-1][col] == count:
            row -= 1
        while lookup[row][col-1] == count:
            col -= 1
        subseq = a[col-1] + subseq
        row -= 1
        col -= 1
        count = lookup[row][col]
    return subseq

In [57]:
def diffUtility(a, b):
    lookup = longestCommonSubsequence(a, b)
    s = getSubsequence(a, lookup)
    diff = ""
    ia = 0
    ib = 0
    char = 0
    while char < len(s):
        if s[char] == a[ia] and s[char] == b[ib]:
            diff += s[char]
            ia += 1
            ib += 1
            char += 1
        elif s[char] == b[ib] and s[char] != a[ia]:
            diff += f"-{a[ia]}"
            ia += 1
        elif s[char] == a[ia] and s[char] != b[ib]:
            diff += f"+{b[ib]}"
            ib += 1
        else:
            diff += f"-{a[ia]}"
            diff += f"+{b[ib]}"
            ia += 1
            ib += 1
    return diff

In [58]:
seq1 = "XMJYAUZ"
seq2 = "XMJAATZ"

diffUtility(seq1, seq2)

'XMJ-YA-U+A+TZ'