# CSCI 5454: Assignment 6

Dynamic Programming

__Your Name: __ Dania Elmadhun


# Problem 1 (Longest Pattern Match, 25 points)

Consider the problem of finding the longest pattern in a string. You are given a string $s$ of length $n$. For simplicity, assume that the string is made up of $4$ characters $A, T, C$ and $G$. You are also given a regular expression pattern of the form $a_1^*a_2^*\cdots a_m^*$, that is zero or more repetitions of $a_1$, followed by zero or more repetitions of $a_2$, ... , followed by zero or more repetitions of $a_m$, wherein $a_1, \ldots, a_m \in \{ A, T, C, G\}$

As an example consider the string $s:\ ATCATTTCGAGGGG$ and the pattern $A^*T^*G^*$. 

You have to find the longest substring (subsequence) of $s$ that matches the regular expression 
pattern. For instance $AATTTGGGGG$ is a substring of $s$ of length $10$ that matches the pattern. Is there a longer substring that matches the pattern?


__Inputs:__ String $s$ made up of characters $A, T, C, G$ and a pattern $p$ given as a string, as well. We do not need to specify the Kleene star next to each character in the pattern: they areimplicitly assumed to be there. 

__ (A) __ Write a recurrence $LPM(j, k)$ that represents the longest pattern match for the substring
$s[j], \ldots, s[n-1]$ and the sub-pattern from $p[k], \ldots, p[m-1]$. Do not forget to write the base cases. Use Latex to make your answer readable.

Also note that Python indices start at index 0 and end at index length of array - 1. Your recurrence must assume that the strings form such Python strings.

## Solution


__RECURRENCE__

String is represented by $S[j]$

Regular expression pattern is represented by $P[k]$

\begin{equation*}
\text{if } S[j]==P[k]:
\end{equation*}


$$LPM(j,k) = 1+LPM(j+1,k)$$



\begin{equation*}
\text{if } S[j] \neq P[k]:
\end{equation*}

take the max value of either:

\begin{equation*}
LPM(j,k)=max \begin{cases} \underbrace{LPM(j+1,k)}_{\text{skip to next character in string}} \\ \underbrace{LPM(j,k+1)}_{\text{skip to next character in pattern}}
\end{cases}
\end{equation*}


__BASE CASES __ 

Base case is if we reach the end of the string/pattern without a match, therefore:

(if string ends)

\begin{equation*}
\text{if } j \geq n:
\end{equation*}

or 

(if pattern ends)

\begin{equation*}
\text{if } k \geq m:
\end{equation*}

then:

\begin{equation*}
LPM(j,k)=0:
\end{equation*}

## (B) Implement

Implement the recursion above using memoization and recover the solution. 

In particular, the function `lpm(s,p)` must return a string `t` that is the longest substring of `s` and matches the pattern `p`.

In [37]:
import pandas as pd

#build a table for the decision-> memoization
#build a table the lengths of the match


def lpm(s, p):
    length=[[0 for x in range(len(p)+1)] for y in range(len(s)+1)]
        #to put in our lengths
    decision=[[0 for x in range(len(p)+1)] for y in range(len(s)+1)]
    #x is columns, y is rows
    #to put in our decision

    
    #BOTTOM UP RIGHT TO LEFT SO INSTEAD OF + WE USE -?? KEEP GETTING ERRORS
    n=len(s)
    m=len(p)
    
    for j in range(n-1,-1,-1):
        for k in range(m-1,-1,-1):
            #recursion
            if s[j]==p[k]:
                length[j][k]=length[j+1][k]+1
                decision[j][k]=0
                #match+=s[j]
            else: #s[j]!=p[k]:
                length[j][k]=max(length[j+1][k], length[j][k+1])
                if length[j+1][k] == length[j][k]:
                    decision[j][k]=1
                else:
                    decision[j][k]=2
            
    match=''
    #recording the sequences
    
    #now lets go back to the decision and find the actual sequence
    
    
    #if 0 go down
    #if 1 go down -> same 
    #or 2 right
    
    j=0
    k=0
    while j < n and k < m:
        if decision[j][k]==0:
            match += s[j]
            j=j+1
            
        elif decision[j][k]==1:
            j=j+1
        
        else:
            k=k+1

    #print(match)    
    return match
    # return a string t
    #assert(False)
    
    
#two different tables for this, one for the lengths one for the decision

In [38]:
# TESTS: DO NOT EDIT
# I wonder if these solutions are unique or other solutions are possible of equal length.
# If you find other solutions, post them on piazza under a single post please.
assert( len(lpm('ACTTTTACTTTTTGGATT','TGA')) == len('TTTTTTTTTGGA') )
assert( len(lpm('ATCATCATCTCATCATCGATTAACA', 'ACT')) == len('ATTTTTTTT') )
assert( len(lpm('ATCCG','CT')) == 2)
assert( len(lpm('GATTACAAAAAACTAGAGAGAGAGATTAAATACCAACACCTAT','GATAC')) == len('GATTAAAAAAAAAAAAAAAAACCCCC'))
assert( len(lpm('GGAATTAACCAACACAA','CAT')) == len('AAAAAAAAA'))

TTTTTTTTTGGA
AAAAAAAAA
CC
GAAAAAAAAAAAAAATTAAAAAAACC
AAAAAAAAA


## Problem 2: Reduce Total Variation (25 points)

You are given an array $a$ of integers of length $n$. Eg. $a = [1,2, 3, -1, 3, 2]$. 
The sum of the array is simply $sum(a) = a[0] + \cdots + a[n-1]$. For example, $sum(a) = 10$. 
The _total variation_ is the absolute value of the difference between successive elements of the array.
$tv(a) =   |a[0] - a[1]| + | a[1] - a[2] | + \cdots + |a[n-2] - a[n-1] | $.
For instance, in the example, $tv(a) =  |1-2| + | 2-3| + | 3-(-1)| + |-1-3| + |3-2| = 11$.


You are allowed to add/substract $0, 1$ or $2$ to each element of the array such that 
(a) the sum of the array remains the same and (b) the total variation of the array is minimized.

For instance, conside the array $a$ with $tv(a) = 11$.
We can modify it as  $a = [1,2, 3\color{red}{-1}, -1\color{red}{+2}, 3\color{red}{-1}, 2]$, yielding
$[1,2,2,1,2,2]$. The sum remains unchanged but the new total variation becomes $3$.

Design a dynamic programming solution that will modify each element of the array by adding/subtracting $0,1,$ or $2$ in order reduce the total variation of an array while the sum remains unchanged.

## Set Up Recursion

Define a recursive function 

$minTV(j, S, p)$ as the minimum total variation distance solution for the sub array 
$a[j], \ldots, a[n-1]$, starting from the index $j$ when $S$ is the total change to the sum for
the prefix $a[0], \ldots, a[j-1]$, and $p \in \{ -2, -1, 0, 1, 2 \}$ is the change that was made to 
$a[j-1]$.

Write down the base cases for this recursion. Also specify how you would call this recursion to solve the
problem for a given array $a$.

**Hint** Convince yourself why we need to track the values of $S$ and $p$ in the recurrence.


## Solution

__ Write Recurrence and Base Cases __

__Recurrence__

$minTV(j,S,p)=$

given an array "a",

$\bullet$ j is the current position in the array

$\bullet$ s is the total sum of changes over all elements j in the array a

$\bullet$ $p_1$ is the change in the current step (j)

$\bullet$ $p$ is the change in the last step (j)


\begin{equation*}
minTV(j,S,p)=\text{min } \underbrace{[\text{abs}[(a[j]+p_1)}_{\text{current step}}-\underbrace{(a[j-1]+p)]}_{\text{previous step}}+[minTV(j+1,S+p_1,p_1))]]
\end{equation*}


Expand above equation for each of the 5 variations from the set where P $\in \{-2,-1,0,1,2\}$

    Recurrsively call to itself from j+1 to the rest of the array
    
$\blacktriangleright$ for $j \geq 1$


\begin{equation*}
minTV(j,S,p)=min \begin{cases} \text{abs}[(a[j]-2)-(a[j-1]+q)]+[minTV(j+1,S-2,-2)] \\ \text{abs}[(a[j]-1)-(a[j-1]+q)]+[minTV(j+1,S-1,-1)] \\ \text{abs}[(a[j])-(a[j-1]+q)]+[minTV(j+1,S,0)]  \\ \text{abs}[(a[j]+1)-(a[j-1]+q)]+[minTV(j+1,S+1,1)] \\ \text{abs}[(a[j]+2)-(a[j-1]+q)]+[minTV(j+1,S+2,2)]
\end{cases}
\end{equation*}


$\blacktriangleright$ for $j=0$ no previous case:

\begin{equation*}
minTV(j=0,S,q)=minTV(j+1,S-2,-2)
\end{equation*}

can't compute previous if there is no previous


__Base Case__

\begin{equation*}
minTV(j=n, S^*, q)
\end{equation*}

if $S^*==0$

    return 0
    
else (if $S^*!=0$)

    return infinity


## Implementation

Implement a function `minimizeTotalVariation` that given an array $a$ returns an new array $\hat{a}$ 
wherein each element of $\hat{a}$ is obtained by adding/subtracting either 0, 1 or 2 to corresponding element of $a$ and the sum of $\hat{a}$ equals that of $a$ but the total variation of $\hat{a}$ is as small as possible.

Note that building a memo table is slightly harder for this example. You may just want to implement the recursion and just cache previously seen recursive calls in a hashtable.

__Suggestion__ Solve this problem in two steps. First implement the recursion without memoization and work on how to recover the solution. Next, use a dictionary to memoize.

In [65]:
from math import inf

def minTV(a, j, s, q, memo):
    #key is (j,s,q) and the value is total variation
    if (j,s,q) in memo:
        v,solution=memo[(j,s,q)]
        return v,list(solution) # return total variation if it's already been computed 
    #base case
    if j == len(a):
        if s == 0:
            return 0,[]
        else:
            return float("inf"),[]
    #recurrision
    v,solution=float("inf"),[]
    for p in [-2,-1,0,1,2]:
        #if j==0:
        v1,solution1=minTV(a, j+1, s+p, p, memo)
        if j>0:
            v1=v1+abs(((a[j]+p))-(a[j-1]+q))#+minTV(a, j+1, s+p, p, memo)
            
        solution1.insert(0,p)
        if v1 < v:
            v = v1 #find minimum among 5 choices
            solution=solution1
    memo[(j,s,q)]=v,list(solution)
    
    return v,solution
            
            
def minimizeTotalVariation(a):
    memo={}
    totalvariation, solution=minTV(a,0,0,0,memo)
    
    return [a[i]+solution[i] for i in range(len(a))]


In [66]:
minimizeTotalVariation([-2,1,-1,-1])

[-1, -1, -1, 0]

In [67]:
# TEST CODE DO NOT EDIT
def calculateTotalVariation(a):
    n = len(a)
    tv = 0
    for i in range(1,n):
        tv = tv + abs(a[i]- a[i-1])
    return tv

def checkResults(a, b):
    sol=minimizeTotalVariation(a)
    assert (sum(sol) == sum(a)), 'Test failed: you do not preserve the sum of elements of the array'
    assert (calculateTotalVariation(sol) == calculateTotalVariation(b)), 'Test failed: your solution does not minimize the total variation'
    print('Test Passed')

In [68]:
checkResults([2,1,2,-1],[1,1,1,1])

Test Passed


In [69]:
checkResults([1,3,4,-2,1,4,2], [3, 3, 3, 0, 0, 2, 2] ) 

Test Passed


In [70]:
checkResults([-2,1,-1,-1],[0, -1, -1, -1])

Test Passed


In [71]:
checkResults([-1,-1,1,-1], [-1, -1, 0, 0])

Test Passed


In [72]:
checkResults([-1, -1, 3, 4, 1, 0, 9, -2, 4, -3], [-1, -1, 2, 2, 2, 2, 7, 0, 2, -1])

Test Passed
