**String Alignment as a Maximum Scoring Path Problem**

Make sure you read the supplementary pdf document in Canvas.

In the previous homework, you learned about the use of dynamic programming to solve the least cost path problem. In this assignment we explore how the same method can be used to align strings. The string alignment problem is very useful in a variety of contexts in including but certainly not limited to

- bioinformatics - comparing DNA or RNA strings
- plagiarism detection - comparing two documents for similarity

We will focus on the following problem. We start with 

- a fixed alphabet of available characters, all assumed to be non-white space characters for simplicity.
- a special non-white space character that is not in the alphabet that we will use for a special purpose to be described below.

In the following example, we take the alphabet to be the four character set {A,C,G, T} and the special character to be the equal sign character (=).

When are given a pair of strings formed from the alphabet to be aligned. For example

string1 = AAAGTCCAGCTA

and

string2 = AAGTCCCGGGCTA

These strings **do not** have to be of the same length.

An alignment will refer to a new pair of strings astring1 and astring 2 with the following rules:


- astring1 is obtained from string1 by *possibly* inserting the special character in some position(s) (possibly at the beginning or end of string1).
- astring2 is obtained from string2 by *possibly* inserting the special character in some position(s)(possibly at the beginning or end of string1).
- astring1 and astring2 are the same length
- if the special character character appears in positon i of astring1 then it cannot appear in position i of astring2
- if the special character character appears in positon i of astring2 then it cannot appear in position i of astring1

For the above example, a possible alignment could be

$\verb+astring1: AAA=GTCCAG==CTA+$ <br>
$\verb+astring2: AAGT==CCCGGGCTA+$

**Problem 1 (5 points)**

If we have an alignment (astring1 and astring2) of two strings (string1 and string1) that satisfies the rules of an alignment based on a special insertion character, there is a simple way to determine string1 from astring1 (and string1 from astring2).

Write a function called **GetStringFromAlignmentString** that takes as input 

- **Ichar** = the insertion character
- **astring** = a string that has some non special insertion characters and possibly some occurences of the special insertion character,

and which gives as output

- **string** = the string obtained by removing all occurences of the special insertion character from astring.

**Your function can be written using a single line of code after the def line - full credit will only be given if you give a version with this property.**

**You are not allowed to use semicolons to get the code into one line.**

Use the following cell for your code.

In [1]:
# Code cell for Problem 1 - do not delete or modify this line
def GetStringFromAlignmentString(Ichar,astring):
    #return ''.join(astring.split(Ichar))
    return astring.replace(Ichar,'')

In [2]:
# Test cell for Problem 1 - do not modify this cell
# Do execute it.
GetStringFromAlignmentString("_","_AB_CDE_FGH_IJ_")

'ABCDEFGHIJ'

**Problem 2 (15 points)**

Write a function called **CheckAlignment** that takes as input

- **string1** = first input string 
- **string2** = second input string
- **Ichar** = special insertion character (not in the string alphabet)
- **astring1** = an alignment string obtained from string 1
- **astring2** = an alignment string obtained from string 2

and which checks whether or not the alignment strings satisfy the alignment rules given above.

This function assumes that the input strings 

- have positive length, and 
- use characters only from the alphabet and no insertion characters.

**result** = a dictionary with the the following key/value pairs

> ("rules_satisfied", True/False), where True means the rules are satisfied and False means they are not satisfied

> ("same_length",True/False), where True means the alignment strings are the same length and False means they are not

> ("correct1",True/False), where True means that astring1 is obtained from string1 by possibly (but not necesarily) adding an insertion character in some position(s) of string1, and False means it is not.

> ("correct2",True/False), where True means that astring2 is obtained from string2 by possibly (but not necesarily) adding an insertion character in some position(s) of string2, and False means it is not.

> ("no_insertion_violations",True/False), where True means that  there are no instances in which both alignment strings have an insertion character in the same position. Note - if the alignment strings are of different lengths, you should only check the positions in which comparison of characters is possible. In other words, only check positions  from 0 to the minimum of the two alignment string lengths.

**This function will be used in Problem 3 so make sure you get it right.**

Your function **should be completely self-contained** meaning that no packages can be imported, and no variables or functions can be defined outside of your function.

In particular, your function **is not allowed to use a call of the function in Problem 1.** 

Use the following cell for your code.


In [3]:
# Code cell for Problem 2 - do not modify or delete this line
def CheckAlignment(string1, string2,  Ichar, astring1, astring2):
    dict={}
    dict["same_length"]=(len(astring1)==len(astring2))
    dict["correct1"]=(astring1.replace(Ichar,'')==string1)
    dict["correct2"]=(astring2.replace(Ichar,'')==string2)
    m=min(len(astring1),len(astring2))
    dict["no_insertion_violations"]=True
    #for i in range(m):
    #     if astring1[i]==astring2[i] and astring1[i]==Ichar:
    #        dict["no_insertion_violations"]=False
            
    l1=set([i for i, c in enumerate(astring1) if c == Ichar])
    l2=set([i for i, c in enumerate(astring2) if c == Ichar])
    dict["no_insertion_violations"]=len(l1.intersection(l2)) == 0
    dict["rules_satisfied"]=not False in dict.values()
    return dict

In [4]:
# Test cell1 for Problem 2 - do not delete or modify this cell
# Do execute it.
import numpy as np
np.random.seed(189)
alphabet=[chr(i) for i in range(65,91)]
n1=10
n2=12
m=15
Ichar="_"
string1="".join(list(np.random.choice(alphabet,size=n1)))
string2="".join(list(np.random.choice(alphabet,size=n2)))
print(string1)
print(string2)
astring1=[string1[i] for i in range(len(string1))]
astring2=[string2[i] for i in range(len(string2))]
for i in range(m-n1):
    pos=np.random.choice(range(len(astring1)+1))
    astring1.insert(pos,Ichar)
for i in range(m-n2):
    pos=np.random.choice(range(len(astring2)+1))
    astring2.insert(pos,Ichar)
astring1="".join(astring1)
astring2="".join(astring2)
print(astring1)
print(astring2)
res=CheckAlignment(string1, string2, Ichar, astring1, astring2)
print(res)


AVTHPDZTKZ
TTZQIUCBYHGG
AVTH_PD__Z_TK_Z
TTZQIUC__BY_HGG
{'same_length': True, 'correct1': True, 'correct2': True, 'no_insertion_violations': False, 'rules_satisfied': False}


In [5]:
# Test cell2 for Problem 2 - do not delete or modify this cell
# Do execute it.
import numpy as np
np.random.seed(192)
alphabet=[chr(i) for i in range(65,91)]
n1=12
n2=13
m1=14
m2=15
Ichar="_"
string1="".join(list(np.random.choice(alphabet,size=n1)))
string2="".join(list(np.random.choice(alphabet,size=n2)))
print(string1)
print(string2)
astring1=[string1[i] for i in range(len(string1))]
astring2=[string2[i] for i in range(len(string2))]
for i in range(m1-n1):
    pos=np.random.choice(range(len(astring1)+1))
    astring1.insert(pos,Ichar)
for i in range(m2-n2):
    pos=np.random.choice(range(len(astring2)+1))
    astring2.insert(pos,Ichar)
astring1="".join(astring1)
astring2="".join(astring2)
print(astring1)
print(astring2)
res=CheckAlignment(string1, string2, Ichar, astring1, astring2)
print(res)

CHZWLIZIXZUU
YGZRBCHIAVXWN
CHZWLIZIX__ZUU
YGZRBCH_IAV_XWN
{'same_length': False, 'correct1': True, 'correct2': True, 'no_insertion_violations': True, 'rules_satisfied': False}


In [6]:
# Test cell3 for Problem 2 - do not delete or modify this cell
# Do execute it.
import numpy as np
np.random.seed(191)
alphabet=[chr(i) for i in range(65,91)]

n1=1000
n2=1050
m=1075
Ichar="_"
string1="".join(list(np.random.choice(alphabet,size=n1)))
string2="".join(list(np.random.choice(alphabet,size=n2)))
astring1=[string1[i] for i in range(len(string1))]
astring2=[string2[i] for i in range(len(string2))]
for i in range(m-n1):
    pos=np.random.choice(range(len(astring1)+1))
    astring1.insert(pos,Ichar)
for i in range(m-n2):
    pos=np.random.choice(range(len(astring2)+1))
    astring2.insert(pos,Ichar)
astring1="".join(astring1)
astring2="".join(astring2)
print(astring1)
print(astring2)
res=CheckAlignment(string1, string2, Ichar, astring1, astring2)
print(res)


RDGFBEO_UYTMUHPDCIYCMTPQFVDTMSXEOBNA_JDAAXVGUQ_NJFKRNAFJTAKWN_GWGWFDKKEBLYDGPRBWQXVTQADGRWBWYSIYSDVZPQLCK_ZPKEJWF_C_SJZJEVFKDIGFMUQSYUEIZ_MCZD_BYG_H_S_OIARAAIKOUSTQFTZJGSHRMWLSQSSKKYVZ_TMBKMQOKIHEPHP_XKXARNYNYRLBEU_AWNFVJKNHRCEOUHEAERARJBYWYUWGGASMEINCTLZUJGUCVRDWMNNRDHF_HSPWPCVBVCQSGSGPHANABGXBEEDUH_NJRMFODBKAAKKIS_JTXWPNWCISUACRPUCDGCFSFZIXZJGZZZTPOWRMCXAVBTDDNSDYQSTKCRDAFROPQNSZSINZ_Z_WAKWW_XPYZLVYDXA_ZXMNGETLLIPRSFHXVUQMAGJVNJGM_TL_GBILGXXOOATKFPENDUXUFYWPLNEOUQDGZG_ICLCOTN_MMMVKQ_EJIFAJRGRSLIMFYVMPVYNNRZRDSMRH_FBDUWYDBPEDX_GKTQJJTSCOGXTH_AMCTNZUN_TMWOOR_LBQSRYTLRXPJOWFLRYNVHAAUUYLK_IOQSLUJOJYZUHQCWTMDULBCDJBGYDAWUR_NKPB_BTNR_OTAHGIQDFZHXBQPTFEDGBFVAHMJTUNZVXYYRJ_MRWLONKHCKOKBTFLV_OYFTJYSTOJRZKLVPAWEDGTHGWHHBJSXBEFUFDTFFCRT__TQPWMPA_WJN_SBQMMZGY_XKGALLVVNKWI_NN_ZHLDAUVZZMK_FRMIOIA_DGPALU_PEWY_QVMVQINIKCAQWSYMDXEWTTRJ__LGPBII_ITOLH_COCPDA_QC_DEIPPDKTVLBYATNFAFMRNMDUBS_ZYUO_KRUHTAXVAFWJVYG_UEZZCYT_OERZETHGAQ_EB_SVLDGBNRKVE_WYKPHHVP_DYXEU_JGOMOGEPA_JYPXJPECMFDMWSMEMDPPKS_JOAESDZJQ_YVP

In [7]:
# Test cell4 for Problem 2 - do not delete or modify this cell
# Do execute it.
import numpy as np
np.random.seed(191)
alphabet=[chr(i) for i in range(65,91)]
n1=1040
n2=1050
m=1060
Ichar="_"
string1="".join(list(np.random.choice(alphabet,size=n1)))
string2="".join(list(np.random.choice(alphabet,size=n2)))
astring1=[string1[i] for i in range(len(string1))]
astring2=[string2[i] for i in range(len(string2))]
for i in range(m-n1):
    u=np.random.uniform(0,1)
    pos=np.random.choice(range(len(astring1)+1))
    if u>.1:
        astring1.insert(pos,Ichar)
    else:
        astring1.insert(pos,"T")
for i in range(m-n2):
    pos=np.random.choice(range(len(astring2)+1))
    astring2.insert(pos,Ichar)
astring1="".join(astring1)
astring2="".join(astring2)
print(astring1)
print(astring2)
res=CheckAlignment(string1, string2, Ichar, astring1, astring2)
print(res)

RDGFBEOUYTMUHPDCIYCMTPQFVDTMSXEOBNAJDAAXVGUQNJFKRNAFJTAKWNGWGWFDKKEBLYDGPRBWQXVTQADGRWBWYSIYSDVZPQLCKZPKEJWFCSJZJEVFKDIGFMUQSYUEIZMCZDBYGHSOIARAAIKO_USTQFTZJGSHRMWLSQSSKKYVZTMBKMQOKIHEPHPXKXARNYNYRL_BEUAWNFVJKNHRCEOUHEAERARJBYWYUWGGASMEINCTLZUJGUCVRDWMNNRDHFHSPWPCVBVCQSGSGPHANABGXBEEDUHNJRMFODBKAA_KKISJTXWPNWCISUACRPUCDGCFSFZIXZJGZZZTPOWRMCXAVBTDDNSDYQSTKCRDAFROPQNSZSINZZWAKWWXPYZLVTYDXAZXM_NGETLLIPRSFHXVUQMAGJVNJGMTLGBILGXXOOA__TKFPENDUXUFYWPLNEOUQDGZGICLCOTNMMMVKQEJIFAJ_RGRSLIMFYV_MPVYNNRZRDSMRHFBDUWYDBPEDX_GKTQJJTSCOGXTHAMCTNZUNTMWOORLBQSRYTLRXPJOWFLRYNVHAAUUYLKIOQSLUJOJYZUHQCWTMDULBCDJBGYDAWURNKPBBTNROTAHGIQDFZHXBQPTFEDGBFVAHMJTUNZVXYYRJMRWLONKHCKOKBTFLVOYFTJYSTOJRZKLVPAWEDGTHGWHHBJSXBEFUFDTFFCRTTQPWMPAWJNSBQMMZGYXKGALLVVNKWINNZHLDAUVZZMKFRMIOIADGPALUPEWYQVMVQINIKCAQWSYMD_XEWTTRJL_GPBI_IITOLHCOCPDAQCDEI_PPDKTVLBYATNFAFMRNMDUBSZYUOKRUHTA_XVAFWJVYGUEZZCYTOERZETHGAQEBSVLDGB_NRKVEWYKPHHVPDYXEUJGOM_OGEPAJYPXJPECMFDMWSMEMDPPKSJOAESDZJQYVPTRJSZFXYOAAATTQPEOFDHSQNGGSTRMPIOVAENSKYGGZWRICZGW

**Scoring an Alignment**

We're interested in finding a best alignment and for this we need to start with a **scoring system** which is a score obtained by summing over all positions of astring1 and astring2 of 

- the score $s_M$  of a mismatch i.e. when the character in astring1 and astring2 are from the alphabet but are different characters 

- the score of $s_I$ an insertion, i.e. when the character in one of the strings is the special character and the character in the other string is from the alphabet 

- the score $s_P$ the score of a perfect match, i.e. when the characters in the position are the same alphabetic character 

For example, if we assume $s_M=-2,$ $s_I=-1,$ and $s_P=3.$ In the alignment 

$\verb+astring1: AAA=GTCCAG==CTA+$ <br>
$\verb+astring2: AAGT==CCCGGGCTA+$

we have the following

$$
\begin{array}{ccccc}
\mbox{position} & \mbox{astring1} & \mbox{astring2} & \mbox{property} & \mbox{score} \\ \hline
0 & A & A & \mbox{perfect match} & 3 \\
1 & A & A& \mbox{perfect match} & 3 \\
2 & A & G & \mbox{mismatch} & -2 \\
3 & = & T & \mbox{insertion} & -1 \\
4 & G & = & \mbox{insertion} & -1 \\
5 & T & = & \mbox{insertion} & -1 \\
6 & C & C & \mbox{perfect match} & 3 \\
7 & C & C& \mbox{perfect match} & 3 \\
8 & A & C& \mbox{mismatch} & -2 \\
9 & G & G& \mbox{perfect match} & 3 \\
10 &= & G& \mbox{insertion} & -1 \\
11 &= & G& \mbox{insertion} & -1 \\
12 & C& C & \mbox{perfect match} & 3 \\
13 &T & T &\mbox{perfect match} & 3 \\
14 &A & A& \mbox{perfect match} & 3 \\ \hline
\end{array} 
$$

There are 8 perfect matches, 2 mismatches and 5 insertions making the total score 

$$
8 \times 3+ 2\times (-2) + 5 \times (-1) = 24 - 4 -5 = 15.
$$

**Problem 3 (15 points)**

We need a function to compute scores. Write a function called **AlignmentScore** that takes as input:

- **sM** = score of a mismatch
- **sI** = score of an insertion
- **sP** = score of a perfect match
- **Ichar** = special character (not in the alphabet used for input strings) for representing insertions in an alignment
- **astring1** = alignment string 1 
- **astring2** = alignment string 2

and gives as output **result**, which is a tuple of size 1 or 5.

If astring1 and astring2 do not satisfy the rules for the alignment of a pair of strings then 
result should be the 1-tuple **(False,).**

If astring1 and astring2 do satisfy the rules for the alignment of a pair of strings then 
result should be the 5-tuple **(False, nM, nI, nP, score)** where

- **nM** = the number of mismatches of non special characters
- **nI** = the number of times a non special character in one string is matched with the special insertion character
- **nP** = the number of perfect matches
- **score** = the score of the alignment

**Your function should make use of the function in Problem 2, but otherwise should be self-contained**

Use the following cell for your code.

In [8]:
# Code cell for Problem 3 - do not remove or modify this line.
def AlignmentScore(sM,sI,sP,Ichar,astring1,astring2):
    str1 = GetStringFromAlignmentString(Ichar,astring1)
    str2 = GetStringFromAlignmentString(Ichar,astring2)
    res=CheckAlignment(str1, str2, Ichar, astring1, astring2)
    if not res['rules_satisfied']:
        return (False,)
    k=len(astring1)
    nM=0
    nI=0
    nP=0
    for i in range(k):
        if astring1[i]==astring2[i]:
            nP += 1
        elif Ichar in [astring1[i], astring2[i]]:
            nI += 1
        else:
            nM += 1
    score = sM*nM+sI*nI+sP*nP
    return (True, nM, nI, nP, score)

In [10]:
# Test cell1 for Problem 3 - do not delete or modify this cell
# Do execute it.
import numpy as np
np.random.seed(18391)
sM=-2
sI=-1
sP=5
Ichar="_"
n=10
astring1=np.random.choice(["A","C","G","T"],size=n)
astring2=np.random.choice(["A","C","G","T"],size=n)
p=np.random.choice(range(n),size=5)
for i in p[0:2]:
    astring1[i]="_"
for i in p[2:5]:
    astring2[i]="_"
astring1="".join(astring1)
astring2="".join(astring2)
res=AlignmentScore(sM,sI,sP,Ichar,astring1,astring2)
print(astring1)
print(astring2)
print(res)


CCTG_GGGG_
_CCTAGG__T
(True, 2, 5, 3, 6)


In [11]:
# Test cell2 for Problem 3 - do not delete or modify this cell
# Do execute it.
import numpy as np
np.random.seed(18391)
sM=-2
sI=-1
sP=7
Ichar="_"
n=300
astring1=np.random.choice(["A","C","G","T"],size=n,p=[.4,.3,.2,.1])
astring2=np.random.choice(["A","C","G","T"],size=n,p=[.4,.3,.2,.1])
p=np.random.choice(range(n),size=10)
for i in p[0:6]:
    astring1[i]="_"
for i in p[6:10]:
    astring2[i]="_"
astring1="".join(astring1)
astring2="".join(astring2)
print(astring1)
print(astring2)
res=AlignmentScore(sM,sI,sP,Ichar,astring1,astring2)
print(res)


AATAA_GCAACAGGAAAGACTCAAGAACAAT_CTTAG_CAGTACAAGTACGAGAAATGCCAGCCTGGACGACCCACAATACATTAAAAAACCCAAGGGAGCACCGACAACGCAAACACGACC_ACGGGGCCAA__AGAAAGTCGCACGCTGGAGAATGCCATGAGACCAACCCAAGAACCCATGACCAGGCCGCACACACCTCGCCGAATCAAAGCCGCAACCCGGAATAATAATCGAAGGAAATACAATAACACGGCCCCTAGCGACAGCAACGGAGAAAACGTACGAGTCCAATGCCA
ATCACTGACTGAGGGCCCTCGACACAATGAAAAGCAAATTGAGGGACCGCGA_GCCGTAATCCTTACGCCGTCTGATGCTATCCCACCTGACGAAGGAAACAGAAATAACACACCAGAGGATGCAACTATCC_ACGAGACACAAGCGCAAACCAACCATACATCATACA_ACACGAGCGAGCAAGACGAAATTCAAGCCAACCAATAACCGCAGACCACCGATCAACCGCACAGCACGAGAAAGACCACATGGAAAAAAACGCCCCGACTTAATG_CGAAAGCGAGGCATGTTGCGACCA
(True, 212, 10, 78, 112)


**Distinct Permutations**

Recall the function from the Least Cost Path assignment for iterating over all distinct permutations of lists containing some identical elements. For example, we might want to list all permutations of the list ["a","a","b"]. The following code gives such a list:

In [12]:
from sympy.utilities.iterables import multiset_permutations as mp
list(mp(["a","a","b"]))

[['a', 'a', 'b'], ['a', 'b', 'a'], ['b', 'a', 'a']]

**Brute Force Approach**

For short strings it is feasible to try all possible alignments. Given string1 has length m and string2 has length n, as described in the pdf document, the possible alignments correspond to the possible paths from (0,0) to (m,n) in which the only allowable moves are:

- H (horizontal) : (i,j)->(i,j+1) 
- V (vertical) : (i,j)->(i+1,j) 
- D (diagonal): (i,j)->(i+1,j+1) 

Here:

- a V move corresponds to appending the next available character from string1 to astring1 and appending the insertion character to astring2,

- an H move corresponds to appending the next available character from string2 to astring2 and appending the insertion character to astring1, and

- a D move corresponds to appending the next available character from string1 to astring1 and appending the next available character in string 2 to astring2.



The number of diagonal moves can be any value from the set {0,1...,min(m,n)} and if we fix the number of diagonal moves at $n_D,$ there must be $m-n_D$ vertical moves and $n-n_D$ horizontal moves. 
Thus, collection of all possible paths corresponds to the permutations of lists consisting of

- $n_D$ D's for some $n_D=0,1,...,\min(m,n)$ (including $\min(m,n)$).
- $m-n_D$ V's, and
- $n-n_D$ H's. 

**Problem 4 (25 points)** Write a program called **AllPaths** that takes as input

- **m** = a positive integer
- **n** = a positive integer

and that outputs a 

- a list of all permutations of $m-n_D$ V's, $n-n_D$ H's and $n_D$ D's for $n_D=0,1,...,\min(m,n).$ Each permutation should be a list.

Use the following cell for your code.

In [13]:
# Code cell for Problem 4 - do not modify or delete this line
from sympy.utilities.iterables import multiset_permutations as mp
def AllPaths(m,n):
    LH = ['H' for i in range(n)]
    LV = ['V' for i in range(m)]
    LD = []
    paths=[]
    for nD in range(min(m,n)+1):
        L=LH+LV+LD
        paths += (list(mp(L)))
        LD.append('D')
        if len(LH)>0:
            LH.pop(0)
        if len(LV)>0:
            LV.pop(0)
    return paths

In [14]:
# Test cell1 for Problem 4  - do not delete or modify this cell
# Do execute it
m=3
n=3
Paths=AllPaths(m,n)
print(len(Paths))
Symbols=[]
Error=False
for p in Paths:
    Symbols.extend(set(p))
    nD=p.count("D")
    nH=p.count("H")
    nV=p.count("V")
    if nD+nV!=m or nD+nH!=n:
        Error=True
Symbols=set(Symbols)
print(Symbols)
print(Error)

63
{'H', 'V', 'D'}
False


In [15]:
# Test cell2 for Problem 4 - do not delete or modify this cell
# Do execute it
m=4
n=3
Paths=AllPaths(m,n)
print(len(Paths))
Symbols=[]
Error=False
for p in Paths:
    Symbols.extend(set(p))
    nD=p.count("D")
    nH=p.count("H")
    nV=p.count("V")
    if nD+nV!=m or nD+nH!=n:
        Error=True
Symbols=set(Symbols)
print(Symbols)
print(Error)

129
{'H', 'V', 'D'}
False


In [16]:
# Test cell3 for Problem 4 - do not delete or modify this cell
# Do execute it
m=5
n=7
Paths=AllPaths(m,n)
print(len(Paths))
Symbols=[]
Error=False
for p in Paths:
    Symbols.extend(set(p))
    nD=p.count("D")
    nH=p.count("H")
    nV=p.count("V")
    if nD+nV!=m or nD+nH!=n:
        Error=True
Symbols=set(Symbols)
print(Symbols)
print(Error)

7183
{'H', 'V', 'D'}
False


**Problem 5 (25 points)**

Write a function called **AlignmentFromPath** that takes as input

- **string1** = first string to be aligned
- **string2** = second string to be aligned
- **path** = a tuple consisting of H's, V's and D' where the number of D's is some number $n_D$ from $\{0,...,\min(m,n)\}$, the number of H's is $m-n_D$ and the number of V's is $n-n_D.$
- **Ichar** = special insertion character

and which returns 

- **astring1** = alignment string associated with string1 based on path
- **astring2** = alignment string associated with string2 based on path


Use the following cell for your code.

In [17]:
# Code cell for Problem 5 - do not delete or modify this line
def AlignmentFromPath(string1,string2,path,Ichar):
    astring1=""
    astring2=""
    k = len(path)
    cnt1=0
    cnt2=0
    for i in range(k):
        if path[i]=='H':
            astring1 += Ichar
            astring2 += string2[cnt2]
            cnt2 += 1
        elif path[i]=='V':
            astring1 += string1[cnt1]
            cnt1 += 1
            astring2 += Ichar
        else:
            astring1 += string1[cnt1]
            astring2 += string2[cnt2]
            cnt1 += 1
            cnt2 += 1
    return astring1, astring2
        

In [18]:
# Test cell1 for Problem 5 - do not delete or modify this cell
# Do execute it
import numpy as np
np.random.seed(18492)
m=5
n=7
string1="".join(np.random.choice(["A","C","G","T"],size=m,p=[.5,.3,.1,.1]))
string2="".join(np.random.choice(["A","C","G","T"],size=n,p=[.5,.3,.1,.1]))
print(string1)
print(string2)
nD=np.random.choice(range(min(m,n)))
nV=m-nD
nH=n-nD
path=["D" for i in range(nD)]+["V" for i in range(nV)]+["H" for i in range(nH)]
path=np.random.permutation(path)
print(path)
astring1,astring2=AlignmentFromPath(string1,string2,path,"_")
print(astring1)
print(astring2)

AAAAT
AAAGCCA
['D' 'H' 'H' 'D' 'H' 'D' 'V' 'V' 'H']
A__A_AAT_
AAAGCC__A


In [19]:
# Test cell2 for Problem 5 - do not delete or modify this cell
# Do execute it
import numpy as np
np.random.seed(18492)
m=120
n=150
string1="".join(np.random.choice(["A","C","G","T"],size=m,p=[.5,.3,.1,.1]))
string2="".join(np.random.choice(["A","C","G","T"],size=n,p=[.5,.3,.1,.1]))
print(string1)
print(string2)
nD=np.random.choice(range(min(m,n)))
nV=m-nD
nH=n-nD
path=["D" for i in range(nD)]+["V" for i in range(nV)]+["H" for i in range(nH)]
path=np.random.permutation(path)
print(path)
astring1,astring2=AlignmentFromPath(string1,string2,path,"_")
print(astring1)
print(astring2)

AAAATAAAGCCAAAACAAATCCAACGCTACCACGACTTAGTATTAAGAATTCAAAAGAACCAAACAAAACGAAGCGCCATTCTACGACAGAAGAAGCGTAAAAAAAAAAATAACACGAAG
AAAAAAAAAACAGCTAAGAAACATCGACAACATCGGAAGCCAAAAACATAGCAGAACCACAACCAACCAAAATAAAAGCCTAACCACACCTGATCCCAAAACCAAAAATACCGGCCTGCCTACCATGAAAACACAAACCCAAACAAACAA
['H' 'D' 'D' 'D' 'D' 'V' 'H' 'H' 'D' 'D' 'H' 'D' 'H' 'H' 'D' 'H' 'D' 'H'
 'D' 'H' 'H' 'D' 'D' 'H' 'H' 'V' 'D' 'D' 'D' 'H' 'H' 'D' 'H' 'D' 'V' 'D'
 'H' 'D' 'H' 'H' 'V' 'H' 'V' 'H' 'V' 'D' 'V' 'D' 'H' 'V' 'H' 'D' 'D' 'H'
 'D' 'V' 'D' 'H' 'D' 'D' 'D' 'H' 'D' 'H' 'D' 'D' 'D' 'H' 'H' 'H' 'D' 'V'
 'D' 'V' 'D' 'H' 'V' 'V' 'V' 'D' 'H' 'H' 'D' 'H' 'H' 'H' 'H' 'D' 'D' 'H'
 'V' 'D' 'H' 'H' 'H' 'D' 'V' 'D' 'H' 'H' 'D' 'D' 'D' 'D' 'D' 'V' 'D' 'V'
 'D' 'D' 'D' 'H' 'D' 'D' 'D' 'D' 'D' 'D' 'D' 'D' 'D' 'D' 'D' 'D' 'H' 'D'
 'D' 'H' 'D' 'D' 'D' 'D' 'D' 'D' 'H' 'D' 'H' 'H' 'D' 'H' 'H' 'H' 'H' 'D'
 'D' 'D' 'D' 'V' 'D' 'H' 'V' 'V' 'V' 'D' 'V' 'D' 'H' 'D' 'V' 'V' 'H' 'H'
 'H' 'D' 'D' 'V' 'D' 'D' 'D' 'H' 'V' 'H' 'D' 'V' 'D' 'V' 'D' 'D' 'D']
_

**Problem 6 (15 points)**

Create a function that takes as input

- **string1** = first string to be aligned
- **string2** = second string to be aligned
- **Ichar** = insertion character
- **sM** = score for a mismatch
- **sI** = score for an insertion
- **sP** = score for a match

and that goes through every possible path that is associated with an alignment of the two strings
using your **AllPaths** function from Problem 4, and your **AlignmentFromPath** function in Problem 5, and returns as ouput:

- **maxscore** = the score of a best alignment
- **astring1** = alignment string corresponding to string1 of a best alignment
- **astring2** = alignment string corresponding to string2 of a best alignment

**Notes:** 

- There may be multiple alignments leading to the same optimal score. Your function need only return one of them
- This code should only be used for strings that are small in size.

Use the following cell for your code

In [20]:
# Code cell for Problem 6 - do not delete or modify this line
def OptimalAlignmentBruteForce(string1,string2,Ichar,sM,sI,sP):
    m=len(string1)
    n=len(string2)
    Paths = AllPaths(m,n)
    maxscore = -100*100
    ASTRING1 = ""
    ASTRING2 = ""
    for path in Paths:
        astring1, astring2 = AlignmentFromPath(string1,string2,path,Ichar)
        res = AlignmentScore(sM,sI,sP,Ichar,astring1,astring2)
        score = res[4]    
        if score > maxscore:
            maxscore = score
            ASTRING1 = astring1
            ASTRING2 = astring2
    return maxscore, ASTRING1, ASTRING2

In [21]:
#OptimalAlignmentBruteForce("𝙰𝙰𝙰𝙶𝚃𝙲𝙲𝙰","𝙰𝙰𝙶𝚃𝙲𝙲𝙲𝙶𝙰","_",-2,-1,3)

For the test cells below, you will need to excecute the code in the following cell.

In [22]:
# Execution cell - do not remove or modify this cell
# Do execute it.
np.random.seed(1842)
clist=["A","C","G","T"]
def replace_random_char(s,clist):
    pos=np.random.choice(len(s))
    c=np.random.choice(clist)
    s=s[0:pos]+c+s[pos+1:]
    return(s)
print(replace_random_char("ATGCATGA",clist))
def insert_random_char(s,clist):
    pos=np.random.choice(len(s))
    c=np.random.choice(clist)
    s=s[0:pos]+c+s[pos:]
    return(s)
print(insert_random_char("ATGCATGA",clist))
def delete_random_char(s):
    pos=np.random.choice(len(s))
    c=np.random.choice(clist)
    s=s[0:pos]+s[pos+1:]
    return(s)
print(delete_random_char("ATGCATGA"))


ATGCATGC
ATTGCATGA
ATGCATA


In [23]:
# Test cell for Problem 6 - do not delete or modify this cell
# Do execute it.
import numpy as np
np.random.seed(14891)
for trial in range(10):
    m=7
    string1="".join(np.random.choice(clist,size=m))
    k=np.random.choice(range(2,5))
    string2=string1
    for i in range(k):
        u=np.random.uniform(0,1)
        if u<.5:
            string2=replace_random_char(string2,["A","C","G","T"])
        elif u<.7:
            string2=insert_random_char(string2,["A","C","G","T"])
        else:
            string2=delete_random_char(string2)
    print(string1)
    print(string2)

    sM=np.random.choice(range(-3,0))
    sI=sM+1
    sP=np.random.choice(range(1,3))
    print(sM,sI,sP)
    score,astring1,astring2=OptimalAlignmentBruteForce(string1,string2,Ichar,sM,sI,sP)
    print(score)
    print(astring1)
    print(astring2)
    print("\n")

GCAGTTC
GCTACGTC
-2 -1 2


9
GC_A_GTTC
GCTACGT_C


TGATTTG
TGAATTGTG
-3 -2 1
3
TGA_TT_TG
TGAATTGTG


CCTGGTC
CCGGAG
-2 -1 1
-1
CC_TG_GTC
CCG_GAG__


TCGACCT
TCGATCGT
-2 -1 2
9
TCGA_C_CT
TCGATCG_T


GGGTTCA
GTGTCA
-3 -2 2
5
GGGTTCA
GTGT_CA


TGTACTC
GTGTACTCGC
-1 0 2
14
_TGTACTC__
GTGTACTCGC


CTCATAA
CTCCATGA
-3 -2 2
7
CTC_ATAA
CTCCATGA


GGCCATC
GCCAC
-1 0 1
5
GGCCATC
G_CCA_C


TATGTAA
ATGTAA
-2 -1 1
5
TATGTAA
_ATGTAA


CCACCTG
CTCCCCG
-1 0 2
10
C_C_ACCTG
CTCC_CC_G




**Optimal String Alignment Using Dynamic Programming**

As described in the above-mentioned pdf document, the optimal alignment problem can be solved using
dynamic programming. 

**Problem 7 (30 points)**

Write a program called **OptimalAlignmentDynamicProgramming** that takes as input:

- **string1** = input string of some positive length (call this **m**)
- **string2** = input string of some positive length (call this **n**)
- **Ichar** = character to be used as the special insertion character
- **sM** = score for a mismatch
- **sI** = score for an insertion
- **sP** = score for a perfect match 

and computes an optimal alignment (one maximizing the alignment score) **using dynamic programming** (not an algorithm that uses brute force trying every possible alignment) and outputs the following:

- **astring1** = alignment string corresponding to string1
- **astring2** = alignment string corresponding to string2
- **score** = total score for the alignment

Use the following cell for you code. Your code should be completely self-contained.

**Try to make your code as efficient as possible.**

In [24]:
# Code cell for Problem 7 - do not delete or modify this line
def OptimalAlignmentDynamicProgramming(string1,string2,Ichar,sM,sI,sP):
    m=len(string1)
    n=len(string2)
    C=np.zeros((m+1,n+1))
    D=[['' for i in range(n+1)] for j in range(m+1)]
    for i in range(1,m+1):
        C[i][0]=sI*i
        D[i][0]='V'
    for j in range(1,n+1):
        C[0][j]=sI*j
        D[0][j]='H'
    
    C[0][0]=0

    for i in range(1,m+1):
        for j in range(1,n+1):
            if string1[i-1]==string2[j-1]:
                cost = sP
            else: 
                cost = sM
            if C[i-1][j-1] + cost >= C[i][j-1] + sI and C[i-1][j-1] + cost >= C[i-1][j] + sI:
                C[i][j]=C[i-1][j-1]+cost
                D[i][j]='D'
            elif C[i][j-1] + sI >= C[i-1][j-1] + cost and C[i][j-1] >= C[i-1][j]:
                C[i][j]=C[i][j-1] + sI
                D[i][j]='H'
            elif C[i-1][j] + sI >= C[i-1][j-1] + cost and C[i-1][j] >= C[i][j-1]:
                C[i][j]=C[i-1][j]+sI
                D[i][j]='V'
    
    path=[(m,n)]
    Path=[]
    cost=C[m,n]
    indx=m
    indy=n
    k=m+n+1
    while k > 1:
        t = path[len(path)-1]
        indx=t[0]
        indy=t[1]
        if D[indx][indy] == 'V':
            path.append((indx-1,indy))
            Path.append('V')
            k -= 1
        elif D[indx][indy] == 'H':
            path.append((indx,indy-1))
            Path.append('H')
            k -= 1
        else:
            path.append((indx-1,indy-1))
            Path.append('D')
            k -= 2
    Path = Path[::-1]
    astring1, astring2 = AlignmentFromPath(string1,string2,Path,Ichar)
    return cost, astring1, astring2
    

In [26]:
# Test cell1 for Problem 7 - do not delete or modify this cell
# Do execute it.

import numpy as np
np.random.seed(14891)
for trial in range(10):
    m=7
    string1="".join(np.random.choice(clist,size=m))
    k=np.random.choice(range(2,5))
    string2=string1
    for i in range(k):
        u=np.random.uniform(0,1)
        if u<.5:
            string2=replace_random_char(string2,["A","C","G","T"])
        elif u<.7:
            string2=insert_random_char(string2,["A","C","G","T"])
        else:
            string2=delete_random_char(string2)

    sM=np.random.choice(range(-3,0))
    sI=sM+1
    sP=np.random.choice(range(1,3))
    bf_score,astring1,astring2=OptimalAlignmentBruteForce(string1,string2,Ichar,sM,sI,sP)
    print(astring1)
    print(astring2)
    dp_score,astring1,astring2=OptimalAlignmentDynamicProgramming(string1,string2,"_",sM,sI,sP)
    print(bf_score-dp_score)

GC_A_GTTC
GCTACGT_C
GC_A_GTTC
GCTACG_TC
0.0
TGA_TT_TG
TGAATTGTG
TG_ATT_TG
TGAATTGTG
0.0
CC_TG_GTC
CCG_GAG__
CCTGGTC
CC_GGAG
0.0
TCGA_C_CT
TCGATCG_T
TCGA_CCT
TCGATCGT
0.0
GGGTTCA
GTGT_CA
GGGTTCA
GTG_TCA
0.0
_TGTACTC__
GTGTACTCGC
_TGTACT__C
GTGTACTCGC
0.0
CTC_ATAA
CTCCATGA
CT_CATAA
CTCCATGA
0.0
GGCCATC
G_CCA_C
GGCCATC
_GCCA_C
0.0
TATGTAA
_ATGTAA
TATGTAA
_ATGTAA
0.0
C_C_ACCTG
CTCC_CC_G
C_CACCT_G
CTC_CC_CG
0.0


In [27]:
# Test cell2 for Problem 7 - do not delete or modify this cell
# Do execute it.
string1="AAGCTATCCATTAATCTCT"
string2="ATGCTGCATCCATATTTCAGTCAGCTCT"
score_tuples=[(-3,-2,1),(-3,-2,2),(-3,-2,3),(-3,-2,4),(-3,-2,5),
              (-4,-2,1),(-4,-2,2),(-4,-2,3),(-4,-2,4),(-4,-2,5),
              (-5,-2,1),(-5,-2,2),(-5,-2,3),(-5,-2,4),(-5,-2,5)]
for st in score_tuples:
    sM,sI,sP=st
    dp_score,astring1,astring2=OptimalAlignmentDynamicProgramming(string1,string2,"_",sM,sI,sP)
    print(dp_score)



-7.0
10.0
27.0
44.0
61.0
-9.0
8.0
25.0
42.0
59.0
-9.0
8.0
25.0
42.0
59.0


In [28]:
# Test cell3 for Problem 7 - do not delete or modify this cell
# Do execute it.
import numpy as np

m=100
string1="".join(np.random.choice(clist,size=m))
k=np.random.choice(range(2,5))
string2=string1
for i in range(5):
    string2=replace_random_char(string2,["A","C","G","T"])
for i in range(10):
    string2=insert_random_char(string2,["A","C","G","T"])
for i in range(5):
    string2=delete_random_char(string2)
sM=-5
sI=-4
sP=3
print(string1)
print(string2)
dp_score,astring1,astring2=OptimalAlignmentDynamicProgramming(string1,string2,"_",sM,sI,sP)
print(astring1)
print(astring2)
print(dp_score)

ACTACCTTCTCTAAAAAGCGTGAGTCCATTCCATGCATGATGACTACGTTCGGCACTTTGCCTGCGTCACCTAGAAACTGCTATCGTGCTAATTGTCATC
ACTACTTGCTAAAAGAGTGAGTCCATCTCCAATAGCTATGTATGACGACGTTCGGCAGTTGCCCTGCGTCACTCTAGAAACTGACTATCGATGCTAATTGTCATC
ACTACCTTCTCTAAAAAGCGTGAGTCCAT_TCC_AT_GC_ATG_ATGACTACGTTCGGCACTTTG_CCTGCGTCAC_CTAGAAACTG_CTATCG_TGCTAATTGTCATC
ACTA_CTT_GCT_AAAAGAGTGAGTCCATCTCCAATAGCTATGTATGACGACGTTCGGCA_GTTGCCCTGCGTCACTCTAGAAACTGACTATCGATGCTAATTGTCATC
204.0


In [30]:
# Test cell3 for Problem 7 - do not delete or modify this cell
# Do execute it.
import numpy as np

m=1000
string1="".join(np.random.choice(clist,size=m))
k=np.random.choice(range(2,5))
string2=string1
for i in range(5):
    string2=replace_random_char(string2,["A","C","G","T"])
for i in range(10):
    string2=insert_random_char(string2,["A","C","G","T"])
for i in range(5):
    string2=delete_random_char(string2)
sM=-5
sI=-4
sP=3
dp_score,astring1,astring2=OptimalAlignmentDynamicProgramming(string1,string2,"_",sM,sI,sP)
print(dp_score)


2893.0


Before you upload your notebook, make sure that

- every cell is executed without any errors
- you save it.