<h1 id="toctitle">Exercise solutions</h1>
<ul id="toc"/>

## Accession names, again

The bulk of the work here will be coming up with patterns to describe the various criteria. Here's a skeleton program that will create a list to hold the accession numbers and loop over them:

In [1]:
import re
accs = ["xkn59438", "yhdck2", "eihd39d9", "chdsye847", "hedle3455", "xjhd53e", "45da", "de37dp"]
for acc in accs:
    print(acc)

xkn59438
yhdck2
eihd39d9
chdsye847
hedle3455
xjhd53e
45da
de37dp


The first criterion is easy - the pattern we are looking for is just the number `5`:

In [2]:
for acc in accs: 
    if re.search("5", acc): 
        print("\t" + acc)

	xkn59438
	hedle3455
	xjhd53e
	45da


Next, accessions that contain `d` or `e`. The easiest way to solve this is probably with alternation:

In [3]:
for acc in accs: 
    if re.search("(d|e)", acc): 
        print("\t" + acc) 

	yhdck2
	eihd39d9
	chdsye847
	hedle3455
	xjhd53e
	45da
	de37dp


For accessions that contain both `d` and `e` in that order we can't use an alternation because we need __both__ letters. We can express it like this: `d`, followed by any character repeated any number of times, followed by `e`:

In [4]:
for acc in accs: 
    if re.search("d.*e", acc): 
        print("\t" + acc) 

	chdsye847
	hedle3455
	xjhd53e
	de37dp


We can use a very similar pattern for the next problem: `d` and `e` separated by any single letter:

In [5]:
for acc in accs: 
    if re.search("(d.e)", acc): 
        print("\t" + acc) 

	hedle3455


The next one is surprisingly tricky. If we re-frame it as `d` followed by anything followed by `e` __or__ `e` followed by anything followed by `d`, it becomes a bit clearer:

In [6]:
for acc in accs: 
    if re.search("d.*e", acc) or re.search("e.*d", acc): 
        print("\t" + acc) 

	eihd39d9
	chdsye847
	hedle3455
	xjhd53e
	de37dp


To find accessions that start with either `x` or `y`, we need to combine an alternation with a start-of-string anchor:

In [7]:
for acc in accs: 
    if re.search("^(x|y)", acc): 
        print("\t" + acc) 

	xkn59438
	yhdck2
	xjhd53e


We can modify this quite easily to add the requirement that the accession ends with `e`. Watch out for the bit in the middle - it has to match anything, any number of times:

In [8]:
for acc in accs: 
    if re.search("^(x|y).*e$", acc): 
        print("\t" + acc) 

	xjhd53e


To match three or more numbers in a row, we need a more specific quantifier – the curly brackets – and a character group which contains all the numbers:

In [9]:
for acc in accs: 
    if re.search("[0123456789]{3,}", acc): 
        print("\t" + acc) 

	xkn59438
	chdsye847
	hedle3455


or we can use a shortcut, `\d` means any digit:

In [10]:
for acc in accs: 
    if re.search("\d{3,}", acc): 
        print("\t" + acc) 

	xkn59438
	chdsye847
	hedle3455


The last one uses a character group and an end-of-string anchor:

In [11]:
for acc in accs: 
    if re.search("d[arp]$", acc): 
        print("\t" + acc) 

	45da
	de37dp


## Double digest

Let's write a pattern for the enyme Abc1. `N` means any base, so the pattern is

`A[ATGC]TAAT`

We can use `re.finditer()` to find the start of all the cut sites:

In [12]:
dna = open("long_dna.txt").read().rstrip("\n") 

print("AbcI cuts at:") 
for match in re.finditer("A[ATGC]TAAT", dna): 
    print(match.start()) 

AbcI cuts at:
1140
1625


Be careful though, the cut position is actually three base pairs upstream of the match:

In [13]:
dna = open("long_dna.txt").read().rstrip("\n") 

print("AbcI cuts at:") 
for match in re.finditer("A[ATGC]TAAT", dna): 
    print(match.start() + 3) 

AbcI cuts at:
1143
1628


Once we've got the cut positions, how to calculate the sizes? Measure the distance from the current cut site to the previous one (or to the start of the sequence):

In [14]:
dna = open("long_dna.txt").read().rstrip("\n") 

last_cut = 0
for match in re.finditer("A[ATGC]TAAT", dna): 
    cut_position = match.start() + 3
    fragment_size = cut_position - last_cut
    print("fragment size is " + str(fragment_size))
    last_cut = cut_position

fragment size is 1143
fragment size is 485


Notice how the current cut position becomes the last cut position for the next iteration. We also have to remember the last fragment, from the last cut to the end:

In [15]:
dna = open("long_dna.txt").read().rstrip("\n") 

last_cut = 0
for match in re.finditer("A[ATGC]TAAT", dna): 
    cut_position = match.start() + 3
    fragment_size = cut_position - last_cut
    print("fragment size is " + str(fragment_size))
    last_cut = cut_position
    
# now the last fragment
fragment_size = len(dna) - last_cut
print("fragment size is " + str(fragment_size))

fragment size is 1143
fragment size is 485
fragment size is 384


Doing the same for two enzymes is trickier. We need to change our strategy. First, make a big list of all the cut positions for both enzymes:

In [17]:
all_cuts = []
# add cut positions for AbcI 
for match in re.finditer("A[ATGC]TAAT", dna): 
    all_cuts.append(match.start() + 3) 
 
# add cut positions for AbcII 
for match in re.finditer("GC[AG][AT]TG", dna): 
    all_cuts.append(match.start() + 4) 

print(all_cuts)

[1143, 1628, 488, 1577]


These aren't in the right order, so we have to sort them:

In [20]:
all_cuts.sort()
all_cuts

[488, 1143, 1577, 1628]

Now we can go through the list of all cuts with the same logic:

In [22]:
last_cut = 0
for cut_position in all_cuts:
    fragment_size = cut_position - last_cut
    print("fragment size is " + str(fragment_size))
    last_cut = cut_position
    
# now the last fragment
fragment_size = len(dna) - last_cut
print("fragment size is " + str(fragment_size))

fragment size is 488
fragment size is 655
fragment size is 434
fragment size is 51
fragment size is 384


## Super bonus exercise

This is going to be complicated, so let's start by taking the bit we already wrote above and turning it into a function. Our function will take a single DNA sequence, a single motif regular expression, and an offset telling us how far upstream of the start of the motif the enzyme cuts, and will return a list of fragment lengths:

In [19]:
import re

def get_fragment_lengths(dna, motif, offset):

    fragment_lengths = []
    
    last_cut = 0
    for match in re.finditer(motif, dna): 
        cut_position = match.start() + offset
        fragment_size = cut_position - last_cut
        fragment_lengths.append(fragment_size)
        last_cut = cut_position

    # now the last fragment
    fragment_size = len(dna) - last_cut
    fragment_lengths.append(fragment_size)
    
    return fragment_lengths

Let's test it by re-running our Abc1 example from the previous exercise and making sure we get the same result:

In [20]:
dna = open("long_dna.txt").read().rstrip("\n") 

get_fragment_lengths(dna, "A[ATGC]TAAT", 3)

[1143, 485, 384]

OK. Now to look at the file and split each line into a name and a motif:

In [10]:
for line in open('restriction_enzyme_data.txt'):
    fields = line.rstrip().split(',')
    enzyme_name = fields[0]
    enzyme_motif = fields[1]
    print(enzyme_name, enzyme_motif)

('AatII', 'gacgt/c')
('AbsI', 'cc/tcgagg')
('AccI', 'gt/mkac')
('Acc65I', 'g/gtacc')
('AclI', 'aa/cgtt')
('AfeI', 'agc/gct')
('AflII', 'c/ttaag')
('AflIII', 'a/crygt')
('AgeI', 'a/ccggt')
('AgsI', 'tts/aa')
('AhdI', 'gacnnn/nngtc')
('AleI', 'cacnn/nngtg')
('AluI', 'ag/ct')
('AlwNI', 'cagnnn/ctg')
('AoxI', '/ggcc')
('ApaI', 'gggcc/c')
('ApaBI', 'gcannnnn/tgc')
('ApaLI', 'g/tgcac')
('ApoI', 'r/aatty')
('AscI', 'gg/cgcgcc')
('AseI', 'at/taat')
('Asi256I', 'g/atc')
('AsiSI', 'gcgat/cgc')
('AvaI', 'c/ycgrg')
('AvaII', 'g/gwcc')
('AvrII', 'c/ctagg')
('BaeGI', 'gkgcm/c')
('BamHI', 'g/gatcc')
('BanI', 'g/gyrcc')
('BanII', 'grgcy/c')
('BclI', 't/gatca')
('BfaI', 'c/tag')
('BglI', 'gccnnnn/nggc')
('BglII', 'a/gatct')
('BisI', 'gc/ngc')
('BlpI', 'gc/tnagc')
('BlsI', 'gcn/gc')
('BmtI', 'gctag/c')
('BsaAI', 'yac/gtr')
('BsaBI', 'gatnn/nnatc')
('BsaHI', 'gr/cgyc')
('BsaJI', 'c/cnngg')
('BsaWI', 'w/ccggw')
('BsiEI', 'cgry/cg')
('BsiHKAI', 'gwgcw/c')
('BsiWI', 'c/gtacg')
('BslI', 'ccnnnnn/nngg')
('Bsp

Looks good, but before we can start searching with a motif we need to do some work. We need to figure out where the forward slash is, use that as the offset, and remove it from the motif string:

In [11]:
for line in open('restriction_enzyme_data.txt'):
    fields = line.rstrip().split(',')
    enzyme_name = fields[0]
    enzyme_motif = fields[1]
    
    offset = enzyme_motif.find('/')
    enzyme_motif = enzyme_motif.replace('/', '')
    print(enzyme_name, enzyme_motif, offset)

('AatII', 'gacgtc', 5)
('AbsI', 'cctcgagg', 2)
('AccI', 'gtmkac', 2)
('Acc65I', 'ggtacc', 1)
('AclI', 'aacgtt', 2)
('AfeI', 'agcgct', 3)
('AflII', 'cttaag', 1)
('AflIII', 'acrygt', 1)
('AgeI', 'accggt', 1)
('AgsI', 'ttsaa', 3)
('AhdI', 'gacnnnnngtc', 6)
('AleI', 'cacnnnngtg', 5)
('AluI', 'agct', 2)
('AlwNI', 'cagnnnctg', 6)
('AoxI', 'ggcc', 0)
('ApaI', 'gggccc', 5)
('ApaBI', 'gcannnnntgc', 8)
('ApaLI', 'gtgcac', 1)
('ApoI', 'raatty', 1)
('AscI', 'ggcgcgcc', 2)
('AseI', 'attaat', 2)
('Asi256I', 'gatc', 1)
('AsiSI', 'gcgatcgc', 5)
('AvaI', 'cycgrg', 1)
('AvaII', 'ggwcc', 1)
('AvrII', 'cctagg', 1)
('BaeGI', 'gkgcmc', 5)
('BamHI', 'ggatcc', 1)
('BanI', 'ggyrcc', 1)
('BanII', 'grgcyc', 5)
('BclI', 'tgatca', 1)
('BfaI', 'ctag', 1)
('BglI', 'gccnnnnnggc', 7)
('BglII', 'agatct', 1)
('BisI', 'gcngc', 2)
('BlpI', 'gctnagc', 2)
('BlsI', 'gcngc', 3)
('BmtI', 'gctagc', 5)
('BsaAI', 'yacgtr', 3)
('BsaBI', 'gatnnnnatc', 5)
('BsaHI', 'grcgyc', 2)
('BsaJI', 'ccnngg', 1)
('BsaWI', 'wccggw', 1)
('BsiEI',

We also need to turn all the ambiguity codes into regular expression character groups. We could start doing it with multiple calls to replace:

In [14]:
motif = 'ccannnnrnnwtgg'

# n means any base
motif = motif.replace('n', '[atgc]')

# r means either a or g
motif = motif.replace('r', '[ag]')

# etc. etc.

but this will quickly get messy; instead let's build a dict of ambiguity codes and replacements, and iterate over it to turn a motif into a regular expression:

In [18]:
code2regex = {
    'k' : '[gt]',
    'm' : '[ac]',
    'r' : '[ag]',
    'y' : '[ct]',
    's' : '[cg]',
    'w' : '[at]',
    
    'v' : '[cgt]',
    'v' : '[acg]',
    'h' : '[act]',
    'd' : '[agt]',
    
    'n' : '[atgc]'
}

for line in open('restriction_enzyme_data.txt'):
    fields = line.rstrip().split(',')
    enzyme_name = fields[0]
    enzyme_motif = fields[1]
    
    # calculate offset
    offset = enzyme_motif.find('/')
    enzyme_motif = enzyme_motif.replace('/', '')
    
    # turn motif into a regex
    for ambiguity_code, replacement in code2regex.items():
        enzyme_motif = enzyme_motif.replace(ambiguity_code, replacement)
    
    print(enzyme_name, enzyme_motif, offset)

('AatII', 'gacgtc', 5)
('AbsI', 'cctcgagg', 2)
('AccI', 'gt[ac][gt]ac', 2)
('Acc65I', 'ggtacc', 1)
('AclI', 'aacgtt', 2)
('AfeI', 'agcgct', 3)
('AflII', 'cttaag', 1)
('AflIII', 'ac[ag][ct]gt', 1)
('AgeI', 'accggt', 1)
('AgsI', 'tt[cg]aa', 3)
('AhdI', 'gac[atgc][atgc][atgc][atgc][atgc]gtc', 6)
('AleI', 'cac[atgc][atgc][atgc][atgc]gtg', 5)
('AluI', 'agct', 2)
('AlwNI', 'cag[atgc][atgc][atgc]ctg', 6)
('AoxI', 'ggcc', 0)
('ApaI', 'gggccc', 5)
('ApaBI', 'gca[atgc][atgc][atgc][atgc][atgc]tgc', 8)
('ApaLI', 'gtgcac', 1)
('ApoI', '[ag]aatt[ct]', 1)
('AscI', 'ggcgcgcc', 2)
('AseI', 'attaat', 2)
('Asi256I', 'gatc', 1)
('AsiSI', 'gcgatcgc', 5)
('AvaI', 'c[ct]cg[ag]g', 1)
('AvaII', 'gg[at]cc', 1)
('AvrII', 'cctagg', 1)
('BaeGI', 'g[gt]gc[ac]c', 5)
('BamHI', 'ggatcc', 1)
('BanI', 'gg[ct][ag]cc', 1)
('BanII', 'g[ag]gc[ct]c', 5)
('BclI', 'tgatca', 1)
('BfaI', 'ctag', 1)
('BglI', 'gcc[atgc][atgc][atgc][atgc][atgc]ggc', 7)
('BglII', 'agatct', 1)
('BisI', 'gc[atgc]gc', 2)
('BlpI', 'gct[atgc]agc', 2)
('B

Now we can plug the regular expression and offset into our function and produce the output. Warning: this will produce a lot of output if we run it on the whole chromosome (which might cause IDLE to crash), so here we're just looking at the first 10000 bases.

In [6]:
dna = open('ce1.txt').read().lower()[:10000]

code2regex = {
    'k' : '[gt]',
    'm' : '[ac]',
    'r' : '[ag]',
    'y' : '[ct]',
    's' : '[cg]',
    'w' : '[at]',
    
    'v' : '[cgt]',
    'v' : '[acg]',
    'h' : '[act]',
    'd' : '[agt]',
    
    'n' : '[atgc]'
}

for line in open('restriction_enzyme_data.txt'):
    fields = line.rstrip().split(',')
    enzyme_name = fields[0]
    enzyme_motif = fields[1]
    
    # calculate offset
    offset = enzyme_motif.find('/')
    enzyme_motif = enzyme_motif.replace('/', '')
    
    # turn motif into a regex
    for ambiguity_code, replacement in code2regex.items():
        enzyme_motif = enzyme_motif.replace(ambiguity_code, replacement)
        
    # get fragment lengths
    lengths = get_fragment_lengths(dna, enzyme_motif, offset)
    
    print(enzyme_name, enzyme_motif, offset, lengths)

('AatII', 'gacgtc', 5, [3840, 2472, 3643, 45])
('AbsI', 'cctcgagg', 2, [2707, 7293])
('AccI', 'gt[ac][gt]ac', 2, [1967, 1910, 6123])
('Acc65I', 'ggtacc', 1, [535, 9465])
('AclI', 'aacgtt', 2, [10000])
('AfeI', 'agcgct', 3, [9923, 77])
('AflII', 'cttaag', 1, [10000])
('AflIII', 'ac[ag][ct]gt', 1, [1831, 144, 318, 2457, 239, 1372, 427, 2731, 481])
('AgeI', 'accggt', 1, [3990, 77, 5933])
('AgsI', 'tt[cg]aa', 3, [470, 23, 512, 25, 151, 253, 101, 304, 561, 245, 244, 495, 160, 41, 386, 175, 12, 156, 359, 28, 334, 24, 21, 76, 79, 168, 113, 63, 48, 17, 47, 43, 154, 113, 83, 99, 256, 31, 102, 98, 92, 149, 66, 14, 172, 162, 56, 313, 28, 126, 483, 712, 33, 66, 175, 359, 207, 117])
('AhdI', 'gac[atgc][atgc][atgc][atgc][atgc]gtc', 6, [8298, 1702])
('AleI', 'cac[atgc][atgc][atgc][atgc]gtg', 5, [10000])
('AluI', 'agct', 2, [517, 354, 57, 117, 236, 107, 422, 499, 552, 614, 939, 3628, 273, 74, 36, 139, 263, 1173])
('AlwNI', 'cag[atgc][atgc][atgc]ctg', 6, [9423, 577])
('AoxI', 'ggcc', 0, [583, 1265, 206

In [3]:
# ignore this cell, it's for loading custom js code
from IPython.core.display import Javascript
Javascript(filename="custom.js")

<IPython.core.display.Javascript object>

In [4]:
# ignore this cell, it's for loading custom css code
from IPython.core.display import HTML
HTML(filename="custom.css")