<h1 id="toctitle">Lists and loops exercise solutions</h1>
<ul id="toc"/>

## Processing DNA in a file

First, just read each line and print to the screen:

In [2]:
file = open("../data files/input.txt") 
for dna in file: 
    print(dna) 

ATTCGATTATAAGCTCGATCGATCGATCGATCGATCGATCGATCGATCGAT

ATTCGATTATAAGCACTGATCGATCGATCGATCGATCGATGCTATCGTCGT

ATTCGATTATAAGCATCGATCACGATCTATCGTACATTCGATTATAAGCGT

ATTCGATTATAAGCACTATCGATGATCTAGCTACGATCGTAGCTGTAGGGT

ATTCGATTATAAGCACTAGCTAGTCTCGATGCATGATCAGCTTAGCTGATG


Now we can grab the bit of the sequence from 15th base to the end:

In [6]:
file = open("../data files/input.txt") 
for dna in file:
    trimmed_dna = dna[14:] 
    print(trimmed_dna) 

TCGATCGATCGATCGATCGATCGATCGATCGATCGAT

ACTGATCGATCGATCGATCGATCGATGCTATCGTCGT

ATCGATCACGATCTATCGTACATTCGATTATAAGCGT

ACTATCGATGATCTAGCTACGATCGTAGCTGTAGGGT

ACTAGCTAGTCTCGATGCATGATCAGCTTAGCTGATG


Looks good, the printed sequences are definitely shorter!

Now switch to writing to a file:

In [4]:
file = open("input.txt") 
output = open("trimmed.txt", "w") 
for dna in file: 
    trimmed_dna = dna[14:] 
    output.write(trimmed_dna) 
output.close()

Notice where we open, write, and close - before, during, and after the loop.

Add one more statement to print the length of the trimmed sequence to the screen:

In [7]:
# open the input file 
file = open("../data files/input.txt") 
 
# open the output file 
output = open("trimmed.txt", "w") 
 
# go through the input file one line at a time 
for dna in file: 

    # get the substring from the 15th character to the end 
    trimmed_dna = dna[14:]

    # print out the trimmed sequence
    output.write(trimmed_dna)

    # print out the length to the screen
    print("processed sequence with length " + str(len(trimmed_dna))) 
output.close()

processed sequence with length 38
processed sequence with length 38
processed sequence with length 38
processed sequence with length 38
processed sequence with length 37


## Multiple exons from genomic DNA

There are two files involved here - the DNA and the exon locations. Start with the locations:

In [7]:
exon_locations = open("exons.txt") 
for line in exon_locations: 
    print(line) 

5,58

72,133

190,276

340,398



Use `split()` to turn each line into a list of two elements:

In [9]:
exon_locations = open("exons.txt") 
for line in exon_locations: 
    positions = line.split(',') 
    print(positions) 

['5', '58\n']
['72', '133\n']
['190', '276\n']
['340', '398\n']


To make it easier to work with, let's assign the start and stop to variables:

In [10]:
exon_locations = open("exons.txt") 
for line in exon_locations: 
    positions = line.split(',') 
    start = positions[0] 
    stop = positions[1] 
    print("start is " + start + ", stop is " + stop)

start is 5, stop is 58

start is 72, stop is 133

start is 190, stop is 276

start is 340, stop is 398



Looks good. Next we tackle the DNA part: open and read the sequence, then use the start/stop positions to extract the exon:

In [15]:
genomic_dna = open("genomic_dna2.txt").read() 
exon_locations = open("exons.txt") 
for line in exon_locations: 
    positions = line.split(',') 
    start = positions[0] 
    stop = positions[1] 
    exon = genomic_dna[start:stop] 
    print("exon is: " + exon) 

TypeError: slice indices must be integers or None or have an __index__ method

Problem: when we split a string, the resulting elements of the list are strings. Look at the output from this:

In [12]:
"123,456,789".split(',')

['123', '456', '789']

and notice how the numbers are surrounded by quotes. We need to turn them into numbers with

```python
    start = int(positions[0]) 
    stop = int(positions[1]) 
```

In [16]:
genomic_dna = open("genomic_dna2.txt").read() 
exon_locations = open("exons.txt") 
for line in exon_locations: 
    positions = line.split(',') 
    start = int(positions[0]) 
    stop = int(positions[1])
    exon = genomic_dna[start:stop] 
    print("exon is: " + exon) 

exon is: CGTACCGTCGACGATGCTACGATCGTCGATCGTAGTCGATCATCGATCGATCG
exon is: CGATCGATCGATATCGATCGATATCATCGATGCATCGATCATCGATCGATCGATCGATCGA
exon is: CGATCGATCGATCGTAGCTAGCTAGCTAGATCGATCATCATCGTAGCTAGCTCGACTAGCTACGTACGATCGATGCATCGATCGTA
exon is: CGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGTAGCTAGCTACGATCG


OK. Next step - do something useful with the exons. We have to concatenate them all to make one long coding sequence. Because we are only dealing with a single exon at a time, we have to do it inside the loop. Here's the easiest way:

In [17]:
genomic_dna = open("genomic_dna2.txt").read() 
exon_locations = open("exons.txt") 

# create a new variable to hold the coding sequence
# at first it is just an empty string
coding_sequence = "" 


for line in exon_locations: 
    positions = line.split(',') 
    start = int(positions[0]) 
    stop = int(positions[1]) 
    exon = genomic_dna[start:stop] 
    
    # take the original coding sequence,
    # add the new exon on to the end, 
    # then store the result back in the coding sequence variable
    coding_sequence = coding_sequence + exon 
    
    
    print("coding sequence is : " + coding_sequence) 

coding sequence is : CGTACCGTCGACGATGCTACGATCGTCGATCGTAGTCGATCATCGATCGATCG
coding sequence is : CGTACCGTCGACGATGCTACGATCGTCGATCGTAGTCGATCATCGATCGATCGCGATCGATCGATATCGATCGATATCATCGATGCATCGATCATCGATCGATCGATCGATCGA
coding sequence is : CGTACCGTCGACGATGCTACGATCGTCGATCGTAGTCGATCATCGATCGATCGCGATCGATCGATATCGATCGATATCATCGATGCATCGATCATCGATCGATCGATCGATCGACGATCGATCGATCGTAGCTAGCTAGCTAGATCGATCATCATCGTAGCTAGCTCGACTAGCTACGTACGATCGATGCATCGATCGTA
coding sequence is : CGTACCGTCGACGATGCTACGATCGTCGATCGTAGTCGATCATCGATCGATCGCGATCGATCGATATCGATCGATATCATCGATGCATCGATCATCGATCGATCGATCGATCGACGATCGATCGATCGTAGCTAGCTAGCTAGATCGATCATCATCGTAGCTAGCTCGACTAGCTACGTACGATCGATGCATCGATCGTACGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGTAGCTAGCTACGATCG


Notice how the coding sequence gets longer each time round the loop as more exons are added to it. 

After the loop has finished we can write the coding sequence to a file. Final version:

In [19]:
# open the genomic dna file and read the contents 
genomic_dna = open("genomic_dna2.txt").read() 
 
# open the exons locations file 
exon_locations = open("exons.txt") 
 
# create a variable to hold the coding sequence 
coding_sequence = "" 
 
# go through each line in the exon locations file 
for line in exon_locations: 

    # split the line using a comma 
    positions = line.split(',') 

    # get the start and stop positions 
    start = int(positions[0]) 
    stop = int(positions[1]) 

    # extract the exon from the genomic dna 
    exon = genomic_dna[start:stop] 

    # append the exon to the end of the current coding sequence 
    coding_sequence = coding_sequence + exon 

# write the coding sequence to an output file 
output = open("coding_sequence.txt", "w") 
output.write(coding_sequence) 
output.close() 

### Bonus exercise: sliding windows

We can start by defining some variables: a DNA sequence and a window size.

In [29]:
dna = "aacgtcgat"
window_size = 4

Let's get the first few windows manually:

In [30]:
window1 = dna[0:4]
window2 = dna[1:5]
window3 = dna[2:6]
print(window1)
print(window2)
print(window3)

aacg
acgt
cgtc


You can see the pattern here:
- the stop position of the window is always 4 more than the start (or whatever the window size is)
- the start position increases by one each time

So we can use `range()` to generate the list of start positions:

In [31]:
start_positions = range(len(dna))
print(start_positions)

[0, 1, 2, 3, 4, 5, 6, 7, 8]


And now write a loop to get the window for each start position:

In [32]:
for start in range(len(dna)):
    stop = start + window_size
    window = dna[start:stop]
    print window

aacg
acgt
cgtc
gtcg
tcga
cgat
gat
at
t


We can see what's going on even more clearly if we print some spaces at the start of the window to make it line up with the original sequence. Use `*` to repeat a string:

In [33]:
print(dna)
for start in range(len(dna)):
    stop = start + window_size
    window = dna[start:stop]
    print (' ' * start) + window

aacgtcgat
aacg
 acgt
  cgtc
   gtcg
    tcga
     cgat
      gat
       at
        t


Notice that we have some incomplete windows at the end. If we want to avoid this, we need to stop the `range()` while there are still enough bases at the end:

In [34]:
print(dna)
for start in range(len(dna) - window_size + 1):
    stop = start + window_size
    window = dna[start:stop]
    print (' ' * start) + window

aacgtcgat
aacg
 acgt
  cgtc
   gtcg
    tcga
     cgat


Now that we have a loop to generate all the windows, it's easy to calculate their AT content (remember the first line!)

In [41]:
from __future__ import division
for start in range(len(dna) - window_size + 1):
    stop = start + window_size
    window = dna[start:stop]
    at = (window.count('a') + window.count('t')) / len(window)
    print(start, window, at)

(0, 'aacg', 0.5)
(1, 'acgt', 0.5)
(2, 'cgtc', 0.25)
(3, 'gtcg', 0.25)
(4, 'tcga', 0.5)
(5, 'cgat', 0.5)


If we want to include partial windows and the start and end, we just have to adjust the call to `range()` so that it starts with a negative number and ends at the length of the sequence. For the negative start positions we have to bump them up to zero:

In [1]:
from __future__ import division
for start in range(1 - window_size, len(dna)):
    stop = start + window_size
    if start < 0:
        start = 0
    
    
    window = dna[start:stop]
    at = (window.count('a') + window.count('t')) / len(window)
    print(start,window,at)

NameError: name 'window_size' is not defined

## Bonus exercise: alignment columns

The key here is to think about what we want to iterate over. It's tempting to start of by writing something like this to iterate over the bases in the first sequence:

In [2]:
alignment = ['atgctcgatcgctag',
             'aag-tcgctcgct--',
             'atcctc--tcgcggg']

first_sequence = alignment[0]
for base in first_sequence:
    print(base)

a
t
g
c
t
c
g
a
t
c
g
c
t
a
g


but we quickly realize that we can't go any further with this approach as it doesn't give us any way to get the corresponding columns from the other two sequences. If we think about iterating over the positions of the bases, then using those positions to get the bases themselves, it becomes clearer:

In [3]:
for column_number in range(15):
    first_sequence_base = first_sequence[column_number]
    print("column " + str(column_number) + ' : ' + first_sequence_base)

column 0 : a
column 1 : t
column 2 : g
column 3 : c
column 4 : t
column 5 : c
column 6 : g
column 7 : a
column 8 : t
column 9 : c
column 10 : g
column 11 : c
column 12 : t
column 13 : a
column 14 : g


We can add some extra lines to get the same base positions for the other two sequences:

In [5]:
first_sequence = alignment[0]
second_sequence = alignment[1]
third_sequence = alignment[2]

for column_number in range(15):
    first_sequence_base = first_sequence[column_number]
    second_sequence_base = second_sequence[column_number]
    third_sequence_base = third_sequence[column_number]
    
    column_string = first_sequence_base + second_sequence_base + third_sequence_base
    
    print("column " + str(column_number) + ' : ' + column_string)

column 0 : aaa
column 1 : tat
column 2 : ggc
column 3 : c-c
column 4 : ttt
column 5 : ccc
column 6 : gg-
column 7 : ac-
column 8 : ttt
column 9 : ccc
column 10 : ggg
column 11 : ccc
column 12 : ttg
column 13 : a-g
column 14 : g-g


Of course, this is cheating because it will only work if the alignment has exactly three sequences. Better to have another loop that iterates over each sequence in the alignment:

In [8]:
for column_number in range(15):
    column_string = ''
    for sequence in alignment:
        column_string = column_string + sequence[column_number]
    print("column " + str(column_number) + ' : ' + column_string)

column 0 : aaa
column 1 : tat
column 2 : ggc
column 3 : c-c
column 4 : ttt
column 5 : ccc
column 6 : gg-
column 7 : ac-
column 8 : ttt
column 9 : ccc
column 10 : ggg
column 11 : ccc
column 12 : ttg
column 13 : a-g
column 14 : g-g


## Bonus exercise: restriction fragments

Let's start with a very simple case: a DNA sequence that contains two EcoRI motifs:

In [9]:
dna = 'ATCGATGTACGTAGAATTCGCTCGATCGTAGCTAGAATTCGCTGATCGTACGTCAGT'

We know that we can get the location of the first one using `find()`:

In [11]:
dna.find('GAATTC')

13

if we then tell `find()` to start looking at the next position, it will find the next motif:

In [12]:
dna.find('GAATTC', 14)

34

How do we capture this idea in a loop? Whenever we find a restriction site, we need to start looking for the next one at that position plus one:

In [21]:
last_position = 0
for i in range(2):
    this_position = dna.find('GAATTC', last_position + 1) + 1
    print('cut site at ' + str(this_position))
    last_position = this_position

cut site at 14
cut site at 35


Notice now in the code above we're adding one to the cut site in two different places: once to account for the fact that the enzyme cuts one base upstream from the start of the motif, and once so that we start looking for the next cut site one base after the current one.

This code should work for any number of cut sites, as long as we fill in the correct value in the `range()`. If we're not sure how many cut sites there are, we can put in a big number:

In [22]:
last_position = 0
for i in range(10):
    this_position = dna.find('GAATTC', last_position + 1) + 1
    print('cut site at ' + str(this_position))
    last_position = this_position

cut site at 14
cut site at 35
cut site at 0
cut site at 14
cut site at 35
cut site at 0
cut site at 14
cut site at 35
cut site at 0
cut site at 14


but you can see what happens, we get all the cut sites repeated many times. We'll see how to fix this when we talk about conditions.

To go from the list of fragments from the list of cut positions, we can subtract the position of the last match from that of the current one, remembering to subtract the position of the final match from the length of the sequence to get the size of the last fragment:

In [24]:
last_position = 0
for i in range(2):
    this_position = dna.find('GAATTC', last_position + 1) + 1
    fragment_length = this_position - last_position
    print(fragment_length)
    last_position = this_position

# now add last fragment
fragment_length = len(dna) - last_position
print(fragment_length)

14
21
22


Note: this code makes a few assumptions - namely that there isn't a motif right at the start of the sequence, and that there are no overlapping motifs. Taking these things into account would complicate things a lot!

In [3]:
# ignore this cell, it's for loading custom js code
from IPython.core.display import Javascript
Javascript(filename="custom.js")

<IPython.core.display.Javascript object>

In [4]:
# ignore this cell, it's for loading custom css code
from IPython.core.display import HTML
HTML(filename="custom.css")