<h1 id="toctitle">Conditions exercise solutions</h1>
<ul id="toc"/>

## Processing tabular data

Let's start with just opening the file and splitting it into different bits of data. Remember the column order:

In [1]:
data = open("data.csv") 
for line in data: 

    # split the line up
    columns = line.rstrip("\n").split(",") 

    # assign the columns to variables 
    species = columns[0] 
    sequence = columns[1] 
    name = columns[2] 
    expression = columns[3] 

    print(name)

kdy647
jdg766
kdy533
hdt739
hdu045
teg436


Now add in the condition for the first bit of the exercise. It's yes/no; if the species name is _Drosophila melanogaster_ or _Drosophila simulans_ then we want to print the gene name. If not, we don't want to do anything. 

In [2]:
data = open("data.csv") 
for line in data: 
    columns = line.rstrip("\n").split(",") 
    species = columns[0] 
    sequence = columns[1] 
    name = columns[2] 
    expression = columns[3] 

    if species == "Drosophila melanogaster" or species == "Drosophila simulans": 
        print(name) 

kdy647
jdg766
kdy533


For the next part of the exercise, what do we need to change? Only the condition - everything else is the same:

In [3]:
data = open("data.csv") 
for line in data: 
    columns = line.rstrip("\n").split(",") 
    species = columns[0] 
    sequence = columns[1] 
    name = columns[2] 
    expression = columns[3] 

    if len(sequence) > 90 and len(sequence) < 110: 
        print(name) 

kdy647
teg436


For the AT content bit, we need to re-use code from earlier in the course. 

In [4]:
from __future__ import division

data = open("data.csv") 
for line in data: 
    columns = line.rstrip("\n").split(",") 
    species = columns[0] 
    sequence = columns[1] 
    name = columns[2] 
    expression = columns[3] 
    
    at = ( sequence.count('A') + sequence.count('t') ) / len(sequence)

    if at < 0.5 and expression > 200: 
        print(name) 

kdy647
jdg766
kdy533
hdt739
hdu045
teg436


Hmm, this doesn't look right - too many names. Here's the problem:

In [5]:
'50' > 200

True

We are accidentally comparing a string and a number. We need to turn the expression level into a number so that we can correctly compare it:

In [15]:
from __future__ import division

data = open("data.csv") 
for line in data: 
    columns = line.rstrip("\n").split(",") 
    species = columns[0] 
    sequence = columns[1] 
    name = columns[2] 
    expression = int(columns[3])
    
    at = ( sequence.count('a') + sequence.count('t') ) / len(sequence)

    if at < 0.5 and expression > 200: 
        print(name) 

teg436


Now it works. 

The next bit requires a complex condition - be careful with the parentheses:

In [7]:
data = open("data.csv") 
for line in data: 
    columns = line.rstrip("\n").split(",") 
    species = columns[0] 
    sequence = columns[1] 
    name = columns[2] 
    expression = columns[3] 
    if (name.startswith('k') or name.startswith('h')) and species != "Drosophila melanogaster": 
        print(name) 

kdy533
hdt739
hdu045


We could also write the condition using `not`:

```python
if (name.startswith('k') or name.startswith('h')) and not species == "Drosophila melanogaster": 
```

depending on what you think is easier to read.

For the final part of the exercise, we have to switch from just using `if`. We have three options:

In [21]:
from __future__ import division

data = open("data.csv") 
for line in data: 
    columns = line.rstrip("\n").split(",") 
    species = columns[0] 
    sequence = columns[1] 
    name = columns[2] 
    expression = int(columns[3])
    
    at = ( sequence.count('a') + sequence.count('t') ) / len(sequence)
    
    if at > 0.65:
        print(name + " has a high AT content (" + str(at) + ")")
    elif at < 0.45:
        print(name + " has a low AT content (" + str(at) + ")")
    else:
        print(name + " has a medium AT content (" + str(at) + ")")

        

kdy647 has a high AT content (0.724770642202)
jdg766 has a medium AT content (0.564102564103)
kdy533 has a medium AT content (0.533333333333)
hdt739 has a low AT content (0.285714285714)
hdu045 has a medium AT content (0.529914529915)
teg436 has a medium AT content (0.459183673469)


## Pairwise distance

First let's forget about pairs and just calculate the distance between two sequences. 

In [22]:
seqA = 'ATTGTACGGA'
seqB = 'AATGAACCGA'

We want to look at each position in turn, so we'll use a range:

In [23]:
for position in range(len(seqA)):
    print("looking at position " + str(position))

looking at position 0
looking at position 1
looking at position 2
looking at position 3
looking at position 4
looking at position 5
looking at position 6
looking at position 7
looking at position 8
looking at position 9


At each position, we want to grab the bases from both sequences:

In [24]:
for position in range(len(seqA)):
    baseA = seqA[position]
    baseB = seqB[position]
    print(baseA, baseB)

('A', 'A')
('T', 'A')
('T', 'T')
('G', 'G')
('T', 'A')
('A', 'A')
('C', 'C')
('G', 'C')
('G', 'G')
('A', 'A')


We check if they match:

In [27]:
for position in range(len(seqA)):
    baseA = seqA[position]
    baseB = seqB[position]
    if baseA == baseB:
        print(baseA, baseB, "match!")
    else:
        print(baseA, baseB, "no match!")

('A', 'A', 'match!')
('T', 'A', 'no match!')
('T', 'T', 'match!')
('G', 'G', 'match!')
('T', 'A', 'no match!')
('A', 'A', 'match!')
('C', 'C', 'match!')
('G', 'C', 'no match!')
('G', 'G', 'match!')
('A', 'A', 'match!')


Now we add a running total to keep track of the matches

In [31]:
matches = 0
for position in range(len(seqA)):
    baseA = seqA[position]
    baseB = seqB[position]
    if baseA == baseB:
        matches = matches + 1
        print(baseA, baseB, "match!")
    else:
        print(baseA, baseB, "no match!")
    print(position, matches)

('A', 'A', 'match!')
(0, 1)
('T', 'A', 'no match!')
(1, 1)
('T', 'T', 'match!')
(2, 2)
('G', 'G', 'match!')
(3, 3)
('T', 'A', 'no match!')
(4, 3)
('A', 'A', 'match!')
(5, 4)
('C', 'C', 'match!')
(6, 5)
('G', 'C', 'no match!')
(7, 5)
('G', 'G', 'match!')
(8, 6)
('A', 'A', 'match!')
(9, 7)


To get the proportion of matching bases, we divide the number of matches by the length of the sequence:

In [34]:
from __future__ import division

matches = 0
for position in range(len(seqA)):
    baseA = seqA[position]
    baseB = seqB[position]
    if baseA == baseB:
        matches = matches + 1
match_proportion = matches/len(seqA)

print(match_proportion)

0.7


So in order to solve the problem, we need to try all possible pairs of sequences for `seqA` and `seqB`.

The easiest way to generate pairs is to use two nested `for` loops:

In [35]:
letters = ['a', 'b', 'c', 'd']

for l1 in letters:
    for l2 in letters:
        print(l1,l2)

('a', 'a')
('a', 'b')
('a', 'c')
('a', 'd')
('b', 'a')
('b', 'b')
('b', 'c')
('b', 'd')
('c', 'a')
('c', 'b')
('c', 'c')
('c', 'd')
('d', 'a')
('d', 'b')
('d', 'c')
('d', 'd')


If we do the same for our sequences:

In [36]:
sequences = ['ATTGTACGGA', 'AATGAACCGA', 'AATGAACCCA', 'AATGGGAATA']
for seqA in sequences:
    for seqB in sequences:
        print(seqA, seqB)

('ATTGTACGGA', 'ATTGTACGGA')
('ATTGTACGGA', 'AATGAACCGA')
('ATTGTACGGA', 'AATGAACCCA')
('ATTGTACGGA', 'AATGGGAATA')
('AATGAACCGA', 'ATTGTACGGA')
('AATGAACCGA', 'AATGAACCGA')
('AATGAACCGA', 'AATGAACCCA')
('AATGAACCGA', 'AATGGGAATA')
('AATGAACCCA', 'ATTGTACGGA')
('AATGAACCCA', 'AATGAACCGA')
('AATGAACCCA', 'AATGAACCCA')
('AATGAACCCA', 'AATGGGAATA')
('AATGGGAATA', 'ATTGTACGGA')
('AATGGGAATA', 'AATGAACCGA')
('AATGGGAATA', 'AATGAACCCA')
('AATGGGAATA', 'AATGGGAATA')


Now we can plug in our distance code:

In [37]:
sequences = ['ATTGTACGGA', 'AATGAACCGA', 'AATGAACCCA', 'AATGGGAATA']
for seqA in sequences:
    for seqB in sequences:
        
        matches = 0
        for position in range(len(seqA)):
            baseA = seqA[position]
            baseB = seqB[position]
            if baseA == baseB:
                matches = matches + 1
        match_proportion = matches/len(seqA)
        
        print(seqA, seqB, match_proportion)

('ATTGTACGGA', 'ATTGTACGGA', 1.0)
('ATTGTACGGA', 'AATGAACCGA', 0.7)
('ATTGTACGGA', 'AATGAACCCA', 0.6)
('ATTGTACGGA', 'AATGGGAATA', 0.4)
('AATGAACCGA', 'ATTGTACGGA', 0.7)
('AATGAACCGA', 'AATGAACCGA', 1.0)
('AATGAACCGA', 'AATGAACCCA', 0.9)
('AATGAACCGA', 'AATGGGAATA', 0.5)
('AATGAACCCA', 'ATTGTACGGA', 0.6)
('AATGAACCCA', 'AATGAACCGA', 0.9)
('AATGAACCCA', 'AATGAACCCA', 1.0)
('AATGAACCCA', 'AATGGGAATA', 0.5)
('AATGGGAATA', 'ATTGTACGGA', 0.4)
('AATGGGAATA', 'AATGAACCGA', 0.5)
('AATGGGAATA', 'AATGAACCCA', 0.5)
('AATGGGAATA', 'AATGGGAATA', 1.0)


There's no point comparing a sequence to itself:

In [38]:
sequences = ['ATTGTACGGA', 'AATGAACCGA', 'AATGAACCCA', 'AATGGGAATA']
for seqA in sequences:
    for seqB in sequences:
        if seqA != seqB:

            matches = 0
            for position in range(len(seqA)):
                baseA = seqA[position]
                baseB = seqB[position]
                if baseA == baseB:
                    matches = matches + 1
            match_proportion = matches/len(seqA)

            print(seqA, seqB, match_proportion)

('ATTGTACGGA', 'AATGAACCGA', 0.7)
('ATTGTACGGA', 'AATGAACCCA', 0.6)
('ATTGTACGGA', 'AATGGGAATA', 0.4)
('AATGAACCGA', 'ATTGTACGGA', 0.7)
('AATGAACCGA', 'AATGAACCCA', 0.9)
('AATGAACCGA', 'AATGGGAATA', 0.5)
('AATGAACCCA', 'ATTGTACGGA', 0.6)
('AATGAACCCA', 'AATGAACCGA', 0.9)
('AATGAACCCA', 'AATGGGAATA', 0.5)
('AATGGGAATA', 'ATTGTACGGA', 0.4)
('AATGGGAATA', 'AATGAACCGA', 0.5)
('AATGGGAATA', 'AATGAACCCA', 0.5)


We can also avoid comparing sequences both ways: this is more complicated because we have to switch to using ranges to refer to the index of the elements in the list:

In [44]:
letters = ['a', 'b', 'c', 'd']
for index1 in range(len(letters)):
    for index2 in range(index1+1, len(letters)):
        print(index1, index2)

(0, 1)
(0, 2)
(0, 3)
(1, 2)
(1, 3)
(2, 3)


This gives us the positions of the element pairs we want. Now we switch to the elements themselves:

In [45]:
letters = ['a', 'b', 'c', 'd']
for index1 in range(len(letters)):
    for index2 in range(index1+1, len(letters)):
        l1 = letters[index1]
        l2 = letters[index2]
        print(l1, l2)

('a', 'b')
('a', 'c')
('a', 'd')
('b', 'c')
('b', 'd')
('c', 'd')


Plugging the same logic into our sequences code:

In [46]:
sequences = ['ATTGTACGGA', 'AATGAACCGA', 'AATGAACCCA', 'AATGGGAATA']
for index1 in range(len(sequences)):
    for index2 in range(index1+1, len(sequences)):
        seqA = sequences[index1]
        seqB = sequences[index2]
        matches = 0
        for position in range(len(seqA)):
            baseA = seqA[position]
            baseB = seqB[position]
            if baseA == baseB:
                matches = matches + 1
        match_proportion = matches/len(seqA)

        print(seqA, seqB, match_proportion)

('ATTGTACGGA', 'AATGAACCGA', 0.7)
('ATTGTACGGA', 'AATGAACCCA', 0.6)
('ATTGTACGGA', 'AATGGGAATA', 0.4)
('AATGAACCGA', 'AATGAACCCA', 0.9)
('AATGAACCGA', 'AATGGGAATA', 0.5)
('AATGAACCCA', 'AATGGGAATA', 0.5)


We will be able to make this much easier to read once we learn how to write our own functions.

## Kmer counting

Given a DNA sequence:

In [47]:
dna="ATGCATCATG"

and a given kmer, we know how to count the number of times it occurs:

In [48]:
dna.count('AT')

3

So the problem is just: how do we check all the kmers in turn? There are two options. 

One is to try each possible kmer of length k one after another. This is a bad idea, because:

- it will take a very long time when k is big, and most of the counts will be zero
- the code to do this is actually quite difficult

The second option is to just look at the kmers that are in the DNA sequence. It's exactly the same problem as the sliding window example from before:

In [50]:
window_size = 2
for start in range(len(dna) - window_size + 1):
    stop = start + window_size
    window = dna[start:stop]
    print (' ' * start) + window

AT
 TG
  GC
   CA
    AT
     TC
      CA
       AT
        TG


We can re-use this code to get a list of all kmers in the DNA sequence. Change the name of the variables to better reflect the problem:

In [51]:
dna="ATGCATCATG"
k = 2
kmers = []

for start in range(len(dna) - k + 1):
    stop = start + k
    kmer = dna[start:stop]
    kmers.append(kmer)

print(kmers)

['AT', 'TG', 'GC', 'CA', 'AT', 'TC', 'CA', 'AT', 'TG']


Now we can just loop over the list of kmers, count each one, and print if it occurs more than n times:

In [52]:
dna="ATGCATCATG"
k = 2
n = 2
kmers = []

for start in range(len(dna) - k + 1):
    stop = start + k
    kmer = dna[start:stop]
    kmers.append(kmer)
    
for kmer in kmers:
    if dna.count(kmer) > n:
        print(kmer)

AT
AT
AT


Now the problem is that we get the same kmer multiple times. One solution is to only add a kmer to the list if it's not already in there:

In [1]:
dna="ATGCATCATG"
k = 2
n = 2
kmers = []

for start in range(len(dna) - k + 1):
    stop = start + k
    kmer = dna[start:stop]
    if kmer not in kmers:
        kmers.append(kmer)
    
for kmer in kmers:
    if dna.count(kmer) > n:
        print(kmer)

AT


In [1]:
# ignore this cell, it's for loading custom js code
from IPython.core.display import Javascript
Javascript(filename="custom.js")

<IPython.core.display.Javascript object>

In [2]:
# ignore this cell, it's for loading custom css code
from IPython.core.display import HTML
HTML(filename="custom.css")