<h1 id="toctitle">Writing functions exercise solutions</h1>
<ul id="toc"/>

## Amino acid percentage part one

This is not too different from the AT counting code. We can start off with a specific protein sequence and amino acid residue:

In [1]:
from __future__ import division

protein = "MSRSLLLRFLLFLLLLPPLP"
aa = "R"

then write the code to do the calculation:

In [2]:
aa_count = protein.count(aa)
protein_length = len(protein)
percentage = aa_count * 100 / protein_length
print(percentage)

10.0


To turn it into a function, we want the `protein` and `aa` variables to be arguments, and the `percentage` variable to the be return value: 

In [3]:
def get_aa_percentage(protein, aa):
    aa_count = protein.count(aa)
    protein_length = len(protein)
    percentage = aa_count * 100 / protein_length
    return percentage

get_aa_percentage( "MSRSLLLRFLLFLLLLPPLP", "R")

10.0

Looks good. We can test it with the assertions:

In [4]:
assert get_aa_percentage("MSRSLLLRFLLFLLLLPPLP", "M") == 5
assert get_aa_percentage("MSRSLLLRFLLFLLLLPPLP", "r") == 10
assert get_aa_percentage("msrslllrfllfllllpplp", "L") == 50
assert get_aa_percentage("MSRSLLLRFLLFLLLLPPLP", "Y") == 0

AssertionError: 

The function fails to give the right answer when the protein and amino acid residues have different case. Let's modify it to do everything in upper case:

In [5]:
def get_aa_percentage(protein, aa):

    # convert both inputs to upper case
    protein = protein.upper()
    aa = aa.upper()

    aa_count = protein.count(aa)
    protein_length = len(protein)
    percentage = aa_count * 100 / protein_length
    return percentage

and try the assertions again:

In [6]:
assert get_aa_percentage("MSRSLLLRFLLFLLLLPPLP", "M") == 5
assert get_aa_percentage("MSRSLLLRFLLFLLLLPPLP", "r") == 10
assert get_aa_percentage("msrslllrfllfllllpplp", "L") == 50
assert get_aa_percentage("MSRSLLLRFLLFLLLLPPLP", "Y") == 0

Now everything runs without errors. 

##Amino acid percentage part two

The new thing here is that we need to write a function that takes a list as one of its arguments. Let's take the same approach as for part one - pick a specific input and write the code to solve it, then turn it into a function. 

We will count the number of each amino acid in the list in turn and use a `total` variable to keep a running total of the count:

In [12]:
protein = "MSRSLLLRFLLFLLLLPPLP"
aa_list = ['M', 'L', 'F']

# the total variable will hold the total number of matching residues 
total = 0 
for aa in aa_list: 
    print("counting number of " + aa)
    aa = aa.upper() 
    aa_count = protein.count(aa) 

    # add the number for this residue to the total count
    total = total + aa_count 
    print("running total is " + str(total))

percentage = total * 100 / len(protein)
print("final percentage is " + str(percentage))

counting number of M
running total is 1
counting number of L
running total is 11
counting number of F
running total is 13
final percentage is 65.0


This looks OK, time to turn it into a function:

In [8]:
def get_aa_percentage(protein, aa_list): 
    protein = protein.upper() 
    protein_length = len(protein) 
    total = 0 
    for aa in aa_list: 
        aa = aa.upper() 
        aa_count = protein.count(aa) 
        total = total + aa_count 
    percentage = total * 100 / protein_length 
    return percentage 

get_aa_percentage("MSRSLLLRFLLFLLLLPPLP", ['M', 'L', 'F'])

65

Now to test with the assertions:

In [10]:
assert get_aa_percentage("MSRSLLLRFLLFLLLLPPLP", ["M"]) == 5
assert get_aa_percentage("MSRSLLLRFLLFLLLLPPLP", ['F', 'S', 'L']) == 70
assert get_aa_percentage("MSRSLLLRFLLFLLLLPPLP") == 65

TypeError: get_aa_percentage() takes exactly 2 arguments (1 given)

Everything passes but the last one - we have not added the default yet. All we need to change is the definition line:

In [5]:
def get_aa_percentage(protein, aa_list=['A','I','L','M','F','W','Y','V']): 
    protein = protein.upper() 
    protein_length = len(protein) 
    total = 0 
    for aa in aa_list: 
        aa = aa.upper() 
        aa_count = protein.count(aa) 
        total = total + aa_count 
    percentage = total * 100 / protein_length 
    return percentage 

and now the tests all pass.

## Base counter

This time we will jump straight to writing the definition. There are a couple of different ways to measure the number of undetermined bases. One is to add up the number of A, T, G and C then subtract the total from the length of the sequence:

In [21]:
def count_undetermined(dna):
    total_good_bases = 0
    for base in ['a', 't', 'g', 'c']:
        total_good_bases = total_good_bases + dna.count(base)
    return len(dna) - total_good_bases

count_undetermined('atucgtgractanctgactg')

3

Another approach is to look at each base in the DNA sequence individually and check if it's undetermined:

In [22]:
def count_undetermined(dna):
    total_undetermined = 0
    for base in dna:
        if base not in ['a', 't', 'g', 'c']:
            total_undetermined = total_undetermined + 1
    return total_undetermined

count_undetermined('atucgtgractanctgactg')

3

Let's go with the second way. We need to convert the number to a proportion...

In [24]:
def count_undetermined(dna):
    total_undetermined = 0
    for base in dna:
        if base not in ['a', 't', 'g', 'c']:
            total_undetermined = total_undetermined + 1
    prop_undetermined = total_undetermined / len(dna)
    return prop_undetermined

count_undetermined('atucgtgractanctgactg')

0.15

... add a threshold argument ...

In [27]:
def check_undetermined(dna, threshold = 0.1):
    total_undetermined = 0
    for base in dna:
        if base not in ['a', 't', 'g', 'c']:
            total_undetermined = total_undetermined + 1
    prop_undetermined = total_undetermined / len(dna)
    if prop_undetermined > threshold:
        print("sequence has a high proportion of undetermined bases")
    else:
        print("sequence has a low proportion of undetermined bases")
        
check_undetermined('atucgtgractanctgactg', 0.1)
check_undetermined('atucgtgractanctgactg', 0.2)

sequence has a high proportion of undetermined bases
sequence has a low proportion of undetermined bases


... and switch to returning True or False:

In [29]:
def check_undetermined(dna, threshold = 0.1):
    total_undetermined = 0
    for base in dna:
        if base not in ['a', 't', 'g', 'c']:
            total_undetermined = total_undetermined + 1
    prop_undetermined = total_undetermined / len(dna)
    if prop_undetermined > threshold:
        return True
    else:
        return False
        
print(check_undetermined('atucgtgractanctgactg', 0.1))
print(check_undetermined('atucgtgractanctgactg', 0.2))

True
False


When we are checking a condition and returning True/False like this, we can use a shortcut:

In [31]:
def check_undetermined(dna, threshold = 0.1):
    total_undetermined = 0
    for base in dna:
        if base not in ['a', 't', 'g', 'c']:
            total_undetermined = total_undetermined + 1
    prop_undetermined = total_undetermined / len(dna)
    return prop_undetermined > threshold
        
print(check_undetermined('atucgtgractanctgactg', 0.1))
print(check_undetermined('atucgtgractanctgactg', 0.2))

True
False


In [7]:
# ignore this cell, it's for loading custom js code
from IPython.core.display import Javascript
Javascript(filename="custom.js")

<IPython.core.display.Javascript object>

In [8]:
# ignore this cell, it's for loading custom css code
from IPython.core.display import HTML
HTML(filename="custom.css")