<h1 id="toctitle">Writing functions</h1>
<ul id="toc"/>

## Writing a simple function

Look at the AT content calculation code from the start of the course:

In [1]:
my_dna = "ACTGATACATATATATCGATGCGTTCAT"
length = len(my_dna)
a_count = my_dna.count('A')
t_count = my_dna.count('T')
at_content = (a_count + t_count) / length
print("AT content is " + str(at_content))

AT content is 0.6785714285714286


There's one line to define the input:

```python
my_dna = "ACTGATACATATATATCGATGCGTTCAT"
```

and one line to define the output:

```python
print("AT content is " + str(at_content))
```

and four lines to do the actual calculation:

```python
length = len(my_dna)
a_count = my_dna.count('A')
t_count = my_dna.count('T')
at_content = (a_count + t_count) / length
```

We can turn this part of the code into a function that we call call:

In [6]:
# define a new function
def get_at_content(dna): #def short for define function, blue is name for function ... creating variable called dna (can be called anything)which can only be used in the function itself
    length = len(dna)
    a_count = dna.count('A')
    t_count = dna.count('T')
    at_content = (a_count + t_count) / length
    return at_content #not all funcs need to return something, best to just return

def get_gc_content(same_dna):
    return 1 - get_at_content(same_dna)#using a function within a function can use the same variable name

# call the function
at = get_at_content("ACTGATACATATATATCGATGCGTTCAT")
print(at)

0.3214285714285714


Things to note about the definition line:

- the definition line starts with `def`
- next comes the name of our new function (arbitrary, as with variable names)
- function argument names go in brackets (also arbitrary)
- def line ends with a colon
- body is indented
- argument names become variables inside the body
- body ends with `return` to return the result

__Defining__ a function doesn't cause it to run; that only happens when we __call__ it. 

Argument variables (`dna`) and local variables (e.g. `a_count`) only exist inside the function.

When we write a function we don't have to worry about how it will be used - we just need to know the __inputs__ (arguments) and __output__.

When we use a function we don't have to worry about how it works inside - we just need to know the __inputs__ and __output__.

### Calling our new function

Once we've written our function, we can use it in lots of different ways:


Calculate the AT content of a sequence in a file...
```python
dna = open('dna.txt').read()
at = get_at_content(dna.rstrip('\n'))
```

Print the AT content for a given sequence without storing it to a variable...
```python
print(get_at_content("ATTAGCGTAGC"))
```

Write the AT content for a sequence directly to a file...
```python
result = open('output.txt', 'w')
result.write(str(get_at_content('ACTGTCGA')) + "\n")
```

Use our function in another function...
```python
def get_gc_content(dna):
    return 1.0 - get_at_content(dna)
```

Separation of code into neat building blocks is very valuable and called __encapsulation__. Without encapsulation, you'll be writing "spaghetti code". 🍝


## Things to avoid when writing functions
### No input

We could write a function that relies on variables defined outside it rather than arguments:

In [7]:
dna = "ATCGCTAGCTGC"
def get_at_content(): #BAD
    length = len(dna)
    a_count = dna.count('A')
    t_count = dna.count('T')
    at_content = (a_count + t_count) / length
    return at_content

at = get_at_content()
print(at)

0.4166666666666667


This breaks encapsulation - now we have to know what variables are set in order to write the function. 

### No output

We could write a function that prints the value instead of returning it:

In [8]:
def get_at_content(dna):
    length = len(dna)
    a_count = dna.count('A')
    t_count = dna.count('T')
    at_content = (a_count + t_count) / length
    print(at_content)

get_at_content("ATGCGTATTTGAGCA") #BAD

0.6


but this also breaks encapsulation - we have to know how the function will be used in order to write it. 

A good rule of thumb: _information gets in by arguments, information gets out by return value_.

## Pure functions

The rule of thumb above has a more formal definition. A **pure** function is a function which:
- Always returns the same result given the same arguments
- Has no side-effects

We've already seen some **pure** functions like `len()`, `str()`

And some **impure** ones like `print()`, `open()` (side effect is causing text to appear, and open can depend on external factors)

Many functions cannot be pure, but in most cases if you can make a pure function then do so.

## Improving our function
### Adding another argument

One problem currently is that we get too many decimal places:

In [9]:
def get_at_content(dna):
    length = len(dna)
    a_count = dna.count('A')
    t_count = dna.count('T')
    at_content = (a_count + t_count) / length
    return(at_content)

get_at_content("ATGCGTATTTTTGAGCA")

0.6470588235294118

We can call the builtin `round()` function on the answer before we return it:

In [10]:
round(1.23456789, 5) #rounds to 5 dp

1.23457

In [11]:
def get_at_content(dna):
    length = len(dna)
    a_count = dna.count('A')
    t_count = dna.count('T')
    at_content = (a_count + t_count) / length
    return round(at_content, 2)
                 
get_at_content("ATGCGTATTTTTGAGCA")

0.65

What if we want more/fewer decimal places? Make the argument to `round()` an argument of our function:

In [12]:
def get_at_content(dna, decimal_places): #can define two factors by adding a new argument
    length = len(dna)
    a_count = dna.count('A')
    t_count = dna.count('T')
    at_content = (a_count + t_count) / length
    return round(at_content, decimal_places)

get_at_content("ATGCGTATTTTTGAGCA", 2)

0.65

In [13]:
get_at_content("ATGCGTATTTTTGAGCA", 4)

0.6471

### Default values for function arguments

In many cases, we don't really care about picking a number of decimal places. We can add a default to the definition:

In [2]:
def get_at_content(dna, decimal_places=2): #default = 2 but can specify if needed
    length = len(dna)
    a_count = dna.count('A')
    t_count = dna.count('T')
    at_content = (a_count + t_count) / length
    return round(at_content, decimal_places)

In [3]:
get_at_content("ATGCGTATTTTTGAGCA", 4)

0.6471

In [4]:
get_at_content("ATGCGTATTTTTGAGCA")#if you dont give a value for dp then it just uses the default

0.65

### Keyword arguments

In all the examples above, we supply the arguments in the same order as the definition. If we want to use a different order (or just be more explicit) we can use the argument names explicitly when callign the function:

In [5]:
get_at_content(decimal_places=3, dna="ATGCGTATTTTTGAGCA") #call functuon w/ arguments being named in a diff order, helpful to keep track of code

0.647

## Testing functions

When we're working on a new function, we might want to test if it's working correctly. Use `assert` with a condition:

In [8]:
assert get_at_content("A") == 1
assert get_at_content("G") == 0
assert get_at_content("ATGC") == 0.5
assert get_at_content("AGG") == 0.33
assert get_at_content("AGG", 1) == 0.3
assert get_at_content("AGG", 5) == 0.33333

#easier than having to go back and edit things again and again

If the `assert` is true nothing happens, but if an `assert` is `False`, then Python will stop and print an error message:

In [7]:
assert get_at_content("AGNN") == 0.5 ##this is not a correct statement so brings back an error

AssertionError: 

Assertions are good for:

- providing some documentation about the behaviour of the function
- reassuring you that your function is giving the right answer
- letting you know if you break the function, should you ever change it 
- demonstrating how the function can be used to other people

Writing assertions about your functions is a **very good habit**, but many programmers don't bother. So now after the course, ask your local bioinformatician if they write assertion tests (also known as unit tests) for their own functions. If not, just sigh, and shake your head sadly.

---

## Exercises

### Amino acid percentage, part one

Write a function that takes two arguments – a protein sequence and an amino acid residue code – and returns the percentage of the protein that the amino acid makes up. Use the following assertions to test your function:

```python
assert my_function("MSRSLLLRFLLFLLLLPPLP", "M") == 5
assert my_function("MSRSLLLRFLLFLLLLPPLP", "r") == 10
assert my_function("MSRSLLLRFLLFLLLLPPLP", "L") == 50
assert my_function("MSRSLLLRFLLFLLLLPPLP", "Y") == 0
```

If you choose a different name for your function then change the name in the `assert` statements to match your function name. But don't change the assertions themselves - modify your function until they all pass.

In [1]:
def percent_protien(protien, AA):
    AA=AA.upper()
    aa_count = protien.count(AA)
    percent=(aa_count/len(protien))*100
    return int(percent)#otherwise will give you floating point number 

assert percent_protien("MSRSLLLRFLLFLLLLPPLP", "M") == 5 #can do 'is 5' as a stricter version
assert percent_protien("MSRSLLLRFLLFLLLLPPLP", "r") == 10 #need to convert to .upper for this one 
assert percent_protien("MSRSLLLRFLLFLLLLPPLP", "L") == 50
assert percent_protien("MSRSLLLRFLLFLLLLPPLP", "Y") == 0 ##no output == good



### Amino acid percentage, part two

Modify the function from part one so that it accepts a list of amino acid residues rather than a single one. If no list is given, the function should return the percentage of hydrophobic amino acid residues (A, I, L, M, F, W, Y and V). Your function should pass the following assertions:

```python
assert my_function("MSRSLLLRFLLFLLLLPPLP", ["M"]) == 5
assert my_function("MSRSLLLRFLLFLLLLPPLP", ['F', 'S', 'L']) == 70
assert my_function("MSRSLLLRFLLFLLLLPPLP") == 65
```

To get this one to work, you'll have to go through the list of amino acid residues one at a time, generate the count for each one, and come up with a total count.

In [38]:
hydrophobic_aa=['A','I','L','M','F','W','Y','V']
print(hydrophobic_aa)

def percent_protien(protien, AA=hydrophobic_aa):
    aa_count_all=[]
    for i in AA:
        aa_count = protien.count(i)
        aa_count_all.append(aa_count)
        

percent_protien("MSRSLLLRFLLFLLLLPPLP",['F', 'S', 'L'])      

['A', 'I', 'L', 'M', 'F', 'W', 'Y', 'V']


TypeError: must be str, not list

In [53]:
protien = "MSRSLLLRFLLFLLLLPPLP"
AA = ['F', 'S', 'L']
aa_count_all=[]

for i in AA:
    aa_count = protien.count(i)
    aa_count_all.append(aa_count)
    

    
print(aa_count_all) 

[2, 2, 10]


In [2]:
protien = "MSRSLLLRFLLFLLLLPPLP"
AA = ['F', 'S', 'L']
aa_count_all=[]

for i in AA:
    aa_count = protien.count(i)
    aa_count_all.append(aa_count)
    

    
print(aa_count_all) 

[2, 2, 10]


In [9]:
hydrophobic_aa=['A','I','L','M','F','W','Y','V']
print(hydrophobic_aa)

def percent_protien(protien, AA=hydrophobic_aa):
    residue_count=0 #use this to count!
    for i in AA:
        aa_count = protien.count(i)
        residue_count=int(residue_count)+int(aa_count)
    percent=(residue_count/len(protien))*100
    return int(percent)#otherwise will give you floating point number

assert percent_protien("MSRSLLLRFLLFLLLLPPLP", ["M"]) == 5
assert percent_protien("MSRSLLLRFLLFLLLLPPLP", ['F', 'S', 'L']) == 70
assert percent_protien("MSRSLLLRFLLFLLLLPPLP") == 6



    

['A', 'I', 'L', 'M', 'F', 'W', 'Y', 'V']


AssertionError: 

### Base counter

Write a function that will take a DNA sequence along with an optional threshold and return `True` or `False` to indicate whether the DNA sequence contains a high proportion of undetermined bases (i.e not A, T, G or C).

Write at least three `assert` statements about your function before you write the function itself.