<div align=right>
<img src="img/logosmall.png" width="100px" align=right>
</div>

# Writing our own functions

<div class="alert alert-warning">
Parts of this section have been adapted from copyrighted material in *Jones, M: Python for Biologists: A complete programming course for beginners (2013)*.

**Please do not distribute it!**

## Why do we want to write our own functions?

Take a look back at the very first exercise in the first section – the one where we had to write a program to calculate the GC content of a DNA sequence:

```python
my_dna = "ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT"

length = len(my_dna)
g_count = my_dna.count('G')
c_count = my_dna.count('C')
gc_content = (g_count + c_count) / length

print("GC content is " + str(gc_content))
```

It takes four middle lines of code to calculate the GC content. Every place in our code where we want to calculate the GC content of a sequence, we need these same four lines – exactly, without any mistakes.

It would be much simpler if Python had a built-in function (let’s call it `get_gc_content()`) for calculating GC content. If that were the case, then we could just run `get_gc_content()` in the same way we run `print()`, or `len()`, or `open()`. Being a general-purpose language, Python does not have such a built-in function.  It does have the next best thing — a way for us to create our own functions.

Creating our own function to carry out a particular job has many benefits. It allows us to **re-use the same code** many times within a program without having to copy it out each time. Additionally, if we find that we have to make a change to the code, we only have to do it in one place. We can also re-use code across multiple programs!

Splitting our code into functions also allows us to tackle larger problems, as we can work on different bits of the code independently.

## Defining a function

Let’s create our `get_gc_content()` function!

Before we start, we need to figure out:

* What are the inputs going to be? (the function's *arguments*)
* What are the outputs going to be? (the function's *return value*)

For this function it seems obvious:

* The input is going to be a single DNA sequence (a string).
* and the output is going to be a decimal number (an integer).

Here's the code:

In [3]:
def get_gc_content(dna):
    length = len(dna)
    g_count = dna.count('G')
    c_count = dna.count('C')
    gc_content = (g_count + c_count) / length
    return gc_content

* The first line of the function definition starts with the keyword `def`( short for *define*)…

* followed by the name of the function…

* followed by the names of the argument variables, in parentheses.

The function name and the argument names are arbitrary.  As always, it's good to use descriptive names.

This first line ends with a colon, just like the first line of a `for` loop.

And just like `for` loops, this line is followed by a *code block* of indented lines – we call this the *function body*.

The function body can have as many lines of code as we like, as long as they all have the same indentation. Within the function body, we can refer to the arguments by using the variable names from the first (definition) line. In this case, the variable `dna` refers to the sequence that was passed in as the argument to the function.

The last line of the function causes it to *return* the GC content that was calculated in the function body. We write the keyword `return` followed by the value that the function should return.

There are a couple of important things to be aware of when writing functions. Firstly, we need to make a clear distinction between *defining* a function, and *calling* it.  The code we’ve written above will not cause anything obvious to happen when we evaluate it, because we’ve not actually asked Python to execute the `get_gc_content()` function – we have simply defined what it is.

In fact, something *does* happen behind the scenes.  A new function object is created, and a variable called `get_gc_content` now refers to this function object.

We can evaluate the function we've just defined:

In [35]:
get_gc_content

<function __main__.get_gc_content>

You'll see that Python regards it as an object of type *function*.

The code in the function body will only be executed when we call the function.  And we call it in the same way we've called built-in functions like `print`:

In [36]:
get_gc_content("ATGACTGGACCA")

0.5

In order to use the calculated return value to do something useful, we must either store the result in a variable:

In [7]:
gc_cnt = get_gc_content("ATGACTGACCA")
print(gc_cnt)

0.45454545454545453


…or use it directly:

In [5]:
print("GC content is", get_gc_content("ATGACTGGACCA"))

GC content is 0.5


Secondly, it’s important to understand that the argument variable `dna` does not refer to any particular value when the function is defined. Instead, its job is to refer to whatever value is *passed* as the argument to the function when it is called.

Note that the variable `dna`, `length`, `g_count`, `c_count` and `gc_content` exist **only inside the function** `get_gc_content`.  We say the variables are *local* to the function, and that the code block which forms the body of the function is the *scope* of these variables.

If we try to use or evaluate a function's local variable from outside the function, we'll get the standard error for an undefined name:

In [33]:
print(g_count)

NameError: name 'g_count' is not defined

If we define a variable in a Jupyter code box outside of any code block, that variable becomes a *global variable* in the current Notebook:

In [37]:
my_variable = 42

We say such a variable is in the *global scope*.

Similarly, if we define a variable outside of any code block in a Python *module* (a "module" is simply a text file containing Python code), that variable is *global* in that module.

In Python ever object has a *namespace*.  Inside a given namespace, variables can be assigned to objects… just like the luggage tags we saw in the illustrations in the first section of this course.

However, namespaces are *completely independent*.  It's quite possible to have two variables called `gc_content` in two different namespaces that reference completely different bits of data.

While were were inside a function, any reference to a variable name (say, `gc_content`) resolved to the *local variable* of that name (the one in the function's namespace).  If, however, there is no local variable called `gc_content` in the function's namespace, the name `gc_content` will be looked up in the *global* namespace.

In a sense, local variables "override" global ones.  We often say that the local variable *shadows* the global variable.

Once we were again outside our function, a reference to `gc_content` once more referred to the variable of that name in the global namespace (which still has the same value we assigned to it all the way back up in the first code box).

Let's define two short functions, one with a local variable called `gc_content` and one without:

In [10]:
def func1():
    return gc_content
    
def func2():
    gc_content = 0.4
    return gc_content

And play with them:

In [38]:
func1()

NameError: name 'gc_content' is not defined

In [39]:
func2()

0.4

Now we define a variable called `gc_content` in the global scope of this Notebook:

In [40]:
gc_content = 0.5

And we try calling our functions again:

In [41]:
func1()

0.5

In [42]:
func2()

0.4

Can you explain exactly what happened there?

## Calling and improving our function

Let’s write some code that uses our `get_gc_content()` function it to see how it works:

In [12]:
my_gc_content = get_gc_content("ATGCGCGATCGATCGAATCG")
print(my_gc_content)
print(get_gc_content("ATGCATGCAACTGTAGC"))
print(get_gc_content("aactgtagctagctagcagcgta"))

0.55
0.47058823529411764
0.0


Looking at the output, we can see that the first function call works fine – the GC content is calculated to be 0.55, is stored in the variable `my_gc_content`, then printed.

However, the output for the next two calls is not so great. The call at line 3 produces a number with way too many figures after the decimal point, and the call at line 4, with the input sequence in lower case, gives a result of `0.0`, which is definitely not correct.  (Can you guess why?)

We’ll fix these problems by making a couple of changes to the `get_gc_content()` function:

* We can add a rounding step in order to limit the number of significant figures in the result.

  Python has a built-in `round()` function that takes two arguments – the number we want to round, and the number of significant digits we desire.
  

* We can fix the lower case problem by converting the input sequence to upper case before starting the calculation.

Here’s the new version of the function, with the same three function calls:

In [15]:
def get_gc_content(dna):
    length = len(dna)
    g_count = dna.upper().count('G')
    c_count = dna.upper().count('C')
    gc_content = (g_count + c_count) / length
    return round(gc_content, 2)
 
my_gc_content = get_gc_content("ATGCGCGATCGATCGAATCG")
print(my_gc_content)
print(get_gc_content("ATGCATGCAACTGTAGC"))
print(get_gc_content("aactgtagctagctagcagcgta"))

0.55
0.47
0.48


Much better, but we can do better still:

Why not make it so that we can specify the number of significant digit when we call the function?

We add a second argument variable `sig_digs` to the function definition, and use it in the call to `round()`:

In [16]:
def get_gc_content(dna, sig_digs):
    length = len(dna)
    g_count = dna.upper().count('G')
    c_count = dna.upper().count('C')
    gc_content = (g_count + c_count) / length
    return round(gc_content, sig_digs)
 
test_dna = "ATGCATGCAACTGTAGC"
print(get_gc_content(test_dna, 1))
print(get_gc_content(test_dna, 2))
print(get_gc_content(test_dna, 3))

0.5
0.47
0.471


The output confirms that the rounding works as intended.

## Encapsulation

Let’s pause for a moment and consider what we've just done:

We wrote a function, and then wrote some code that used that function. In the process of writing the code that used the function, we discovered a couple of problems with our original function definition. **We were then able to go back and change the function definition, without having to make any changes to the code that used the function.**

This is a programming phenomenon that we call *encapsulation*. Encapsulation just means dividing up a complex program into little bits which we can work on independently. In our example the code is divided into two parts: The part where we define the function, and the part where we call it.  And we can make changes to one part without worrying about the effects on the other.

This is a very powerful idea, because without it, the size of programs we can write is limited to the number of lines of code we can hold in our head at one time. Some of the example code in the solutions to exercises in the previous section were starting to push at this limit already, even for relatively simple problems. By contrast, using functions allows us to build up a complex program from small building blocks, each of which individually is small enough to understand in its entirety.

## Functions don’t always have to take an argument

There's nothing that says that your function *must* take an argument. It's perfectly possible to define a function with no arguments:

```python
def get_a_number():
    return 42
```

…but such functions tend not to be very useful. For example, we can write a version of `get_gc_content` that doesn’t require any arguments by setting the value of the `dna` variable inside the function:

In [17]:
def get_gc_content_2():
    dna = "ACTGATGCTAGCTA"
    length = len(dna)
    g_count = dna.upper().count('G')
    c_count = dna.upper().count('C')
    gc_content = (g_count + c_count) / length
    return round(gc_content, 2)

…but this version will always calculate the same value, unless we change the DNA sequence directly in the code.  It's not reusable, and therefore not very useful as a function.

Another thing we should avoid doing is writing a function that uses a global variable rather than a parameter:

In [43]:
def get_gc_content_3():
    length = len(dna)
    g_count = dna.upper().count('G')
    c_count = dna.upper().count('C')
    gc_content = (g_count + c_count) / length
    return round(gc_content, 2)
 
dna = "ACTGATCGATCG"
print(get_gc_content_3())

0.5


It works because the function gets the value of the global `dna` variable.

>Remember:  If you reference a variable inside a function, Python first tries to look it up in the function's own namespace.  If it doesn't find a variable of that name bound in the local namespace, it also tries the global namespace.  So you can indeed use global variables inside the body of a function.

But this is practically **never a good idea**.  It **breaks the encapsulation** that we worked so hard to achieve. The function now only works if there is a variable called `dna` set in the bit of the code where the function is called, so the two pieces of code are no longer independent.

If you find yourself writing code like this, it's usually a good idea to identify which variables from outside the function are being used inside it, and turn them into arguments.

## Functions don’t always have to return a value

Consider this variation of our function – instead of returning the GC content, this function prints it to the screen:

In [44]:
def print_gc_content_p(dna):
    length = len(dna)
    g_count = dna.upper().count('G')
    c_count = dna.upper().count('C')
    gc_content = (g_count + c_count) / length
    print(round(gc_content, 2))

When you first start writing functions, it’s very tempting to do this kind of thing. You think *“OK, I need to calculate and print the GC content – I’ll write a function that does both”*. The trouble with this approach is that it results in a function that is less flexible. Right now you want to print the GC content to the screen, but what if you later discover that you want to write it to a file, or use it as part of some other calculation? You’ll have to write more functions to carry out these tasks.

As far as possible, the functions you write should be like mathematical functions.  A mathematical function takes certain arguments, does its calculation, and returns the result of that calculation.  It does not affect the state of the world in any way except by returning its result.

For instance, this mathematical function `f`:

\begin{equation*}f(x) = x^2\end{equation*}

If we call it with an argument it consumes that argument, and returns a result which is the square of that argument:

\begin{equation*}f(2) = 4\end{equation*}

While doing its calculation, it doesn't make anything appear on your computer screen, it doesn't switch on the room light, and it doesn't launch any nuclear missiles;  it *simply and only returns its result*.

Similarly (for the most part!) the functions we write in our code should strive to be "real functions" which should only communicate with the outside world through the results they return.  This helps to ensure a reasonable degree of encapsulation.  The above mathematical function can be implemented and called in Python like this:

In [18]:
def f(x):
    return x**2

f(2)

4

(The double-asterisk (“`**`”) is Python's exponentiation operator.)

Anything a function does over and above returning a result, and which does affect the state of the world (like writing some text to the terminal) is called a *side effect*.  Examples of side effects are:  Changing the value of a global variable, and printing a result to the screen.

Of course, we have to print results to the screen *sometimes*, but in such cases it's good to group all the input/output operations together in functions that *only* do input and output.

## Functions can be called with named arguments

What do we need to know about a function in order to be able to use it? We need to know what the return value and type is, and we need to know the number and type of the arguments. For most of the examples we’ve seen so far we also need to know the **order** of the arguments.

For instance, to use the `open()` function we need to know that the name of the file comes first, followed by mode flag. And to use our two-argument version of `get_gc_content()`, we need to know that the DNA sequence comes first, followed by the number of significant figures.

Python supports *keyword arguments* which allows us to call functions in a slightly different way. Instead of giving a list of arguments in parentheses:

In [21]:
get_gc_content("ATCGTGACTCG", 2)

0.55

…we can supply a list of argument variable names and values joined by equals signs:

In [22]:
get_gc_content(dna="ATCGTGACTCG", sig_digs=2)

0.55

This style of calling functions has several advantages. It doesn’t rely on the order of arguments, so we can use whichever order we prefer. This statement behaves identically to the one above:

In [23]:
get_gc_content(sig_digs=2, dna="ATCGTGACTCG")

0.55

It’s also clearer to read what’s happening when the argument names are given explicitly.

We can even mix and match the two styles of calling – the following are all identical:

In [24]:
get_gc_content("ATCGTGACTCG", 2)

0.55

In [25]:
get_gc_content(dna="ATCGTGACTCG", sig_digs=2)

0.55

In [26]:
get_gc_content("ATCGTGACTCG", sig_digs=2)

0.55

Although we’re not allowed to start using keyword arguments then switch back to positional – this will cause an error:

In [45]:
get_gc_content(dna="ATCGTGACTCG", 2)

SyntaxError: positional argument follows keyword argument (<ipython-input-45-8c6ce3e321d5>, line 1)

Keyword arguments can be particularly useful for functions and methods that have a lot of arguments, and we’ll use them where appropriate in the examples and exercise solutions in the rest of this course.

## Function arguments can have defaults

We’ve encountered function arguments with defaults before: Recall that the `open()` function takes two arguments – a file name and a mode flag – but that if we call it with just a file name it uses a default value of `'r'` for the mode flag.

We can write our own functions to have default arguments:  We simply specify the default value in the function definition. Heress a version of our `get_gc_content()` function where the default number of significant digits is two:

In [46]:
def get_gc_content(dna, sig_digs=2):
    length = len(dna)
    g_count = dna.upper().count('G')
    c_count = dna.upper().count('C')
    gc_content = (g_count + c_count) / length
    return round(gc_content, sig_digs)

Now we have the best of both worlds. If the function is called with two arguments, it will use the number of significant figures specified; if it’s called with one argument, it will use the default value of two significant figures. Let’s see some examples:

In [47]:
get_gc_content("ATCGTGACTCG")

0.55

In [29]:
get_gc_content("ATCGTGACTCG", 3)

0.545

In [30]:
get_gc_content("ATCGTGACTCG", sig_digs=4)

0.5455

The function takes care of filling in the default value for `sig_digs` for the first function call where none is supplied.

Argument defaults allow us to write very flexible functions which can have varying numbers of arguments. It only makes sense to use them for arguments where a sensible default can be chosen.  (There's no point specifying a default for the `dna` argument in our example.)  They are particularly useful for functions where some of the options are only going to be used infrequently.

Note that you can't get away without supplying any non-optional argument (i.e. an argument without a default value):

In [48]:
get_gc_content()

TypeError: get_gc_content() missing 1 required positional argument: 'dna'

At least the error is very specific!

## Testing functions

When writing code of any type, it’s important to periodically check that your code does what you intend it to do. If you look back over the solutions to exercises from the first few sections, you can see that we generally test our code at each step by printing some output to the screen and checking that it looks OK. For example, when we were first calculating GC content, we used a very short test sequence to verify that our code worked before running it on the real input.

The reason we used a test sequence was that, because it was so short, we could easily work out the answer by eye and compare it to the answer given by our code. This idea – running code on a test input and comparing the result to an answer that we know to be correct – is such a useful one that Python has a built-in tool for expressing it: `assert`. An assertion consists of the keyword `assert`, followed by a call to our function, then two equals signs, then the result that we expect.

For example, we know that if we run our `get_gc_content` function on the DNA sequence “ATGC” we should get an answer of `0.5`. This assertion will test whether that’s the case:

In [50]:
assert get_gc_content("ATGC") == 0.5

AssertionError: 

The way that assertion statements work is very simple; if an assertion turns out to be false (i.e. if Python executes our function on the input “ATGC” and the answer isn’t `0.5`) then the program will fail with an `AssertionError`.

Assertions are useful in a number of ways. They provide a means for us to check whether our functions are working as intended and therefore help us track down errors in our programs. If we get some unexpected output from a program that uses a particular function, and the assertion tests for that function all pass, then we can be reasonably confident that the error doesn’t lie in the function but in the code that calls it.

They also let us modify a function and check that we haven’t introduced any errors. If we have a function that passes a series of assertion tests, and we make some changes to it, we can re-run the assertion tests and, assuming they all pass, be reasonably confident that we haven’t broken the function.

Assertions are also useful as a form of documentation. By including a collection of assertion tests alongside a function, we can show exactly what output is expected from a given input.

Finally, we can use assertions to test the behaviour of our function for unusual inputs. For example, what is the expected behaviour of `get_gc_content()` when given a DNA sequence that includes unknown bases (usually represented as N)? A sensible way to handle unknown bases would be to exclude them from the GC content calculation – in other words, the GC content for a given sequence shouldn’t be affected by adding a bunch of unknown bases. We can write an assertion that expresses this:

In [51]:
assert get_gc_content("ATGCNNNNNNNNNN") == 0.5

AssertionError: 

This assertions fails for the current version of `get_gc_content`. However, we can easily modify the function to remove all N characters before carrying out the calculation:

In [52]:
def get_gc_content(dna, sig_digs=2):
    dna = dna.upper().replace('N', '')
    length = len(dna)
    g_count = dna.upper().count('G')
    c_count = dna.upper().count('C')
    gc_content = (g_count + c_count) / length
    return round(gc_content, sig_digs)

…and now the assertion should pass:

In [53]:
assert get_gc_content("ATGCNNNNNNNNNN") == 0.5

It’s common to group a collection of assertions for a particular function together to test for the correct behaviour on different types of input. Here’s an example for `get_at_content` which shows a range of different types of behaviour:

In [54]:
assert get_gc_content("A") == 0
assert get_gc_content("G") == 1
assert get_gc_content("ATGC") == 0.5
assert get_gc_content("AAG") == 0.33
assert get_gc_content("AAG", 1) == 0.3
assert get_gc_content("AAG", 5) == 0.33333

In fact, this idea of grouping sets of tests together is such a good one that we have special words for it (*test suites* and *unit testing*) and there are multiple built-in Python tools for carrying out such tests.  There's even a style of software development that makes testing central — *"test-driven development"*.  Covering these concepts in detail is beyond this introductory course.

---

## Exercises

### 1. Percentage of amino acid residues, Part 1

Write a function that takes two arguments – a protein sequence and an amino acid residue code – and returns the percentage of the protein that the amino acid makes up.

In [8]:
# Exercise 1

def my_function(protein, residue):
    protein = protein.upper()
    residue = residue.upper()
    return protein.count(residue) / len(protein) * 100

Use the following assertions to test your function:

In [9]:
assert my_function("MSRSLLLRFLLFLLLLPPLP", "M") == 5
assert my_function("MSRSLLLRFLLFLLLLPPLP", "r") == 10
assert my_function("MSRSLLLRFLLFLLLLPPLP", "L") == 50
assert my_function("MSRSLLLRFLLFLLLLPPLP", "Y") == 0

### 2. Percentage of amino acid residues, Part 2

Modify the function from part one so that it accepts a list of amino acid residues rather than a single one. If no list is given, the function should return the percentage of hydrophobic amino acid residues (A, I, L, M, F, W, Y and V).

In [1]:
# Exercise 2
def my_function(protein, residue):
    protein = protein.upper()
    residue = residue.upper("M", "R", "L", "Y")
    return protein.count(residue) / len(protein) * 100

if nolist hydrophobic_ac("A", "I", "L", "M", "F", "W", "Y", "V")
return preotein.count(residue) / len(protein) * 100

Your function should pass the following assertions:

In [None]:
assert my_function("MSRSLLLRFLLFLLLLPPLP", ["M"]) == 5
assert my_function("MSRSLLLRFLLFLLLLPPLP", ['M', 'L']) == 55
assert my_function("MSRSLLLRFLLFLLLLPPLP", ['F', 'S', 'L']) == 70
assert my_function("MSRSLLLRFLLFLLLLPPLP") == 65

### 3. Factorial

The *factorial* of a number `n` — denoted by `n!` — is defined as the product of all positive integers less than or equal to `n`.  For example:

    5! = 5 * 4 * 3 * 2 * 1 = 120
    
Write a function `factorial(n)` that returns the factorial of its argument, `n`.

>Hint:  There are a number of ways to do this, but the easiest may be a *recursive* function, that is, a function that calls itself.

In [None]:
# Exercise 3

