## Learning objectives


1. Running python scripts

2. Writing functions

3. Understanding assert statements

4. Parsing FASTA files

5. Writing a k-mer counter

---

## Running a python script from the command line

There are several ways to run a python script from the command line. We have been using one, passing a script to the python program. Alternatively, we can make the script executable itself.

```bash
>chmod a+rx SCRIPT.py
```

This means that we can now call our script, either using a direct path to it

```bash
>./SCRIPT.py
```

Alternatively, the script could be put into any folder listed in the `PATH` environmental variable.

However, running any of the scripts we've seen so far give us an error. Why? The computer doesn't know what language the code is written in so it defaults to the language of the shell you're in (we're in bash). So how do we fix this?

### The Shebang!!!

The shebang is a special comment at the beginning of the script that tells the computer what environment to run the script in. We can even use it to tell the computer a specific version of python to use.

Now we can run our script as a stand-alone program and the computer will know how to run it as a python script! The same thing works for any other non-compiled language scripting.

## Writing functions

Anytime that you may want to run the same set of operations more than once, a function is a convenient way to save writing the code over. Functions also allow you to create much more readable and understandable code. Function definitions contain four parts, the word "def" to let the interpreter know that you're defining a function, the name of the function, the arguments that the function can take, and the code to execute.

Arguments to a function have a specific order and the order that you pass thenm to the function will correspond to the order they are given in the function definition. However, you can use the names of the arguments in the function definition to pass arguments in different orders.

Sometimes you don't need to define every argument everytime you call the function. You can define default values when you define the function. However, you need to put all variables without default values before those with defaults.

There are two ways that functions can pass back information. The `return` statement will end the function and pass back whatever follows the return. If there is no return statement, then the function returns `None`. The second way is to pass a mutable variable to the function. When a function is called, immutable variables remain unchanged by operations in the function. Mutable variable types, however, allow changes within the function to be maintained after the function call finishes.

What can we do if we have many different results that we want to return from a function? Given the following function, can you edit the code to return all of the variables that we created within the function?

Functions can also be called recursively. That is, a function can call itself. This is very useful for things like traversing a tree. Let's try finding a factorial value.

If you haven't encountered it by now, one of the quirks about (but not limited to) python is that it is interpreted linearly from top to bottom, which pratically means that you can't refer to anything that hasn't been defined above it, including functions. So, if you have functions but the main part of your code is outside a function, you have tp pay attention to ordering. As you can see, the following doesn't work.

In [None]:
a = some_function()

def some_function():
    return "my value"

Therefor it is useful to enclose ALL of your code in functions. Traditionally, the main part of your code will occur in the cleverly named `main` function. 

But, that code doesn't seem to do anything!!! Why not? Can we fix it?

## Defensive programming

One thing to watch out for is code being executed in a way that you didn't intend. Therefore its important to check that things are what you expect before moving on. 

## Parsing a FASTA sequence

Now, let's use our new found skills to read in a FASTA file. The FASTA format is used to hold raw sequence. Each sequence begins with a ">" followed by the name of the sequence. All of the lines following this until then next name line or the end of the file are the sequence associated with that name, broken by line breaks for readability. So, let's break down the steps we'll need to take to read in a FASTA file with a single sequence. And let's do it in a function that accepts a file stream as an argument.

### FASTA format


<br><div style="background: #EEE"> \>sequence 1<br> AGATCTCCCTGAGAGAAGAGCTCTCTCTCGA<br> TCTCGGATTACGTAGGCTAGAGAGAGAGCTA<br> TTCAA<br> \>sequence 2<br> GATCTCGGGATAAAAAAACTGGGATCTGATC<br> ATCTAAAGAGAG </div><br>


*Write your pseudo-code here*


In [None]:
# Let's have it accept an open file object
# That way it can be passed a file or standard input


## Parsing a FASTA file

Now that we've read in a single sequence, let's alter that function to read in all of the sequences in a file. How do we need to alter the function?


*Write your pseudo-code here*


To test this, we'll need a FASTA file so copy `/Users/cmdb/qbb2021/data/subset.fa` into the same directory as this notebook.

## K-mer counting

A k-mer is an arbitrary length sequence. One characteristic of genomes is that some k-mers are reused often while others are rare or absent. This can allow us to distinguish different species by k-mer distributions, identify copy numnber variation, repeat expansions or contractions, etc. It is also used in sequene alignment. Now that we've got a function for reading in sequences, let's break those sequences down into k-mers and count the occurance of each k-mer. But, what steps do we need to do this?

*Write your pseudo-code here*

While it is useful to know the distribution of k-mers for a set of sequences, there are some applications like sequence alignment that you also need to know the position. Since we have seen that some k-mers occur more than once, we need to be able to hold multiple positions for each k-mer. What changes to the above code do we need to make to hold the locations of each k-mer?