# Outline of Defensive Programming and Automation

We've gone over the basics of Python and Bash, and we've done some data exploration using the Jupyter Notebook. This is a good entrypoint to discuss some programming *practices*: how to program in ways that are understandable/transparent, will save you time, and will prevent headaches down the road.

Outline:
1. **Defensive Programming** - defending yourself and others against "bad" code
2. **Working with files** - allowing our Jupyter/Python functions to interact with files outside this notebook
3. **Scripts** - saving our code for later
4. **Automation** - letting the computer run everything for you

Code is read much more often than it is written. Ensuring your original code when first written is documented and efficient means less headscratching when you inevitably revisit it to re-do analyses, add more functions, use other datasets, and many other real scenarios that you'll probably experience in lab.

# Defensive Programming

It is very easy to start coding, and then subsequently lose track of why and how you wrote your code. You may also have collaborators with whom you share your code, you may have auditors or reviewers that need to check your work, or you may even step away for your computer for the day and forget what you were doing! "Defensive programming" is a set of tactics that protects you and others from code that is difficult to read, understand, or troubleshoot. There are many components to defensive programming, some of which we'll cover in today's module:
* Style guides
* Naming conventions
* Comments
* Docstrings
* Don't Repeat Yourself - DRY
* Error handling
* Outlining your code - Pseudocode

## Style Guides

Programming languages have style guides that ensure your code is readable, usable, and debugable by others. Most Python developers choose to follow the PEP8 style guide: https://www.python.org/dev/peps/pep-0008/. Other languages, such as R, have their own style guides (e.g. Tidyverse: https://style.tidyverse.org/). Many automated tools rely on these standards to parse your code for debugging and generating documentation. We won't go through all of PEP8, but we'll start with a subset of conventions that are common across different languages.

Adopting a style guide provides **consistency** to your code, making it easier for you and others to read and edit. That being said, while the above guides are commonly used, they aren't required. If you have a reason to format something differently, as long as others can understand it and you've documented the reason, go for it!

## Naming Conventions

How do you decide which names to give to your variables, functions, and scripts? In general, they should be **descriptive**: the name you give a variable should tell you about what information it holds. For example:
* A variable named `path_to_fasta` would tell you that it stores the location of a [FASTA file](https://en.wikipedia.org/wiki/FASTA_format) on your computer, probably as a `String` object.
* A variable named `phred_score` would tell you that it stores a [PHRED score](https://en.wikipedia.org/wiki/Phred_quality_score), probably as an `int`.
* A variable named `temp2` probably doesn't tell you anything about what data it holds.

There are many common conventions as to how variable names should be formatted. The ones below are not specific to Python; you'll see them across other languages:

1. **snake_case**: All words are lower case with underscores between them
2. **CamelCase**: Words start with capital letters and are not separated
3. **mixedCase**: Like CamelCase but the first word is lowercase
4. **UPPERCASE_WITH_UNDERSCORES**: All letters are uppercase, separated by underscores 

Python's style guide outlines rules for using naming conventions:

1. Variables: **snake_case**
    - dna_sequence, file_path
2. Functions: **snake_case**
    - combine_replicates(), align_sequences()
3. Errors: **CamelCase**
    - ValueError, SyntaxError, FileNotFoundError

There are more guidelines, but these are common ones that are encountered early on.

### Exercise: Using naming conventions

Edit the code block below to conform to PEP8 naming conventions. Post in the collaborative document your answers.

In [None]:
def Velocity(TOTALDISTANCE, time):
    "This calculates the distance over time"
    Velocity_Result = TOTALDISTANCE / time
    return(Velocity_Result)
Velocity(10, 2)

### Dangers in variable naming

Aside from making names look professional, there are some rules about naming to help prevent errors and bugs. Specifically, you should never name something the same as a function included in python. 

Let's use the function sum() to see what happens of we use sum as a variable name:

Let's try calling the sum() function again.

## Comments

Comments are human-readable annotations that document various lines/functions in your code document. In Python, they are denoted using `#` - everything written after it on the same line is not run. This means we can use it to help both ourselves and others understand what the code is trying to do. 

In [None]:
def my_function(x):
    # This is a comment.
    # print(x + x)
    # The print function above will not run due to the '#'
    print(x)

my_function(1)

You should write comments for your code often, specifically when you are doing a task that is specialized like using a formula or applying a custom function to do a task. Some good advice from PEP8:

>Comments that contradict the code are worse than no comments. Always make a priority of keeping the comments up-to-date when the code changes!

## Docstrings

We previously covered docstrings within functions as a string just after the def statement. However, what if you want to write **more than just a single line of documentation**? Fortunatly there is a way to do so by using **triple-quotes**.

In [None]:
def my_function(x):
    # This is a comment.
    # print(x + x)
    # The print function above will not run due to the '#'
    print(x)

my_function(1)

### PEP guidelines on docstrings

Python PEP guidelines suggest the following format:

"""One line description

More details about your function, in triple-quotes

"""

**or**

"""Only a single line description in triple-quotes"""

### ... are supplemented by community formats

Docstrings are pretty flexible, even using PEP standards. There are a couple common format guidelines for docstrings that you can choose from. Why start with these?

1. It gives a good overview of what information people expect in your docstrings
2. The formats here can be parsed by common tools

While the formats for documentation may differ slightly depending on the language choice, the information expected from them is fundamental. 

In [None]:
# Google format
"""Takes a string and returns a list of letters

Args:
    string (list): A string to parse for letters
    upper (bool): The letters are returned uppercase 
        (default is False)

Returns:
    list: A list of each letter in the string
"""

#Numpy format
"""Takes a string and returns a list of letters

Parameters
----------
string : str
    A string to parse for letters
upper : bool, optional
    The letters are returned uppercase  (default is False)

Returns
-------
list
    A list of each letter in the string
"""

#reStrucured text 
"""Takes a string and returns a list of letters

:param string: A string to parse for letters
:type string: str
:param upper: A string used to join each string (default is False)
:type upper: bool
:returns: A list of each letter in the string
:rtype: list
"""


### Exercise: Importance of naming and documentation

Given the following function with poor naming and no documentation, determine:
1. What are the 2 inputs
2. What does it return

In [None]:
def FUNCTION(number, words):
    Smallest = 0
    for dictionary in number:
        LETTER = (dictionary / words) * 100
        if LETTER > Smallest:
            Smallest = LETTER
    return Smallest

## Exercise: Refactoring a Function

**Refactoring** is a term that means re-writing code without changing the task it performs. 

Refactor the above function with poor naming and no documentation. You will want to:

1. Rename variables to an appropriate name
2. Write a docstring explaining what the function does using one of the example formats (Google, Numpy, reStructured)

Test your function by running it after you refactor to see if it still produces the same output.

In [None]:
# Your answer here!


## Don't Repeat Yourself - DRY

DRY is a concept in programming to avoid writing redundant code. A good sign your code is redundant would be if you copy and paste parts of it and then edit the new copies.

Suppose we had three lists of proteins and wanted to check if there are any matches to a list of proteins of interest and wrote the following code:

In [None]:
protein_data1 = ["CREG1", "ELK1", "SF1", "GATA1", "GATA3", "CREB1"]
protein_data2 = ["ATF1", "GATA1", "STAT3", "P53", "CREG1"]
protein_data3 = ["RELA", "MYC", "SF1", "CREG1", "GATA3", "ELK1"]
proteins_of_interest = ["ELK1", "MITF", "KAL1", "CREG1"]

# Are there any matches in the first list?
match_list1 = []
for protein in protein_data1:
    if protein in proteins_of_interest:
        match_list1.append(protein)

# Are there any matches in the second?
match_list2 = []
for protein in protein_data2:
    if protein in proteins_of_interest:
        match_list2.append(protein)

# Are there any matches in the third?
match_list3 = []
for protein in protein_data3:
    if protein in proteins_of_interest:
        match_list3.append(protein)

We want to do an analysis on multiple datasets. But there are a few problems with this approach:

1. You have to copy/paste this code for each list you want to compare.
2. If you make a change to the analysis, you need to edit every copy.
    - Very hard to remain consistent
3. If you want to compare to another list, you need to either:
    - Edit every copy.
    - Change the variable `proteins_of_interest`.
        - May affect your analysis somewhere else

To prevent repeating code, we can write a function to do our repeated task. 

### Exercise: Refactor an Analysis Into a Function

Write a function that does the repeated task and run it on the three lists.  

In [None]:
# Your function here!


## A Helpful Way to Format Strings

In the next sections, we will learn how to make custom error messages. Having a good way to format strings for these messages will make our job easier. Previously, we concatenated strings using the `+` operator. 

Suppose we have an integer, and we want to add it to a string:

In [None]:
# Adding an int to a string


If we directly add a string and an integer, we get a `TypeError`. One way to neatly get around this is to use what is called an **f string**. This is a python-specific method that takes care of formating the string for us.  

In [None]:
# Example with f strings


## Error Handling in Python

One fundamental concept in programming is troubleshooting errors. A wide variety of errors can be reported when something goes wrong in Python. For example, look at the error message when the following code is run:

In [None]:
print(hello)

Notice that in the command `print(hello)` when `hello` is not defined, Python returns something called a `NameError`. Python has many different types of errors built-in including:

- NameError
- ZeroDivisionError
- TypeError
- IndexError
- ... and more!

These errors are also referred to as **exceptions**, meaning "exceptional events". More information on the other exceptions built into Python are here: https://docs.python.org/3/library/exceptions.html.

We can also call an error on purpose using the `raise` keyword with a custom message. This message is part of the **traceback**, which is a report Python gives on what happened.  

In [None]:
# Raising a custom error


For errors we get without directly calling `raise`, both the class of `Error` and the message are predetermined. This is useful in some situations, but less so in others. Suppose you wrote a reverse complement function:

...and suppose someone imports this function from a script we wrote (which we will do later today). Maybe they will run our script with a sequence that contains lowercase letters (indicating masked genomic regions):

In [None]:
reverse_complement("CACGtgcatggTGAAA")

For a user, this is a really confusing error. Instead of returning the reverse complement, the function raises an error saying `KeyError: 'g'`.

### Try and Except Keywords

One way we can pre-empt errors is though `try` and `except`. A `try-except` code block runs the code under `try`, and if there is a particular error we think would happen, we can write a custom response under `except`.



In [None]:
# Try-Except demonstration


The `try-except` block works as follows:
1. The code under `try` is run.
2. If no exception occurs after running all of the code under `try`, the code under `except` is skipped.
3. If an exception occurs, *and* it matches the one in the `except` line, then any remaining code under `try` is skipped and the code under `except` is now run.
4. If an exception occurs, but it *doesn't* match the exception we gave it, this exception is **unhandled** and the program stops.

Note that, even though we **caught** the `NameError`, the program actually keeps going. This is because the `try-except` block only **controls the flow** of the program by specifying what code should be run in the case an error is encountered; it doesn't `raise` an exception by itself. We can visualize this by adding a `print()` statement after the whole `try-except` block:

In [None]:
# Example with print() before and after - continue from previous block

If we want the program to stop, we can use `raise` within the code run under `except`:

In [None]:
# Example with raise added within the except code - continue from previous block

So, why go through the trouble of **catching exceptions**?

1. If our program receives input that causes an error, we want the program to **fail fast**: don't do anything else, and immediately notify the user.
2. We want to avoid returning **incorrect or unexpected** results.

### Printing Error Messages

Before we use our new keywords, let's go over the output for errors. When we print something in Jupyter or in the command line, by default the text goes to a destination called **standard output**, or `stdout`. Simultaneously, errors and warnings go to a destination called **standard error**, or `stderr`. It is dangerous to print error messages in `stdout` because some workflows utilize it for data and the error messages can get mixed in (eg, when using pipes `|` on the command line).

We can set a destination for out `print()` command using the `file` argument. The default destination is `sys.stdout`. To send to `sys.stderr`, we will need to import the `sys` module.

### A brief digression on modules
* `sys` is from the **standard library**, a collection of modules included in Python but not available by default.
* Libraries are incredibly useful: there are libraries for working with numeric and scientific data, generating plots, fetching data from the web, working with image and document files, databases, and more.
* The full documentation for `sys` is part of the general Python documentation: https://docs.python.org/3/library/sys.html.

In [None]:
# Help for print


In [None]:
# Import sys, compare stdout vs stderr


In Jupyter, `stderr` has a red background and `stdout` has a no background color. 

**Note**: The **traceback** messages shown when an error occurs are going to `stderr`. In Jupyter, they are formatted differently than other output to `stderr`. 

Back to the observation that reverse_complement can give cryptic error messages: let's add a `try` statement and place the `for` loop in it. We can then use `except` with our expected `KeyError` to print a different message and `raise` to both **fast fail** and return the full **traceback**. 

In [None]:
# Edit the function below!
def reverse_complement(dna_sequence):
    """Reverses the complement of a dna sequence"""
    complements = {"T":"A", "A":"T", "C":"G", "G":"C"}
    reverse = dna_sequence[::-1]
    result = ""
    # Add try - except - raise statements
    for letter in reverse:
        result = result + complements[letter]
    return(result)

help(reverse_complement)
print(reverse_complement("CAAg"))

Note that `Error`s automatically output to `sys.stderr`.

Alternatively, we can check the input for the dictionary and produce an error ourselves. 

In [None]:
# Edit the function below!
def reverse_complement(dna_sequence):
    """Reverses the complement of a dna sequence"""
    complements = {"T":"A", "A":"T", "C":"G", "G":"C"}
    reverse = dna_sequence[::-1]
    result = ""
    for letter in reverse:
        # Check that letter is valid, if not raise an Error.
        result = result + complements[letter]
    return(result)

print(reverse_complement("CAAg"))

### Some notes on `if-else` versus `try-catch` for errors

We just demonstrated two different ways to handle potential errors: `if-else` blocks and `try-catch` blocks. Both of them protected against invalid input, so what's the difference?
* In the `if-else` block, we checked the value of `letter` *before* trying to use it in the `complements` dictionary. In the `try-catch` block, we immediately called `complements[letter]`, which threw an error.
* If you already know what your inputs should or shouldn't contain, it's *probably* a good idea to code in that information beforehand througn an `if-else`.
* `try-catch` immediately runs code and stops when an error happens. This means the things done in `try` aren't reversed when you hit an error.
* If you don't know *how exactly* something might break, but know it *can* happen, use a `try-catch`. For example, throw a `KeyError` or `ValueError` for keys that don't exist in your dictionary, but you can't guess what keys the user might try to use.
* For some more discussion on this, see this StackOverflow forum post: https://stackoverflow.com/questions/7604636/better-to-try-something-and-catch-the-exception-or-test-if-its-possible-first.

### Application: Sanitizing Input

Another problem is when programs produce incorrect results instead of producing an error. Suppose we have a function that prints all k-mers of a given *k* from a sequence:

In [None]:
# Function to return all k-mers from a sequence


In [None]:
kmers_from_sequence("CACGTGACTAG", 3)
print("After the function")

In [None]:
kmers_from_sequence("CACGTGACTAG", -3)
print("After the function")

We can **sanitize** the inputs to solve this. The value, *k*, should be a number greater than 0 and not longer than the length of the sequence.

Refactor the following function to check that the value of `k` is both:
- A positive number
- Not longer than the length of `dna_sequence`

If there is a problem, `raise` a `ValueError` with an appropriate message. 

In [None]:
# Edit the function below
def kmers_from_sequence(dna_sequence, k):
    # Write code to check input here!
    
    positions = len(dna_sequence) - k + 1
    for i in range(positions):
        kmer = dna_sequence[i:i + k]
        print(kmer)

In [None]:
kmers_from_sequence("CAATCGACGTA", 12)  # Should return an error

## Syntactical shortcut: Separate code with line breaks

If you have lines that are long and hard to read, putting in line breaks can help. In Python, you can have line breaks inside parentheses. Let's demonstrate this on some code we wrote yesterday:

In [None]:
# import data from yesterday
import pandas as pd
gapminder = pd.read_table("gapminderDataFiveYear_superDirty.txt", sep = "\t")
gapminder['region'] = gapminder['region'].astype(str)

# Method 1 for formatting the 'region' column:
gapminder['region'] = gapminder['region'].str.lstrip() # Strip white space on left
gapminder['region'] = gapminder['region'].str.rstrip() # Strip white space on right
gapminder['region'] = gapminder['region'].str.lower() # Convert to lowercase

# Method 2 for formatting the 'region' column:
gapminder['region'] = gapminder['region'].str.lstrip().str.rstrip().str.lower() # Strip white space on left and right, and convert to lowercase

print(gapminder['region'])

There are three different transformations happening above: removing whitespace on the left, removing whitespace on the right, and converting the text to lowercase. We can make this one line more intuitive by breaking it up into three:

In [None]:
# New method of chaining functions


We get the same output as above! This code is functionally the same as methods 1 and 2. We benefit from explicitly delineating each step like in method 1, and we also get the nicer syntax of applying all cleaning steps at the same time with method 2.

## Outlining your code - Pseudocode

Thinking about what you want your future code to do *before* coding anything reduces the time you spend physically coding. It forces you to think about the big pieces that go into solving your problem and how they'll fit together, revealing potential problems much earlier. Let's take an example from day 1 to illustrate:

In [None]:
percent = 20
if percent < 39:
    print('Low')
elif percent < 47:
    print('Normal')
else:
    print('High')

To make this code more relevant for us, let's introduce a real biological variable: *hematocrit*. Hematocrit is the percentage (by volume) of red blood cells in blood. The normal values for humans are:
* Male: 41% - 50%
* Female: 36% - 44%
* Average: 39% - 47%  -  these percentages used in the code above

Now let's go over the logic of our previous code:
1. We first check if the percent is less than 39: if so, then label as "Low".
2. We then check if the percent is less than 47: if so, then label as "Normal".
3. Otherwise, label as "High".

This code seems straightforward at first glance, but there's a conceptual oversight: the second case here isn't strictly correct. Values less than 47% are normal *only if* the value isn't already less than 39%. The logic presented in this code works by virtue of checking the 39% case before the 47% case. Suppose we unintentionally coded in case 2 before case 1: now we first check for values less than 47, and any values that fulfill that condition are "Normal", even if they're *also* less than 39. This is an easy and common accident that can lead to erroneous results.

It's more accurate to explicitly state the values that fall under each category:
* Values between 0% - 38% are "Low"
* Values between 39% - 47% are "Normal"
* Values between 48% - 100% are "High"

How would we rewrite the previous logic to follow this biological meaning?

This is better! We've done a few defensive programming concepts here:
* Written up **pseudo-code**: this is not actual code, but an outline of how you want your code to be structured.
* Sanitized our input and guarded against a potential error.
* More explicitly stated the biological meaning of our code, making it easier for others to follow.
* Defended against the concept of a "wrong order" for our if/else statements - now it doesn't matter how the three conditions are ordered.

Now, re-write the actual code to follow this new logic:

In [None]:
# New code here!


# Working with files

Real-world data will typically be in a file, **not** in your code. Python has functions (and libraries) to read files in various formats, including text, image, and binary formats. Sequence data is typically stored in plain text files, like FASTA. Let's walk through reading one of those.

## Open and close

When you use a word processor or spreadsheet, you open files, work with them, and then close them when you're done. In Python, you do the same thing.

In [None]:
# Opening and closing a file


Let's go through the steps we just did:
1. We used the `open()` function on a string that represents a path to a file.
    * The result of that function was saved to the variable `f`. This value is called a *file object*.
2. We wrote a `for` loop. When you write a `for` loop for a file object, each time a loop happens, one line in the file is read and stored in the loop variable `line`.
3. We printed the loop variable `line` for each loop.
4. We used the `close()` function to close the file.

Text files are generally read line-by-line in a loop like we just did with this FASTA file. Each line of text from the file is set into our loop variable, at which point we're free to do whatever we want with the loop variable.

## Reading lines

Notice the blank line in between each line from the file. Text files and programs indicate that there is a new line using a special newline character, `\n`. In our previous example, each line in the file includes the `\n` newline character at the end.

Let's visualize this by adding each line to a single string, `all_lines`, and compare printing the string versus the raw data.

In [None]:
# Visualizing newline characters


If desired, we can strip off the newline character with the `.strip()` method:

In [None]:
# Use strip() to remove newlines


Let's try this with our original code to print each line in a file:

In [None]:
# Modifying our original code to remove newlines


In general, when reading text files, check that you're not accidentally including extra `\n` characters (or excluding them when you need to separate lines).

## Exercise: Processing a fasta file

Let's put some concepts together: write a function called `read_fasta` that takes a filename and returns a `list` of all the DNA sequences in the file.
* Each element in the list should correspond to one FASTA entry.
* Make sure to remove the header rows and only include the DNA sequences.
* Test your code with the `ls_orchid.fasta` file.

Some hints:
* Look at `ls_orchid.fasta` before coding to know how your input data is formatted.
* There are multiple FASTA sequences in this file - you'll need to know when to stop reading the previous sequence and start reading the next one.
* Remember to strip the newlines off each FASTA sequence.
* It's very helpful to write pseudocode first, to go through the logic of your function before writing it.

In [None]:
# Your code here!


In [None]:
print(read_fasta('ls_orchid.fasta'))

# Scripts

We've done a good job of organizing our code into functions, but we've only been running them from this notebook. Suppose you want to run this program outside of Jupyter Notebook (e.g. on the command line) or give it to someone else to run: we want to nicely package this script in a way that makes it easy for the next person to run. We wouldn't just want to hand them this entire Jupyter Notebook and say "the right function's in there somewhere, just keep looking" or "yeah, you just need to edit the arguments to use your own files"!

Let's take our `read_fasta` function and put it in a script that reads the `ls_orchid.fasta` file and prints its contents:

Notice that the first line contains a `%%` operator followed by the command `writefile` and a file name. This operator is specific to Jupyter notebooks, called a "Cell Magic Command", and copies the code written in a cell into a file.

Our script reads the `ls_orchid.fa` file (and only the `ls_orchid.fa` file) every time it's run, an example of **hard-coding** an argument. Most programs let you specify the arguments you want to use, as opposed to being inflexible and pre-determined every time you run it. Let's change our script again to achieve this goal.

In Python, our program can get these arguments, but we have to load the `sys` module. Remember from earlier when we imported `sys` to call `sys.stderr`? Well, `sys` can also grab arguments from the command line. Let's change our `read_fasta` function slightly to allow it to read in *any* file by using the `sys` module:

In [None]:
# Edit function above!

But what happens if we don't have an input file name? According to the `sys` documentation, `sys.argv` returns a list where the first item `sys.argv[0]` is the name of the script by default, and any additional items in the list are the command line arguments. If no arguments were passed, `sys.argv` should be a list with exactly one element, the script name. Let's add some more code to our script to check whether the user passed in a filename:

In [None]:
# Edit function above!

Here, we use yet another function of `sys`, `sys.exit`, which lets us stop a script or program before the whole script is used.

## Importing your own scripts

So far, we have used modules to help us work on our analyses such as:
- Standard Library
    - `sys`
- Third Party
    - `pandas`
    - `numpy`
    - `matplotlib`

These are imported using the `import` keyword, which lets us use their functions. We also write functions for use in our own code. Having our own functions available to import into other scripts gives the benefit of:
1. Letting us reuse code over multiple analyses (DRY).
2. Letting others use our code in their own scripts without copy/pasting (DRY).

While it may seem like going out of one's way to write a module for analysis, you can have the same Python file work in two different ways:
1. A module that's imported into another Python notebook or script
2. A standalone script that's run from the command line

Let's demonstrate this by making a new Jupyter notebook file: `New -> Python3 Notebook`. In this notebook:

In [None]:
import cool_functions

Why is this giving an error when we import it into our Jupyter notebook? This is because we're using `sys.argv` to read arguments from the command line, but we're *not* on the command line right now, so there are no arguments/files to read in. We're attempting to do method #1 (import as a module within another Python file) with code that's meant for method #2 (run as a standalone script on the command line).

So, how do we separate the `print` statement containing `sys.argv` such that it's only run when the script is run from the command line?

Answer: Add this `if` statement before the `print` statement: `if __name__ == "__main__":`.

In [None]:
# Edit function above!

Go back to your new Jupyter notebook file and re-run the `import` cell - it should no longer throw an error.

## Exercise: Adding new functions to your module

In your new notebook, you've now imported the `cool_functions.py` file, which contains the `read_fasta` function that returns a list of DNA sequences from a fasta file. Building off this module, how would you get a count of unique dinucleotides (AA, AT, AC, AG, TA, etc.) for each FASTA sequence? Write out some pseudocode first, then implement it.
* Hint: you can loop over a string using `for letter in string`, giving you the actual letter.
* Another hint: you can loop over the positions in a string using `for int in range(len(string))`, giving you integers for each letter.

In [None]:
# Put your answer here!


# Automation: Combining Python and Bash

Automation saves both time and effort in large quantities. You save your own time by having the computer automatically loop over data files, performing multiple analyses in parallel, and generating output files and visualizations. By saving these directions in the form of code, you also save time by making your work documented and reproducible, saving you the effort of re-doing them manually should the need arise.

Suppose you're given a few *hundred* fasta files you need to concatenate. You could manually type them all into a list in your Jupyter notebook and run it, or you could have `Bash` automate the `Python` script for you!

First, a technical check: in your Git Bash or Terminal window, run `python --version`. Hopefully, you get some info about the version of Python you're running.

Recall our bash lesson on the first day:

Using the `python` command in the terminal, we can also run Python files without needing to open up Jupyter notebook! You use this `python` command just like any other bash command:

In [None]:
python cool_functions.py ls_orchid.fasta

And we can use it in a for loop in Bash, just like any other command:

You should get a lot of text outputted to your terminal. Better yet, let's redirect it to another file to save for later:

You may have seen the `>` bash symbol before, which means to redirect the output of a command into another file. Here, we use `>>`: it's is similar in that it also redirects output to a file, but this *concatenates* the result to whatever is in the file, instead of overwriting the whole file with the new results.

Now we can open up and check `output_fastas.txt`. There's one line for every file processed.

## Automating the Bash command

Okay, time for one more level of automation: it's great that we can get Bash to loop over our Python script and our input files, but what if we don't want to type out the `for` loop in bash every single time we run an analysis? We can store that command in a Bash script, similar to how we stored our Python code in a Python script.

In either `nano` or `vim`, copy and paste the above code into a new file called `script.sh`. Remember that we can create a new file by directly typing `nano script.sh` or `vim script.sh` on the command line. Once you've created your `script.sh` file, run it on the command line with:

It should work as intended (i.e. not output anything to the terminal), and you can open up your `output_fastas.txt` file to see the results.

## Exercise: Optimizing your script

How would you modify this `script.sh` so that it empties the contents of `output_fastas.txt` before running your program? Hint: there are multiple bash commands you could potentially use.

## Automating your time: Walking away from your computer

Congratulations, you now have a `Bash` script that automates a `Python` script over however many fasta files you have! What if you were dealing with gigabytes or even terabytes of FASTA files? You'd be waiting forever for your script to finish, but you probably don't want to sit in front of your computer that long. What can you do?

* `nohup`: Stands for `no hangup` - even if you close your bash terminal, the script will continue to run in the background of your computer. Just make sure you don't shut down your computer before it finishes! On a computing cluster, this isn't really a problem since compute clusters generally stay online 24/7.
  * The Duke Compute Cluster (DCC) is an example of a computing cluster: https://oit-rc.pages.oit.duke.edu/rcsupportdocs/dcc/. It provides free computational time and storage (specific amounts will vary) to Duke affiliates. Ask your PI for access if you're interested!
* `&`: Run this program in the background of your terminal. This frees up your terminal so you can work on other things and run more commands. Note that this *does not* keep it running if the terminal is closed; you'll still need `nohup` for that.

TL,DR: `&` *frees up* your terminal so you can do other things, `nohup` *lets you close* your terminal while still running your script.