### defensive programming

In [1]:
#run train
#add an extra .brake() - what's going on here?

#### print statements:
how do we figure out how our code doesn't work? we have some expectation of what values variables have, and they don't have those. Need to see what's in the variables!
Simplest method is just print(x). This is honestly very good!! Obviously don't have print statements in your final code, but honestly I use a lot of print statements. The alternative is the logging module, but I prefer print statements - much faster, I don't mind having my output spammed with stuff, going into another file and having potentially a log file per python file is a lot of clutter.

#### intro to debugging
PyCharm's visual debugger is extremely useful!! show operation. set a breakpoint, code will stop running there. then you get:
- evaluating expressions. good to see what the values of operations that break your program will be, check array operations, etc.
- view numpy arrays!!! extremely useful
- program execution control - step over = executes line by line, skipping functions, step into = goes into function calls, step into (my code) is the same but ignoring libraries, step out goes up a level. run to cursor - "mobile breakpoint"
- conditional breakpoints!! I learned about these when making this, but god these seem very useful - you can have the breakpoint trigger only when something is met, so if there's a loop that breaks towards the end, make it break when the index variable is 99% done, etc.

In [2]:
#add print statements after accelerate and brakes
#debug the thing
#mention conditional debugger

#### asserts:
print statements are nice for when things go wrong, how do we do prevent errors before they happen? asserts! 
https://blog.regehr.org/archives/1091 is the best philosophical resource about asserts I've come across. 
Key points: "An assertion is a Boolean expression at a specific point in a program which will be true unless there is a bug in the program."
Basically, they're a way to reassure yourself that things are as they should be. Sanity checks. I think two main types of asserts are useful in research computing:
1. math stuff - if variables, operations, etc. are mathematically constrained, assert that this is the case! e.g. asserting probabilities sum to 1. 
2. preconditions - at the top of functions, make sure that arrays that will be multiplied have the complementary shapes, etc. no real type checking in Python so this is a useful equivalent. don't literally use this to check types though! Asserts should be pretty sparing - the blog above says empirically 1 in 70 lines of code.

In [3]:
#add an assert to the odometer - maybe someone adds something to let the train drive backwards. odometer still shouldn't be negative.

## Coding for not just you: reproducibility and accessibility
### argparse
sometimes, you cannot run your code inside PyCharm, but must run it from the command line. The two primary instances of this are:
- when publishing, people often want a command line tool. idrk why but they do.
- for cluster work it's somewhere between much easier and the only way to get jobs to run.

you can use sys.argv and make your command something like python science.py 4 10 "linear" 500 1e4 "fast" 8
or use argparse! Python's built in library for, unsurprisingly, parsing command line arguments.

How argparse works: 3 easy steps.
    
1. set up ArgumentParser

In [None]:
import argparse
parser = argparse.ArgumentParser()

2. add arguments

In [None]:
parser.add_argument("filename", type=argparse.FileType("rb"), help="input file")
parser.add_argument("-n", "--number", type=int, default=5, help="an optional integer")
parser.add_argument("--print_this_stuff", nargs="*", help="prints all the extra args you put in")

3. parse args

In [None]:
args = parser.parse_args()

#args now has a variable for each argument:
print(args.filename)
print(args.number)
for val in args.print_this_stuff:
    print(val)

#also the documentation is built-in! -h

Because it is a well-written module, argparse can handle whatever stuff you might want out of your inputs. Different types, required/optional arguments, mutually exclusive groups (e.g. "verbose" mode vs. "quiet" mode) - use group = parser.add_mutually_exclusive_group() and then group.add_argument()

### a brief introduction to environments

Python is highly dependent on packages (numpy, scipy, matplotlib, keras, scikit-learn, scikit-allele, whatever actual biologists use...). That's great! However, we need to keep track of all of these packages. Otherwise there are some issues, such as:
- how do people know what packages your project uses? Or what version?
- What if you were to HYPOTHETICALLY update all of your packages because you're publishing your paper and then PURELY HYPOTHETICALLY nothing is compatible anymore and your code doesn't work?

Environments are not the solution to these questions but they are the key component to all solutions.

Each environment is basically a fresh version of Python. You can put in just the set of packages you need to do something and no more. And then you can start another environment for a different project, etc. How do we set up environments? There are 3 basic commands that are used and that's pretty much it.

1. conda create --name <env> (python=3.x) (packages you want)
2. conda activate name
3. conda deactivate

From there, install packages you need using the command `conda install package-name`. That's pretty much it!

### my transformation into a software engineer

so for my first project I didn't use git or anything. No environments. No code testing. Nothing. All of my code was on my UChicago Box (still recommend, you can get more granular version history than git). Then we had to upload it like good scientists and it was a disaster. 

So: **maintain a GitHub + pyproject.toml and occasionally push changes to it**. Do this more frequently if you have collaborators. More advanced Github stuff below, but that's not needed for every project. See https://github.com/steinrue/EMSel for a fairly simple pyproject.toml you can copy.

Also, I don't actually know how to use git properly. Like, doing commits on the command line and stuff. Instead, I use Github Desktop! Highly recommend a GUI for modern git usage.

#### my transformation into a software engineer II: electric snakemake

automated workflow creation! no more running 10 python scripts in a row oh wait I forgot I updated the 6th one now I have to rerun everything again oh wait I ran these two out of order so it didn't run on the latest dataset time to rerun everything etc

99% of rules for snakemake look like this:

In [None]:
rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"

A snakemake workflow consists of multiple rules that interact. Snakemake infers the relations between your rules and builds a DAG of dependencies:
![snakemake-dag](https://snakemake.readthedocs.io/en/stable/_images/dag_call.png)

support for clusters (with special SLURM support!) is built-in as well - snakemake 8 made everything seem super complicated (what is "snakemake-executor-plugin-cluster-generic") but https://github.com/jdblischak/smk-simple-slurm is an **incredible** resource

lastly, **INPUT AS JSON OUTPUT AS CSV/PDF/DATAFRAME**

### Optional optimization 1: a better GitHub setup

All of my setup is heavily adapted from the project setup cookiecutter at https://learn.scientific-python.org/development/

Github actions! Environment specification! Tests! Linters! All of this is by far the easiest to set up at the beginning of a project (or grad school) so let's go through it now.

### Optional optimization 2: tests

Tests are good! Tests are a little bit harder in a research context than a pure software development context. "Test-driven development" doesn't make sense - we are driven by research questions. Instead, I use tests to make sure that core computational functionality doesn't change, more or less.

### Optional optimization 3: containers

How do you make sure your code runs the same everywhere? Simply build a little operating system around your code so that it always runs in the same environment! This is called "containerization". The most common containerization platform is Docker. It is not too hard to set up but almost certainly overkill for research.