## Lecture 2 - Numpy and defensive programming

In lecture 1, we learned why you should write Python, why you should write your Python in PyCharm, and several cool features of base Python.

- f-strings are the easiest way to insert data into a string
- save your data via pickling dictionaries
- list comprehensions are very pythonic
- walrus operator good
- uhhhhhh decorators? kinda?

### Intro to Numpy

Numpy is built around the **ndarray**, which you can think of as a matrix of arbitrary dimension.

![ndarray instantiation](https://numpy.org/doc/stable/_images/np_array.png)

In [None]:
import numpy as np

Most common ways of instantiating numpy arrays:
- np.array(list)
- np.zeros((shape object))
- np.arange(start, stop, step)
    
Properties of arrays:
- array.shape: returns a tuple with the relevant information. E.g. generate_binary_numbers(5).shape = (32, 5)
- array.ndim: len(array.shape).
- array.size: np.product(array.shape)

#### Accessing elements of arrays: array slicing

Basic slicing works the same way as on Python lists, just across multiple dimensions, potentially. For example:

In [None]:
a = np.arange(25).reshape(5,5)
print(a)

In [None]:
a[1:4, 0:2]

In [None]:
a[::2, 3]

###### mini-exercises (3 hands)

What is the output of a[3:, :3]?

How about a[2:4:-1, 1::2]?

#### advanced indexing

Numpy is much cooler than base Python. Specifically, you can index a numpy array with *a numpy array*. This is called "advanced indexing". A simple example:

In [None]:
zip_code_index = np.array([6,0,6,3,7])
b = np.arange(30)**2
print(b)

What do I get here? (3 hands)

In [None]:
b[zip_code_index]

This can get somewhat wild - what do I get as the output of these two cells?

In [None]:
b[a]

In [None]:
a[b]

Note that even though b is a one dimensional array, we can slice it two-dimensionally!!

##### Advanced slicing: use case

Boolean arrays can be easily generated in numpy:

In [None]:
a > 12

and can be used to index arrays (most commonly the array itself):

In [None]:
a[a > 12] = 100
a

Operations on arrays

We just saw one example - we set some values in an array to an integer. Other examples:

In [None]:
c = np.repeat(np.arange(5),5).reshape(5,5)
c

In [None]:
a * c

In [None]:
a + c

Operations are *all* element-wise unless otherwise specified (e.g. for "normal" matrix multiplication, use @). Because things are elementwise, arrays of the same shape can be operated on as you would expect. But what about something like

In [None]:
a + c[:, 0]

What happened here?

#### Broadcasting

Broadcasting is numpy's process of attempting to "morph" two arrays into having the same shape so that element-wise operations can be applied. (I thought this was more black magic and I just sort of... tried transposing arrays, reshaping things, etc. until something worked, until about two months ago. Now I more or less understand broadcasting, and it's actually pretty simple.)

Broadcasting works as follows:

Numpy _prepends_ arrays with dimensions of size 1 as necessary until they have the same number of dimensions, then compares dimensions starting from the rightmost element of a shape tuple, and deems two arrays compatible if, for each dimension:
1. both arrays have the same size
2. one (or both) arrays have size 1.

For all dimensions with size 1, numpy will then "stretch" this dimension to make it have the same shape as non-1 sizes of the same dimension before doing the operation. Here is a picture depicting this process:

![ndarray broadcasting](https://numpy.org/doc/stable/_images/broadcasting_2.png)

As a numerical example, if you have matrices d with d.shape = (8,3,1,8) and e with e.shape = (3,5,1), d + e will not throw an error and will have shape (8,3,5,8).

So, from the earlier code, a + c[:, 0] worked because a has shape (5,5) and c[:, 0] has shape (5,) -> (1,5), which can be broadcasted to (5,5).

###### mini exercises:

For each of the following, determine whether the two arrays have compatible dimensions, and if they do, what the dimensions of the resulting array after a binary operation are.

1. f.shape = (5,1,3,2), g.shape = (1,3)
2. f.shape = (5,1,3,2), h.shape = (1,3,1)
3. f.shape = (5,1,3,2), i.shape = (1,3,1,1)
4. f.shape = (5,1,3,2), j.shape = (1,3,1,1,1)
5. f.shape = (5,1,3,2), k.shape = (1,3,1,1,1,1)

Everything else in numpy is just functions. Numpy (+ scipy) has functions for everything you could ever want, seriously. As an example, I was calculating p-values by fitting points to a null distribution, and using the definition of a p-value as 1-cdf. Some of my p-values were very very small, so they were being returned as 0, which caused their log to be bad, etc. 

Turns out every (continuous) distribution in scipy.stats can return log(1-cdf) with more precision than manually computing the log of 1 minus the cdf. wild.

Lastly, the axis keyword is important and a little confusing - basically applies a numpy function along a "direction":

In [None]:
d = np.arange(15).reshape(3,5)
print(d)
print(np.sum(d))
print(np.sum(d, axis=0))
print(np.sum(d, axis=1))

Oftentimes, you know what the shape of the resulting array you want is but not what axis that corresponds to - for example, you know you want to average something over time within 100 different experiments, is that axis=0? 1?
My preferred way to remember this is that axis=i will delete the ith value from the shape.
d.shape = (3,5) -> axis=0 makes the shape (5,), axis=1 makes the shape (3,)

### defensive programming

In [None]:
#run car
#add an extra .brake() - what's going on here?
#add print statements after accelerate and brakes
#debug the thing
#mention conditional debugger
#add an assert to the odometer - maybe someone adds something to let the car drive backwards. odometer still shouldn't be negative.

defensive programming things:
print statements:
how do we figure out how our code doesn't work? we have some expectation of what values variables have, and they don't have those. Need to see what's in the variables!
Simplest method is just print(x). This is honestly very good!! Obviously don't have print statements in your final code, but honestly I use a lot of print statements. The alternative is the logging module, but I prefer print statements - much faster, I don't mind having my output spammed with stuff, going into another file and having potentially a log file per python file is a lot of clutter.

debugging (10 min)
PyCharm's visual debugger is extremely useful!! show operation. set a breakpoint, code will stop running there. then you get:
-evaluating expressions. good to see what the values of operations that break your program will be, check array operations, etc.
-view numpy arrays!!! extremely useful
-program execution control - step over = executes line by line, skipping functions, step into = goes into function calls, step into (my code) is the same but ignoring libraries, step out goes up a level. run to cursor - "mobile breakpoint"
-conditional breakpoints!! I learned about these when making this, but god these seem very useful - you can have the breakpoint trigger only when something is met, so if there's a loop that breaks towards the end, make it break when the index variable is 99% done, etc.

asserts:
print statements are nice for when things go wrong, how do we do prevent errors before they happen? asserts! 
https://blog.regehr.org/archives/1091 is the best philosophical resource about asserts I've come across. 
Key points: "An assertion is a Boolean expression at a specific point in a program which will be true unless there is a bug in the program."
Basically, they're a way to reassure yourself that things are as they should be. Sanity checks. I think two main types of asserts are useful in research computing:
1. math stuff - if variables, operations, etc. are mathematically constrained, assert that this is the case! e.g. the assert with probabilities summing to 1 in the hmm
2. preconditions - at the top of functions, make sure that arrays that will be multiplied have the complementary shapes, etc. no real type checking in Python so this is a useful equivalent. don't literally use this to check types though! Asserts should be pretty sparing - the blog above says empirically 1 in 70 lines of code.

code publishing things:
argparse (10 min)
sometimes, you cannot run your code inside PyCharm, but must run it from the command line. The two primary instances of this are:
-when publishing, people often want a command line tool. idrk why but they do.
-for cluster work it's somewhere between much easier and the only way to get jobs to run.
you can use sys.argv and make your command something like python science.py 4 10 "linear" 500 1e4 "fast" 8
or use argparse! Python's built in library for, unsurprisingly, parsing command line arguments.

How argparse works: 3 easy steps.
    
1. set up ArgumentParser

In [None]:
import argparse
parser = argparse.ArgumentParser()

2. add arguments

In [None]:
parser.add_argument("filename", type=argparse.FileType("rb"), help="input file")
parser.add_argument("-n", "--number", type=int, default=5, help="an optional integer")
parser.add_argument("--print_this_stuff", nargs="*", help="prints all the extra args you put in")

3. parse args

In [None]:
args = parser.parse_args()

#args now has a variable for each argument:
print(args.filename)
print(args.number)
for val in args.print_this_stuff:
    print(val)

#also the documentation is built-in! -h

Because it is a well-written module, argparse can handle whatever stuff you might want out of your inputs. Different types, required/optional arguments, mutually exclusive groups (e.g. "verbose" mode vs. "quiet" mode) - use group = parser.add_mutually_exclusive_group() and then group.add_argument()