# Numpy

Numpy is *the* library that allows Python to be usable for fairly high-performance computing.  You'll need to have a good handle on using it to be poductive and fast when doing your data analysis.  Unfortunately, Numpy is a *huge* library, so we will only get to scratch the surface.

The library is built around the *array* data structure.  This is what we'll focus on for this session; we'll largely ignore most of the routines and functions.

Install with:

```bash
conda install numpy
```

## Digression: why do we need Numpy in the first place?

This is going to be a pretty technical dive into Python's type system, and how it makes Python slow compared to other languages.

By now you're familiar with how to create variables in Python:

```python
x = 5
```

Note that there is no point at which we've told python "`x` is going to store an integer."  We also haven't told python "`5` is an integer."  But, in order for the computer to execute any code using `x`/`5`, it does need to access and work with all of the possible type information available.

The way Python does this is in a bit of a "dumb" way.  Every time it accesses a value (either as a literal, or through a variable), it runs some code to check that value's type, and then checks whether it can do the requested operation using that data type.  It does this every time it needs to access a value.

This has a huge benefit for us as programmers: we don't need to worry about exact types, and we can move faster when developing code.  We don't, for example, need to worry about whether `x` is a floating point number or integer--that distinction usually doesn't matter to us as humans thinking about the problem.

But there's a huge downside: the computer has to do a huge amount of extra work to do every time it does any operation.  Consider the following simple example:

```python
5 + 6
```

Python has to:
1. Run code to check what the type of `5` is.
2. Run code to check what the type of `6` is.
3. Check whether the `+` operation is supported for these two types.
4. Run the `+` operation.

This is a bigger problem when we have, say, a list of values.  Python does not have a built-in way to say "everything in this list is the same type, so you only need to check the first one."  Or, even better, "this is a list of integers, so you don't need to do any checking at all."

Consider the following example:

```python
my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
for i in range(len(my_list)):
    my_list[i] = my_list[i] + 1
```

In this case, Python has to run the type-checking on every element of `my_list`, and on `1`, every single time.  This can *really* add up when you have a lot of values in this list.  If you had a way to bypass this--e.g., to tell Python "this is a list of integers, and I am only adding integers to it"--you could get a huge speedup.

Many other languages let you say "this is a list of integers" and bypass the expensive runtime type-checking.  Languages like C, C++, Java, Haskell, etc. allow this, and as a result, those languages can be extremely fast for doing lots of small math operations.

Numpy brings this kind of functionality to Python.  Numpy allows you to specify the equivalent of "this is a list of all integer values," so Python doesn't have to do any of the slow and expensive type checking.

Essentially, this means that we get to skip steps 1-3 in the above list!  There are some other speedups we get from Numpy, too, mostly due to the fact that it pushes as much logic as possible into C--which is very, very fast--and gives us very nice and easy Python interfaces to that fast code.

# Numpy: the quick-start

In [1]:
# Conventionally imported with the `np` alias
import numpy as np

# make an array--the core data structure in Numpy
my_array = np.array([1, 2, 3, 4, 5])
print(f"{my_array=}")

# Arrays can be indexed just like lists.
print(f"{my_array[0]=}")
print(f"{my_array[-1]=}")
print(f"{my_array[::-1]=}")

# numpy supports *vectorized* operations.  This will look familiar to R users.
# This is equivalent to: `np.array([my_array[0] + 1, my_array[1] + 1, ...])`
print(f"{my_array + 1=}")

# numpy supports *element-wise* operation between pairs of arrays.
# Equivalent to: np.array([my_array[0] * my_array[0], my_array[1] * my_array[1], ...])
print(f"{my_array * my_array=}")

my_array=array([1, 2, 3, 4, 5])
my_array[0]=1
my_array[-1]=5
my_array[::-1]=array([5, 4, 3, 2, 1])
my_array + 1=array([2, 3, 4, 5, 6])
my_array * my_array=array([ 1,  4,  9, 16, 25])


Expressions like `my_array + 1` are called *vectorized* expressions (or operations).  We treat the array like it's a single value, and then do math on it--addition, subtraction, multiplication, exponentiation, whatever we want.  Numpy then "intercepts" that expression and translates it into a loop over the array's elements, performing our desired operation once per element.  All of this is done behind the scenes for us, and involves a lot of very, very fast C code to give us some extra speed boosts.

Expresions like `my_array * my_array` are usually called *element-wise* operations.  When we have two arrays on either side of a mathematical operator, Numpy will try to line them up element-for element, and run the given operation on the paired values.

It's very easy to tell when you'll get vectorized versus element-wise operations:
- If you have a mathematical operation where one argument is an array and another is a single number (e.g. a float or int), you'll get a vectorized operation.
- If both arguments are arrays with the same dimensions and size, you'll get element-wise operations.
  - If they have different dimensions/sizes, you will either get an error or you'll get *broadcasting*, which we'll briefly see later.

Numpy has a lot of built-in functions for doing operations over arrays, e.g. computing their sums, performing dot products and matrix multiplication, and more; I'm going to generally gloss over these functions and focus more on the important core concepts of the library.  Numpy also has a lot of ways to construct new arrays, which I will also generally gloss over (with a few important exceptions).

Let's do a quick benchmark to show how fast this is.  We'll pick a pretty reasonable example: a list of 10,000 numbers, which we want to normalize so they sum to 1.

In [2]:
def normalize_list(my_list):
    """Normalize numeric values in a Python list so they all
    sum to 1."""
    total = sum(my_list)
    for i in range(len(my_list)):
        my_list[i] /= total
    return my_list

def normalize_array(my_array):
    """Normalize a numpy array so its values sum to 1."""
    return my_array / np.sum(my_array)

# Create a numpy array of random values.  We'll come back to random generation
# later; it's worth addressing on its own.
rng = np.random.default_rng()
random_array = rng.random(size=10_000)

# list() can convert between numpy arrays and Python lists
random_list = list(random_array)

from timeit import timeit
print(
    "Python list:",
    timeit("normalize_list(random_list)", globals=globals(), number=10000),
)
print(
    "Numpy array:",
    timeit("normalize_array(random_array)", globals=globals(), number=10000),
)

Python list: 13.2514705
Numpy array: 0.13752679999999984


That's almost 100x faster!

So why would we ever use anything other than numpy arrays?  The answer: flexibility.  Numpy arrays are *homogeneously typed.*  Every element in the array has to have the exact same type.  This means you can't mix and match strings and numbers; you either have an array of integers, or you have an array of strings.  Python lists, as you hopefully know, do not have this limitation.

```python
a_valid_list = [1, "2", 3.4, 10+6j, ["another list", "with some elements"]]
```

Sometimes, this flexibility that Python lists give us is super important.  But if you're just dealing with lists of numbers that you need to do math on, then by all means, switch to Numpy arrays!

There is one other important area where Python lists outperform Numpy arrays: appending values.  Without getting too deep into the weeds, when you append a new value to a Python list, Python asks the operating system for a new chunk of memory to store that new item in.  When you append an item to a Numpy array, Numpy asks the operating system for a new chunk of memory big enough to store *the entire original array, plus the new item,* then copies all of its data over to the new location.  This sounds like a lot of unnecessary work, but it is actually important; it guarantees that all the data is always right next to each other in RAM (fancy terms: the arrays are *memory-contiguous*), which has some *enormous* speed benefits.  So, if you need to do a lot of appending and resizing, use a Python list.

In [3]:
def list_append():
    my_list = []
    for i in range(10_000):
        my_list.append(i)
    return my_list

def array_append():
    my_array = np.array([])
    for i in range(10_000):
        my_array = np.append(my_array, i)
    return my_array

print("List appending:", timeit("list_append()", globals=globals(), number=100))
print("Array appending:", timeit("array_append()", globals=globals(), number=100))

List appending: 0.056842500000001905
Array appending: 5.820718200000002


But, if we already know how many values we have, Numpy arrays can be fast.  Not always as fast as appending to a list, but way faster than appending to a Numpy array.

In [4]:
def array_fill():
    # Create an empty array with 10,000 elements.
    my_array = np.empty(10_000)
    for i in range(10_000):
        my_array[i] = i
    return my_array

print("Array filling:", timeit("array_fill()", globals=globals(), number=100))

Array filling: 0.09004249999999914


## Multi-dimensional arrays

Arrays in Numpy can be multi-dimensional.  The examples above are a one-dimensional array, but we can make a two-dimensional array (which you might recognize as a matrix, from linear algebra).  When constructing an array literal, a multi-dimensional array is just a list of lists passed to np.array().

In [5]:
my_2d_array = np.array([
    [1, 2],
    [3, 4]
])
print(my_2d_array)
print()

# We can still do vectorized and element-wise operations.
print(my_2d_array + 1)
print()
print(my_2d_array ** my_2d_array)
print()

# One operator is special: the @ operator, which is reserved
# for matrix multiplication.
print(my_2d_array @ my_2d_array)
print()

# np.matmul does the same thing as @, just in function form.
print(np.matmul(my_2d_array, my_2d_array))
print()

[[1 2]
 [3 4]]

[[2 3]
 [4 5]]

[[  1   4]
 [ 27 256]]

[[ 7 10]
 [15 22]]

[[ 7 10]
 [15 22]]



We can *reshape* arrays, which can be extremely useful for a few very specific applications.  Arrays are stored linearly in memory, with some metadata about where new rows start.  So we can just change that metadata!

In [6]:
# .ravel() -> flattens the array down to 1d.
my_2d_array = my_2d_array.ravel()
print(my_2d_array)
print()

# .reshape((a, b, c, ...)) -> fill the array's contents into an a-by-by-c-by... array.
# Note the extra prentheses!  The argument to .reshape() is a *tuple* of the new
# dimensions!
my_2d_array = my_2d_array.reshape((2, 2))
print(my_2d_array)

[1 2 3 4]

[[1 2]
 [3 4]]


An important note about reshaping arrays.  You can let *one, and only one* of the dimensions be the value `-1` when reshaping.  Numpy will figure out what the right value is for this--it's a shorthand for "however big it needs to be in this dimension to still have the same number of elements."  There'll be an example of this later.

Arrays can be 3d, 4d, 5d, or however many dimensions you need (but it might not always make sense).

In [7]:
# a 3x3x3 array
my_3d_array = np.arange(0, 27).reshape((3, 3, 3))
print(my_3d_array)
print(my_3d_array.shape)

[[[ 0  1  2]
  [ 3  4  5]
  [ 6  7  8]]

 [[ 9 10 11]
  [12 13 14]
  [15 16 17]]

 [[18 19 20]
  [21 22 23]
  [24 25 26]]]
(3, 3, 3)


In [8]:
# a 2x5x10 array
my_3d_array = np.arange(0, 100).reshape((2, 5, 10))
print(my_3d_array)
print(my_3d_array.shape)

[[[ 0  1  2  3  4  5  6  7  8  9]
  [10 11 12 13 14 15 16 17 18 19]
  [20 21 22 23 24 25 26 27 28 29]
  [30 31 32 33 34 35 36 37 38 39]
  [40 41 42 43 44 45 46 47 48 49]]

 [[50 51 52 53 54 55 56 57 58 59]
  [60 61 62 63 64 65 66 67 68 69]
  [70 71 72 73 74 75 76 77 78 79]
  [80 81 82 83 84 85 86 87 88 89]
  [90 91 92 93 94 95 96 97 98 99]]]
(2, 5, 10)


An important caveat: a one-dimensional array is *not* the same as, say, a two-dimensional array that only has one row!  This can cause bizarre bugs if you're not aware of it.

In [9]:
x = np.arange(0, 10)
y = x.reshape(1, 10)
print(x)
print(y)

# In this case, they will behave like two one-dimensional arrays, but the result will be
# a 2d array.
print(x + y)

# A 10x1 array--or a column vector--is very different, though.
# Note the -1; this says "make it have one column, and however many
# rows are needed to keep the same total number of elements."
y = y.reshape(-1, 1)
print(y)

# This will do something weird...
print(x + y)

[0 1 2 3 4 5 6 7 8 9]
[[0 1 2 3 4 5 6 7 8 9]]
[[ 0  2  4  6  8 10 12 14 16 18]]
[[0]
 [1]
 [2]
 [3]
 [4]
 [5]
 [6]
 [7]
 [8]
 [9]]
[[ 0  1  2  3  4  5  6  7  8  9]
 [ 1  2  3  4  5  6  7  8  9 10]
 [ 2  3  4  5  6  7  8  9 10 11]
 [ 3  4  5  6  7  8  9 10 11 12]
 [ 4  5  6  7  8  9 10 11 12 13]
 [ 5  6  7  8  9 10 11 12 13 14]
 [ 6  7  8  9 10 11 12 13 14 15]
 [ 7  8  9 10 11 12 13 14 15 16]
 [ 8  9 10 11 12 13 14 15 16 17]
 [ 9 10 11 12 13 14 15 16 17 18]]


The last expression does something called *broadcasting.*  The most common place you'll see broadcasting is probably when you have a row and a column vector, like in the above example.  Numpy will essentially compute *every possible pairwise operation* between members of the two arrays--i.e., a Cartesian, or outer, product--and return a new array with all of those combinations.  This can be the source of weird bugs--so be careful about your array dimensions!

# A note in indexing

Numpy arrays use *matrix-style indexing* and are *C-ordered.*  This means:

- With a two-dimensional array, the first axis is the *row* index.  The *top* row is row 0.
- With a two-dimensional array, the second axis is the *column* index.  The *leftmost* column is 0.
- Beyond the first two dimensions, there isn't really as good of an intuition, and you usually don't need to worry about it.

Since Numpy arrays have multiple dimensions, you have to pass multiple slicing or indexing expressions:

```python
x = np.array([[0, 1], [2, 3]])
print(x[1, 0]) # row 1, column 0 -> prints 2
```

If you want to get a whole column out of an array, you can use the syntax:

```python
my_column = my_array[:, column_number]
```

This says "get me a slice with everything from the first axis (all rows), but only position `column number` on the second axis (just the one column).  Similarly, to get a specific row:

```python
my_column = my_array[row_number, :]
```

Or, just:

```python
my_row = my_array[row_number]
```

Since Python/Numpy will assume you wanted to use `:` for any axes after the ones you've explicitly indexed.

In [10]:
x = np.array([[0, 1], [2, 3]])
print(x[1, 0])

print(x[:, 1]) # note that this prints the column as a 1d array!
print(x[1, :])

2
[1 3]
[2 3]


Lastly, and most cool: you can index an array *with a list, tuple, or other array.*  This will give a list of indices to return from the specified axis; it acts kind of like a filter.

In [11]:
x = np.array([5, 3, 7, 9, 1, 2])
print(x)
print(x[[1, 4, 2]])

# Works for row/column indexing too.
x = np.arange(9).reshape(3, 3)
print()
print(x)
print()
print(x[[0, 2]]) # just rows 0 and 2
print()
print(x[:, [0, 2]]) # just columns 0 and 3
print()
print(x[[0, 2], [1, 2]]) # just the values at indices (0, 1) and (2, 2)

[5 3 7 9 1 2]
[3 1 7]

[[0 1 2]
 [3 4 5]
 [6 7 8]]

[[0 1 2]
 [6 7 8]]

[[0 2]
 [3 5]
 [6 8]]

[1 8]


# The `axis` keyword

Many of Numpy's functions--and many functions that are compatible with Numpy arrays--support an `axis` keyword argument.  This argument is both extremely useful and kind of tricky to fully explain.

It's easiest to show what this does on a simple 2d array, using the `np.sum` function.

In [12]:
my_array = np.arange(6).reshape(2, 3)
print(my_array)

print("Sum of each row:", np.sum(my_array, axis=1))

print("Sum of each column:", np.sum(my_array, axis=0))

print("Sum of all dimensions:", np.sum(my_array))

print("Sum of all dimensions:", np.sum(my_array, axis=(0, 1)))

[[0 1 2]
 [3 4 5]]
Sum of each row: [ 3 12]
Sum of each column: [3 5 7]
Sum of all dimensions: 15
Sum of all dimensions: 15


What's going on here is that, normally, `np.sum` wants to add up every value in the array.  But, by specifying an `axis` argument, we're telling it something like: calculate the function separately on "slices" of the array that are in the direction of the axis/axes I've specified.  Glue all those results back together into a new array with fewer dimensions/axes.

So with a 2d array, and `axis=1`:
- Axis 1 is the column axis.  (since the coordinates of an array are in the form (row, column); column is the second index.  But Python starts counting at 0, so it gets the numeric value of 1).
- `np.sum(array, axis=1)` tells Python to:
  1. Split the array up into "slices," where each slice has a different value of axis 0 (the row axis), since axis 0 isn't specified in the `axis=` argument.  (in a 2d matrix, each of these "slices" is a row).
  2. For each of these "slices," calculate `np.sum`.
  3. Stick all the results back into a new array.
  
This will either make sense almost instantly, or it will take some time.  If it takes some time, don't worry; just practice with it, using small arrays and a few different summary/descriptive functions (`np.mean`, `np.sum`, `np.std`, etc) until it makes intuitive sense.

By default, if you don't specify `axis=`, Numpy assumes you want *all* axes specified.

When the array is 3d, some of our intuitions can still apply.  `np.sum(axis=1)` means "take slices for all the different values of axes 0 and 2 (which will be 2d arrays).  Calculate `np.sum` for each of those.  Stick the results into a new array."

Most of the time, we're going to use `axis=` to *reduce the number of dimensions in an array* (like in the examples above).  But, there are a few cases where this does not reduce the number of dimension, e.g., with functions that return arrays rather than scalar values.  `np.cumsum` is one such example; it calculates a cumulative sum from its inputs:

In [13]:
my_array = np.arange(6)
print(my_array)
print(np.cumsum(my_array)) # -> [0, 0+1, 0+1+2, 0+1+2+3, ...]
print()
my_array = my_array.reshape(2, 3)
print(my_array)
print()

# Replace each row with its own, independent cumulative sum
print(np.cumsum(my_array, axis=1))
print()

# Or, do the same, but to the columns.
print(np.cumsum(my_array, axis=0))

[0 1 2 3 4 5]
[ 0  1  3  6 10 15]

[[0 1 2]
 [3 4 5]]

[[ 0  1  3]
 [ 3  7 12]]

[[0 1 2]
 [3 5 7]]


Fortunately, we will *usually* use this with reduction-like functions.

# Data Types

Numpy has its own set of types (they get re-used by a lot of other libaries that rely on Numpy, but they're all Numpy's types at the end of the day).  These types are *machine types*: they represent the view that the CPU, not the programmer, has of different data types.  Numpy calles these *dtypes*.

A short list of some of the most common dtypes (this is by no means exhaustive):
- Signed integer types (can be positive or negative): np.int8, np.int16, np.int32, np.int64
- Unsigned integer types (can only be positive): np.uint8, np.uint16, np.uint32, np.uint64
- Floating point numbers: np.float64, np.float32, np.float16

Just like with plain Python, integers store whole numbers, and floats store numbers with decimal places.  Unlike plain Python, though, Numpy values are limited to predefined *bit widths*.  E.g., `np.int64` is a 64-bit integer (an integer that can be up to 64 binary digits long); `np.int8` is an 8-bit integer (an integer that can be up to 8 binary bits long).  The number of bits is important for three reasons:

- It determines the maximum numeric value that can be stored using this type.  Exactly how big the number is depends on whether the type is an int, uint, or float type.
- It determines how much memory a value of the type uses.  A 64-bit number always uses 64 bits, a 32-bit number always uses 32 bits, etc.  This does not depend on whether the type is an int, uint, or float type.
- Math on some kinds of values is faster than others.  Most notably, np.float32 and np.float64 tend to be much faster to do math on than np.float16.

Usually, the good rule of thumb for what dtype to use is this: use `np.float64`, or `np.int64`, depending on what your data looks like.  If you need to change dtypes, you'll almost always know, e.g. if you need to cut down the amount of RAM your program uses (you can fit 8 8-bit numbers in the same space as one 64-bit number, for example), or if you're running on a 32-bit-only machine.

Most array construction methods in Numpy have a `dtype` keyword argument.  You can check the dtype of an array by accessing the `.dtype` attribute.

In [14]:
x = np.array([1, 2, 3], dtype=np.int8)
print(x)
print(x.dtype)
print(f"The array's contents takes up {x.size * x.itemsize} bytes of memory.")
x = x.astype(np.float64)
print(x)
print(x.dtype)
print(f"The array's contents takes up {x.size * x.itemsize} bytes of memory.")

[1 2 3]
int8
The array's contents takes up 3 bytes of memory.
[1. 2. 3.]
float64
The array's contents takes up 24 bytes of memory.


But as I said: most of the time, you can stick with the "default" dtype that Numpy gives you.  It'll work for most cases, and you'll usually know if you need to switch--you'll be running out of RAM, most of the time, or hitting over/underflow errors.

In [15]:
# an overflow example.  uint8 can store numbers between 0 and 255.
# if you do addition and it would go over 255, it wraps back around to 0.
x = np.array([253, 254, 255], dtype=np.uint8)
print(x)
x = x + 1
print(x)

# and similarly: subtracting wraps back around the the max value.
print(x - 1)

# for 32 and especailyl 64 bit integers, this will happen very rarely.
# for 32 and especially 64 bit floating point values, this will basically never happen.

[253 254 255]
[254 255   0]
[253 254 255]


# Random Number Generation

Numpy has some very efficient tools for generating random numbers.  If you need a big array of random values, it's very efficient to compute them all at once--this is *way* faster than computing one random number at a time in Python and appending to a list.  Numpy essentially generates a big chunk of random 1s and 0s and splits it up into chunks that are the rigth size for the `dtype` you're using; in plain Python, you have to generate each number one at a time, which involves a lot more overhead.

This is one of the most useful array construction features in Numpy, so it's worth covering.  Basically: you call `numpy.random.default_rng()`, save the result to a variable, and then call methods on that variable to create new random arrays.  Most of the array construction methods have a `size` argument; this tells Numpy how many values to generate, and thus how big to make the resulting array.

If you give `size=` a single number, Numpy creates a one-dimensional array of that many elements.  If you give it a tuple of values, each tuple represents the size of the final array.  The first element of the tuple is the size along the first axis (number of rows, in a 2d array), the second element is the size along the second axis (number of columns, in a 2d array), etc.  For some random generation methods, `size` is the first argument; for others, it comes later, in which case it is usually passed by name.

In [16]:
rng = np.random.default_rng()

# an array of 10 completely random values between 0 and 1
rand_array = rng.random(10)
print(rand_array)

# a 2x3 array.  The first argument to rng.random() is either a single number, for a
# one-dimensional array, or a tuple for a multi-dimensional array.
rand_array = rng.random((2, 3))
print(rand_array)

# random array of integers between 0 and 10, inclusive.
rand_array = rng.integers(0, 10, size=10)
print(rand_array)

[0.36694548 0.67957525 0.98065981 0.71282351 0.67986744 0.83943533
 0.54494691 0.10603038 0.61796627 0.44867307]
[[0.07356242 0.37427293 0.22345655]
 [0.37995634 0.59775639 0.33965679]]
[5 5 5 0 1 1 0 6 8 6]


There are also methods to generate random values from specific distributions, e.g., an array of normally-distributed random values.  Check the Numpy documentation for more details.

# Wrap-up

Numpy is a *big* library, and we've barely scratched the surface of what you can do with it.  But it is *extremely* important.  You need to get proficient with it, because every other library in the data science ecosystem ultimately builds off of it, and uses its arrays under the hood.

A lot of the stuff in Numpy is also heavily based in mathematics, mostly linear algebra.  You'll get a lot more out of Numpy if you're familiar with things like matrices, vectors, and different operations you can do on them.