In [None]:
# Makes it so any variable or statement on it's own line gets printed w/o print()
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Using `numpy` and `pandas` to hold and manipulate data


Two of the most useful libraries for working with scientific data are `numpy` and `pandas`. 

`Numpy` is a library of math functions that provides:
  1. An array object of arbitrary homogeneous items
  2. Fast mathematical operations over arrays
  3. Linear Algebra, Fourier Transforms, Random Number Generation

That first point is where we'll start. `Numpy` introduces a new object for holding groups of variables: n-dimensional arrays of data. Within `numpy` they're referred to as ndarrays, as will I in the rest of this class. 

After we introduce you to ndarrays we will switch to `pandas`, which is a wrapper for ndarrays that makes them much easier to use. Then we'll come back to `numpy` and show you some of the functions from `numpy` and `scipy`, which adds more complex mathematical and statistical functions to python. 

### Adding libraries to python

First we need to import the libraries we want to use. This is the same process you used for the last homework to add a function to python, but libraries can add hundreds of new functions.

When we import these libraries we can assign them an alias, which is easier to remember and type. The ones used below are common for these packages. 

In [None]:
import numpy as np
import pandas as pd

In [None]:
# Let's take a look at the description of `numpy`.
# Remember, almost every function and library has a small help file

#np?

In [None]:
# Try hitting tab after the period to see all available `numpy` functions and subfunctions
#np.

Note: `numpy` has several sub-libraries that group together functions by category, like np.random for getting random numbers, np.linalg for doing linear algebra, etc. We will mostly use np.random. 

### NumPy arrays: a new thing to hold other things

NumPy arrays (ndarrays) are essentially lists with two important restrictions:

 - An array can only hold **one** type of data
 - Arrays are somewhat unmutable: 
     - you **can** change the contents of an ndarray 
     - but you **can't** change the size or the data type of the array 
 
Why would we want a list with extra restrictions? Speed, both in terms of time to compute and time to program. 

The computer only needs to check the type of data once for the array, not once for each variable. This adds up when you have huge arrays like the results of an 'omics' experiment. 

Being forced to organize your data makes it simpler to search, manipulate, and subset your data. 

### Making a new ndarray
There are a number of ways to make ndarrays. You can import data from text files, convert a list to an ndarray, or with one of the `numpy` functions that builds basic ndarray types useful for data analysis.

Let's use a function from `numpy` for making arrays containing a series of numbers: `np.arange()`.

In [None]:
# The `numpy` function arange(start, stop, step) gives you an array of values
# between the start and stop (not including the stop) incremented by step
# The default step is 1
a = np.arange(0,10)

a
type(a)

Arrays like these can be useful for describing experimental variables like sampling timepoints, threshhold values or bins, or any other linear series. 

Note that many of the functions we introduced previously work with this new data type. Try using `len()` to see how long the `a` array is.

In [None]:
# Check the size of an array
# with len(), like we did with lists
len(a)

An ndarray is an object, which is a collection of things, methods that act on those things, and attributes. 

You can access The help document for ndarrays with `ndarray?` to show the top level description of `a`. You can access all of the methods and attributes of `a` by entering a `.` after the ndarray and hitting tab. 

In [None]:
# The top level help file for an ndarray
#a?

In [None]:
# Check out the methods of an ndarray
#a.

`numpy` arrays store variables _about_ the array as attributes. Let's look at a few of `a`'s attributes:

- size : the total number of things contained by the array
- ndim : arrays can have any number of dimensions
- shape : the size of each dimension

In [None]:
a.size
a.ndim
a.shape # we'll come back to this when we make arrays with more dimensions

# Note that these aren't methods, so you don't use parentheses

### Math operations on arrays
Math operators ($+, -, *, /$) work on arrays by acting on each element or variable in the array. 

In [None]:
# Try multiplying all of `a`'s values by 3 and adding 3 to each value
a*3
a+3

Let's make another array, we'll call it `b`, the same shape as `a` that contains values from 0 to 1, stepping by 0.1. Then take that array and try using various mathematical operators on both arrays.

In [None]:
# Use np.arange to make b
b = np.arange(0,1,0.1)
a
b

In [None]:
b+a
b*a
b**a

In [None]:
# Notice operators work with differently with arrays and lists
# You can convert an array to a list using the method ndarray.tolist()
# Convert a to a_list and then multiply both by three

a_list = a.tolist()
a
a_list
a*3
a_list*3

In [None]:
# You can also use boolean operators on arrays
# That gives us an array of True and False values
a >= 8 # Which values are greater than or equal to 8
a == 99 # Which values are equal to 99

### Adding and reshaping dimensions

So far all of the arrays we've worked with have been one dimensional. NumPy arrays can be any number of dimensions. What does that mean? It just means we are keeping track of that many different variables for each sample.  

We can plot the expression of one gene on a line, two genes on a grid, three genes in 3D, maybe show data for another few genes by mapping that to the size and color of the marker, but beyond a few dimensions it's difficult to imagine higher dimensional data. 

M. tuberculosis has ~4,000 genes. If we do an RNAseq experiment with three samples and measure the expression of each gene for each sample, we are generating 4,000 dimensional data.

Let's make a multi dimensional NumPy array where each dimension has a length of 2. 

In [None]:
# Easy start. Make a 1D ndarray that contains 1 and 2
nda = np.array([1,2])
nda

In [None]:
# Now let's make a 2D array, extending the series to 4
nda = np.array([[1, 2], [3, 4]])
nda

In [None]:
# Make a second 2D array by adding four to nda
nda2 = nda + 4
nda2

In [None]:
# Now make a 3D array by combining nda and nda2
nda3 = np.array([nda, nda2])
nda3

In [None]:
#Check the size and shape of nda3
nda3.size
nda3.shape

In [None]:
# You can reshape an ndarray into any shape that can contain all of the elements
nda3.reshape([1, 8])
nda3.reshape([2, 4])
nda3.reshape([2, 2, 2])

Let's continue adding length two dimensions. Past 2D it gets tedious to make these arrays by hand. That is a sure sign that we should use code to do this repetitive task.

***
### <font color=brown>Hands on practice</font>

Lets write a function `square_seq_nda` that takes as input a number of dimensions and a value for the length of every dimension. 

We can make that sequence with another `numpy` function, `np.linspace`. This function returns evenly spaced numbers over a specified interval. That means that we don't have to worry about zero indexing or adding one the the stop value, like we did with `arange`.

In [None]:
# Here's an example of using np.linspace to count from 1 to 10
# The 'num=' sets the number of equally spaced variables desired
np.linspace(1, 10, num = 10, dtype='int')

In [None]:
# Hint: you can create the tuple (2, 2, 2) with 
([2]*3)

In [None]:
def square_seq_nda (dims, length):
    '''
    Function takes the number of dimensions and the size of each dimension
    and returns a dims-dimensional array filled with integers in sequence. 
    So square_seq_nda(2,2) would return
    [[1, 2], [3, 4]]
    '''
    # Start by calculating the total number of elements in the array
    size = length ** dims
    # Make a 1-D array containing a series of numbers 'size' long
    nda = np.linspace(1, size, num=size)
    # Make a tuple that contains 
    shape = ([length]*dims)
    nda = nda.reshape(shape)
    return nda

In [None]:
square_seq_nda(4,3)

In [None]:
# We can use mathematical operators on multidimensional arrays just like we did before
a_3d = square_seq_nda(3,2)
print(a_3d*3)
print(a_3d**2)

### Slicing and subsetting of ndarrays

Indexing and slicing ndarrays uses the same methods we learned for other collections, with one important difference. 

When you slice an ndarray your are only making a 'view' of that slice, not a copy. If you want a copy, you have to make it explicitly.

Let's use our square_seq_nda() function to make a 3 x 3 x 3 ndarray to work with.

In [None]:
a = square_seq_nda(3,3)
a

In [None]:
# As before you can you can get specific values or ranges of values using square brackets and slices
print("a is:\n",a)
print("a_3d is:\n",a_3d)

In [None]:
# Find the third value in the first column from a
a[1,1,0]

In [None]:
# Get the second row of values from a on the final sheet
# Remember, if you want all of the values use a colon
a[2,1,:]

In [None]:
# Make a new array 'b' by copying 'a' using the nda.copy() method and replacing 23 with 99
b = a.copy()
b[2,1,1] = 99
b

In [None]:
# Think of two ways to get the last layer of a_3d
# Write a boolean operation to check if the values in the last
# sheet are the same for 'a' and 'b'
a[2,:,:] == b[-1,:,:]