# Using `numpy` and `pandas` to hold and manipulate data


Two of the most useful libraries for working with scientific data are `numpy` and `pandas`. 

`Numpy` is a library of math functions we need to do data analysis. 

`Numpy` also introduces a new object for holding groups of variables: n-dimensional arrays of data. Within `numpy` they're referred to as ndarrays, but I'll just call them arrays for this class. 

We'll start by introducing you to arrays and `numpy` functions, why you might want to use them, and how they work. Later we'll cover `pandas`, a "wrapper" for `numpy` arrays that makes them simpler to use, and `scipy`, which adds more complex mathematical and statistical functions to python using arrays. 


### Let's upgrade! Adding libraries to python

First we need to import the libraries we want to use. This is the same process you used for the last homework to add new functions to python, but these packages add hundreds of new functions

When we import these libraries we can give them an alias, which is easier to remember and type. The ones used below are common for these packages. 

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Let's take a look at the description of `numpy`.
# Remember, almost every function and library has a small help file

#np?

# Try hitting tab after the period to see all the `numpy` function options
#np.

Note: `numpy` has several sub-libraries that group together functions by category, like np.random for getting random numbers, np.linalg for doing linear algebra, etc. We will mostly use np.random. 

### NumPy arrays: a new thing to hold other things

NumPy arrays are essentially lists that have two important restrictions:

 - An array can only hold one type of data
 - Arrays are unmutable: you **can** change the contents, but you **can't** change the size of the array or the data type
     - To "add" to an array you have to make a new array that is a copy plus the new data
 
Why would we want a list with extra restrictions? The short answer is speed. The computer only needs to check the type of data once for the array, not once for each variable. This adds up when you have huge arrays like the results of an 'omics' experiment. 

There are a number of ways to make `numpy` arrays. You can import data from text files (covered later), you can convert a list to an array, or you can use one of the `numpy` functions that builds some basic array types useful for data analysis.

Let's look at two new functions from NumPy for making arrays: `np.arange()` and `np.zeroes()`.

In [3]:
# The `numpy` function arange(start, stop, step) gives you an array of values
# between the start and stop (not including the stop) incremented by step
# The default step is 1
a = np.arange(0,10)

print(a)
print(type(a))

[0 1 2 3 4 5 6 7 8 9]
<class 'numpy.ndarray'>


In [4]:
# Let's check the size of our new array

# We can do this with len(), like we did with lists
print (len (a))

# Remember how an object is a collection of variables and methods?
# In addition to the variables contained by the array, 
# `numpy` arrays store variables _about_ the array

print(a.shape) # we'll come back to this when we make arrays with more dimensions
print(a.size)

# Note that these aren't methods, so you don't use parentheses

10
(10,)
10


In [5]:
# We can get data from the array just like we did with lists and tuples with square brackets and slicing
print(a[3])
print(a[0:3])
print(a[2:])

# We use square brackets to assign new values to our array
print(a)
a[3]=99
print(a)

3
[0 1 2]
[2 3 4 5 6 7 8 9]
[0 1 2 3 4 5 6 7 8 9]
[ 0  1  2 99  4  5  6  7  8  9]


#### Math operations on arrays
Math operators ($+, -, *, /$) work on arrays by acting on each element or variable in the array. 

In [6]:
print (a*3)
print (a+3)

[  0   3   6 297  12  15  18  21  24  27]
[  3   4   5 102   7   8   9  10  11  12]


In [7]:
# You can use operators with two arrays
b = np.arange(0,1,0.1)
print(a)
print(b)
print(b+a)


[ 0  1  2 99  4  5  6  7  8  9]
[0.  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9]
[ 0.   1.1  2.2 99.3  4.4  5.5  6.6  7.7  8.8  9.9]


In [8]:
# Notice operators work with differently with arrays and lists
# You can convert an array to a list using the method ndarray.tolist()
# Convert a to a_list and then multiply both by three

a_list = a.tolist()
print (a)
print(a_list)
print(a*3)
print (a_list*3)

[ 0  1  2 99  4  5  6  7  8  9]
[0, 1, 2, 99, 4, 5, 6, 7, 8, 9]
[  0   3   6 297  12  15  18  21  24  27]
[0, 1, 2, 99, 4, 5, 6, 7, 8, 9, 0, 1, 2, 99, 4, 5, 6, 7, 8, 9, 0, 1, 2, 99, 4, 5, 6, 7, 8, 9]


In [9]:
# You can also use boolean operators on arrays
# That gives us an array of True and False values
print(a >= 8) # Which values are greater than or equal to 8
print(a == 99) # Which values are equal to 99

[False False False  True False False False False  True  True]
[False False False  True False False False False False False]


#### Adding dimensions

So far all of the arrays we've worked with have been one dimensional. NumPy arrays can be any number of dimensions. What does that mean? It just means we are keeping track of that many different variables for each sample. 

M. tuberculosis has ~4,000 genes. If we do an RNAseq experiment with three samples and measure the expression of each gene for each sample, we are generating 4,000 dimensional data. We could plot the expression of one gene on a line, two genes on a grid, three genes in 3D, maybe show data for another few genes by mapping that to the size and color of the marker. 

Let's start by making a two dimensional NumPy array.

In [10]:
# First, let's reshape our 1D array from above using the ndarray.reshape() method
# We know from a couple cells above that a is 10 variables long
# Reshape that to a 2 by 5 table
print(a.reshape(2,5))

# Note that didn't change the shape of a!
print('a is still:',a)

# If you don't store the reshaped array it simply shows us the table... sometimes thats all we want
# Otherwise you can store that view in another variable, or overwrite the existing variable
a = a.reshape(2,5)
print(a)

[[ 0  1  2 99  4]
 [ 5  6  7  8  9]]
a is still: [ 0  1  2 99  4  5  6  7  8  9]
[[ 0  1  2 99  4]
 [ 5  6  7  8  9]]


In [11]:
# Here's an array with three dimensions, each of length three
a_3d = np.arange(1,28,1).reshape(3,3, 3)
print(a_3d)

[[[ 1  2  3]
  [ 4  5  6]
  [ 7  8  9]]

 [[10 11 12]
  [13 14 15]
  [16 17 18]]

 [[19 20 21]
  [22 23 24]
  [25 26 27]]]


In [12]:
# We can use mathematical operators just like we did with the 1D arrays
print(a_3d*3)
print(a_3d**2)

[[[ 3  6  9]
  [12 15 18]
  [21 24 27]]

 [[30 33 36]
  [39 42 45]
  [48 51 54]]

 [[57 60 63]
  [66 69 72]
  [75 78 81]]]
[[[  1   4   9]
  [ 16  25  36]
  [ 49  64  81]]

 [[100 121 144]
  [169 196 225]
  [256 289 324]]

 [[361 400 441]
  [484 529 576]
  [625 676 729]]]


In [13]:
# As before you can you can get specific values or ranges of values using square brackets and slices
print("a is:\n",a)
print("a_3d is:\n",a_3d)

a is:
 [[ 0  1  2 99  4]
 [ 5  6  7  8  9]]
a_3d is:
 [[[ 1  2  3]
  [ 4  5  6]
  [ 7  8  9]]

 [[10 11 12]
  [13 14 15]
  [16 17 18]]

 [[19 20 21]
  [22 23 24]
  [25 26 27]]]


In [14]:
# Find the fourth value in the first column from a
a[0,3]

99

In [15]:
# Get the second row of values from a
# Remember, if you want all of the values use a colon
a[1,:]

array([5, 6, 7, 8, 9])

In [16]:
# Get the first layer of a_3d
a_3d[:,:,0]

array([[ 1,  4,  7],
       [10, 13, 16],
       [19, 22, 25]])

In [17]:
# Think of two ways to get the last layer of a_3d
a_3d[:,:,2] == a_3d[:,:,-1]

array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]])

#### Functions for modifying arrays
We've already seen ndarray.reshape for changing the structure of our data. We can also divide arrays with np.split() and add data with np.append() or np.stack().

In [18]:
# The np.split() function will split an array into equal sized chunks
# You can specify if you want to break up the array by rows, columns, sheets, etc.
np.split(a_3d, 3, axis = 0)

[array([[[1, 2, 3],
         [4, 5, 6],
         [7, 8, 9]]]), array([[[10, 11, 12],
         [13, 14, 15],
         [16, 17, 18]]]), array([[[19, 20, 21],
         [22, 23, 24],
         [25, 26, 27]]])]

In [19]:
# Remember that NumPy arrays are immutable, so any change we make is just a "view" until we make a copy
# For instance, we can add data to an array using np.append, but that won't change the original array
np.append(a, (1,2))

array([ 0,  1,  2, 99,  4,  5,  6,  7,  8,  9,  1,  2])

In [33]:
# If we use an array in a for loop it iterates over the rows
for row in a:
    print(row)
    print(row+1)

[ 0  1  2 99  4]
[  1   2   3 100   5]
[5 6 7 8 9]
[ 6  7  8  9 10]


In [31]:
# FInally, we can transpose an array using array.T()
a.T

array([[ 0,  5],
       [ 1,  6],
       [ 2,  7],
       [99,  8],
       [ 4,  9]])

### NumPy functions to make arrays shine

NumPy gives you a library of functions that work with arrays. These can be split into functions that:
 - modify the structure of arrays
 - access data in arrays
 - use arrays as inputs to a range of mathematical functions 

I'll cover a few of the most useful functons here, but we will see many more as the class goes on.

#### Magic functions
The first function isn't actually part of NumPy. 

Python has a set of "magic" commands that work on memory and the operating system. Not surprisingly, you can easily cause things to crash doing that, so magic commands limit your ability to do damage by only giving you a few powerful functions. All magic functions start with a "%" symbol. 

We will run into a few more of these later, but for now I just want to show you one really useful magic function: `%whos`.

You may find that you lose track of the variables you've created so far. Let's see whats there with `%whos`

In [39]:
# This will show us every variable in memory
%whos

Variable   Type       Data/Info
-------------------------------
a          ndarray    2x5: 10 elems, type `int32`, 40 bytes
a_3d       ndarray    3x3x3: 27 elems, type `int32`, 108 bytes
a_list     list       n=10
b          ndarray    10: 10 elems, type `float64`, 80 bytes
np         module     <module 'numpy' from 'C:\<...>ges\\numpy\\__init__.py'>
pd         module     <module 'pandas' from 'C:<...>es\\pandas\\__init__.py'>
row        ndarray    5: 5 elems, type `int32`, 20 bytes


In [40]:
# This will show us every variable in memory that has the data type 'ndarray'
%whos ndarray

Variable   Type       Data/Info
-------------------------------
a          ndarray    2x5: 10 elems, type `int32`, 40 bytes
a_3d       ndarray    3x3x3: 27 elems, type `int32`, 108 bytes
b          ndarray    10: 10 elems, type `float64`, 80 bytes
row        ndarray    5: 5 elems, type `int32`, 20 bytes


#### Random number generators

Random numbers are a good way of simulating expected results or sampling a random subset of data. 

We can generate random numbers using the functions in the np.random sub-library. These numbers can be taken from a uniform distribution (all numbers equally possible) or from a normal distribution (a 'bell-shape' centered on the mean) or many other distributions we won't cover here. 

In [None]:
# We will start by setting a random seed so that all our random variables match
np.random.seed(42)

In [None]:
# A random integer with randint(start, stop(not included), number of values desired)
np.random.randint(1, 11, 9)

In [None]:
# Random integers between 0 and 10 in a 2 by 2 array
print(np.random.randint(0, 10, size=[2,2]))

In [None]:
# Three random floating-point number between 0 and 1
print(np.random.rand(3))

In [None]:
# Normal distribution with mean=0 and variance=1 in a 1 by 5 array
print(np.random.randn(1, 5 ))

In [None]:
# Pick 10 items from a given list, with equal probability
print(np.random.choice(['a', 'e', 'i', 'o', 'u'], size=10))  

# Pick 10 items from a given list with a predefined probability 'p'
print(np.random.choice(['a', 'e', 'i', 'o', 'u'], size=10, p=[0.3, .1, 0.1, 0.4, 0.1])) 

In [None]:
# Let's make a 5 by 5 matrix of random numbers from the normal distribution
# And let's be a bit fancy- keep the mean at zero but
# use a for loop and fill in the first row with sd =1, the second with sd = 2, etc.

normarray = np.zeros ((5,5))

for row in np.arange(5):
    randrow = np.random.randn(5)*(row + 1)
    normarray[row, :] = randrow

normarray = normarray.round(3)
print(normarray)


#### Built in methods for NumPy arrays 
Lastly lets look at some of the methods that every NumPy array has. We've already used ndarray.tolist() and nda.reshape().

Now we will learn how to get the average, min, max, or median of ndarrays.

In [None]:
# The nda.mean() gives you the mean of the whole array
print(normarray.mean())

# You can specify if you want to average the rows (axis=0) or columns (axis=1)
print(normarray.mean(axis=0))
print(normarray.mean(axis=1))

In [None]:
# You get min, max, or median the same way
print(normarray.max(axis=1))
print(normarray.min(axis=1))

## `Pandas` for tidy data management

For many of us, Excel is the go-to program for data analysis. `R` (another programming language) has been popular for analysis and modeling, partly because `R` has a type of object called a DataFrame that functions like an Excel spreadsheet in computer memory. 

`Pandas` brings DataFrames to python, introducing a convenient way to import, store, and save tables of data. `Pandas` is a "wrapper" around NumPy - `pandas` methods use NumPy functions and objects, but tries to make things simpler. The trade off is that `pandas` is slower, but for the types of analysis biologists do that will rarely be a problem.

## Import data into a DataFrame
Let's import some data to work with. `Pandas` provides simple tools for importing from Excel, csv, or any other common data format. Here we are going to use the `read_csv` command to pull in a table of RNAseq data from a [melanoma study from 2017](https://www.nature.com/articles/s41467-017-02353-y) published in Nature Communications. 

`Numpy` arrays let you slice and manipulate data and perform lots of mathematical operations on those arrays. `Pandas` builds on that functionality by focusing on rigidly defined lists (Series) and 2D tables (DataFrames aka "panel data", whence comes `pandas`). `Pandas` also makes data import and manipulation simpler and more intuitive than `numpy`. However in exchange for being simpler to write and read, `pandas` can be slower. 

If you have data already you can import it directly from a text file like a comma seperated values, or csv, file. `Pandas` makes this easy. But in `numpy`, well...

**DO NOT TRY TO UNDERSTAND THIS CODE** 
This is the code for importing csv files with `numpy`:
```python
import csv

with open('employee_birthday.txt', mode='r') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    line_count = 0
    for row in csv_reader:
        if line_count == 0:
            print(f'Column names are {", ".join(row)}')
            line_count += 1
        print(f'\t{row["name"]} works in the {row["department"]} department, and was born in {row["birthday month"]}.')
        line_count += 1
    print(f'Processed {line_count} lines.')
```
OK, that's probably unfair to NumPy. There are easier ways to do this now, but in a lot of older code this is exactly what data import looks like. 

This is much easier to do in `pandas`.

We'll start by importing a table of gene expression data using the `pandas` function `pd.read_csv()` and a table of metadata using `pd.read_excel()`.

### `Pandas` DataFrames and Series
`Pandas` introduces two new ways of collecting variables:

 - Series: A named list of values, all of the same type
 - DataFrame: A Excel spreadsheet in computer memory made by bundling Series Each column is a different Series of data, and each row is a separate observation or sample.

A `pandas` Series is a one dimensional NumPy array with extra methods attached.
A DataFrame (df) is a collection of related Series, where each column of data is a Series and each row is a different observation of those variables. 

In [42]:
# Let's import comma-seperated data from a text file

df = pd.read_csv('data\GSE88741-expression.csv', index_col=0)

In [43]:
# Lets take a look at the imported data
# The number of rows and columns in a DataFrame can be found at df.shape
print ("Dimensions of DataFrame:",df.shape,"\n")

# Let's take a look at the top 5 rows of `df` using df.head()
print (df.head())

# Note: Why don't we need parentheses after `df.shape`?

Dimensions of DataFrame: (35238, 12) 

             GSM2344965  GSM2344966  GSM2344967  GSM2344968  GSM2344969  \
gene_symbol                                                               
A1BG                400         320         490         331         363   
A1CF                  1           1           3           0           0   
A2M               23278       47606       20484        2652        2707   
A2ML1                 6           8          10           1           7   
A2MP1                21           7          34           0           6   

             GSM2344970  GSM2344971  GSM2344972  GSM2344973  GSM2344974  \
gene_symbol                                                               
A1BG                390         225         248         301         755   
A1CF                  1           0           2           3           1   
A2M                2854           4           7           3       26726   
A2ML1                 4           3           0           1 

In [44]:
# describe() shows a quick statistics summary of your data
# round() limits the number of significant digits
# You can chain together functions like we do here with round
df.describe().round()

Unnamed: 0,GSM2344965,GSM2344966,GSM2344967,GSM2344968,GSM2344969,GSM2344970,GSM2344971,GSM2344972,GSM2344973,GSM2344974,GSM2344975,GSM2344976
count,35238.0,35238.0,35238.0,35238.0,35238.0,35238.0,35238.0,35238.0,35238.0,35238.0,35238.0,35238.0
mean,879.0,891.0,885.0,906.0,999.0,987.0,1040.0,1007.0,963.0,962.0,1030.0,971.0
std,7328.0,6617.0,6925.0,4686.0,5050.0,5326.0,4732.0,4626.0,4458.0,4757.0,4832.0,4562.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2.0,2.0,3.0,3.0,3.0,3.0,2.0,2.0,2.0,3.0,3.0,3.0
75%,453.0,465.0,476.0,431.0,485.0,480.0,516.0,499.0,526.0,569.0,556.0,529.0
max,723398.0,798530.0,664427.0,366345.0,378896.0,394524.0,301857.0,290756.0,308816.0,271799.0,297404.0,312086.0


In [45]:
# This is a large dataset, so lets take a small random sample to work with
# the sample() method randomly selects a number of rows or columns from a larger DataFrame
# We are setting the random_state here so that we all use the same random genes
df_sample = df.sample(100, axis = 0, random_state = 333)

In [46]:
print ("Dimensions of DataFrame:",df_sample.shape)
print()
print (df_sample.head())

Dimensions of DataFrame: (100, 12)

             GSM2344965  GSM2344966  GSM2344967  GSM2344968  GSM2344969  \
gene_symbol                                                               
ASPDH                 1           1           1           4           0   
KRT18P19              0           0           0           1           0   
ANKIB1             2578        2432        2067        2634        3238   
AGGF1P6               0           0           0           0           0   
ZNF618             1489        1441        1089         997        1088   

             GSM2344970  GSM2344971  GSM2344972  GSM2344973  GSM2344974  \
gene_symbol                                                               
ASPDH                 1           2           1           1           2   
KRT18P19              0           2           1           2           0   
ANKIB1             3158        2373        1908        2324        2918   
AGGF1P6               0           0           0           0    

So that was realtively painless, but it required you to save your data as a csv file. Much of the data I work with is in Excel spreadsheets, and you _can_ save those as csv files. However `pandas` lets you import directly from Excel files.

In [47]:
# Let's bring in the metadata from an excel spreadsheet
meta = pd.read_excel("data/GSE88741-metadata.xlsx", index_col=1)
meta

Unnamed: 0_level_0,Sample_geo_accession,Stage,cell type
Sample Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
FM_1,GSM2344965,primary melanocytes,normal melanocytes
FM_2,GSM2344966,primary melanocytes,normal melanocytes
FM_3,GSM2344967,primary melanocytes,normal melanocytes
SK_MEL_28_1,GSM2344968,metastatic,melanoma cell line
SK_MEL_28_2,GSM2344969,metastatic,melanoma cell line
SK_MEL_28_3,GSM2344970,metastatic,melanoma cell line
SK_MEL_147_1,GSM2344971,metastatic,melanoma cell line
SK_MEL_147_2,GSM2344972,metastatic,melanoma cell line
SK_MEL_147_3,GSM2344973,metastatic,melanoma cell line
UACC_62_1,GSM2344974,metastatic,melanoma cell line


In [None]:
# Now let's extract the Sample Titles and use them to replace the ugly GSM names
columns = meta.index
print (type(columns))
df_sample.columns = columns

In [None]:
df_sample.head()

In [None]:
# Notice how we set the column names above
# We can use that command to show us index and column names as well
print(df_sample.index)
print()
print(df_sample.columns)
print()
print(df_sample.dtypes)

## Slicing and sorting
Getting subsets of data out of `pandas` DataFrames is done primarily in one of two ways.

If you want to search for row and column names, you use .loc(). For instance, if I want the value from the 3rd row and 2nd column you would use `df.loc(
You can instead use the index numbers to select the data you want using .iloc()

In [None]:
# Let's sort by the index 

In [None]:
df_sample.UACC_62_1

In [None]:
# We can send our DataFrame into a `numpy` array
# ndarrays can only be one type of data, so if we added any metadata
# this would convert to whatever data type works for all data types present

nda = df.to_numpy()
type(nda)
#nda.head()


### Homework for Class 5

Let's use NumPy arrays and functions with pandas data frames in a demo experiment. 

We have a new drug, excitin (tm), that we think acts by blocking fatty acid synthesis. If we are right, genes in that pathway are likely induced to try and compensate. We are going to check if we are right by looking at expression of those genes in the presence and absence of excitin. 

We don't have a good idea *when* we expect to see the FA synthesis pathway induced, so we are sampling every 8 hours for three days in the presence and absence of excitin. We will measure the expression of six genes at each time point, three FA biosynthesis genes and three control genes that we don't expect to see changed. 

While the experiment is running lets use dummy data to set up an analysis pipeline.

For the homework you need to:
> 1. Make an array containing all of the time points in this experiment
1. 

In [None]:
# Let's set up an array of sampling times using a new function: np.arange()
# Pass np.arange() the start, stop (not included), and step size

sample_times = np.arange(0, 3*24+1, 8)

print(sample_times)
print(type(sample_times))

In [None]:
# Now let's make an array to hold all of our results when we get them
# Let's say we're going to use qRT-PCR to measure expression of five genes at each time point
# We can use np.zeroes((tuple)) to set up an array filled with zeroes,
# where tuple is the size of the array we want, 10 by 5

array_size = (10, 5)
data_table = np.zeros (array_size)
print(data_table)

Let's generate a table of data that assumes the null hypothesis, i.e. we won't see any expression changes. If that's the case we will have a mean of zero (no change) with some noise that follows a normal distribution. 