# Manipulating Data with Numpy, Scipy, and Pandas
## Topics

- Array data types with numpy
- Basic statistical analysis with numpy tools
- Introduction to the scipy packages
- Pandas for data manipulation

## Introduction

Today we're covering libraries that are an important part of the core Python data stack - numpy, scipy, and pandas.

**Numpy** is a library for fast numerical processing on 1-dimensional (arrays) or 2-dimensional (tables/matrics) data types.

**Scipy** contains more advanced tools for statistics and regression.

**Pandas** introduces the copy of a DataFrame to Python (familiar to those of you coming from R).

## NumPy Basics
Numerical Python is a powerful library of functions, methods, and data types we can used to analyze our data. Unforunately for those of us whose heads continue to spin in a crash-course of syntax, it also uses a different set of rules. I hope you'll understand why when you see the power and speed NumPy's data types afford us. Let's start off creating some empty arrays, which look sorta like lists, and are in fact vectors.

They differ in a few fundamental ways from lists:

1. **Arrays cannot be of mixed types.** They can be all integers, floats, strings, logical (or boolean) values, or other immutable values. But they cannot be some characters, some numbers, or any other olio of data types. They also cannot contain mutable types such as lists. So, we can have a list of lists, but not an array of lists. We can, however, have an array of arrays (sortof). Which brings us to:
2. Arrays can be multidimensional, but they must be rectangular. You can have a list of lists, where the first interior list is 3 elements long, the second 5, and the third 12, but for your multidemsional arrays, every row must have the same number of columns.
3. We can perform vector operations on them, which can be algebraic functions (like a dot product), or simple replacements of values in a slice of the array.

## Arrays
Here's one way: start with a list and turn it into an array with the array method:

In [None]:
import numpy as np

a = [0] * 40
print type(a)
a = np.array(a)
print type(a)
print a

You now have an array a of 1 row and 40 columns with zeros. But there's a better way to get a vector of zeros:

In [None]:
a = np.zeros(40)

And here's how to declare something that's not all zeros

In [None]:
a = np.arange(40)
print a
type(a[0])

Notice the int type.

What if we want a float? There's a couple ways to do it:

In [None]:
a = np.arange(40, dtype=float)  # Explicitly tell it to use floats
print type(a[0])

a = np.arange(40.0)  # If you give it a float for the length, it will automatically use floats
print type(a[0])

Like with range(), you can also give arange() more parameters:

In [None]:
np.arange(40, 50)  # Start and Stop

In [None]:
np.arange(40, 50, 2) # Start, Stop, and increment


In [None]:
np.arange(40,50,.25)

As I said above, you can have arrays with more than one dimension

In [None]:
a = np.zeros(  (10, 10)   ) # Note the inner set of parentheses. (Rows, Columns)
a

And you can even modify a particular element with the same syntax, or a subtly different syntax, as our list-of-lists:

In [None]:
a[5,5] = 3  # choose row, then column
a[6,6] = 42  # Only one set of []
a

You can even add a number to a specific position using the '+=' notation.

In [None]:
a[6,6]+=10 # Add 10 to the nth row, nth column
a

So far, the coolest thing I've shown you isn't really that exciting: a range function that can have floats. The real power of arrays is the ability to have one statement affect a large chunk of an array:

In [None]:
a[1,:] = 1
a

In [None]:
a[:,0] = 7
a

In [None]:
a[a == 0] = -1
a

Let us pause for a moment and think about how we would do this with a for loop in lists:

In [None]:
# Create a list of lists of all zeros
LoL = [[0]*10 for i in range(10)] #LoL - List of Lists
 
# Set entries in row 1 to 1
for i, elem in enumerate(LoL[1]):
    LoL[1][i] = 1

# Set entries in column zero to 7
for L in LoL:
    L[0] = 7

We can also take slices of arrays, just as if they were lists:

In [None]:
a = np.arange(10)
a[2:5]

In [None]:
a[-1]

In [None]:
a[::-1]

Maybe you can see the advantage of the array syntax. But wait, there's more! Act now, and we'll throw in math operations for free!

## Vector Math with Arrays
We can do math on many values at once with arrays, no for loop required.

In [None]:
a = np.arange(0, 100, 2)
b = np.arange(50)

a

In [None]:
b

In [None]:
b / 2.0

In [None]:
a * b # Pairwise multiplication

In [None]:
np.sum(a * b) 

In [None]:
np.dot(a, b) # or can take the dot product

## Basic Statistics with Numpy

NumPy is **huge**, with around 1200 pages of [reference documentation](http://docs.scipy.org/doc/numpy/reference/index.html), but all of you will, at some point, use some basic statistics to get a feel for your data. So let's make sure we hit some of those functions:

### Random distributions

In [None]:
a = np.random.uniform(0, 100, 10) # Low, High, Size of output
a

In [None]:
a = np.random.uniform(0, 100, (3,3)) # Can also give a shape for the third argument
a

In [None]:
a = np.random.normal(0, 1, 10) # Normal distribution with mean=0, std=1, 10 samples
a

### Summary Statistics

In [None]:
a = np.random.normal(5, 3, 1000)  # Draw 1000 numbers from the standard normal distribution with mean 5 and std 3
np.mean(a) # Calculate the mean of this sample

In [None]:
np.std(a) # Standard deviation

In [None]:
np.min(a)

In [None]:
np.max(a)

### Operating on 2d arrays
One of the areas where numpy really shines is its ability to quickly operate along an axis of a 2d array

In [None]:
a = np.ones((5,3))# 5 rows, 3 columns
a

In [None]:
a.sum()  # Sum over all elements

In [None]:
a.sum(axis=0)  # Sum across all rows

Rows are axis 0 and Columns are axis 1.  The order here makes sense because its the same order that you use when indexing an array, rows first - then columns.

In [None]:
a.sum(axis=1) # Sum across all columns

## Boolean Numpy Arrays for Selection and Filtering

In [None]:
a = np.zeros(10, dtype=bool)
a

In [None]:
# Slicing and mass-assignment still work
a[2:5] = True
a

In [None]:
# The ~ character inverts the boolean array
b = ~a
b

In [None]:
# Demonstrating "&" and "|"
a = np.array([True, False, True])
b = np.array([False, False, True])

print "A and B"
print a & b

print "A or B"
print a | b

Using boolean expressions, you can specifically read out or assign to pieces of the array based on the values in the array

In [None]:
data = np.random.randn(10)
print data

In [None]:
data_less_than_zero = data < 0
print data_less_than_zero

In [None]:
data[data_less_than_zero] = 0   # Replace all values less than zero, with zero
print data

In [None]:
data = np.random.randn(10)
data[data < 0] = 0  # You could also do this without a temporary variable (data_less_than_zero)
print data

In [None]:
data = np.random.rand(20,5)*10 # Random data from 0 to 10
print data

In [None]:
# Show me the mean of each row
row_means = data.mean(axis=1)
print row_means

In [None]:
# Give me a subset of the data matrix, containing only rows with a mean > 5 and the second column < 4 
mean_greater_five = data.mean(axis=1) > 5
print "mean_greater_five: ", mean_greater_five
print

In [None]:
new_matrix = data[mean_greater_five, :]
print
print "new_matrix"
print new_matrix

In [None]:
# OR, all in one line - without a temporary variable
new_matrix = data[ (data.mean(axis=1) > 5) ]
print
print "new_matrix"
print new_matrix

### Why Numpy?
1. Avoid writing loops (don't re-invent the wheel)
2. **Efficient Computation**

Regarding the second point, numpy is useful because operations using it are many times faster than their pure Python implementations.  This is because numpy processes arrays using code written in 'low-level' languages like C or Fortran.  These languages are much more tedious to write programs in, but run much faster than a 'high-level' language like Python.  However, by using Python to call functions written by other people in low-level languages, you can get the best of both worlds.

**Quick performance comparison**

In [None]:
## Setup Create a 1000 x 1000 list of lists (2d matrix)
N_ROWS = 1000
N_COLS = 1000
python_matrix = [[1]*N_COLS for i in range(N_ROWS)]

In [None]:
%%timeit
# Add 1 to every entry in the matrix
for i in xrange(N_ROWS):
    for j in xrange(N_COLS):
        python_matrix[i][j] += + 1

In [None]:
%%timeit
# List comprehensions help...a little
result = [[x+1 for x in row] for row in python_matrix]

In [None]:
%%timeit numpy_matrix = np.zeros((1000, 1000))
numpy_matrix += 1

Because numpy is able to know that everything is going to be a float, it can do a lot of optimizations to the arrays that it wouldn't be able to do if each element could, conceivably be a different type. Furthermore, a lot of the time is spent checking to make sure i and j aren't too big or small for the size of the lists, while the numpy code just loads the size of the array once and never checks again.

## SciPy and Fitting

SciPy (pronounced "Sigh Pie") is a collection of libraries that builds on NumPy, and has lots of convenient, fast functions for working with large amounts of scientific data. It's slightly smaller than NumPy, with only 900-odd pages of documentation. That includes sections on integrating C or Fortran code into Python, which is way outside the scope of this course, but if you ever do get to the point where you need a super-efficient implementation of something, you're covered. Especially in the one-off nature of academic science, you're often better served spending less time writing code that takes longer to run, compared to spending lots and lots of time writing code that runs slightly faster.

The [stats](http://docs.scipy.org/doc/scipy/reference/tutorial/stats.html) module of SciPy has functions for even more statistical distributions, statistical tests, and other assorted functions that a good statistician might need. As an example, let's see how we might use the [linregress](http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html#scipy.stats.linregress) function, which does a linear regression on some data. Linear regression is the process of finding a line that minimizes the sum of the square of the vertical distances from each point to the line.

First, we'll set up some noisy data:

In [None]:
import numpy as np

slope = 0.5
intercept = -10

x = np.arange(0, 100)
y = slope*x + intercept
noise = 5 * np.random.normal(0, 1, size=len(x))

y = y + noise

# Plot the line - more detials on this are covered in a later lecture
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(y);

However, using a bit of linear algebra, we can actually compute the best fit linear coefficients

In [None]:
n = len(x)
 
m = (n * sum(x * y) - sum(x) * sum(y)) / (n * sum(x**2) - (sum(x))**2)
b = (sum(y) - m * sum(x))/n
r = (n * sum(x * y) - sum(x) * sum(y)) / np.sqrt((n*sum(x**2) - sum(x)**2)
* (n * sum(y**2) - sum(y)**2))
 
print m, b, r

y2 = m*x + b
plt.plot(x,y)
plt.plot(x,y2);

This gives us pretty much the right result, but it was kind of a pain to type in. If only the libraries had some sort of function that could do linear regression for us...

In [None]:
from scipy import stats
 
r_slope, r_int, r_rval, r_pval, r_stderr = stats.linregress(x, y)
 
print "Regression Slope: ", r_slope
print "Regression Intercept: ", r_int
print "Regression correlation: ", r_rval
print "R^2:, ", r_rval**2
print "p(slope is 0): ", r_pval

Or, if you want to just compute the correlation, there's a function for that

In [None]:
from scipy.stats import pearsonr, spearmanr

result = pearsonr(x, y)
print "Pearson: ", result

print

result = spearmanr(x, y)
print "Spearman: ", result

Scipy can also be used to calculate a t-test

In [None]:
from scipy.stats import ttest_ind

# Generate two sets of samples from the normal distribution

group1 = np.random.normal(1.3, 1, 1000)
group2 = np.random.normal(1, 1, 1000)

# Some plotting code (ignore for now)
plt.hist(group1, 100, (-5, 5), alpha=.6)
plt.hist(group2, 100, (-5, 5), alpha=.6)
plt.show()


result = ttest_ind(group1, group2)
print "P =", result.pvalue



## End of Part 1 - Break for Exercises

<ol>
<li style="margin-bottom: 20px"><b>Writing Mathematical Functions</b>
    <ol>
    <li>Write a function that accepts an array of floats as inputs. Return an array where every value of the input array has been divided by 1.5.</li>
    <li>Use a random function (uniform or normal) to generate an array of floats. Write a function that accepts this array, and returns a list of values that are more than one standard deviation greater or less than the mean of the array.</li>
    <li>Write a function that estimates a p-value from the exponential distribution (another distribution in numpy).  The function should take a number as an input (lets call it x), and return an estimate at the probability that a number drawn from the exponential distribution will be equal to or greater than x.  <br/><br/>To do this, generate many samples from the exponential distribution (use the default scale=1.0), count the number of samples greater than x, and divide the result by the number of samples you generated.  <br/><br/>Don't use a loop to count the number of samples greater than x.  Instead look at what happens when you use np.sum() on a boolean array, or read about the method np.count_nonzero().<br/><br/>Calling your function should look like this:<br/>
    ```
    out = my_function(3)
print out #prints 0.050316 (or close to this number)
    ```</li>
    </ol>
</li>

<li><b>Strings to arrays</b><br/>
So we had this idea that we might be able to find a periodicity in the spacing of pyrimidine residues downstream of the termination site in Rho dependent genes (by and large, we don't). Nevertheless:
    <ol>
    <li>Make a function that takes a DNA string as input (Only G, C, A, or T's) and an arbitrary substring (e.g. "CT"). The function should find all locations of the substring in the string and return it as an array.<br/><i>For Example:</i><br/>
    ```
    a = find_substring("GCACTTGCACGTACGCCGT", "AC") 
# output a contains [2, 8, 12] (or a numpy array with these values)
    ```</li>
    <li>Using the result of find_substring from (a), find the distance between each pair of adjacent substrings. (i.e. How many basepairs separate each position where we found the subtring.) Check if a numpy method does this.<br/><i>For Example:</i><br/>
    ```
    differences = find_differences(a)
# differences contains [6, 4]
    ```</li>
    <li>Use the fasta-parser you've written to read the S.cerevisiae genome fasta file from Lecture 1.1 . Then, using the functions in part (a) and (b), generate a full list of the spacings between 'CT' nucleotide pairs for each chromosome and return an array of the differences between adjacent positions</li>
<li>Using numpy, compute the histogram of these spacings (we'll show you how to plot them later).  Use Google (or the documentation we linked above) to look up the right numpy function and how to use it.</li></ol></li></ol>

# Introduction to Pandas

Pandas is a great tool for working with data in Python.  The main object in Pandas you will use is the **DataFrame**.  It has several advantages over numpy ndarrays:

1. Allows mixed-types
2. Label-based row-column indices
3. Easy database-like operations (merge, join, groupby, sort, etc...)

In [None]:
import pandas as pd  # Pandas is usually abbreviated this way in python

To play around with Pandas, first let's read in some data from file.

In the nycflights13 folder, we have a set of files with data on all the flights that departed NYC airports in 2013.

In [None]:
# Read in data from a tab-delimited text-file
planes = pd.read_table("nycflights13/planes.txt")

# Pandas also has read_excel, read_csv, read_json, read_sql and others

In [None]:
# What's this 'plane' variable have in it?
print type(planes)

In [None]:
planes

In [None]:
# How big is it?
print planes.shape  # same as for numpy array

In [None]:
# What are the column labels?
print planes.columns

In [None]:
# What are the row labels?
print planes.index

There are three important types that are used by DataFrames:

- DataFrame
- Series
- Index

## Series

One-dimensional - represents a single column or row of data.  Only has one Index

## DataFrame

Two-dimensional.  Has both row and column labels (two Indexes)

## Index

This represents the row or column labels in Series and DataFrames

![DataFrame Vs Series](DataFrameVsSeries_.png)

In [None]:
print planes.columns
print type(planes.columns)

### DataFrame Indexing

You can grab a single column using
```
dataframe[column_name]
```

To grab a row, use:
```
dataframe.loc[row_name]
```

And to grab a specific element use:
```
dataframe.loc[row_name, column_name]
```

In [None]:
planes.head(10) # show just the first 10 rows

In [None]:
planes['manufacturer']

In [None]:
print type(planes['manufacturer'])

When you grab a single column, you have a series

![ColumnIndex](ColumnIndex.png)

In [None]:
rowthree = planes.loc[3]  # We use 3 because the row index is just numbers right now
print rowthree
print type(rowthree)

You'll notice the row is a 'Series', and it has its own index - the same as the columns of the data frame!

![Row Indexing](RowIndex.png)

### Dataframe index

So far the row-index has been numeric (just 0 through ~3300).  However, we might want to use labels here too.

To do this, we can select a column to be the dataframe's index
**Only do this if the column contains unique data**

In [None]:
planes.head(5) # Before

In [None]:
planes = planes.set_index('tailnum')

In [None]:
planes.head(5) # After

You can also set the index column when you read the file in:

```python
planes = pd.read_table('planes.txt', index_col=0) #Set the first column as the index
```

In [None]:
# Now we can grab a row by name:
planes.loc['N10156']

In [None]:
# Also use .loc to grab a single value

print planes.loc['N10156', 'model']

### But now how do I get the 3rd row since we changed the index to tail-numbers?

Here's where **iloc** comes into play.

Works like **loc** but uses integers

In [None]:
print planes.iloc[3] # Get the third row

In [None]:
print planes.iloc[:, 3] # Get the third column

### Indexing: In-summary

You can grab a single column using
```python
dataframe[column_name]
```

To grab a row, use:
```python
dataframe.loc[row_name]
```

And to grab a specific element use:
```python
dataframe.loc[row_name, column_name]
```

If you want to grab rows or column based on their position, use:
```python
dataframe.iloc[row_number or :, column_number or :]
```

## Let's explore the 'flights' table

In [None]:
flights = pd.read_table("nycflights13/flights.txt")

In [None]:
flights.head(5) # first 5 rows

In [None]:
flights.tail(5) # last 5 rows

In [None]:
flights.sample(5) # random 5 rows

### Perform functions along an axis

In [None]:
# Get the average air_time across all flights
flights['air_time'].mean()

In [None]:
subset = flights[['air_time', 'dep_delay', 'arr_delay']]  # Grab only these three columns
subset.mean(axis=0)  # Take mean across all rows

In addition to mean, there's also:
- min
- max
- median
- sum
- var (for variance)
- std (for standard deviation)

There's also `sort_values` to sort by one or more columns:

In [None]:
flights.sort_values("air_time").head(10)

# Shortest flights are only ~20 minutes from NYC to Philadelphia or Connecticut!

In [None]:
flights.sort_values(['year', 'month', 'day', 'hour', 'minute']).head(10)

# Sorts by year, then by month, then by day....and so on

**unique()** is useful for checking out the values in a column

In [None]:
flights['origin'].unique()  # Three departure airports in the NYC area in the data set

## Selecting specific rows

What if we wanted to find the average departure delay for each of the three airports?

A few ways we could do this:

In [None]:
ewr_delays = []
lga_delays = []
jfk_delays = []

for i in flights.sample(10000).index:  # Only running over a small part, this takes ~2 minutes over the whole thing!
    row = flights.loc[i]
    origin = row['origin']
    delay = row['dep_delay']
    
    if pd.isnull(delay): continue   #  Skip NaNs
        
    if origin == 'JFK':
        jfk_delays.append(delay)
    if origin == 'EWR':
        ewr_delays.append(delay)
    if origin == 'LGA':
        lga_delays.append(delay)
        
print 'JFK Delay: ', sum(jfk_delays) / len(jfk_delays)
print 'EWR Delay: ', sum(ewr_delays) / len(ewr_delays)
print 'LGA Delay: ', sum(lga_delays) / len(lga_delays)
        

In [None]:
# A better way

lga_rows = (flights['origin'] == 'LGA')
print lga_rows

In [None]:
jfk_delays = flights.loc[flights['origin'] == 'JFK', 'dep_delay']
ewr_delays = flights.loc[flights['origin'] == 'EWR', 'dep_delay']
lga_delays = flights.loc[flights['origin'] == 'LGA', 'dep_delay']

print 'JFK Delay: ', jfk_delays.mean()  # pandas mean ignores NaNs by default
print 'EWR Delay: ', ewr_delays.mean()
print 'LGA Delay: ', lga_delays.mean()

That's nice and all, but what if there were 100 origins?  

Wouldn't want to write 100 lines here!


### Using Groupby

In [None]:
# All in one statement
flights.groupby('origin')['dep_delay'].mean()

### What's happening here?

![GroupByExample](GroupBy.png)

In [None]:
# Could group by another variable - with more levels
flights.groupby('carrier')['dep_delay'].mean().sort_values()

## Merging

Merging provides a way to combine two tables together based on the data in them

To demonstrate this, we'll look at combining the planes and flights tables

In [None]:
# Notice that planes has 'tailnum' as an index
planes.head(10)

In [None]:
# Flights has a column for tailnum - every flight corresponds to a row in planes
flights.head(10)

What if we want to know how many seats (total) were on flights that took off on february first?

In [None]:
# First, subset flights to just have rows for february first

feb1_flights = flights.loc[ (flights.month == 2) & (flights.day == 1)]

feb1_flights.head(10)

Next, we're going to merge the two tables together.  For every row in flights, we're going to add in columns from planes from the row that matches the flights 'tailnum'

In [None]:
feb1_flights_w_planes = feb1_flights.merge(planes, left_on='tailnum', right_index=True)
feb1_flights_w_planes.head(10) # Let's look at the resulting table

In this statement:

```python
feb1_flights.merge(planes, left_on='tailnum', right_index=True)
```

'left' refers to the first dataframe (feb1_flights), and 'right' refers to the second dataframe (planes)

`left_on='tailnum'` means:  Use the 'tailnum' column for the feb1_flights dataframe

We could also supply `right_on` to tell it what column to use in the planes dataframe, but since we want the index, we use `right_index=True` instead (you can't do `right_on='Index'` because what if a column was named Index?)

In [None]:
# Finally, we can answer our question now by summing the 'seats' column, which came from the 'planes' table

print feb1_flights_w_planes['seats'].sum()

That's a lot of seats!  

Just for flights leaving three airports in the NYC area on one day.

I'm sure all the flights weren't full, but I bet they real number of people departing is at least 80% of that figure.

## Writing to file

Pandas makes writing the results to a file very simple:

In [None]:
# Write to a CSV (comma-seperated value) file

# top 20 rows
top = flights.head(20)
top.to_csv("flights_top.csv")

top.to_csv("flights_top.csv", sep="\t")  # Use tab as a separator instead of comma


top.to_excel("flights_top.xlsx", sheet_name='FlightsTop')  # Use tab as a separator instead of comma

# You might need to install the openpyxl module for Excel writing to work
# To do this, open a terminal and type in "conda install openpyxl", then restart the jupyter notebook by
# going to Kernel (at the top) and selecting 'Restart'.  You will have to re-run the earlier cells that load the data

## Exercises - Part 2

<ol start="3">
<li style="margin-bottom: 20px"><b>Does rain cause airline delays?</b><br/>
Let's see if we can use the data to answer this question - does rain cause airline delays?
    <ol>
        <li>
Load the `nycflights13/weather.txt` table.  Investigate the precip column - find out the average amount of precipitation when the precipitation is not 0.  Also find the standard deviation of the precipitation (but only in the hours when it isn't zero).
        </li>
        <li>
Merge the 'flights' table from earlier with the 'weather' table on the ['year', 'month', 'day', 'hour', 'origin'] columns.  This will give you weather information for each flight.  Select only the 'dep_delay' and 'precip' columns so you have a table with only two columns.
        </li>
        <li>
Select only the rows where precip == 0.  What is the average dep_delay for these rows?  What about the dep_delay where there is high precipitation (use a cutoff where "high precipitation" is precipitation that is greater than the mean + 1 standard deviation as calculated in part 1a).
        </li>
        <li>
There's a difference in delay from part 1c, but is it significant?  Use Google to look up the ranksums function from scipy and use it to test whether the delays from the No-Precipitation group are significantly different than the delays in the High-Precipitation group.
        </li>
    </ol>
</li>

<li style="margin-bottom: 20px"><b>Explore</b><br/>
Check out the tables and see if there is another question you could ask and try to answer.
</li>
</ol>