# Manipulating Data with Numpy and Scipy
## Topics

- Array data types with numpy
- Basic statistical analysis with numpy tools
- Introduction to the scipy packages

## Introduction

Today we're covering libraries that are an important part of the core Python data stack - numpy and scipy

**Numpy** is a library for fast numerical processing on 1-dimensional (arrays) or 2-dimensional (tables/matrics) data types.

**Scipy** contains more advanced tools for statistics and regression.

[This](https://www.w3resource.com/python-exercises/numpy/index.php) is a great resource for additional problems and examples.


## NumPy Basics
Numerical Python is a powerful library of functions, methods, and data types we can used to analyze our data. Unforunately, it also uses a different set of rules. Let's start off creating some empty arrays, which look like lists, but are in fact vectors.

NumPy arrays differ in a few fundamental ways from Python lists:

1. **Arrays cannot be of mixed types.** They can be all integers, floats, strings, logical (or boolean) values, or other immutable values. They also cannot contain mutable types such as lists. 

2. Arrays can be multidimensional, but they must be rectangular. You can have a list of lists, where the first interior list is 3 elements long, the second 5, and the third 12, but for your multidemsional arrays, every row must have the same number of columns.

3. We can perform vector operations on them, which can be algebraic functions (like a dot product), or simple replacements of values in a slice of the array.

### Reviewing basic python lists

We have looked at lists before, but we haven't explicitly discussed multi-dimensional arrays. This actually has been implied - we can place anything in a lis, so why not another list? Thusly:

In [None]:
flat_list = [1,2,3]
list_of_lists = [[1,2,3],[4,5,6],[7,8,9]]
print list_of_lists[1][2]

Noe that what python calls lists, most other languages - and NumPy - call arrays. They're pretty similar, with the exceptions called out above.

## Arrays
Here's one way: start with a list and turn it into an array with the array method:

In [None]:
import numpy as np

a = [0] * 40 #Remember this trick? This creates a list with 40 elements, each "0"
print type(a)
a = np.array(a)
print type(a)
print a

You now have an array a of 1 row and 40 columns with zeros. But there's a better way to get a vector of zeros:

In [None]:
a = np.zeros(40)

And here's how to declare something that's not all zeros

In [None]:
a = np.arange(40)
print a
type(a[0])
print a[31]

Notice the int type.

What if we want a float? There's a couple ways to do it:

In [None]:
a = np.arange(40, dtype=float)  # Explicitly tell it to use floats
print type(a[0])

a = np.arange(40.0)  # If you give it a float for the length, it will automatically use floats
print type(a[0])

Like with range(), you can also give arange() more parameters:

In [None]:
np.arange(40, 50)  # Start and Stop

In [None]:
np.arange(40, 50, 2) # Start, Stop, and increment


In [None]:
np.arange(40,50,.25)

As I said above, you can have arrays with more than one dimension

In [None]:
a = np.zeros(  (3, 9)   ) # Note the inner set of parentheses. (Rows, Columns)
a

We can ask what the __shape__ of the array is - that is, what is the matrix size and number of dimensions?

In [None]:
a.shape

In [None]:
a.reshape(3,3,3)

In [None]:
# let's define a 10x10 array...
a = np.zeros(  (10, 10)   )

And you can even modify a particular element with the same syntax, or a subtly different syntax, as our list-of-lists:

In [None]:
a[5,5] = 3  # choose row, then column
a[6,6] = 42  # Only one set of []
a

You can even add a number to a specific position using the '+=' notation.

In [None]:
a[6,6]+=10 # Add 10 to the nth row, nth column
a

So far, the coolest thing I've shown you isn't really that exciting: a range function that can have floats. The real power of arrays is the ability to have one statement affect a large chunk of an array:

In [None]:
#First, let's see hot to use numpy-style array indexing to pick a subset of our array...
a[1,:]

In [None]:
# Now, let's actually use that syntax to assign everying in the second row to one...
a[1,:] = 1
a

In [None]:
#... and all values in the first column to 7.
a[:,0] = 7
a

In [None]:
# We can add smart tests in our indexing! If it's zero, assign it to -1!
a[a == 0] = -1
a

Let us pause for a moment and think about how we would do this with a for loop in lists...
... in fact, let's try it in the [debugger](http://www.pythontutor.com/visualize.html#mode=edit)

In [None]:
# Create a list of lists of all zeros
LoL = [[0]*10 for i in range(10)] #LoL - List of Lists
 
# Set entries in row 1 to 1
for i, elem in enumerate(LoL[1]):
    LoL[1][i] = 1

# Set entries in column zero to 7
for L in LoL:
    L[0] = 7

We can also take slices of arrays, just as if they were lists:

In [None]:
a = np.arange(10)
a

In [None]:
a[2:5]

In [None]:
a[-1]

In [None]:
a[::-1]

## Vector Math with Arrays
We can do math on many values at once with arrays, no for loop required.

In [None]:
a = np.arange(0, 100, 2)
b = np.arange(50)

a

In [None]:
b

In [None]:
b / 2.0

In [None]:
a * b # Pairwise multiplication

In [None]:
(a * b).sum()

In [None]:
np.dot(a, b) # or can take the dot product

## Basic Statistics with Numpy

NumPy is **huge**, with around 1200 pages of [reference documentation](http://docs.scipy.org/doc/numpy/reference/index.html), but all of you will, at some point, use some basic statistics to get a feel for your data. So let's make sure we hit some of those functions:

### Random distributions

In [None]:
a = np.random.uniform(0, 100, 10) # Low, High, Size of output
a

In [None]:
# Let's not take the comment's word for it...
np.random.uniform?

In [None]:
a = np.random.uniform(0, 100, (3,3)) # Can also give a shape for the third argument
a

In [None]:
a = np.random.normal(0, 1, 10) # Normal distribution with mean=0, std=1, 10 samples
a

In [None]:
# A quick station break for something cool...
import matplotlib.pyplot as plt
plt.plot(a);

### Summary Statistics

In [None]:
a = np.random.normal(5, 3, 1000)  # Draw 1000 numbers from the standard normal distribution with mean 5 and std 3
a

In [None]:
np.mean(a) # Calculate the mean of this sample

In [None]:
np.std(a) # Standard deviation

In [None]:
np.min(a)

In [None]:
np.max(a)

### Operating on 2d arrays
One of the areas where numpy really shines is its ability to quickly operate along an axis of a 2d array

In [None]:
a = np.ones((5,3))# 5 rows, 3 columns
a

In [None]:
# Hey, why are they all floats? Let's find out...
np.ones?

In [None]:
a.sum()  # Sum over all elements

In [None]:
a.sum(axis=0)  # Sum across all rows

Well, that whole axis thing seems arbitrary. Let's read the manual a bit...

In [None]:
np.sum?

Complicated, but I think I get it.. Let's try it. Since everything is 1 in all directions, that's maybe not the best test. Let's set one of the rows to all twos...

In [None]:
a[1]=2
a

In [None]:
a.sum(axis=0)

Okay, now it makes sense. 

Rows are axis 0 and Columns are axis 1.  The order here makes sense because its the same order that you use when indexing an array, rows first - then columns.

In [None]:
a.sum(axis=1) # Sum across all columns

## Boolean Numpy Arrays for Selection and Filtering

In [None]:
a = np.zeros(10, dtype=bool)
a

In [None]:
# Slicing and mass-assignment still work
a[2:5] = True
a

In [None]:
# The ~  character inverts the boolean array, operating as a "not" operator.
b = ~a
b

In [None]:
# Demonstrating "&" and "|"
a = np.array([True, False, True])
b = np.array([False, False, True])

print "A and B"
print a & b

print "A or B"
print a | b

Using boolean expressions, you can specifically read out or assign to pieces of the array based on the values in the array

In [None]:
data = np.random.randn(10)
print data

In [None]:
data_less_than_zero = data < 0
print data_less_than_zero

In [None]:
# This does a column-wise comparison. Each float value column in __data__
# is compared to a boolean value in __data_less_than_zero__
# If __float__ __or__ __boolean__ is true, retain the float.
# this is an in-place operation.
data[data_less_than_zero] = 0   # Replace all values less than zero, with zero
print data

In [None]:
data = np.random.randn(10)
data[data < 0] = 0  # You could also do this without a temporary variable (data_less_than_zero)
print data

In [None]:
# Random! It returns a random number between 0 and 1 with a uniform distribution
# Let's take a look:
np.random.rand?

In [None]:
data = np.random.rand(8,4)*10 # Random data from 0 to 10
print data

In [None]:
# Show me the mean of each row (as above, axis=0 is columns)
row_means = data.mean(axis=1)
print row_means

In [None]:
np.mean?

In [None]:
# Give me a subset of the data matrix, containing only rows with a mean > 5
# For each row, take the mean, return it as a one dimensional array.
mean_greater_five = data.mean(axis=1) > 5
print "mean_greater_five: ", mean_greater_five
print

In [None]:
# If the row mean is greater than five, retain it
# Otherwise, bye-bye!
new_matrix = data[mean_greater_five,:]
print
print "new_matrix"
print new_matrix

In [None]:
print data

In [None]:
# OR, all in one line - without a temporary variable
new_matrix = data[ (data.mean(axis=1) > 5) ]
print
print "new_matrix"
print new_matrix

## SciPy and Fitting

SciPy (pronounced "Sigh Pie") is a collection of libraries that builds on NumPy, and has lots of convenient, fast functions for working with large amounts of scientific data. It's slightly smaller than NumPy, with only 900-odd pages of documentation. That includes sections on integrating C or Fortran code into Python, which is way outside the scope of this course, but if you ever do get to the point where you need a super-efficient implementation of something, you're covered. Especially in the one-off nature of academic science, you're often better served spending less time writing code that takes longer to run, compared to spending lots and lots of time writing code that runs slightly faster.

The [stats](http://docs.scipy.org/doc/scipy/reference/tutorial/stats.html) module of SciPy has functions for even more statistical distributions, statistical tests, and other assorted functions that a good statistician might need. As an example, let's see how we might use the [linregress](http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html#scipy.stats.linregress) function, which does a linear regression on some data. Linear regression is the process of finding a line that minimizes the sum of the square of the vertical distances from each point to the line.

First, we'll set up some noisy data:

In [None]:
import numpy as np

slope = 0.5
intercept = -10

x = np.arange(0, 100)
y = slope*x + intercept
noise = 5 * np.random.normal(0, 1, size=len(x))

y = y + noise

# Plot the line - more detials on this are covered in a later lecture
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(y);

However, using a bit of linear algebra, we can actually compute the best fit linear coefficients

In [None]:
n = len(x)
 
m = (n * sum(x * y) - sum(x) * sum(y)) / (n * sum(x**2) - (sum(x))**2)
b = (sum(y) - m * sum(x))/n
r = (n * sum(x * y) - sum(x) * sum(y)) / np.sqrt((n*sum(x**2) - sum(x)**2)
* (n * sum(y**2) - sum(y)**2))
 
print m, b, r

y2 = m*x + b
plt.plot(x,y)
plt.plot(x,y2);

This gives us pretty much the right result, but it was kind of a pain to type in. If only the libraries had some sort of function that could do linear regression for us...

In [None]:
from scipy import stats
 
r_slope, r_int, r_rval, r_pval, r_stderr = stats.linregress(x, y)
 
print "Regression Slope: ", r_slope
print "Regression Intercept: ", r_int
print "Regression correlation: ", r_rval
print "R^2:, ", r_rval**2
print "p(slope is 0): ", r_pval

Or, if you want to just compute the correlation, there's a function for that

In [None]:
from scipy.stats import pearsonr, spearmanr

result = pearsonr(x, y)
print "Pearson: ", result

print

result = spearmanr(x, y)
print "Spearman: ", result

Scipy can also be used to calculate a t-test

In [None]:
from scipy.stats import ttest_ind

# Generate two sets of samples from the normal distribution

group1 = np.random.normal(1.3, 1, 1000)
group2 = np.random.normal(1, 1, 1000)

# Some plotting code (ignore for now)
plt.hist(group1, 100, (-5, 5), alpha=.6)
plt.hist(group2, 100, (-5, 5), alpha=.6)
plt.show()


result = ttest_ind(group1, group2)
print "P =", result.pvalue



## End of Part 1 - Break for Exercises

<ol>
<li style="margin-bottom: 20px"><b>Writing Mathematical Functions</b>
    <ol>
    <li>Write a function that accepts an array of floats as inputs. Return an array where every value of the input array has been divided by 1.5.</li>
    <li>Use a random function (uniform or normal) to generate an array of floats. Write a function that accepts this array, and returns a list of values that are more than one standard deviation greater or less than the mean of the array.</li>
    <li>Write a function that estimates a p-value from the exponential distribution (another distribution in numpy).  The function should take a number as an input (lets call it x), and return an estimate at the probability that a number drawn from the exponential distribution will be equal to or greater than x.  <br/><br/>To do this, generate many samples from the exponential distribution (use the default scale=1.0), count the number of samples greater than x, and divide the result by the number of samples you generated.  <br/><br/>Don't use a loop to count the number of samples greater than x.  Instead look at what happens when you use np.sum() on a boolean array, or read about the method np.count_nonzero().<br/><br/>Calling your function should look like this:<br/>
    ```
    out = my_function(3)
print out #prints 0.050316 (or close to this number)
    ```</li>
    </ol>
</li>

<li><b>Strings to arrays</b><br/>
So we had this idea that we might be able to find a periodicity in the spacing of pyrimidine residues downstream of the termination site in Rho dependent genes (by and large, we don't). Nevertheless:
    <ol>
    <li>Make a function that takes a DNA string as input (Only G, C, A, or T's) and an arbitrary substring (e.g. "CT"). The function should find all locations of the substring in the string and return it as an array.<br/><i>For Example:</i><br/>
    ```
    a = find_substring("GCACTTGCACGTACGCCGT", "AC") 
# output a contains [2, 8, 12] (or a numpy array with these values)
    ```</li>
    <li>Using the result of find_substring from (a), find the distance between each pair of adjacent substrings. (i.e. How many basepairs separate each position where we found the subtring.) Check if a numpy method does this.<br/><i>For Example:</i><br/>
    ```
    differences = find_differences(a)
# differences contains [6, 4]
    ```</li>
    <li>Use the fasta-parser you've written to read the S.cerevisiae genome fasta file from Lecture 1.1 . Then, using the functions in part (a) and (b), generate a full list of the spacings between 'CT' nucleotide pairs for each chromosome and return an array of the differences between adjacent positions</li>
<li>Using numpy, compute the histogram of these spacings (we'll show you how to plot them later).  Use Google (or the documentation we linked above) to look up the right numpy function and how to use it.</li></ol></li></ol>