# Intro to Numpy 

Tonight we're going to start off with a little detour into the `Numpy` library, which is the fundamental package for scientific computing in Python. It turns out that the `Pandas DataFrames` that we worked with last class are actually but off the `numpy array` (which we'll get to), so it's important to have some basic knowldege of what's running under the hood of our `DataFrames`. We started with `Pandas DataFrames` as opposed to `Numpy` and `numpy arrays` because they are a little bit more intuitive, and we're able to interact with them from a much higher level. 

While `Numpy` offers a number of things (see the [docs](http://www.numpy.org/) for a better idea), one of it's mainstays is the `numpy array` (an n-dimensional array), which is what we'll focus on tonight. There are loads of things that you can do with `numpy arrays`, and tonight we're going to take a survey through them. Learning all of these really just takes working with them day in and day out, and so tonight we'll try to aim for breadth over depth. We want you to walk out of tonight with an idea of the many things you can do with `numpy` and `numpy arrays`. 

## The basics of the Array

#### What's the big deal with Numpy Arrays?

What's so special about a `numpy array`? From a high level, they are kind of like lists - they just store a bunch of stuff in a container. Okay, great, so what's the big deal? Well, it turns out that a `numpy array` is much faster to interact with and perform calculations with than a standard list. Why is that, though? The two main reasons that they are faster are: 

1. They are stored as one contiguous block of memory, rather than being spread out across multiple locations like a list. 
2. Each item in a `numpy array` is of the same data type (i.e. all integers, all floats, etc.), rather than a conglomerate of any number of data types (as a list is). 

Just how much faster are they? Well, let's take the numbers from 0 to 1 million, and sum those numbers, timing it with both a list a numpy array.

In [2]:
import numpy as np # Standard import - follow this!
def sum_np_array(): 
    a = np.arange(0, 1000000)
    return a.sum()
    
def sum_lst(): 
    a = xrange(1000000)
    return sum(a)

In [5]:
%timeit sum_np_array()

1000 loops, best of 3: 1.63 ms per loop


In [7]:
%timeit sum_lst()

100 loops, best of 3: 7.91 ms per loop


Woah, so it's about 5 times faster! This is because of those two points above. Because numpy arrays store data in contiguous blocks of memory, it is able to take advantage of **vectorization**, which is the ability of a CPU to perform one operation on mulitiple pieces of data at once. In addition, since a numpy array knows what type each object it is storing is (and those types don't change), it doesn't have to waste time checking what type each object is (like a list). The combo of these two things speeds up our calcualtion quite a bit. 

It's also worth nothing that all we did above was a sum - just a **simple** sum. When we move to doing more complicated operations, we'll save even more time! Let's look at what else numpy arrays can do...

#### Getting a Numpy Array

Now that we know how awesome `numpy arrays` are, let's dive into them. We're not going to cover everything that you can do with `numpy arrays` (see the [methods docs](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.ndarray.html#numpy.ndarray) for that), but we'll look at the basics. 

Let's start with how we can create a `numpy array`. To do this, we use the `np.array()` constructor, which as the docs state, expects some kind of array or something that exposes the array interface (i.e. acts like an array - lists and tuples are examples). So, this means that we can create a numpy array by passing in a list or tuple. 

In [26]:
my_lst_ndarray = np.array([1, 2, 3, 4, 5])
my_tuple_ndarray = np.array((1, 2, 3, 4, 5), np.int32) # You can specify the data type upon creation. 

In [23]:
# Just like we have the shape attribute on Pandas DataFrames, we also have it on numpy arrays.
print my_lst_ndarray.shape
print my_tuple_ndarray.shape

(5,)
(5,)


In [24]:
# We also have the dtype attribute, which will tell us the data type of the objects in our 
# ndarray. Remember that 
print my_lst_ndarray.dtype
print my_tuple_ndarray.dtype

int64
int32


In [36]:
my_lst_ndarray2 = np.array(["1", 2, 3, "10", 5])
print my_lst_ndarray2.dtype

|S2


In [39]:
# If you try to tell the ndarray to be a certain data type, it will try to cast every 
# object you pass in to that data type (here int32), and fail if it can't cast it to that 
# data type. Below, we are able to cast "10" to a 32 bit integer, so this is fine. 
my_lst_ndarray3 = np.array([1, 2, 3, "10", 5], np.int32) 
print my_lst_ndarray3.dtype

int32


In [40]:
# This will not work, because Python can't pass the string 'bozo' to a 32 bit integer. 
my_lst_ndarray3 = np.array([1, 2, 3, "bozo", 5], np.int32) 

ValueError: invalid literal for long() with base 10: 'bozo'

In [3]:
# Other methods of creating getting numpy arrays. We just want you to know these exist. 
zeros_arr = np.zeros((3,4)) # Create a matrix of zeros with 3 rows and 4 columns. 
ones_arr = np.ones((10,20)) # Create a matrix of ones with 10 rows and 20 columns.
identity_arr = np.identity(50) # Create an identity matrix with 50 rows and 50 columns. 
random_arr = np.random.rand(2, 2) # Create a 2x2 array of random floats ranging from 0 to 1. 
range_arr = np.arange(0, 20, 0.5) # Create a numpy array with arguments (start, end, step_size). 

## A Little bit more of Numpy Arrays

Okay, so now that I have a `numpy array`, what can I do with it? Well, there are ton of things! You can index into them, perform calculations, ask for aggregation type metrics, etc. 

#### Indexing 

Let's begin by indexing into them. With `numpy arrays`, we don't have the `.loc[]`, `.iloc[]`, or `.ix[]` methods like we do on a DataFrame - we simply index into them like we would a list. It's effectively a multidimensional list, though. So, we can pass it multiple indexing values. Let's take a look. 

In [11]:
# Reshape will reshape the data to the shape that you tell it to (here 5 rows, 4 columns). 
range_arr = np.arange(0, 20, 1).reshape(5, 4)
range_arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

In [23]:
range_arr[:, 2] # Grab every row, but only the element at index 2 in those rows. 
range_arr[0:2] # With no second index, this defaults to taking the rows. 
range_arr[0:2, 1:3] # The first set of numbers refers to the rows to grab, the second set the columns.  

array([[1, 2],
       [5, 6]])

#### Other methods

Now let's look at some of the other methods that are available. Again, there is a ton we can do, and we're aiming here to at least get your eyes on a lot of the things that are possible, and give you a notebook here that you can look back at to see what is possible (Google is also amazing for this). 

In [24]:
# Remember what this looks like. 
range_arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

In [28]:
print range_arr.sum(axis=0) # Sum along the rows (i.e. get column totals)
print range_arr.sum(axis=1) # Sum along the columns (i.e. get row totals)
print range_arr.sum() # Get sum of all elements in numpy array. 

[40 45 50 55]
[ 6 22 38 54 70]
190


In [43]:
# This grabs the mean, standard deviation, max, and min values along the rows 
# (i.e. for the columns). We could also do this along the columns, or for the 
# array as a whole. 
print range_arr.mean(axis=0)
print range_arr.std(axis=0)
print range_arr.max(axis=0)
print range_arr.min(axis=0)

[  8.   9.  10.  11.]
[ 5.65685425  5.65685425  5.65685425  5.65685425]
[16 17 18 19]
[0 1 2 3]


In [44]:
# If we want to instead grab the **index** at which those min and max values occur (either
# along the rows or columns), then we can use the .argmin() and .argmax() methods available 
# on our numpy array. 
print range_arr.argmin(axis=0) # We see that the mins of each column occur at row 1 (index 0). 
print range_arr.argmax(axis=0) # We see that the maxes of each column occur at row 5 (index 4).
print range_arr.argmin() # Here we get the index of the overall minimum (the 0th index). 
print range_arr.argmax() # Here we get the index of the overall maximum (the last index). 

[0 0 0 0]
[4 4 4 4]
0
19


In [49]:
# We can get the cumulative sum or product with the following. 
print range_arr.cumsum(axis=0)  # Here it gets the cumsum along the rows (i.e. from top to bottom)
print range_arr.cumprod(axis=0) # Gets the cumprod along the rows

[[ 0  1  2  3]
 [ 4  6  8 10]
 [12 15 18 21]
 [24 28 32 36]
 [40 45 50 55]]
[[    0     1     2     3]
 [    0     5    12    21]
 [    0    45   120   231]
 [    0   585  1680  3465]
 [    0  9945 30240 65835]]


In [53]:
# We can flatten our arrays as follows. 
print range_arr.flatten()
print range_arr.ravel()

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]


## A Brief look at a cool Numpy method

The majority, if not all, of the methods that we looked at for numpy arrays are available on Pandas columns. They might have some slightly different naming conventions (`idxmax` on a column versus `argmax` on a numpy array, for example), but since Pandas DataFrames are built on Numpy arrays, the methods available on Numpy arrays largely coincide with the methods available on a column from a Pandas DataFrame. 

Many of these methods are available as functions on the `numpy` module itself, as well. Just like we can call `argmax` on a numpy array, we can call `np.argmax()`, and pass in a list or tuple. Before we move back to DataFrames, let's look at one last method that is available in `numpy`, `np.where()`. `np.where()` can help us to find what elements in a numpy array meet some condition. 

In [54]:
my_ndarray = np.array([2, 4, 6, 8, 24, 3, 8, 9, 12])

In [60]:
print np.where(my_ndarray <= 2) # Returns the index where it is true. 
print np.where(my_ndarray == 8)
print np.where(my_ndarray > 6)

(array([0]),)
(array([3, 6]),)
(array([3, 4, 6, 7, 8]),)


# Combining Datasets With Pandas 

Pandas ways of combining two sets of data include the use of `pd.merge()`, `df.join()`, and `pd.concat()`. For the most part, these three do largely the same things (although you'll notice the slight syntax difference with `.merge()` and `.concat()` called via the Pandas module and `.join()` called on the DataFrame itself. There are some cases where one of them might be better than the other in terms of writing less code or performing some kind of data combining in an easier way. The major differences between these, though, largely depend on what they do by default when you try to combine different data sets. By default, `.merge()` looks to join on common columns, `.join()` on common indices, and `.concat()` by just appending on a given axis.

You can find more detail about the differences between all three of these in the [docs](http://pandas.pydata.org/pandas-docs/stable/merging.html). We'll look at some examples below. 

In [62]:
# We'll go back to our wine data set. Who doesn't love wine?
import pandas as pd
wine_df = pd.read_csv('data/winequality-red.csv', delimiter=';')
wine_df.columns

Index([u'fixed acidity', u'volatile acidity', u'citric acid',
       u'residual sugar', u'chlorides', u'free sulfur dioxide',
       u'total sulfur dioxide', u'density', u'pH', u'sulphates', u'alcohol',
       u'quality'],
      dtype='object')

In [65]:
# get_dummies is a method called on the pandas module - you simply pass in a Pandas Series 
# or DataFrame, and it will convert categorical variable into dummy/indicator variables. 
quality_dummies = pd.get_dummies(wine_df.quality, prefix='quality')
quality_dummies.head()

Unnamed: 0,quality_3,quality_4,quality_5,quality_6,quality_7,quality_8
0,0,0,1,0,0,0
1,0,0,1,0,0,0
2,0,0,1,0,0,0
3,0,0,0,1,0,0
4,0,0,1,0,0,0


In [68]:
# Okay, so let's start of by looking at df.join. Remeber, this joins on indices by default. 
# This means that we can simply join our quality dummies dataframe back to our original 
# wine dataframe. 
joined_df = wine_df.join(quality_dummies)
joined_df.head() 

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,quality_3,quality_4,quality_5,quality_6,quality_7,quality_8
0,7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5,0,0,1,0,0,0
1,7.8,0.88,0.0,2.6,0.098,25,67,0.9968,3.2,0.68,9.8,5,0,0,1,0,0,0
2,7.8,0.76,0.04,2.3,0.092,15,54,0.997,3.26,0.65,9.8,5,0,0,1,0,0,0
3,11.2,0.28,0.56,1.9,0.075,17,60,0.998,3.16,0.58,9.8,6,0,0,0,1,0,0
4,7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5,0,0,1,0,0,0


In [73]:
# Let's now look at concat. 
joined_df2 = pd.concat([quality_dummies, wine_df], axis=1)
joined_df2.head()

Unnamed: 0,quality_3,quality_4,quality_5,quality_6,quality_7,quality_8,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,0,0,1,0,0,0,7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
1,0,0,1,0,0,0,7.8,0.88,0.0,2.6,0.098,25,67,0.9968,3.2,0.68,9.8,5
2,0,0,1,0,0,0,7.8,0.76,0.04,2.3,0.092,15,54,0.997,3.26,0.65,9.8,5
3,0,0,0,1,0,0,11.2,0.28,0.56,1.9,0.075,17,60,0.998,3.16,0.58,9.8,6
4,0,0,1,0,0,0,7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5


In [76]:
# Let's read in a different data set. Since we're looking at 
red_wines_df = pd.read_csv('data/winequality-red.csv', delimiter=';')
white_wines_df = pd.read_csv('data/winequality-white.csv', delimiter=';')

In [77]:
red_wines_df.columns

Index([u'fixed acidity', u'volatile acidity', u'citric acid',
       u'residual sugar', u'chlorides', u'free sulfur dioxide',
       u'total sulfur dioxide', u'density', u'pH', u'sulphates', u'alcohol',
       u'quality'],
      dtype='object')

In [78]:
white_wines_df.columns

Index([u'fixed acidity', u'volatile acidity', u'citric acid',
       u'residual sugar', u'chlorides', u'free sulfur dioxide',
       u'total sulfur dioxide', u'density', u'pH', u'sulphates', u'alcohol',
       u'quality'],
      dtype='object')

In [86]:
red_wines_quality_df = red_wines_df.groupby('quality').mean()['fixed acidity'].reset_index()

In [85]:
white_wines_quality_df = white_wines_df.groupby('quality').mean()['fixed acidity'].reset_index()

In [87]:
pd.merge(red_wines_quality_df, white_wines_quality_df, on=['quality'])

Unnamed: 0,quality,fixed acidity_x,fixed acidity_y
0,3,8.36,7.6
1,4,7.779245,7.129448
2,5,8.167254,6.933974
3,6,8.347179,6.837671
4,7,8.872362,6.734716
5,8,8.566667,6.657143


## Pivot Tables

From [wiki](https://en.wikipedia.org/wiki/Pivot_table): 'Among other functions, a pivot table can automatically sort, count total or give the average of the data stored in one table or spreadsheet, displaying the results in a second table showing the summarized data. Pivot tables are also useful for quickly creating unweighted cross tabulations.'

As you might have guessed, we have pivot tables that are available for our use in Pandas. The way that we do this is by calling the `.pivot_table()` function that is available on the pandas module (which we've stored as `pd`). As the [docs](http://pandas.pydata.org/pandas-docs/stable/reshaping.html#pivot-tables-and-cross-tabulations) tell us, the `.pivot_table()` expects a number of different arguments: 

1. data: A DataFrame object
2. values: a column or a list of columns to aggregate
3. index: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.
4. columns: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.
5. aggfunc: function to use for aggregation, defaulting to numpy.mean

So notice by default this uses the mean for the `aggfunc` parameter. 

In [89]:
# Let's recall what the data looks like. 
red_wines_df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25,67,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15,54,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17,60,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5


In [98]:
# So if we wanted to get the mean fixed acidity for each quality category like we did 
# earlier, but now with a pivot_table, we could do this..
pd.pivot_table(red_wines_df, values='fixed acidity', index='quality')

quality
3    8.360000
4    7.779245
5    8.167254
6    8.347179
7    8.872362
8    8.566667
Name: fixed acidity, dtype: float64

In [102]:
pd.pivot_table(red_wines_df, values='fixed acidity', index='quality', aggfunc=np.sum)

quality
3      83.6
4     412.3
5    5561.9
6    5325.5
7    1765.6
8     154.2
Name: fixed acidity, dtype: float64