Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = "Orion"
COLLABORATORS = ""

---

# Intro to Data Science

Based on 

- Dr. VanderPlas' book. Prof at University of Washington and Visting Researcher at Google

> *The Python Data Science Handbook* by Jake VanderPlas (O’Reilly). Copyright 2016 Jake VanderPlas, 978-1-491-91205-8.

Code: <br>
https://jakevdp.github.io/PythonDataScienceHandbook/


# Aggregations: Min, Max, and Everything In Between

Often when faced with a large amount of data, a first step is to compute **summary statistics** for the data in question.

Perhaps the most common summary statistics are the **mean and standard deviation**, which allow you to summarize the "typical" values in a dataset, 

but other aggregates are useful as well (the sum, product, median, minimum and maximum, quantiles, etc.).

NumPy has **fast built-in aggregation** functions for working on arrays; we'll discuss and demonstrate some of them here.

## Summing the Values in an Array

As a quick example, consider computing the sum of all values in an array.
Python itself can do this using the built-in ``sum`` function:

In [4]:
import numpy as np

In [13]:
L = np.random.random(100)
print(L.shape)
sum(L)

(100,)


The syntax is quite similar to that of NumPy's ``sum`` function, and the result is the same in the simplest case:

In [None]:
%timeit np.sum(L)

However, because it executes the operation in compiled code, NumPy's version of the operation is computed much more quickly:

In [15]:
big_array = np.random.rand(1000000)
#Exercise:
%timeit np.sum(L)
%timeit sum(L)
#Calcualte the timing profile for both sums. How big is the difference?

7.74 µs ± 94.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
14.2 µs ± 879 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


Be careful, though: the ``sum`` function and the ``np.sum`` function are not identical, which can sometimes lead to confusion!

In particular, their optional arguments have different meanings, and ``np.sum`` is aware of multiple array dimensions, as we will see in the following section.

## Minimum and Maximum

Similarly, Python has built-in ``min`` and ``max`` functions, used to find the minimum value and maximum value of any given array:

In [None]:
min(big_array), max(big_array)

NumPy's corresponding functions have similar syntax, and again operate much more quickly:

In [None]:
np.min(big_array), np.max(big_array)

In [22]:
#Exercise:
#Calcualte the timing profile for both mins. How big is the difference?
%time min(big_array)

%time max(big_array)

CPU times: user 104 ms, sys: 0 ns, total: 104 ms
Wall time: 106 ms
CPU times: user 104 ms, sys: 0 ns, total: 104 ms
Wall time: 104 ms


0.99999875381888

For ``min``, ``max``, ``sum``, and several other NumPy aggregates, **a shorter syntax** is to use methods of the array object itself:

In [None]:
print(big_array.min(), big_array.max(), big_array.sum())

Whenever possible, make sure that you are using the NumPy version of these aggregates when operating on NumPy arrays!

### Multi dimensional aggregates

One common type of aggregation operation is an aggregate along a row or column.
Say you have some data stored in a two-dimensional array:

In [23]:
M = np.random.random((3, 4))
print(M)

[[0.57629432 0.52438499 0.85286367 0.14103562]
 [0.23963977 0.81917582 0.45835055 0.58055543]
 [0.23457779 0.80908304 0.62789438 0.99714459]]


By default, each NumPy aggregation function will return the aggregate over the entire array:

In [None]:
M.sum()

Aggregation functions take an additional argument specifying the **axis** along which the aggregate is computed. 

For example, we can find the minimum value within each column by specifying ``axis=0``:

In [None]:
print(M.shape)
M.min(axis=0)

The function returns four values, corresponding to the four columns of numbers.

Similarly, we can find the maximum value within each row:

In [None]:
print(M.shape)
M.max(axis=1)

The way the axis is specified here can be confusing to users coming from other languages.

The ``axis`` keyword specifies the **dimension of the array that will be collapsed**, rather than the dimension that will be returned.

So specifying ``axis=0`` means that the first axis will be collapsed: for two-dimensional arrays, this means that values within each column will be aggregated.

In [36]:
#Exercise:
# Create a 5x3 random matrix and calculate its mean for each row.
arr = np.random.random((5,3))
print(arr)

print(arr.mean(axis=1))


[[0.10049576 0.71920838 0.80243118]
 [0.81981964 0.7603464  0.62229574]
 [0.12068436 0.21859548 0.48511698]
 [0.00234583 0.62997519 0.18621269]
 [0.32061441 0.14854938 0.01296926]]
[0.54071177 0.73415392 0.27479894 0.27284457 0.16071102]


### Other aggregation functions

NumPy provides many other aggregation functions, but we won't discuss them in detail here.

Additionally, most aggregates have a ``NaN``-safe counterpart that computes the result while ignoring missing values, which are marked by the special IEEE floating-point ``NaN`` value (for a fuller discussion of missing data, see Ch 3: Handling Missing Data).

Some of these ``NaN``-safe functions were not added until NumPy 1.8, so they will not be available in older NumPy versions.

The following table provides a list of useful aggregation functions available in NumPy:

|Function Name      |   NaN-safe Version  | Description                                   |
|-------------------|---------------------|-----------------------------------------------|
| ``np.sum``        | ``np.nansum``       | Compute sum of elements                       |
| ``np.prod``       | ``np.nanprod``      | Compute product of elements                   |
| ``np.mean``       | ``np.nanmean``      | Compute mean of elements                      |
| ``np.std``        | ``np.nanstd``       | Compute standard deviation                    |
| ``np.var``        | ``np.nanvar``       | Compute variance                              |
| ``np.min``        | ``np.nanmin``       | Find minimum value                            |
| ``np.max``        | ``np.nanmax``       | Find maximum value                            |
| ``np.argmin``     | ``np.nanargmin``    | Find index of minimum value                   |
| ``np.argmax``     | ``np.nanargmax``    | Find index of maximum value                   |
| ``np.median``     | ``np.nanmedian``    | Compute median of elements                    |
| ``np.percentile`` | ``np.nanpercentile``| Compute rank-based statistics of elements     |
| ``np.any``        | N/A                 | Evaluate whether any elements are true        |
| ``np.all``        | N/A                 | Evaluate whether all elements are true        |

We will see these aggregates often throughout the rest of the book.

## Example: What is the Average Height of US Presidents?

Aggregates available in NumPy can be extremely useful for summarizing a set of values.

As a simple example, let's consider the heights of all US presidents.

This data is available in the file *president_heights.csv*, which is a simple comma-separated list of labels and values:

In [1]:
#Bash command to output the first part of files:
!head -4 data/president_heights.csv

order,name,height(cm)
1,George Washington,189
2,John Adams,170
3,Thomas Jefferson,189


We'll use the __Pandas__ package, which author explores more fully in Chapter 3, to read the file and extract this information 

(note that the heights are measured in centimeters).

In [None]:
import pandas as pd
data = pd.read_csv('data/president_heights.csv')
print(data.columns)

In [None]:
#save the heights column to an array
heights = np.array(data['height(cm)'])
print(heights)
#can also use an index


Now that we have this data array, we can compute a variety of summary statistics:

In [None]:
print("Mean height:       ", heights.mean())
print("Standard deviation:", heights.std())
print("Minimum height:    ", heights.min())

In [None]:
#Exercise: Calculate the max: 
print("Maximum height:    ", )

In [None]:
# ME
print("Maximum height:    ", heights.max())

Note that in each case, the aggregation operation reduced the entire array to a single summarizing value, which gives us information about the distribution of values.

We may also wish to compute quantiles:

In [None]:
print("25th percentile:   ", np.percentile(heights, 25))
print("Median:            ", np.median(heights))
print("75th percentile:   ", np.percentile(heights, 75))

We see that the median height of US presidents is 182 cm, or just shy of six feet.

Of course, sometimes it's more useful to see a visual representation of this data, which we can accomplish using tools in Matplotlib (author discusses Matplotlib more fully in Chapter 4). 

For example, this code generates the following chart:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
#seaborn is an extension to matplotlib. 
# Prettifies things and ease working with Panda's DataFrames
import seaborn; seaborn.set()  # set plot style

In [None]:
plt.hist(heights)
plt.title('Height Distribution of US Presidents')
plt.xlabel('height (cm)')
plt.ylabel('number');

These aggregates are some of the fundamental pieces of **exploratory data analysis** that we'll explore in more depth in later chapters of the book.