# Soft Introduction to Descriptive Statistics with `NumPy` 

**Note: The Game of Thrones Data Set is explained in the R Notebook Lab.**

Now that we have the basics of data manipulation with `pandas` under our belts, we are going to begin diving into descriptive statistics. Don't worry; we aren't leaving data manipulation behind. It will play a big role throughout the rest of course and our data science careers.

Today we will be introduced to another popular data science library known as `NumPy`. `NumPy` provides us with methods to work with mathematical functions to run on large arrays and matrices. Today we will be using it for some of our descriptive statistics...

### Read in Data Set

In [None]:
import pandas as pd
import numpy as np

with open('../../../datasets/game-of-thrones/GoT_age_at_death.csv') as file:
    df = pd.read_csv(file)
    df.columns  = ['character', 'age', 'dead', 'gender', 'affiliation'] # change file header names
    
    # change column types
    df['dead'] = df['dead'].astype('category')
    df['gender'] = df['gender'].astype('category')
    df['affiliation'] = df['affiliation'].astype('category')

In [None]:
df.head() 

### The Mean

#### A brief note about `pandas` and `NumPy`

It is important to know that `pandas` is built on top of `NumPy` and therefore much of the functionality that we will be introducing today as `NumPy` methods is also available with `pandas`. Let's see for ourselves...

In [None]:
pandas_mean = df.age.mean()
numpy_mean = np.mean(df.age)

print("I am the mean constructed with pandas: {}".format(pandas_mean))
print("And I am the mean constructed with NumPy: {}".format(numpy_mean))

See... the two produce the same result, although the methods look a bit different. Either way is fine for finding the mean from a numeric column of a data frame. But the `pandas` way will only work on a `pandas` object. Imagine if we wanted to find the mean of a numeric list we created without `pandas`. 

In [None]:
x = [1,2,3,4,5] # create a list of numbers

The `pandas` way won't work...

In [None]:
x.mean()

But the `NumPy` way will...

In [None]:
np.mean(x)

For the majority of the following descriptive statistics we will be using `NumPy` so we can familiarize ourselves with the `NumPy` package, which will be used heavily in other modules and courses.

### The Standard Deviation

For a more detailed description of all of the statistics that we will be using in this notebook, refer to the `R` lab notebook on the topic as it discusses the a bit more of the logic behind each one. Below is how to find the standard deviation with `NumPy`.

In [None]:
np.std(df.age)

Hmm...this looks a little odd. If you have a good memory and can remember back to our `R` lab notebook, you might recall that the standard deviation of this dataset's `Age` variable was 19.0something, and although 18.992 is rather close to 19.0..., the result is different based on design.

When looking at statistics we have populations and then we have samples. Simply put, a population is everyone while a sample is a subset of the population. Imagine we wanted to measure the heights of Oregonians. The population would be a measurement for every single person from Oregon. This would be impossible to do, so instead we would take a sample of individuals from Oregon. 

Now, standard deviation is found by taking the square root of a statistic called the "variance" and the way variance is computed for the population is just a tad bit different than how it is computer for the sample (we will get more into the math behind this in the Statistical and Mathematical Foundations course). It is this small difference that produces the different results. `NumPy` defaults to the population standard deviation. 

Here is how we specify the sample standard deviation...

In [None]:
np.std(df.age,ddof = 1)

There we have it! Instead of getting too deep into the math behind this `ddof` parameter and the argument that we passed it, just know that this is subtracting 1 from the sample size (amount of data points), which is what it takes to produce the sample standard deviation. We will get into this parameter a bit more during the Mathematical and Statistical Foundations course.

### The Median

The median is simple to find with `NumPy`. Remember, the median is not sensitive to outliers and, therefore, is sometimes more preferable than the mean when trying to find the average.

Here's how we do it...

In [None]:
np.median(df.age)

### Quartiles

In `R`, you may recall, the `summary` function will print the quartiles as well as the mean of a numeric variable. In `pandas` the `describe()` method is akin to that, except that it also prints out the number of values in the variable as well.

Let's try it out below.

In [None]:
df.age.describe()

That's nice and all, but what if we wanted to see the max value of the points at a different percentage of the dataset? `NumPy` provides us with a very convenient function to do so.

In [None]:
np.percentile(df.age, 65)

So 60% of the values in the age variable are 41 and below. This method is simple. We inserted two arguments: the first is a numeric object, in this case the `age` variable of our *Game of Thrones* dataset, and the second is a number between 0 and 100 signifying the percentage.

### Maximum and Minimum Values

Once again, we see that there are multiple ways to do the same thing. We can also find the maximum and minimum value of a variable by calling the `amax` and `amin` method.

In [None]:
np.amax(df.age)

In [None]:
np.amin(df.age)

## Bivariate Analysis

We're going to jump over to our Stature Hand and Foot dataset to find the covariance and correlation. Remember, these are statistics that look at the relationship between two variables. For these two statistics, we will be using `pandas` again.

### Read in the Data

In [None]:
with open('../../../datasets/stature-hand-foot/stature-hand-foot.csv') as file2:
    df2 = pd.read_csv(file2)
    df2['gender'] = df2['gender'].astype('category')
    df2.columns = ['gender', 'height', 'hand_length', 'foot_length']

In [None]:
df2.head()

### Covariance

When finding the covariance of two variables, you are going be performing a method on a pandas series type object. In other words, we are going to be calling a method on a numeric variable and passing another variable as an argument. Take a look at the example below. 

In [None]:
df2.hand_length.cov(df2.foot_length)

As you can see, we run the method on the `hand_length` variable and pass the `foot_length` variable as the argument.

### Correlation

And now to assess the strength of the relationship between these two variables. As you can imagine, the sytax is the same as it was for the covariance. 

In [None]:
df2.hand_length.corr(df2.foot_length)

This wraps up our lab notebooks for this module. In the practices, we will be combining our data manipulation skills from module 1 with what we just learned. This allows us to add another layer of exploration and perhaps uncovers some more meaningful information and better relationships.