<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Descriptive-Analysis" data-toc-modified-id="Descriptive-Analysis-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Descriptive Analysis</a></span><ul class="toc-item"><li><span><a href="#Objectives" data-toc-modified-id="Objectives-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Objectives</a></span></li></ul></li><li><span><a href="#Sample-Data" data-toc-modified-id="Sample-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Sample Data</a></span></li><li><span><a href="#Different-Statical-Measures" data-toc-modified-id="Different-Statical-Measures-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Different Statical Measures</a></span><ul class="toc-item"><li><span><a href="#Measures-of-Center" data-toc-modified-id="Measures-of-Center-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Measures of Center</a></span><ul class="toc-item"><li><span><a href="#Mathematical-Properties" data-toc-modified-id="Mathematical-Properties-3.1.1"><span class="toc-item-num">3.1.1&nbsp;&nbsp;</span>Mathematical Properties</a></span></li></ul></li><li><span><a href="#Measures-of-Spread" data-toc-modified-id="Measures-of-Spread-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Measures of Spread</a></span><ul class="toc-item"><li><span><a href="#Min,-Max,-and-Range" data-toc-modified-id="Min,-Max,-and-Range-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Min, Max, and Range</a></span></li><li><span><a href="#Percentiles-and-IQR" data-toc-modified-id="Percentiles-and-IQR-3.2.2"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span>Percentiles and IQR</a></span></li><li><span><a href="#Standard-Deviation" data-toc-modified-id="Standard-Deviation-3.2.3"><span class="toc-item-num">3.2.3&nbsp;&nbsp;</span>Standard Deviation</a></span></li></ul></li><li><span><a href="#Visual-Description" data-toc-modified-id="Visual-Description-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Visual Description</a></span><ul class="toc-item"><li><span><a href="#Histograms" data-toc-modified-id="Histograms-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Histograms</a></span></li><li><span><a href="#Box-and-Whisker-Plot" data-toc-modified-id="Box-and-Whisker-Plot-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Box and Whisker Plot</a></span></li></ul></li><li><span><a 

In [None]:
from scipy import stats
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

%matplotlib inline

plt.style.use('fivethirtyeight')

# Descriptive Analysis 

When trying to understand your data, it is typically impossible to just look at raw data and get much insight. We need ways to turn a bunch of data into a smaller set of numbers that are easily digestible summaries of your data. This will make them understandable both for you and for the people you work with. We call these **descriptive statistics**.

## Objectives

- Use business context to guide exploratory analyses
- Pose clear business-relevant questions and answer them with data
- Identify limitations of data for solving business problems

# Sample Data

Let's build a simple dataset, based on a hypothetical survey of the number of pairs of shoes owned by 11 random people:

In [None]:
# Assign the numpy array to the 'data' variable.
data = np.array([5, 6, 3, 4, 3, 4, 8, 8, 1, 8, 2])                     
data

# Different Statistical Measures 📊

## Measures of Center

One natural place to begin is to ask about where the **middle** of the data is. In other words, what is the value that is closest to our other values? 

There are three common measures used to describe the "middle":

- **Mean**: The sum of values / number of values
- **Median**: The value with as many values above it as below it
    - If the dataset has an even number of values, the median is the mean of the two middle numbers.
- **Mode**: The most frequent value(s)
    - A dataset can have multiple modes if multiple values are tied for the most frequent.

Let's see what we have for our example:

In [None]:
print(f"Mean: {np.mean(data)}")
print(f"Median: {np.median(data)}")
print(f"Mode: {stats.mode(data)[0][0]}")

In [None]:
## You can also find the mode(s) using np.unique()
counts = np.unique(data, return_counts=True)
counts

# The second item printed in this array is the counts of the specific values in our data set.

**Discussion**: If somebody asked you "How many pairs of shoes do people usually have?", how would you answer (based on these data)?

## Measures of Spread

Another natural question is about the **spread** of the data or the **dispersion**. In other words, how wide a range of values do you have? And how close or far are they from the "middle"?

### Min, Max, and Range

The minimum and maximum values of a dataset tell you the full extent of the values of your dataset. The range of the dataset is the difference between those two values.

In [None]:
print(f"Min: {data.min()}")
print(f"Max: {data.max()}")
print(f"Range: {data.max() - data.min()}")

### Percentiles and IQR

You can also calculate values at various **percentiles** to understand the spread. An "Nth Percentile" value is the value that is greater than N% of other values. The 25th and 75th percentiles are commonly used to describe spread, and the **interquartile range (IQR)** is the difference between these two values.

See [the docs](https://numpy.org/doc/stable/reference/generated/numpy.percentile.html) for more specifics about how percentiles are calculated, which is surprisingly tricky.

In [None]:
print(f"25th Percentile: {np.percentile(data, 25)}")
print(f"75th Percentile: {np.percentile(data, 75)}")
print(f"IQR: {np.percentile(data, 75) - np.percentile(data, 25)}")

### Standard Deviation

The **standard deviation** is in effect the distance from the mean of the "average" data point. It is defined as: $$\sqrt\frac{\Sigma(x_i - \bar{x})^2}{n}$$


In [None]:
print(f"Standard Deviation: {data.std()}")

# Visual Description

A picture is worth a thousand words - or numbers! Here we will show how to use histograms and box-and-whisker plots to describe your data.

## Histograms

One natural way of starting to understand a dataset is to construct a **histogram**, which is a bar chart showing the counts of the different values in the dataset.

There will usually be many distinct values in your dataset, and you will need to decide how many **bins** to use in the histogram. The bins define the ranges of values captured in each bar in your chart. 

In [None]:
fig, ax = plt.subplots()
ax.hist(data, bins=14)
plt.title('Counts, 14 Bins')

In [None]:
fig, ax = plt.subplots()
ax.hist(data, bins=10)
plt.title('Counts, 10 Bins')

In [None]:
fig, ax = plt.subplots()
ax.hist(data, bins=5)
plt.title('Counts, 5 Bins')

In [None]:
fig, ax = plt.subplots()
ax.hist(data, bins=7)
plt.title('Counts, 7 Bins')

## Box and Whisker Plot

A box-and-whisker plot can also be useful for visually summarizing your data by showing the min, IQR, and max.

In [None]:
fig, ax = plt.subplots()
ax.boxplot(data)
plt.title('Counts of Pairs of Shoes')