# Distributions

## Objectives
- Relate the concepts of distributions and normality back to data we are familiar with.
- Discuss Z-scores and the ways we can calculate them.
- Discuss normality and how to asses it.

#### Loading Data

In [None]:
# Import pandas.
import pandas as pd

In [None]:
# Load data and check it out!


# Visualizing Distributions

In [None]:
# Imports
### 
# NOTE: Best practice for your notebooks is to perform all imports 
# in the same cell at the top of the notebook.
###
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('seaborn-talk')

## Normal Distribution
> *AKA: Gaussian / Bell Curve*

- Normal Distributions are symmetric around the mean.
- The mean, median, mode of a normal distribution are equal.
- The area under the bell curve is $1.0$.
- Normal distributions are dense in the center and less dense in the tails.
- Normal distributions are defined by two parameters: $\mu$ (mean) and $\sigma$ (standard deviation).
- Approximately $68\%$ of the area of a normal distribution is within one standard deviation of the mean.
    - $\pm1\sigma$
- Approximately $95\%$ of the area of a normal distribution is within two standard deviations of the mean.
    - $\pm2\sigma$
    
    
<img src='data/standard-normal-distribution-with-probabilities.png'>

### Probability Mass Function (PMF)

- The **Probability Mass Function** associates *probabilities with **discrete** variables.*
    - *Discrete:* a number of known possible outcomes.

In [None]:
# PMF Demo
a = (
    pd.Series(
        ['a','a','a','a',
         'b','b','b','b',
         'c','c',
         'd',
         'e','e']
    )
    .value_counts(normalize=True)
    .sort_index()
)

fig, ax = plt.subplots()
ax.bar(a.index, a)
ax.set(title='PMF', xlabel='Discrete Variable', ylabel='Probability')
fig.tight_layout()

### Probability Density Function (PDF)
- The **Probability Density Function** associates probabilities with **continuous** variables.
    - *Continuous:* A possibly-infinite number of outcomes.

<img src='data/pdf_demo.png'>

### Cumulative Distribution Function (CDF)
- The **Cumulative Distribution Function** shows $P(X \leq x)$ for an x within the sample space.

<img src='data/cdf.png'>

## Visualizing our Dataset

In [None]:
# Domestic Gross - Histogram


In [None]:
# Domestic Gross - KDE


# Bonus: Add vertical lines depicting the mean and median. (`axvline`)


In [None]:
# Domestic Gross - Boxplot


# Skew / Kurtosis

In [None]:
from scipy import stats

***Skew***
<img src='data/skew.png' width=75%>

In [None]:
# Domestic Gross - Skew & Skew-Test


***Kurtosis***
<img src='data/kurtosis.png'>
> *https://analystprep.com/cfa-level-1-exam/quantitative-methods/kurtosis-and-skewness-types-of-distributions/*

In [None]:
# Domestic Gross - Kurtosis & Kurtosis-Test


# Z-Score
- A *Z-Score* tells us how many standard deviations away from the mean a point would be in a *Standard Normal Distribution*.

$ \LARGE z = \frac{(x - \mu)}{\sigma} $
> - *$z$: z-score*
> - *$\mu$ (mu): population mean*
> - *$\sigma$ (sigma): population standard deviation*

---

- *Z-Scores* are associated with cumulative probabilities (retrieved from a *z-table*).
<img src='data/z-table.png' width=500>

> <a src='https://www.chegg.com/homework-help/questions-and-answers/usage-z-table-1-assuming-normal-distribution-using-z-tables-find-z-score-corresponding--75-q19507214'>*Chegg.com - Usage of Z-Table*</a>

### By Hand - Don't do this!

In [None]:
# Calculating z-score by hand for RUNTIME_MINUTES.


### Using `sklearn` and `StandardScaler`

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
# Instantiate the Object.


In [None]:
# Fit the object.


In [None]:
# Check mean attribute.


In [None]:
# Check var attribute.


In [None]:
# Transform column.


In [None]:
# Revert column back to original values.


In [None]:
# Shorthand, using `fit_transform`.


In [None]:
# Plot PDF of original values.


In [None]:
# Plot PDF of scaled values.


In [None]:
# Create a plot with side-by-side axes showing each the original & scaled distributions.


# Bonus: Add vertical line showing the mean of each sample.
