# Exploratory Data Analysis
### Foundations of Machine Learning

## EDA is Powerful
- It's the early 2010's and Anne Case and Angus Deaton are doing research on mortality at Princeton
- They get the CDC WONDER data and analogous international sources, and plot mortality. They see this:



<img
  src="./src/all_cause_mortality.png"
  style="width:45%; max-width:none;"
  alt="All cause mortality, 2000-2015."
/>


## EDA
- Case and Deaton -- having determined that this is a real phenomenon and not a mistake -- dig into the mortality data by cause:

<img
  src="./src/mortality_by_cause.png"
  style="width:50%; max-width:none;"
  alt="Mortality by cause, 2000-2015."
/>


## EDA
- What is this? What have they discovered?
- This is one of the most important trends in the last 25 years, and it was discovered because some people were playing with data and making very simple visualizations --- this is kind of shocking, perhaps even disturbing
- The second plot is very clever: For example, Lung Cancer and Diabetes are included as counterpoints to the other causes, to show that some are declining are flat, and they are dashed instead of solid to set off that distinction
- You cannot overstate the value of a well-constructed visualization

## Exploratory Data Analysis
- Today we look at some tools for exploratory data analysis, or EDA: How to visualze one or two variables, and basic statistics
- Our goal is for you to be able to analyze one or more variables, and come to some basic conclusions about where the variation in the data is
- Some good datasets for today are:
    1. `nhanes_data_17_18.csv`
    2. `airbnb_NYC.csv`
    3. `craigslist_cville_Cars_long.csv`
- We'll plot in both Pandas and Seaborn

In [1]:
import numpy as np # Numerical analysis
import matplotlib.pyplot as plt # Basic plotting
import seaborn as sns # Advanced plotting
import pandas as pd # Dataframes and basic statistics

## Outline
1. Analyzing a single variable
2. Analyzing two variables

# Analyzing a Single Variable

## The Histogram
- The classic way of visualizing the relative frequency with which a variable takes particular values is the **histogram**:
    1. Group similar observations into $B$ distinct **bins**:
        - For a categorical use the original class labels, or consolidate them until you have fewer than $B$ bins
        - For a numerical variable, find the maximum and minimum, and set the bin size equal to $\Delta = (x_{max}-x_{min})/B$. Then the $k$-th bin is $[x_{min} + (k-1) \Delta, x_{min} + k \Delta)$.  
    2. Make a bar graph where the height of the $k$-th bin is proportional to the number of observations taking that range of values

| Method | Usage |
| :---: | :---:|
| `df[var].plot.hist()` | Pandas histogram |
| `sns.histplot(df[var])` | Seaborn histogram |
| `sns.histplot(data=df,x=var)` | Seaborn histogram |

## Long Tails
- Many variables have a distinct pattern: A huge number of small values clustered in a handful of bars, and then many bars with almost no observations in them
- This is a **right-skewed variable** or a variable with a **heavy tail**
- Neither our eyes nor computers like these

| Method | Usage |
| :---: | :---:|
| `df[var_ln] = np.log( df[var] )` | Natural logarithm, if there are no zero values |
| `df[var_ihs] = np.arcsinh( df[var] )` | Inverse hyperbolic sine, if there are zero values |


- Let's make a plot of log and inverse hyperbolic sine, to see how they work, then look at some examples in the data

## Statistics

- The histogram is great
- The histogram is complicated
- To summarize variables, we often use **statistics** to capture information about the histogram in a handful of numbers

## Statistics
- A **sample** is a set of values drawn for some variable. 
- We usually denote a **sample of size $N$** as a list/vector, $X = [ x_1, x_2, ..., x_N]$
- The **sample mean** is a measure of central tendency of a variable $X$:
$$
\bar{x} = \frac{x_1 + x_2 + ... + x_N}{N} = \frac{1}{N} \sum_{i=1}^N x_i
$$
- The **sample variance** is a measure of variation of a variable $X$:
$$
\bar{s}^2 = \frac{(x_1-\bar{x})^2 + (x_2-\bar{x})^2 + ... + (x_N-\bar{x})^2}{N} =  \dfrac{1}{N} \sum_{i=1}^N (x_i - \bar{x})^2
$$
- The **sample standard deviation** is a measure of variation of a variable $X$:
$$
\bar{s} = \sqrt{ \dfrac{1}{N} \sum_{i=1}^N (x_i - \bar{x})^2 }
$$

## Statistics

- Of course, we can use Pandas to compute these:

| Method | Usage |
| :---: | :---:|
| `df[var].describe()` | Standard summary |
| `df[var].mean()` | Mean |
| `df[var].var()` | Variance |
| `df[var].std()` | Standard Deviation |

## Kernel Density Plots
- The classic histogram has some flaws; namely, the number of bins is not easy to select, and the apparent results can vary depending on the number of bins
- The alternative is to fix a particular $x_0$, weight the points around it by distance, and average; this is called a **kernel density estimator (kde)**

| Method | Usage |
| :---: | :---:|
| `df[var].plot.kde()` | Pandas KDE plot |
| `sns.kdeplot(df[var])` | Seaborn KDE plot |

## Always, always, always, look at your data

<img
  src="./src/lawyerSalaries2018.jpg"
  style="width:70%; max-width:none;"
  alt="Histogram of entry-level lawyer salaries."
/>

- How *useful* is it to say, "The average yearly salary of a lawyer is about $100k?"
- Statistics can be incredibly misleading

# Analyzing Multiple Variables

## Conditioning Categorically

- We often want to understand how one variable $Y$ behaves as another variable $X$ is varied
- If $Y$ is numerical and $X$ is categorical, there is a very nice way to do this, very easily: Plot a different KDE for each category
- This is an extremely powerful way to make expressive visualizations

| Method | Usage |
| :---: | :---:|
| `sns.kdeplot(df[var], hue = cat, common_norm=False)` | Conditional KDE Plot |

## Conditioning Categorically
- We can do the same kind of thing above with statistics instead of graphs
- Let `var` be the variable of interest, and `cat` a categorical variable

| Method | Usage |
| :---: | :---:|
| `df.loc[:,[var,cat] ].groupby(cat).mean()` | Groupby calculation |

## Scatter Plot
- With the hued KDE, we conditional a numerical on a categorical
- If we condition a numerical $Y$ on a numerical $X$, we get: A scatterplot
- Precisely, we have a set of points $(x_i, y_i)$, and plot them on the same graph
- This gives us a sense of how $Y$ varies with $X$

| Method | Usage |
| :---: | :---:|
| `sns.scatterplot(x=df[var], y = df[var], alpha=.1)` | Scatter plot |

## Covariance
- What is the linear association between two variables?
- We do something like the variance, but interacting the two variables together:
$$
\hat{c}_{XY} = \frac{(x_1-\bar{x})(y_1-\bar{y})+ (x_2-\bar{x})(y_2-\bar{y})+...+(x_N-\bar{x})(y_N-\bar{y})}{N}
$$
- This only captures **co-linear assocation**: We can find examples of non-linear association where the "positive" and "negative" parts cancel out to zero, suggesting no association at all (e.g. $y = x^2 + \varepsilon$, with $x$ between $-1$ and $+1$)
- Let `var_list = [var_1, var_2, ...]` be a list of variables for which we want covariances

| Method | Usage |
| :---: | :---:|
| `df.loc[:,var_list].cov(numeric_only=True)` | Covariance Matrix |

## Correlation
- Covariance is useful, but it is hard to interpret across pairs because the units vary
- To normalize it, we often look at the correlation instead:
$$
r_{XY}= \dfrac{c_{XY}}{s_X s_Y}
$$
- This number must be between -1 and 1, and it gives a normalized summary of how strongly each $Y$ and $X$ are linearly related

| Method | Usage |
| :---: | :---:|
| `df.loc[:,var_list].corr(numeric_only=True)` | Correlation Matrix |

## Contingency Table
- We've covered numerical by categorical and numerical by numerical
- What about categorical by categorical?
- We can make a table with the labels for each of the variables along the side and top, and tabulate the number of co-occurences in the table
- This is more interpretable if we normalize, so we can see the proportions of counts in each of the cases

| Method | Usage |
| :---: | :---:|
| `pd.crosstab(df[var_1], df[var_2])` | Raw contingency table |
| `pd.crosstab(df[var_1], df[var_2], dropna=True, normalize=True)` | Crosstabulate and normalize |
| `pd.crosstab(df[var_1], df[var_2], margins=True, dropna=True, normalize=True)` | Add margins |
