# Visualization
### Foundations of Machine Learning 

## Introduction
- The histogram or kernel density are our most common ways of visualizing data, and provide great intuition about **relative** likelihood of occurring: More height, more likely
- Choosing the number of bins or the bandwidth can feel arbitrary, and doing it incorrectly can be misleading
- Today we focus on more **robust** tools that summarize variables without being as sensitive to extreme values

## Outline
1. ECDF and Quantiles
2. Outliers

# The ECDF and Quantiles

## The Empirical Cumulative Distribution Function (ECDF)
- The ECDF of a variable $X$ is:
    1. Fix a value; let's call it $x$
    2. Compute the proportion of obervations whose value falls below $x$: That's the value of the ECDF at $x$
    3. Do this for all values of $x$ that occur in your data
- This generates a graph that increases as you move from the left to the right, with a jump at every value in the dataset

| Method | Usage |
| :---: | :---:|
| `df[var].plot.ecdf()` | Pandas ECDF |
| `sns.ecdfplot(df[var])` | Seaborn ECDF |


## The Median
- The **median** is a measure of central tendency, like the mean
- It is the value $m$ for which (approximately) 50% of the population is above $m$ and 50% of the population is below $m$
- It is more **robust** than the mean, because if we adjust very high or very low values, it typically won't change the median at all, but would often impact the value of the mean

## Quantiles
- The concept of a Quantile generalizes the Median
- Take the ECDF, and pick a number on the vertical axis, like .4
- Trace that value to the graph
- Now trace down from the graph to the horizontal axis
- This value is the 40th percentile or .4 quantile: The value of $X$ for which 40% or a proportion .4 of the sample is below that value

| Method | Usage |
| :---: | :---:|
| `np.quantile(X,q)` | Gives the $q$ quantile of $X$ |

## 5-Number Summary
- Since there are many quantiles to consider (one for each data point), we typically focus on a handful of key ones:
    - The 0-quantile, the sample minimum
    - The .25-quantile
    - The .5-quantile, the sample median
    - The .75-quantile
    - The 1-quantile, the sample maximum
- This is called a **five-number summary**

| Method | Usage |
| :---: | :---:|
| `df['var'].describe()` | Summarizes the data |

## Exercise
- Plot the ECDF and compute a 5 number summary for a numeric variable of interest in your data

# Outliers

## Extreme Behavior
- Our models are often sensitive to **extreme values** or **outliers** 
- Our models are **robust** if changes to the higher-est and lower-est values do not significantly impact our estimates
- The mean is not robust, but the median is
- In many scenarios, identifying extreme values and handling them is an important part of the process

## Interquartile Range
- The **interquartile range** or IQR is the range of values between the .25 and .75 quantiles
- This contains 50% of the observations
- If the IQR is relatively small, most of the data are in a tight window; if it is large, the data are relatively spread out
- Like the median is a robust version of the mean, the IQR is a robust version of the variance: It quantifies how spread out the data are, but isn't sensitive to extreme values

| Method | Usage |
| :---: | :---:|
| `iqr = np.quantile(df['var'], .75) - np.quantile(df['var'], .25) ` | Compute IQR |

## Boxplots
- The histogram/density plot visualizes relative frequency and the ECDF visualizes absolute frequency
- The **boxplot** visualizes the 5 number summary, median, and extreme values

| Method | Usage |
| :---: | :---:|
| `df['var'].boxplot() ` | Pandas boxplot |
| `sns.boxplot(df['var']) ` | Seaborn boxplot |

## The Whiskers
- Boxplots typically have two lines that extend from the top and bottom of the box, and then a "scatterplot" of points outside those lines. The **whiskers** extent to:
    - The .75-quantile plus $1.5 \times IQR$
    - The .25-quantile minus $1.5 \times IQR$
- The points outside the whiskers are typically called **outliers**, but that word is squishy



```python
# Compute whiskers:
q75 = np.nanquantile(df['var_outlier'], .75)
q25 = np.nanquantile(df['var_outlier'], .25)
iqr = q75 - q25
upper_whisker = q75 + 1.5 * iqr
lower_whisker = q25 - 1.5 * iqr
```

## Outliers
- The outliers are part of the data
    - Hurricanes, earthquakes, stock market crashes, heart attacks are all extreme events --- and we care about them, very much
    - But some observations might not be representative of the population of interest, like a Maserati or Ferrari in a dataset of used cars
- We have three strategies that we use for handling a situation in which there are many outliers:
    1. Drop the outliers
    2. Winsorize: Round the values outside the whiskers to the values of the nearest whisker
    3. Transform: Take a logarithm or inverse hyperbolic sine to "squash" the data into a smaller interval

## Handling Outliers
- Suppose the variable is `var`
- We can create an outlier dummy with:
```python
# Outlier dummy:
df['var_is_outlier'] = ( (df['var'] < lower_whisker) |
(df['var'] > upper_whisker) ).astype(int)
```

- And to winsorize:
```python
df['var_winsorize'] = ( (df['var'] < lower_whisker) * lower_whisker 
+ (df['var'] > upper_whisker) * upper_whisker 
+ (df['var'] >= lower_whisker) * (df['var'] <= upper_whisker) * df['var'])
```

## Handling Outliers
- If there's a handful of extreme values, dropping them is appropriate
- Taking a log or archsinh transformation works best when the variable just has a very long tail
- If there's a large number of extreme values, but they're not **that** far from the whiskers, then winsorizing is a good compromise

## Violin Plots
- The violin plot smashes the boxplot and the kernel density plot together
- This is nice if you want to see everything all at once

| Method | Usage |
| :---: | :---:|
| `sns.violinplot( x=df['var'], inner='quart',fill=False) ` | Seaborn violin plot |


## Exercise
- Make a plot of a numeric variable of interest in your data
- Detect outliers
- Winsorize, drop, or transform?