# 5. Outliers

### Introduction

There is no formal statistical definition of an outlier but generally speaking, we think of outliers as being an abnormal observation distant from other points. There has been lots of research [dedicated to outlier detection](https://en.wikipedia.org/wiki/Outlier#Detection) but for our purposes we will concentrate on allowing our natural human ability to notice slight imperfections from a standard. 

Box plots are great tools for visually detecting outliers. Seaborn (and most other plotting tools) defaults to labeling outliers as any observation more than 1.5 times the IQR beyond either the first or third quartiles.

First, let's recreate our data again.

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

diamonds = pd.read_csv('../data/diamonds.csv')

new_order = ['cut', 'color', 'clarity','carat', 'price', 'x', 'y','z','depth', 'table']
diamonds = diamonds[new_order]

order = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']
diamonds['cut'] = pd.Categorical(diamonds['cut'], ordered=True, categories=order)

order = ['J', 'I', 'H', 'G', 'F', 'E', 'D']
diamonds['color'] = pd.Categorical(diamonds['color'], ordered=True, categories=order)

order = ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']
diamonds['clarity'] = pd.Categorical(diamonds['clarity'], ordered=True, categories=order)

### Plot box plots for each column simultaneously
Pandas is better at making plots for each column in your dataset independently. We will use it to make boxplots on the numeric data. By default, only the numeric columns will be plotted, so we don't have to drop them before plotting. By setting `subplots` to `True`, each column will be plotted on its own Axes. Control the number of rows and columns of the grid with `layout`.

In [None]:
diamonds.plot(kind='box', subplots=True, figsize=(18, 10), layout=(2, 4));

### Handling outliers
During EDA, we are not necessarily interested in taking an action on the outlier. Instead we can label it, investigate it further and then make a decision on it.

### Labeling the outliers
A simple procedure can be done to label outliers. Use the comparison operators to create a boolean Series for each variable. For instance, any depth less than 45 or greater 75 will be labeled as an outlier.

In [None]:
x_out = diamonds['x'] < 3
y_out = (diamonds['y'] > 30) | (diamonds['y'] > 20)
carat_out = diamonds['carat'] > 4
depth_out = (diamonds['depth'] < 45) | (diamonds['depth'] > 75)
table_out = (diamonds['table'] < 40) | (diamonds['table'] > 90)

### Put outliers in their own DataFrame
Let's make an entirely new DataFrame to hold the outliers. We pass the DataFrame constructor a dictionary mapping the column name to the outlier Series.

In [None]:
d = {'x': x_out, 
     'y': y_out, 
     'carat': carat_out, 
     'depth': depth_out, 
     'table_out':table_out}

outliers = pd.DataFrame(d)
outliers.head()

### Use the outlier DataFrame to select rows with outliers
Let's select all the rows that have an outlier in the x column. Each column is just a boolean Series, so you can just pass it to the selection operator to make the selection.

In [None]:
diamonds[outliers['x']]

### Operations on the outliers DataFrame
We can find the total number of outliers in each column.

In [None]:
outliers.sum()

### Get all rows with an outlier
Use the `any` DataFrame method to determine if there are any True values in each row. This returns a boolean Series which can be used to select all rows in the original DataFrame that have an outlier.

In [None]:
any_outlier = outliers.any(axis=1)
any_outlier.head()

In [None]:
diamonds[any_outlier]

### Comments on outliers
* There are several rows with x,y,z all equal to 0. These variables must be positive, so they can't possibly be correct. 
* The two y values over 30mm can't possibly be right as one of them would be wider than the largest diamond ever found and the price is much too low.

### Calculated Depth
The data dictionary tells us that the **`depth`** is equal to **`z / mean(x,y)`**. Let's calculate the depth using this formula and compare to the depth from the data.

In [None]:
diamonds['calculated_depth'] = diamonds['z'] / ((diamonds['x'] + diamonds['y']) / 2) * 100

In [None]:
diamonds.head()

In [None]:
diamonds['depth_diff'] = (diamonds['depth'] - diamonds['calculated_depth']).abs()

In [None]:
diamonds.sort_values('depth_diff', ascending=False).head(25)

In [None]:
(diamonds['depth_diff'] < 5).mean(), (diamonds['depth_diff'] > 5).sum()

### depth vs calculated depth
If this was a pristine dataset, then the calculated depth would equal the depth for each observation. About .1% (or 40) of the observations have an absolute depth difference less than 1. What does this mean for the other .2% of the data? There must be a measurement/input error in x, y or z. The table above sorts by largest absolute depth difference. A **`z`** of 0 is responsible for much of the large depth differences.

More investigation into these wrong calculated depth observations might need to happen.

## Only coverage of outliers in a single dimension
It is possible for outliers to exist as a result of a combination of variables, but this discussion is limited to just outliers in a single dimension.

# Exercise
Complete these steps on your dataset