### Outlier Analysis
- Use Tukey's method to identify outliers for each feature.
- Identify each instance that is an outlier for more than one feature.
- Assess what percentage of the total data are outliers for:
    - one feature
    - two features
    - other
- Come up with a plan for handling outliers.

In [1]:
import pandas as pd
import numpy as np

In [2]:
housing_log_sc_df = pd.read_pickle('final_log_sc.p')

**Use Tukey's method to identify outliers**

In [3]:
def display_outliers(dataframe, col, param=1.5):
    Q1 = np.percentile(dataframe[col], 25)
    Q3 = np.percentile(dataframe[col], 75)
    tukey_window = param*(Q3-Q1)
    less_than_Q1 = dataframe[col] < Q1 - tukey_window
    greater_than_Q3 = dataframe[col] > Q3 + tukey_window
    tukey_mask = (less_than_Q1 | greater_than_Q3)
    return dataframe[tukey_mask]

In [4]:
for col in housing_log_sc_df:
    print(col, display_outliers(housing_log_sc_df, col).shape)

CRIM (0, 11)
INDUS (2, 11)
NOX (0, 11)
RM (27, 11)
AGE (17, 11)
DIS (0, 11)
RAD (0, 11)
TAX (0, 11)
PTRATIO (16, 11)
B (78, 11)
LSTAT (1, 11)


In [5]:
from collections import Counter

In [6]:
raw_outliers = []
for col in housing_log_sc_df:
    outlier_df = display_outliers(housing_log_sc_df, col)
    raw_outliers += list(outlier_df.index)

In [7]:
outlier_count = Counter(raw_outliers)
outliers = [k for k,v in outlier_count.items() if v > 1]

In [8]:
len(outliers)

9

In total we have 9 outlier

In [9]:
housing_log_sc_df.shape

(506, 11)

To handle outlinear we can remove the value from our analysis or use Mean Meadian methods to replace it.