# Identifying and Removing Outliers

To identify outliers in the data, we will use what is [the Tukey Method](http://datapigtechnologies.com/blog/index.php/highlighting-outliers-in-your-data-with-the-tukey-method/). 

This means that we will look for points that are more than 1.5 times the Inter-quartile range above the third quartile or below the first quartile.

In [25]:
import pandas as pd
import numpy as npb

In [12]:
run src/load_data.py

In [13]:
whos DataFrame

Variable             Type         Data/Info
-------------------------------------------
customer_df          DataFrame         Fresh   Milk  Grocer<...>n\n[440 rows x 6 columns]
customer_final_df    DataFrame            Fresh      Milk  <...>n\n[435 rows x 6 columns]
customer_log_df      DataFrame             Fresh       Milk<...>n\n[440 rows x 6 columns]
customer_log_sc_df   DataFrame            Fresh      Milk  <...>n\n[440 rows x 6 columns]
customer_sc_df       DataFrame            Fresh      Milk  <...>n\n[440 rows x 6 columns]


In [14]:
def display_outliers(dataframe, col, param=1.5):
    Q1 = np.percentile(dataframe[col], 25)
    Q3 = np.percentile(dataframe[col], 75)
    tukey_window = param*(Q3-Q1)
    less_than_Q1 = dataframe[col] < Q1 - tukey_window
    greater_than_Q3 = dataframe[col] > Q3 + tukey_window
    tukey_mask = (less_than_Q1 | greater_than_Q3)
    return dataframe[tukey_mask]

In [15]:
for col in customer_log_sc_df:
    print(col, display_outliers(customer_log_sc_df, col).shape)

Fresh (16, 6)
Milk (4, 6)
Grocery (2, 6)
Frozen (10, 6)
Detergents_Paper (2, 6)
Delicatessen (14, 6)


What if we count the rows that show up as an outlier more than once?

In [16]:
from collections import Counter

In [17]:
raw_outliers = []
for col in customer_log_sc_df:
    outlier_df = display_outliers(customer_log_sc_df, col)
    raw_outliers += list(outlier_df.index)

In [18]:
outlier_count = Counter(raw_outliers)
outliers = [k for k,v in outlier_count.items() if v > 1]

In [19]:
len(outliers)

5

In [20]:
customer_log_sc_df.shape

(440, 6)

In [21]:
outliers

[65, 66, 128, 154, 75]

In [22]:
pwd

'/home/jovyan/UCLA_CSX_450_2_2018_W/09-wholesale_customers-3'

In [27]:
customer_df.head()

Unnamed: 0,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicatessen
0,12669,9656,7561,214,2674,1338
1,7057,9810,9568,1762,3293,1776
2,6353,8808,7684,2405,3516,7844
3,13265,1196,4221,6404,507,1788
4,22615,5410,7198,3915,1777,5185


In [28]:
customer_sc_df.head()

Unnamed: 0,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicatessen
0,0.052933,0.523568,-0.041115,-0.589367,-0.043569,-0.066339
1,-0.391302,0.544458,0.170318,-0.270136,0.086407,0.089151
2,-0.447029,0.408538,-0.028157,-0.137536,0.133232,2.243293
3,0.100111,-0.62402,-0.392977,0.687144,-0.498588,0.093411
4,0.840239,-0.052396,-0.079356,0.173859,-0.231918,1.299347


In [30]:
customer_log_sc_df.head()

Unnamed: 0,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicatessen
0,0.486184,0.976299,0.440155,-1.50925,0.644143,0.408966
1,0.087889,0.990956,0.652171,0.134052,0.766043,0.627926
2,0.016356,0.891151,0.454687,0.376899,0.804405,1.776833
3,0.517477,-0.957973,-0.084792,1.141574,-0.328712,0.633133
4,0.880631,0.439662,0.395847,0.757322,0.404939,1.456588


In [31]:
customer_df.keys()

Index(['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper',
       'Delicatessen'],
      dtype='object')