# Identifying and Removing Outliers

To identify outliers in the data, we will use what is [the Tukey Method](http://datapigtechnologies.com/blog/index.php/highlighting-outliers-in-your-data-with-the-tukey-method/). 

This means that we will look for points that are more than 1.5 times the Inter-quartile range above the third quartile or below the first quartile.

In [None]:
customer_df = read.csv('Wholesale_customers_data.csv')
customer_df$Channel <- NULL
customer_df$Region <- NULL
dim(customer_df)

In [None]:
customer_log_df = log(customer_df)
customer_log_sc_df = data.frame(scale(customer_log_df))

In [None]:
display_outliers <- function (dataframe, feature, param=1.5) {
    feature_vec =  as.vector(dataframe[[feature]])
    Q1 <- quantile(feature_vec, .25)
    Q3 <- quantile(feature_vec, .75)
    tukey_window <- param*(Q3-Q1)
    less_than_Q1 <- dataframe[[feature]] < Q1 - tukey_window
    greater_than_Q3 <- dataframe[[feature]] > Q3 + tukey_window
    tukey_mask <- (less_than_Q1 | greater_than_Q3)
    return(dataframe[tukey_mask,])
}

In [None]:
display_outliers(customer_log_sc_df, 'Grocery')

In [None]:
display_outliers(customer_log_sc_df, 'Milk')

In [None]:
for (feature in colnames(customer_log_sc_df)){
    outlier_count = dim(display_outliers(customer_log_sc_df, feature))[1]
    print(paste(feature, outlier_count))
}

What if we count the rows that show up as an outlier more than once?

In [None]:
raw_outliers = c()
for (feature in colnames(customer_log_sc_df)){
    outlier_df = display_outliers(customer_log_sc_df, feature)
    outlier_indices = rownames(outlier_df)
    raw_outliers = c(raw_outliers, outlier_indices)
}
raw_outliers

In [None]:
table(raw_outliers)

In [None]:
dim(customer_log_sc_df)