# Outliers

In this section, we'll be learning about different outlier detection techniques and outlier imputation techniques, by analyzing our simulated customer lifetime value data. In this section, we'll be covering how to: 

**Outlier Detection**
- Box Plots
- Z-Scores
- Isolation Forests
- DBSCAN

**Outlier Treatment**
- Removal
- Winsorize

## Import Libraries

First, we'll need to import the relevant libraries. We'll be using the standard `pandas`, `numpy` libraries for data manipulation. We'll need to use a few functions from `scipy` for our imputation techniques.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats

## Load Data

Next, we'll load our customer lifetime value dataset. You'll see in our dataset, we have about 6 columns. The `purchases` column is the column we care about in our customer lifetime value problem. 

In [None]:
df = pd.read_csv("/kaggle/input/clv-data/clv_data.csv")

# Outlier Detection

First, we'll dive into different methods to detect outliers.....

## Box Plot

The first plot we'll use is a boxplot. A boxplot is a method of displaying a distribution of data based off the minimum, maximum lower quartile, upper quartile and the median. An outlier is a datapoint that falls outside the whiskers of the plot. You'll see in this plot, the data point above the whisker would be considered the outlier:

In [None]:
sns.boxplot(df['purchases'])

In [None]:
def extract_outliers_from_boxplot(array):
    ## Get IQR
    iqr_q1 = np.quantile(array, 0.25)
    iqr_q3 = np.quantile(array, 0.75)

    # finding the iqr region
    iqr = iqr_q3 - iqr_q1

    # finding upper and lower whiskers
    upper_bound = iqr_q3 + (1.5 * iqr)
    lower_bound = iqr_q1 - (1.5 * iqr)

    outliers = array[(array > upper_bound) | (array < lower_bound)]
    print('Outliers within the box plot are :{}'.format(outliers))
    return outliers

# Assuming df is your DataFrame and 'purchases' is a column in it
outliers = extract_outliers_from_boxplot(df['purchases'])

## Violin Plot

An alternative to a boxplot is a violin plot. A violin plot includes all the data in a boxplot while also adding density forms. This allows you to see how well your points are distributed across the entire dataset: 

In [None]:
plt.violinplot(df['purchases'])

## Z-Scores

A similar method to boxplots is using z-scores. The core difference, is using z-scores, we can specify the percentile we want to use, to classify a point as an outlier:

In [None]:
purchases = df['purchases']

def percentile_outliers(array,
                        lower_bound_perc,
                        upper_bound_perc):
    
    upper_bound = np.percentile(df['purchases'], upper_bound_perc)
    lower_bound = np.percentile(df['purchases'], lower_bound_perc)
    
    outliers = array[(array <= lower_bound) | (array >= upper_bound)]
    
    return outliers

def z_score_outliers(array,
                     z_score_lower,
                     z_score_upper):

    z_scores = scipy.stats.zscore(array)
    outliers = (z_scores > 1.96) | (z_scores < -1.96)
    
    return array[outliers]

In [None]:
outliers = percentile_outliers(df['purchases'],
               upper_bound_perc = 99,
               lower_bound_perc = 1)

In [None]:
z_score_outliers(df['purchases'],
                     z_score_lower = -1.96,
                     z_score_upper = 1.96)

## Isolation Forests

The next approach is an algorithm based approach called Isolation Forests. Isolation forest is essentially a decision tree that will randomly select a feature to split on. Outliers would likely get split first by the decision tree, which tells us where the outliers are:  

In [None]:
from sklearn.ensemble import IsolationForest

features = ['age','income','days_on_platform','purchases']

## We'll do a simple drop null for now
df = df.dropna()

## Create a training-test set
X = df[features]
X_train = X[:4000]
X_test = X[1000:]

## Fit Model
clf = IsolationForest(n_estimators=50, max_samples=100)
clf.fit(X_train)

## Get Scores
df['scores'] = clf.decision_function(X_train)
df['anomaly'] = clf.predict(X)

## Get Anomalies
outliers=df.loc[df['anomaly']==-1]

outliers

# Outlier Treatment

Now that we have some techniques for detecting outliers, let's look into different ways to treat outliers. 

## Removal

The first method is simply removing our outliers. The typical way to remove outliers is through z-score removal. Specify the z-score or percentile cutoff you want for your outliers, then, remove any point that falls above or below that threshold. We've written out a few functions you can use: 

In [None]:
def z_score_removal(df, column, lower_z_score, upper_z_score):
    
    col_df = df[column]

    z_scores = scipy.stats.zscore(purchases)
    outliers = (z_scores > upper_z_score) | (z_scores < lower_z_score)
    return df[~outliers]

def percentile_removal(df, column, lower_bound_perc, upper_bound_perc):
    
    col_df = df[column]
    
    upper_bound = np.percentile(col_df, upper_bound_perc)
    lower_bound = np.percentile(col_df, lower_bound_perc)

    z_scores = scipy.stats.zscore(purchases)
    outliers = (z_scores > upper_bound) | (z_scores < lower_bound)
    return df[~outliers]

filtered_df = z_score_removal(df, 'purchases', -1.96, 1.96)
percentile_removal(df, 'purchases', lower_bound_perc = 1, upper_bound_perc = 99)

## Winsorize

Dropping outliers is the crudest approach. If you feel those rows are valuable, we can winsorize, also known as "capping" our outliers. Rather than keep the outlier value, if the value falls above a specific threshold, we can replace the outlier with that threshold value. Here, we've written a function for you: 

In [None]:
def winsorize(df, column, upper, lower):
    col_df = df[column]
    
    perc_upper = np.percentile(df[column],upper)
    perc_lower = np.percentile(df[column],lower)
    
    df[column] = np.where(df[column] >= perc_upper, 
                          perc_upper, 
                          df[column])
    
    df[column] = np.where(df[column] <= perc_lower, 
                          perc_lower, 
                          df[column])
    
    return df

In [None]:
winsorize(df, 'purchases', 97.5, 0.025)

## Additional Outlier Detection Techniques

On top of these methods, there are many other methods:

- Mahalanobis Distance: This is a distance metric that helps us detect multivariate outliers. 
- Robust Mahalanobis Distance: Adds a layer on the original, by only using data points where the determinant of the covariance matrix is as small as possible. 

There are a number of additional Algorithm-Based techniques:

- DBScan Cluster Outlier Detection
- K-Means Cluster Outlier Detection
- Hierarchical Clustering Detection 

There are also algorithms that are robust to outliers, so you don't need to worry as much if you're using these models:

- Random Forest
- Gradient Boosted Trees

We will add these methods in future iterations of the course. 

## Conclusion

In conclusion, we've gone over both techniques for detecting outliers and treating outliers. To review, we went over the following methods for detecting outliers: 

- Box Plots
- Violin Plots
- Z-score method
- Percentile Method
- Isolation Forests

To treat outliers, we went over: 

- Z-score Removal
- Winsorizing