{{ badge }}

Ref: 
https://machinelearningmastery.com/model-based-outlier-detection-and-removal-in-python/

**<h1> Automatic Outlier Detection Algorithms**

* The presence of outliers in a **classification or regression dataset** can result in **poort fit** and **lower** predictive modelling performance.
* Identifying and **removing outliers** is challenging with simple statistical methods given the large number of input variables in a dataset.
* In this notebook you'll find how to use automatic outlier detection and removal to improve machine learning predictive modelling performance.
* **Automatic outlier detection models** provide an alternative to satistical techniques with a large number of input variables with complex and unknown inter-relationships.
* We'll see how to correctly apply this technique and removal to the training dataset only to avoid data leakage.
*  How to evaluate and compare predictive modelling pipelines with outliers removed from the training dataset.

**<h2>Overview**

1. Outlier Detection and Removal
2. Dataset and Performance Baseline
  1. House Price Regression Dataset
  2. Baseline Model Performance
3. Automatic Outlier Detection
  1. Isolation Forest
  2. Minimum Covariance Determinant
  3. Local Outlier Factor
  4. One-Class SVM

**<h2>Outlier Detection and Removal**

* Most common type of outlier is the observation that are far from the rest of the observations or the center of mass of observations.
* This is easy to understand when we have one or two variables and we can visualize the data as histogram or scatter plot. But in case of many input features, task becomes more challenging.
* Simple statistical methods for identifying outliers use **standard deviations** or the **interquartile range**.
* Outliers can skew statistical measures and data distributions, providing a misleading representation of the underlying data and relationships.
* Removing outliers from training data before modeling can result in better fit of the data and in turn more skillful predictions.
* Outlier detection algorithms that we are going to discuss approach the definition of an outlier in slighter different ways.

**<h2>Dataset and Performance Baseline**

**<h3> House Price Regression Dataset**

* We'll use **Boston house price regression** dataset for the rest of the notebook.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
df = pd.read_csv(url, header = None)

data = df.values

#splitting the dataset into input and output variables
X, y = data[:, :-1], data[:, -1]

print(X.shape, y.shape)

#Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 1)

#summarize the shape of the train and test sets
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(506, 13) (506,)
(339, 13) (167, 13) (339,) (167,)


**<h3>Baseline Model Performance**

* We are dealing with **regression** dataset meaning we will be predicting a numeric value.
* We'll fit a linear regression algorithm and evaluate predictions using the **Mean absolute Error (MAE)**.

In [2]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

#fitting the model
model = LinearRegression()
model.fit(X_train, y_train)

#evaluating the model
y_pred = model.predict(X_test)

#evaluating the predictions
mae = mean_absolute_error(y_test, y_pred)
print('MAE : %.3f'% mae)

MAE : 3.417


**<h2>Automatic Outlier Detection**

* In scikit-learn, we can find many built-in automatic methods for identifying outliers in data.
* In each method, fit model will predict which examples in the training dataset are outliers and which are not.
* The outliers will then be removed from the training dataset, then the model will be fit on the remaining examples and evaluated on the entire test dataset.
* We do not fit the outlier detection method on the entire training dataset as this would lead to **data leakage**.

**<h3> Isolation Forest**

* It is a tree-based anamoly detection algorithm.
* It isolates anamolies that are **few in number** and **different in the feature space**.
* **IsolationForest class** in scikit-learn library is used.
* **contamination** is the most important hyperparameter in the model, used to help estimate the number of outliers in the dataset.
* This value ranges between **0.0 - 0.5** and by default set to **0.1**.

In [3]:
from sklearn.ensemble import IsolationForest

#identifying outliers in the training dataset
iso = IsolationForest(contamination = 0.1)
y_pred = iso.fit_predict(X_train)

#select all rows that are not outliers
mask = y_pred != -1

X_train_isf, y_train_isf = X_train[mask, :], y_train[mask]

#shape of updated training dataset
print(X_train_isf.shape, y_train_isf.shape)

#fit the model
model = LinearRegression()
model.fit(X_train_isf, y_train_isf)

y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
print('MAE : %.3f' % mae)

(305, 13) (305,)
MAE : 3.224


* In this case, we can see that the model identified and removed 34 outliers and achieved a MAE of about **3.188** which is less than previous value **3.417**.
* Results may differ given the stochastic nature of the learning algorithm, the evaluation procedure, and/or differences in precision across systems. Try running example a few times.

**<h3>Minimum Covariance Determinant**

* If the input features have a **Gaussian distribution**, then simple statistical methods can be used to detect outliers.
* For example, if the dataset has two input variables and both are Gaussian, then the feature space forms a multi-dimensional Gaussian.
* The knowledge of such distribution can be used to identify values far from the distribution.
* MCD approach can be generalized by defining a **hypersphere (ellipsoid)** that covers the normal data, and data that falls outside this shape is considered an **outlier**. An efficient implementation of this technique for multivariate data is known as the **Minimum Covariance Determinant or MCD**.
* Scikit-learn library provides access to this method via the **EllipticEnvelope** class.
* In this method also there's a hyper parameter **contamination** that defines the expected ratio of outliers to be observed in practice.
* We'll set it to a value of **0.01**, found with little trial and error.

In [4]:
from sklearn.covariance import EllipticEnvelope

#identifying outliers in the training dataset
ee = EllipticEnvelope(contamination = 0.01)
y_pred = ee.fit_predict(X_train)

#select all rows that are not outliers
mask = y_pred!=-1
X_train_mcd, y_train_mcd = X_train[mask, :], y_train[mask]

#shape of updated trainin dataset
print(X_train_mcd.shape, y_train_mcd.shape)

#fit the model
model = LinearRegression()
model.fit(X_train_mcd, y_train_mcd)

#evaluate the model
y_pred = model.predict(X_test)

#evaluate predictions
mae = mean_absolute_error(y_test, y_pred)
print('MAE : %.3f'% mae)

(335, 13) (335,)
MAE : 3.388


* In this case, we can see that the model identified and removed 4 outliers and achieved a MAE of about **3.388** which is less than previous value **3.417**.
* Results may differ given the stochastic nature of the learning algorithm, the evaluation procedure, and/or differences in precision across systems. Try running example a few times.

**<h3>Local Outlier Factor**

* A simple approach to identify outliers is to locate points that are far from the other examples in the feature space.
* This method works well for feature spaces with low dimensionality, but becomes less reliable as the number of features is increased -- referred to as the **curse of dimensionality**.
* The local outlier factor(LOF) is a technique that harnesses the idea of **nearest neighbors** for outlier detection.
* Each example is assigned a scoring of **how isolated** or **how likely is to be outlier** based on the size of its local neighborhood.
* Those examples with the **largest score** are more likely to be outliers.
* Scikit-learn library provides an implementation of this approach in the **LocalOutlierFactor** class.
* Hyper-parameter **contamination** is the expected percentage of outliers in the dataset. Default value 0.1

In [5]:
from sklearn.neighbors import LocalOutlierFactor


#identify outliers in the training dataset
lof = LocalOutlierFactor()
y_pred = lof.fit_predict(X_train)

#select all rows that are not outliers
mask = y_pred != -1
X_train_lof, y_train_lof = X_train[mask, :], y_train[mask]

print(X_train_lof.shape, y_train_lof.shape)

#fit the model
model = LinearRegression()
model.fit(X_train_lof, y_train_lof)

#evaluate model
y_pred = model.predict(X_test)

#evaluate predictions
mae = mean_absolute_error(y_test, y_pred)
print('MAE : %.3f' % mae)

(305, 13) (305,)
MAE : 3.356


**<h3>One-Class SVM**

* The SVM algorithm developed initially for binary classification and can be used for **one-class** classification.
* When modelling one class, the algorithm captures the **density of the majority class** and classifies points on the extremes of the density function as outliers.
* Although SVM is a classification algorithm and **One-class SVM** is also a classification algorithm, it can be used to identify outliers in input data for both **regression** and **classification** datasets.
* **OneClassSVM** is the class provided by scikit-learn for implementation of one-class SVM.
* Hyper parameter is 'nu' that specifies the approximate ratio of outliers in the dataset, which defaults to 0.1.
* In this case we keep it to 0.01, found with little trial and error.

In [6]:
from sklearn.svm import OneClassSVM

#identify outliers in the training dataset.
ee = OneClassSVM(nu = 0.01)
y_pred = ee.fit_predict(X_train)

mask = y_pred != -1

X_train_ocsvm, y_train_ocsvm = X_train[mask, :], y_train[mask]

print(X_train_ocsvm.shape, y_train_ocsvm.shape)

#fit the model
model = LinearRegression()
model.fit(X_train_ocsvm, y_train_ocsvm)

#evaluate the model
y_pred = model.predict(X_test)

#evaluate predictions
mae = mean_absolute_error(y_test, y_pred)
print('MAE : %.3f' %mae)

(336, 13) (336,)
MAE : 3.431


* In this case, we can see that only three outliers were identified and removed and the model achieved a MAE of about 3.431, which is not better than the baseline model that achieved 3.417. Perhaps better performance can be achieved with more tuning.