<a href="https://colab.research.google.com/github/cagBRT/Data/blob/main/7_AutomaticOutlierDetection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

There are three notebooks in the Outliers section:<br>
1. This notebook
2. [InterquartileRange](https://colab.research.google.com/github/cagBRT/Data/blob/main/6_InterquartileRange.ipynb)
3. [AutomaticOutlierDetection](https://colab.research.google.com/github/cagBRT/Data/blob/main/7_AutomaticOutlierDetection.ipynb)

The seaborn pairplot takes a while to run. <br>
Do the following: 
1.   Set Runtime type to GPU
2.   Run All cells 



In [None]:
# Clone the entire repo.
!git clone -s https://github.com/cagBRT/Data.git cloned-repo
%cd cloned-repo

In [None]:
!pip install seaborn

In [None]:
%matplotlib inline
from pandas import read_csv
import pandas as pd
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.neighbors import LocalOutlierFactor
import seaborn as sns

A simple approach to identifying outliers is to locate those examples that are far from the other examples in the multi-dimensional feature space. This can work well for feature spaces with low dimensionality (few features), although it can become less reliable as the number of features is increased(AKA the curse of dimensionality).

The local outlier factor, or LOF for short, is a technique that attempts to harness the idea of nearest neighbors for outlier detection. <br>
Each example is assigned a scoring of how isolated or how likely it is to be an outlier based on the size of its local neighborhood. <br>
Those examples with the largest score are more likely to be outliers. <br>
The scikit-learn library provides an implementation of this approach in the LocalOutlierFactor class.

https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html


There are 14 attributes in each case of the dataset. They are:<br>
CRIM - per capita crime rate by town<br>
ZN - proportion of residential land zoned for lots over 25,000 sq.ft.<br>
INDUS - proportion of non-retail business acres per town.<br>
CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)<br>
NOX - nitric oxides concentration (parts per 10 million)<br>
RM - average number of rooms per dwelling<br>
AGE - proportion of owner-occupied units built prior to 1940<br>
DIS - weighted distances to five Boston employment centres<br>
RAD - index of accessibility to radial highways<br>
TAX - full-value property-tax rate per 10,000 dollars<br>
PTRATIO - pupil-teacher ratio by town<br>
Bk is the proportion of blacks by town<br>
LSTAT - percentage of lower status of the population<br>
MEDV - Median value of owner-occupied homes in $1000's<br>

In [None]:
# load the dataset
df = pd.read_csv('Bostonhousing.csv', header=None)
# retrieve the array
data = df.values

In [None]:
df.head()

In [None]:
sns.scatterplot( x=df[0], y=df[1], data=df)
plt.show()

In [None]:
sns.pairplot(df, diag_kind="kde")
plt.show()

In [None]:
# split into input and output elements
X, y = data[1:, :-1], data[1:, -1]
# summarize the shape of the dataset
print(X.shape, y.shape)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) 
# summarize the shape of the train and test sets
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [None]:
sns.pairplot(X_test, diag_kind="kde")
plt.show()

Use the Linear Regression model 

In [None]:
# fit the model
model = LinearRegression()
model.fit(X_train, y_train)

Test the model

In [None]:
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat) 
print('MAE: %.3f' % mae)

**Remove outliers from the training dataset** <br>
The tehory is the outliers are causing the linear regression model to learn a bias or skewed understanding of the problem. If the outliers are remvoed from the training set, this may result in a better  performing model. <br>
We can do this by defining the LocalOutlierFactor model then use it to make a prediction on the training dataset, marking each row in the training dataset as normal(1) or an outlier (-1).<br>
In this case the default parameters are used for the outlier detection model.  Although it is a good idea to tune the configuration to the specifics of your dataset.


In [None]:
# identify outliers in the training dataset
lof = LocalOutlierFactor()
yhat = lof.fit_predict(X_train)

**Remove the outliers from the training set**

In [None]:
# select all rows that are not outliers
mask = yhat != -1
X_train, y_train = X_train[mask, :], y_train[mask]

In [None]:
# summarize the shape of the updated training dataset print(X_train.shape, y_train.shape)
# fit the model
model = LinearRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

Recall the trainig and test set shapes for the origial dataset, before the outliers were removed:<br>
(339, 13) (167, 13) (339,) (167,)

Note the shape of the training set after the removal of the outliers

In [None]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)