# Data Peprocessing: Outliers and Missing Values

In [15]:
from sklearn import datasets
from sklearn.neighbors import LocalOutlierFactor
from pylab import *
import seaborn as sns
import matplotlib.pyplot as plt
from __future__ import print_function
from ipywidgets import interact, interactive, fixed, interact_manual
%matplotlib notebook
from kemlglearn.preprocessing import Discretizer

## Outliers Detection

We will use the LOF method to make a percentage of examples as outliers of the Iris dataset. 
This method computes an outlierness factor and marks the percentage of examples with higher value

You can play with the number of neighbors used to compute LOF


In [16]:
NEIGHBORS=3
OUTLIERS=0.1

iris = datasets.load_iris()
lof = LocalOutlierFactor(n_neighbors=NEIGHBORS, contamination=OUTLIERS)    
labels = lof.fit_predict(iris['data'])
fig = plt.figure(figsize=(16,6))
plt.subplot(1,2,1)
plt.scatter(iris['data'][:, 2], iris['data'][:, 1], c=labels,s=100)
plt.title('Outliers/Inliers')
plt.subplot(1,2,2)
plt.scatter(iris['data'][:, 2], iris['data'][:, 1], c=lof.negative_outlier_factor_,s=100)
plt.title('LOF');

<IPython.core.display.Javascript object>

## Missing Value Imputation

We obtain a corrupted copy of the iris dataset by adding some missing values (exactly 75 missing values distributed on the four dimensions)

The graphic shows the original data marking in yellow the examples that are going to be corrupted.

In [17]:
from sklearn.preprocessing import Imputer
from numpy.random import randint
iris = datasets.load_iris()
dimX, dimY = iris['data'].shape
lrandX = randint(dimX, size=75)
lrandY = randint(dimY, size=75)
lcols = [['r','g','b'][i]  for i in iris['target']]
for i in lrandX:
    lcols[i] = 'y'
fig = plt.figure(figsize=(8,8))
plt.scatter(iris['data'][:, 2], iris['data'][:, 1], c=lcols,s=100);

<IPython.core.display.Javascript object>

This is a kernel density estimation of the distribution of the values for dimension 1 for the original data (no missing values)

In [18]:
orig = datasets.load_iris()['data']

vshow=0

fig = plt.figure(figsize=(10,10))
sns.distplot(orig[:,vshow], hist=False, rug=True, color="g", kde_kws={"shade": True})
plt.show()

<IPython.core.display.Javascript object>

Now we corrupt the data an we apply a missing values imputer algorithm to complete the data, in this case we substitute the missings using the mean of the attibute.

In [19]:
for x,y in zip(lrandX,lrandY):
    iris['data'][x,y]=float('NaN')
imp = Imputer(missing_values='NaN', strategy='mean')
imp_iris1 = imp.fit_transform(iris['data'])
fig = plt.figure(figsize=(8,8))
plt.scatter(imp_iris1[:, 2], imp_iris1[:, 1], c=lcols,s=100);

<IPython.core.display.Javascript object>

As we can see, all the examples with missing values for the dimensions 1 and 2 appear aligned on the mean of the attributes. It can be seen that the distribution of the dimension 1 has changed, the variance has been reduced.

In [20]:
vshow=0

fig = plt.figure(figsize=(10,10))
sns.distplot(orig[:,vshow], hist=False, rug=True, color="r", kde_kws={"shade": True})
sns.distplot(imp_iris1[:,vshow], hist=False, rug=True, color="g", kde_kws={"shade": True});

<IPython.core.display.Javascript object>

Now we use the most frequent value of the attribute to impute the missing values

In [21]:
imp = Imputer(missing_values='NaN', strategy='most_frequent')
imp_iris2 = imp.fit_transform(iris['data'])
fig = plt.figure(figsize=(8,8))
plt.scatter(imp_iris2[:, 2], imp_iris2[:, 1], c=lcols,s=100);

<IPython.core.display.Javascript object>

As expected, the imputed examples now appear aligned on the most frequent value and the variance is also reduced

In [22]:
vshow=0

fig = plt.figure(figsize=(10,10))
sns.distplot(orig[:,vshow], hist=False, rug=True, color="r", kde_kws={"shade": True})
sns.distplot(imp_iris2[:,vshow], hist=False, rug=True, color="g", kde_kws={"shade": True});

<IPython.core.display.Javascript object>

Now we are going to use  the euclidean distance to determine the closest examples and to use the mean of the values of the 3-nearest neighbor to substitute the missing value

In [23]:
from kemlglearn.preprocessing import KnnImputer
knnimp = KnnImputer(missing_values='NaN', n_neighbors=3)
imp_iris3 = knnimp.fit_transform(iris['data'])
fig = plt.figure(figsize=(8,8))
plt.scatter(imp_iris3[:, 2], imp_iris3[:, 1], c=lcols,s=100);

<IPython.core.display.Javascript object>

As we can see the examples look more naturally distributed and now the distribution of the attributes looks more similar to the original one.

In [24]:
vshow=0

fig = plt.figure(figsize=(10,10))
sns.distplot(orig[:,vshow], hist=False, rug=True, color="r", kde_kws={"shade": True})
sns.distplot(imp_iris3[:,vshow], hist=False, rug=True, color="g", kde_kws={"shade": True});

<IPython.core.display.Javascript object>