## Detecting Outliers

In [1]:
import numpy as np
from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import make_blobs

# Create Data
X, _ = make_blobs(n_samples = 10,
                  n_features = 2,
                  centers = 1,
                  random_state = 1)
X

array([[-1.83198811,  3.52863145],
       [-2.76017908,  5.55121358],
       [-1.61734616,  4.98930508],
       [-0.52579046,  3.3065986 ],
       [ 0.08525186,  3.64528297],
       [-0.79415228,  2.10495117],
       [-1.34052081,  4.15711949],
       [-1.98197711,  4.02243551],
       [-2.18773166,  3.33352125],
       [-0.19745197,  2.34634916]])

In [4]:
# Fill outliers
X[1, 1] = 99999
X[8, 0] = 99999
X

array([[-1.83198811e+00,  3.52863145e+00],
       [-2.76017908e+00,  9.99990000e+04],
       [-1.61734616e+00,  4.98930508e+00],
       [-5.25790464e-01,  3.30659860e+00],
       [ 8.52518583e-02,  3.64528297e+00],
       [-7.94152277e-01,  2.10495117e+00],
       [-1.34052081e+00,  4.15711949e+00],
       [-1.98197711e+00,  4.02243551e+00],
       [ 9.99990000e+04,  3.33352125e+00],
       [-1.97451969e-01,  2.34634916e+00]])

### Detect Outliers

<font color=blue>EllipticEnvelope</font> assumes the data is normally distributed and based on that assumption “draws” an ellipse around the data, classifying any observation inside the ellipse as an <b>inlier</b> (<font color=blue>labeled as 1</font>) and any observation outside the ellipse as an <b>outlier</b> (<font color=blue>labeled as -1</font>). <br>
A major limitation of this approach is the need to specify a <font color=blue>contamination</font> parameter which is the proportion of observations that are outliers, a value that we don’t know.

In [7]:
# Note the issue: We've injected two outliers in a list of 10 observations(rows).
# If we supply contamination as .1 which 10% i.e. 1 out of 10 values, it is just showing observation X[1] as outlier.
# If we supply contamination as .2 which 20% i.e. 2 out of 10 values, it is just showing observation X[1], X[8] as outliers.

# Create detector
outlier_detector = EllipticEnvelope(contamination=.1)

# Fit detector
outlier_detector.fit(X)

# Predict outliers
outlier_detector.predict(X)

array([ 1, -1,  1,  1,  1,  1,  1,  1, -1,  1])