## Some outlier detection techniques

### Paul Anzel, 12/16/15

The basic technique I was taught for finding outliers was to find the $z$-score of various points--find the mean $\mu$ and the standard deviation $\sigma$ and calculate

$$z_i = \frac{x_i - \mu}{\sigma}$$

(and with various extensions for multidimensional data).

This is fine as it goes for when you have hundreds of points, a tiny fraction of outliers, and your data is normally distributed. However, for the data I'm having to deal with, this doesn't actually work well.

In [1]:
from __future__ import print_function, division
import numpy as np
import scipy as sp
np.set_printoptions(suppress=True, precision=3)
np.random.seed(200)

In [2]:
# x[3] is a bogus datapoint where a scraper gave a dumb datapoint
x = np.array([10, 11, 10, 100001, 9, 10, 11])
mean_x = x.mean()
print('Mean x = %.1f' % mean_x)
std_x = x.std()
print('StDev x = %.1f' % std_x)
z_score = (x - mean_x)/std_x
print(z_score)

Mean x = 14294.6
StDev x = 34989.5
[-0.408 -0.408 -0.408  2.449 -0.408 -0.408 -0.408]


So, if we had set some limit $z < 2.5$, we'd completely miss the bogus point. We could try using a more robust estimate for the middle point with the median.

In [3]:
median_x = np.median(x)
print('Median x = %.1f' % median_x)
zm_score = (x - median_x)/std_x
print(zm_score)

Median x = 10.0
[ 0.     0.     0.     2.858 -0.     0.     0.   ]


But that would still fail if we used $z < 3$ as a threshold. The next thing would then be to use a more robust form to estimate the variance (say the Interquartile Range).

In [4]:
q75, q25 = np.percentile(x, [75 ,25])
iqr = q75 - q25
fake_std = iqr/1.349
print('IQR = %.1f' % iqr)
zmi_score = (x - median_x)/fake_std
print(zmi_score)

IQR = 1.0
[      0.          1.349       0.     134887.859      -1.349       0.
       1.349]


But we can run into some issues, particularly if we have very few data-points (how do you even really define IQR on all of 3 data points?).

In [5]:
x2 = [9, 10, 11, 100001]
q75, q25 = np.percentile(x2, [75 ,25])
iqr = q75 - q25
fake_std = iqr/1.349
print('IQR = %.2f :(' % iqr)
zmi2 = (x2 - np.median(x2))/fake_std
print(zmi2)

x3 = [1000, 9, 9, 10, 11, 100001]
q75, q25 = np.percentile(x3, [75, 25])
iqr = q75 - q25
fake_std = iqr/1.349
print('IQR = %.2f :(' % iqr)
zmi3 = (x3 - np.median(x3))/fake_std
print(zmi3)

IQR = 24998.75 :(
[-0.    -0.     0.     5.396]
IQR = 743.50 :(
[   1.795   -0.003   -0.003   -0.001    0.001  181.422]


In [6]:
x3 = [1000, 9, 9, 9, 10, 11, 100001]
q75, q25 = np.percentile(x3, [75, 25])
iqr = q75 - q25
fake_std = iqr/1.349
print('IQR = %.2f :(' % iqr)
zmi3 = (x3 - np.median(x3))/fake_std
print(zmi3)

IQR = 496.50 :(
[   2.69    -0.003   -0.003   -0.003    0.       0.003  271.677]


For these really-low-quantity data sets, here's some alternate methods.

### [Dixon's Q test](https://en.wikipedia.org/wiki/Dixon%27s_Q_test)

[Original paper](http://depa.fquim.unam.mx/amyd/archivero/ac1951_23_636_13353.pdf)

[Python implementation](http://sebastianraschka.com/Articles/2014_dixon_test.html) - source for much of the code here.

The basic idea is this--sort the data, and look at the minimum and maximum data points. For these points, calculate

$$ Q = \frac{\text{gap to next point}}{\text{range}}$$

If Q is above some critical value (viz, there's a massive gap) then there's a good chance that the point is an outlier. The critical values of Q are given below, for 3 points, 4, points, ..., 28 points. 90, 95, and 99 represent confidence levels.

In [7]:
q90 = [0.941, 0.765, 0.642, 0.56, 0.507, 0.468, 0.437,
       0.412, 0.392, 0.376, 0.361, 0.349, 0.338, 0.329,
       0.32, 0.313, 0.306, 0.3, 0.295, 0.29, 0.285, 0.281,
       0.277, 0.273, 0.269, 0.266, 0.263, 0.26
      ]

q95 = [0.97, 0.829, 0.71, 0.625, 0.568, 0.526, 0.493, 0.466,
       0.444, 0.426, 0.41, 0.396, 0.384, 0.374, 0.365, 0.356,
       0.349, 0.342, 0.337, 0.331, 0.326, 0.321, 0.317, 0.312,
       0.308, 0.305, 0.301, 0.29
      ]

q99 = [0.994, 0.926, 0.821, 0.74, 0.68, 0.634, 0.598, 0.568,
       0.542, 0.522, 0.503, 0.488, 0.475, 0.463, 0.452, 0.442,
       0.433, 0.425, 0.418, 0.411, 0.404, 0.399, 0.393, 0.388,
       0.384, 0.38, 0.376, 0.372
       ]

Q90 = {n:q for n,q in zip(range(3,len(q90)+1), q90)}
Q95 = {n:q for n,q in zip(range(3,len(q95)+1), q95)}
Q99 = {n:q for n,q in zip(range(3,len(q99)+1), q99)}

For x2 above...

In [8]:
x2.sort()
print(x2)

[9, 10, 11, 100001]


In [9]:
Q_min = (x2[1]-x2[0])/(x2[-1]-x2[0])
print('Q_min = %.3f' % Q_min)
Q_max = (x2[-1]-x2[-2])/(x2[-1]-x2[0])
print('Q_max = %.3f' % Q_max)
print(Q_max > Q95[len(x2)])

Q_min = 0.000
Q_max = 1.000
True


Two things to note:
- This algorithm only works for one point at a time, though you could potentially iterate it (Risky! Think about [10, 10.1, 11, 1000] with Q90). There is an extension for multiple outliers, but it's not often used.
- The algorithm assumes that the underlying data is normally distributed.

### [Grubbs' test](https://en.wikipedia.org/wiki/Grubbs%27_test_for_outliers)

This is another outlier test. To do this, compute the $z$ scores of the sample, and find the largest absolute value, $G = \max |z|$. We expect there's no outliers at significance level $\alpha$ if

$$ G > \frac{N-1}{\sqrt{N}} \sqrt{\frac{t^2_{\alpha/2N, N-2}}{N - 2 + t^2_{\alpha/2N, N-2}}}$$

with $t^2_{\alpha/2N, N-2}$ the upper critical value of the t-distribution with N - 2 DOF at significance level $\alpha/2N$.

Run this, removing the biggest outlier each time (if it exists), until it stops. But don't run this on too-small sample sizes (say $N \leq 6$).

In [21]:
from scipy import stats
def Grubbs_outlier_test(y_i, alpha=0.95):
    """
    Perform Grubbs' outlier test.
    
    ARGUMENTS
    y_i (list or numpy array) - dataset
    alpha (float) - significance cutoff for test

    RETURNS
    G_i (list) - Grubbs G statistic for each member of the dataset
    Gtest (float) - rejection cutoff; hypothesis that no outliers exist if G_i.max() > Gtest
    no_outliers (bool) - boolean indicating whether there are no outliers at specified
    significance level
    index (int) - integer index of outlier with maximum G_i
    
    Code from https://github.com/choderalab/cadd-grc-2013/blob/master/notebooks/Outliers.ipynb
    """
    s = y_i.std()
    G_i = np.abs(y_i - y_i.mean()) / s
    N = y_i.size
    t = stats.t.isf(1 - alpha/(2*N), N-2)
    # Upper critical value of the t-distribution with N − 2 degrees of freedom and a
    # significance level of α/(2N)
    Gtest = (N-1)/np.sqrt(N) * np.sqrt(t**2 / (N-2+t**2))    
    G = G_i.max()
    index = G_i.argmax()
    no_outliers = (G > Gtest)
    return [G_i, Gtest, no_outliers, index]

In [23]:
x_grubbs1 = np.array([9, 10, 11, 9, 10, 100000, 11])
G1, Gtest, noout, indexval = Grubbs_outlier_test(x_grubbs1)
print(G1)
print("%.3f" % Gtest)
print(noout)
print(indexval)

[ 0.408  0.408  0.408  0.408  0.408  2.449  0.408]
1.411
True
5


Note that both of these tests assume that your data is normally distributed.
- Let's say the data comes in digitized, as we get $[10, 12, 10]$. This might not raise any flags, but the Q-test would see a gap/range ratio of 1 for the 12 value, and flag it as an outlier. Alternately, if we had something like $[10, 11, 10, 11, 10, 11, 10, 11, 20]$ we'd not see the 20, despite looking pretty out of place.
- If data is not normal, we can also run into problems--for example, if we had bimodal data centered around 1 and -1, it wouldn't be too unreasonable to draw $[1.01, -0.9, 1.13]$, but this would throw things into a loop. In this case, you just can't run these tests.

I'll need to look for additional routines for non-normal data.