# Outliers

*SG2227 Saleh Rezaeiravesh and Philipp Schlatter*

In [16]:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
from numpy.linalg import norm
import math as mt
from scipy import special
from scipy import stats
from scipy.stats import chi2,t,norm

π = mt.pi

We show here the method based on *Chauvenet's criteron* as one way to identify an outlier in a series of data. It goes back to William Chauvenet (24 May 1820 in Milford, Pennsylvania – 13 December 1870 in St. Paul, Minnesota), who was mainly working in the US Navy.

The basic idea is that one defines a band with acceptable values around the mean of the data set. The width of this band depends both on the variance (or its estimator), but also the number of samples in the series. The more samples you have, the more spread you allow.

The basic idea is to estimate the range outside of which one data point would lie with probability $1/2$, i.e. that this point would be either rounded down (to zero) or up (to one) with equal probability. Thus the number of points to lie outside the interval is $N \cdot P = 1/2$. Valid points points should be "close" to the mean with probability 

$$P_N=1-\frac{1}{2N}=\frac{N-\frac12}{N} \ . $$ 

Assuming a normal distribution and using the normalised Gaussian variable $z=(x-\mu)/\sigma$, this translates to

$$P(-z_\alpha \leq z \leq z_\alpha) = 1-\alpha = \frac{1}{\sqrt{2\pi}}\int_{-z_\alpha}^{z_\alpha} \exp\left(-\frac{u^2}{2}\right) \mathrm d u $$

with $P_N=1-\alpha = 1-1/(2N)$. The valid points are then in the interval

$$x\in [\overline{x}-z_\alpha s_x, \overline{x}+z_\alpha s_x ] \ .$$

The relevant values for $z_\alpha$ for a few $N$ are listed in the table below. For instance for a sample with 10 points, the probability to be close to the mean is $95\%$, so one requires that a valid point must be in the region where $95\%$ of the points of the populations lie. For $N=1000$, this probability has increased to $99.95\%$. So the more points one has, the more forgiving one may be, as the estimated parameters ($\hat{\mu}$, $\hat{\sigma}$) are less affected by the points far from the mean. Note that for low $N$ one may use the t-distribution.

In [17]:
Nlist=(3,4,5,6,10,20,50,100,200,500,1000)
print('N        z_α      1-α')
for N in Nlist:
    α = 1/(2*N)
    print('%4d   %.4f   %.4f' % (N,np.sqrt(2)*special.erfinv(1-α),1-α))

N        z_α      1-α
   3   1.3830   0.8333
   4   1.5341   0.8750
   5   1.6449   0.9000
   6   1.7317   0.9167
  10   1.9600   0.9500
  20   2.2414   0.9750
  50   2.5758   0.9900
 100   2.8070   0.9950
 200   3.0233   0.9975
 500   3.2905   0.9990
1000   3.4808   0.9995


Typically, one applies Chauvenet's criterion just once, but it can also be done iteratively, to detect what is called *shielded outliers*, that only become apparent once other stronger outliers are removed.

In [33]:
T=np.array([24.67,24.75,25.02,24.70,24.83,24.08,25.11,25.00])
np.mean(T),np.std(T)

(24.77, 0.3012474066278416)

In [34]:
N=T.size
α = 1-1/(2*N)
z_α=np.sqrt(2)*special.erfinv(α)
d=z_α*np.std(T)
m=np.mean(T)
TT=np.array([])
for i in range(N):
    if (T[i]>(m+d) or T[i]<(m-d)):
        print('Detected outlier: ',T[i])
    else:
        TT=np.append(TT,T[i])

Detected outlier:  24.08


In [35]:
np.mean(TT),np.std(TT)

(24.868571428571425, 0.16119452059356004)

One can observe a minute change of the mean, however a significant reduction of the standard deviation by a factor of two.