{{badge}}

> **Finding Outliers**





https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/

* **Standard Deviation** can be used to identify outliers in Gaussian or Gaussian like data.
* **Interquartile range** can be used to identify outliers in any data regardless of distribution.

* We use randn() function to generate random Gaussian values with mean 0 and std 1.

In [None]:
from numpy.random import seed,randn
from numpy import mean, std

#seeding the random generator to get same values
seed(1)

#generating the data
data = 5* randn(10000) + 50

#printing mean and std of generated data

print(f'mean = {round(mean(data),3)}, Std = {round(std(data),3)}')

mean = 50.049, Std = 4.994


* Above are close to what we used for generating the data

#Standard Deviation Method

* If we know the distribution of data is Gaussian or Gaussian like, we can use this method.
* From Gaussian distribution we can draw 68 - 95 - 99.7% rule.
* **μ + σ : 68%**
* **μ + 2σ : 95%**
* **μ + 3σ : 99.7%**
* Often times data outside the 3rd Std (99.7%) is taken as outlier in Gaussian distribution.
* For smaller samples of data: 2nd Std taken as cutoff for outliers.
* For larger samples even 4th Std taken as cutoff for outliers.
* Sometimes, data is standardized i.e σ = 0, std = 1. In that case we'll detect outliers using standard Z-score cut-off values.

* We'll calculate mean and std of given sample, then we'll calculate the cut-off for detecting outliers.

In [None]:
data_mean, data_std = mean(data), std(data)

#Identifying outliers
cut_off = data_std*3

lower, upper = data_mean - cut_off, data_mean + cut_off

#Calculating outliers that falls outside of above range

outliers = [x for x in data if x <lower or x > upper]
print(f'No. of identified outliers : {len(outliers)}')

#filtering the data from outliers
outliers_removed = [x for x in data if x >= lower and x <= upper]
print(f'No. of non outliers : {len(outliers_removed)}')

No. of identified outliers : 29
No. of non outliers : 9971


* Above we have done for single variable data.
* If data has two features then bound will be ellipse
* 3 - dimension Ellipsoid
* Multi - accordingly

# Interquartile Range (IQR) Method

* This can be used when the data is non-Gaussian or normal distributed.
* IQR is very good for summarizing non - Gaussian distribution data.
* IQR = difference between 75th and 25th %ile and defines the box of whiskers plot. Refers 50% of data.
* %iles can be calculated by sorting the obervations and selecting values at specific indices.
* 50th %ile is the middle values or the average of two middle values.
* IQR can be used to detect outliers with limits that are factor K of the IQR below the 25th %ile or above the 75th %ile.
* K = 1.5/3. >3 are used to detect extreme outliers.

In [None]:
from numpy import percentile

#calculating interquartile range
q25, q75 = percentile(data,25), percentile(data,75)

iqr = q75 - q25

#cut off for outliers = 1.5 times iqr
cut_off = 1.5 * iqr

#setting limits
lower, upper = q25-cut_off, q75 + cut_off

#detecting outliers
outliers = [x for x in data if x <lower or x > upper]
print(f'No. of identified outliers : {len(outliers)}')

#filtering data without outliers
outliers_removed = [x for x in data if x>=lower and x<=upper]
print(f'No. of identified outliers : {len(outliers_removed)}')

No. of identified outliers : 81
No. of identified outliers : 9919


* Above approach can be extended to higher dimension.
* Decision bound will be rectangle or hyper-rectangle