### Outlier is a data point in a data set that is distant from all the observation. A data point that lies outside the overall distribution of the dataset

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### Identifying the Outlier
1. Data point that falls outside 1.5 times of below the 1st quartile and above the 3rd quartile of an interquarile range
2. Z-score for the data point falls outside the 3 standard deviation

### Reason for the outlier

1. Variability in the data.
2. An experimental measurement error

### Impact of outlier in the data
1. Various problem in statistical analysis.
2. Significant impact on mean and standard deviation

### Finding outlier
1. Scatter plot
2. Box plot
3. Z-score
4. IQR (Inter Quartile Range)

In [2]:
dataset = [11,10,12,14,12,15,14,13,15,102,12,14,17,19,107, 10,13,12,14,12,108,12,11,14,13,15,10,15,12,10,14,13,15,10]

### Outlier detection using Z score

### Using Z score

Formula for Z score = (observation - mean)/standard deviation

z = (X — μ) / σ

In [3]:
outliers=[]
def detect_outliers(data):
    
    threshold=3
    mean = np.mean(data)
    std  = np.std(data)
    
    for i in data:
        z_score = (i-mean)/std
        if np.abs(z_score) > 3:
            outliers.append(i)
    return outliers

In [5]:
outliers_pt = detect_outliers(dataset)

In [6]:
outliers_pt

[102, 107, 108]

### InterQuartile Range
75% - 25% values in the dataset

#### Steps

1. Arrange the data in increasing order.
2. Calculate the first quantile(q1) and third quantile(q2)
3. Find the interquartile range (q3 - q1)
4. Find lower bound q1*1.5
5. Find upper bound q3*1.5

Anything that lies lower and upper bound is an outlier

In [7]:
## Sort the data
sorted(dataset)

[10,
 10,
 10,
 10,
 10,
 11,
 11,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 13,
 13,
 13,
 13,
 14,
 14,
 14,
 14,
 14,
 14,
 15,
 15,
 15,
 15,
 15,
 17,
 19,
 102,
 107,
 108]

In [8]:
quantile1, quantile3 = np.percentile(dataset,[25,75])

In [9]:
print(quantile1,quantile3)

12.0 15.0


In [11]:
## Find the IQR

iqr_value=quantile3-quantile1
print(iqr_value)

3.0


In [13]:
## Find the lower bound and upper bound value

lower_bound_val = quantile1 - (1.5 *  iqr_value)
upper_bound_val = quantile3 + (1.5 * iqr_value)

In [14]:
print(lower_bound_val, upper_bound_val)

7.5 19.5
