## What is an outlier?

An outlier is a data point in a dataset that is distant from all other observations. A data point that lies outside the overall 
distribution of the dataset

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

CRITERIA TO IDENTIFY AN OUTLIER:
1. Data point that falls outside 1.5 times IQR
2. Data point that falls outside 3 standard deviations. We can use z score and if z score falls outside of 2 standard deviation

REASONS FOR AN OUTLIER TO EXISTS:
1. Variability in the data
2. An experimental measurement error.

Z-score: Standard normal distribution= mean=0 std=1
Z=(Xi-mu)/std

IMPACTS OF HAVING OUTLIERS IN A DATASET:
1. It causes various problema during statistical analysis
2. It may cause a significant impact on the mean and the standard deviation.

WAYS TO FIND AN OUTLIER:
1. Using scatter plots
2. Box plot
3. using z score
4. using the IQR 

In [4]:
dataset=[11,10,12,14,12,15,14,13,15,102,12,14,17,19,107,10,13,12,14,12,108,12,11,14,13,15,10,15,12,10]

In [5]:
# Using Z Score

outliers=[]
def detect_outliers(data):
    threshold=3
    mean=np.mean(data)
    std=np.std(data)
    
    for i in data:
        z_score=(i-mean)/std
        if np.abs(z_score)>threshold:
            outliers.append(i)
    return outliers

In [6]:
outlier_pt=detect_outliers(dataset)

In [7]:
outlier_pt

[107, 108]

In [9]:
# Using IQR
# data should be arranged in ascending order

sorted(dataset)

[10,
 10,
 10,
 10,
 11,
 11,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 13,
 13,
 13,
 14,
 14,
 14,
 14,
 14,
 15,
 15,
 15,
 15,
 17,
 19,
 102,
 107,
 108]

In [10]:
q1,q3=np.percentile(dataset,[25,75])

In [11]:
print(q1,q3)

12.0 15.0


In [12]:
iqr_value=q3-q1
iqr_value

3.0

In [15]:
upperbound=q3+(1.5*iqr_value)
lowerbound=q1-(1.5*iqr_value)
print(upperbound, lowerbound)

19.5 7.5


In [17]:
outlier_new=[]
for i in dataset:
    if i<lowerbound:
        outlier_new.append(i)
    if i>upperbound:
        outlier_new.append(i)
outlier_new

[102, 107, 108]