### What is an Outlier?

- An outlier is a data point in data set that is distinct from all the other observations. A data point that lies outside the overall distribution of the dataset.

In [9]:
import numpy as np
import pandas as pd
%matplotlib inline

### What is the criteria to identify an outlier?

1. Data point that falls outside of the 1.5 times of an interquartile range above the 3rd quartile and below the 1st quartile
2. Data point that falls outside of the 3 standard deviations. we can use a z score and if the the data point falls outside the 2 standard deviation.

### What is the reason for an outlier to exist in a dataset?

1. Variability in the data
2. An Experimental measurement Error

### What are the impacts of having an outliers in a dataset?|

1. It causes various problems during statistical analysis
2. It may cause significant impact on mean and standard deviation

### What are the various ways of finding an outlier?

1. Scatter Plot
2. Box Plot
3. Using Z Score
4. Using the IQR Interquantile range

In [10]:
dataset = [11,10,12,14,12,15,14,13,15,102,12,14,17,19,107,10,13,12,14,17,19,107,10,13,12,14,13,15,10,15,12,10]

# Detecting Outliers Using Z-Score

### Using Z-Score

- Formulae for Z-Score = (Observation - Mean)/Standard Deviation

#z = (x-μ)/σ

In [11]:
outliers=[]
def detect_outliers(data):
    threshold = 3
    mean = np.mean(data)
    std = np.std(data)
    
    for i in data:
        z_score = (i-mean)/std
        if np.abs(z_score) > threshold:
            outliers.append(i)
    return(outliers)
            

In [12]:
outlier_pt = detect_outliers(dataset)

In [13]:
outlier_pt

[107, 107]

# INTERQUANTILE RANGE

#75% - 25% values in a dataset

### Steps
- Arrange the Data in Increasing Order
- Calculate first(q1) and third quartile(q3)
- Find the Interquantile range (q3-q1)
- Find the Lower bound
- Find the Upper bound
Anything that lies outside the lower or the upper bound is considered as an outlier

### Perform all the steps of IQR

In [17]:
sorted(dataset)

[10,
 10,
 10,
 10,
 10,
 11,
 12,
 12,
 12,
 12,
 12,
 12,
 13,
 13,
 13,
 13,
 14,
 14,
 14,
 14,
 14,
 15,
 15,
 15,
 15,
 17,
 17,
 19,
 19,
 102,
 107,
 107]

In [19]:
quantile1, quantile3 = np.percentile(dataset,[25,75])

In [20]:
print(quantile1, quantile3)

12.0 15.0


#Find the IQR

In [23]:
iqr_value = quantile3-quantile1
print(iqr_value)

3.0


In [25]:
#Find the lower_bound_value and upper_bound_value
lower_bound = quantile1-(1.5*iqr_value)
upper_bound = quantile3-(1.5*iqr_value)

In [26]:
print(lower_bound,upper_bound)

7.5 10.5


# Any value away from lower_bound and upper_bound is considered as Outliers