<a href="https://colab.research.google.com/github/cagBRT/Data/blob/main/5_Outliers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook demonstates a method for identifying outliers.

There are three notebooks in the Outliers section:<br>
1. This notebook
2. [InterquartileRange](https://colab.research.google.com/github/cagBRT/Data/blob/main/6_InterquartileRange.ipynb)
3. [AutomaticOutlierDetection](https://colab.research.google.com/github/cagBRT/Data/blob/main/7_AutomaticOutlierDetection.ipynb)

In [None]:
# Clone the entire repo.
!git clone -s https://github.com/cagBRT/Data.git cloned-repo
%cd cloned-repo

In [None]:
!pip install outlier-plotting

# **Outliers are** ...<br>
“Observation which deviates so much from other observations as to arouse suspicion it was generated by a different mechanism” — Hawkins(1980)

**Why we care about outliers**

In [None]:
from IPython.display import Image
Image("outlier.png" , width=640)

This notebook demonstrates methods to identify outliers

# **What is an outlier?** <br>

A dataset can contain values that are outside the range of what is expected and unlike the other data. <br>
These are called outliers.<br><br>
Identifying data outliers is an important skill for machine learning.

Outliers can be caused by:<br>
- measurement error
-data input error
- corruption of data
- true outlier observation

What constitutes an outlier depends on the data. <br>
When working with data, an subject matter expert should be consulted<br>
The SME will to interpret the data and decide if it is a true outlier or not.

In [None]:
import matplotlib.pyplot as plt
from numpy.random import seed
from numpy.random import randn
from numpy import mean
from numpy import std

**Create a Synthetic Dataset**

In [None]:
# seed the random number generator
#This will make each dataset the same
seed(1)
# generate univariate observations
data = 5 * randn(10000) + 50
# summarize
print('mean=%.3f stdv=%.3f' % (mean(data), std(data)))

In [None]:
data.shape
y=[]
for i in range(len(data)):
  y.append(i)
data.shape

It is worthwhile to consider plotting the identified outlier values, perhaps in the context of non-outlier values to see if there are any systematic relationship or pattern to the outliers. If there is, perhaps they are not outliers and can be explained, or perhaps the outliers themselves can be identified more systematically.

In [None]:
plt.scatter(data,y )

# **Standard Deviation Method**

In [None]:
Image("standardDev2.png" , width=640)

1 Standard deviation from the mean = 68%<br>
2 Standard deviation from the mean = 95%<br>
3 Standard deviation from the mean = 99.7%<br>

Values that are outside of the 3 standard deviation are rare or unlikely, but still part of the distribution.

**Using the Z-score**<br>
The Z-score tells you how many standard deviations the data is from the mean.<br>
Z-score == 0 means the data is within the range -1 to 1 deviations<bR>
Z-score == 1 means the data is within the range (-2 to -1) to (1 to 2) deviations<bR>
Z-score == 2 means the data is within the range (-3 to -2) to (2 to 3) deviations<bR>

**Calculate the mean and standard deviation**

In [None]:
# calculate summary statistics
data_mean, data_std = mean(data), std(data)
print(data_mean)
print(data_std)

**Define outliers**<br>
This definition will depend upon the data

In [None]:
# define outliers
#Any data that is more than 3 deviations from the mean
cut_off = data_std * 3
lower, upper = data_mean - cut_off, data_mean + cut_off
print("Cutoff:", cut_off)
print(upper,lower)

Plot the data with the outlier cutoff lines shown

In [None]:
plt.scatter(data,y )
plt.axvline(x=lower, c='red')
plt.axvline(x=upper, c='red')

Create a list of the outliers

In [None]:
#outliers = [x for x in data if x < lower or x > upper]
outliers = []
for x in data:
  if x<lower or x> upper:
    outliers.append(x)

In [None]:
print(outliers)
print("there are: %d outliers" %(len(outliers)))

Once you have identified the outliers, you can start thinking about what to do with them.

**Assignment**<br>
Find the outliers in the file Outliers.csv
1. Sort the data to identify the outliers
2. Plot the data to identify the outliers
3. Use the Z-Score method to identify the outliers

In [None]:
#Assignment

In [None]:
#@title
!pwd

In [None]:
#@title
import csv
with open('Outliers.csv') as csvfile:
  file_reader = csv.reader(csvfile, delimiter=',', quotechar='|')
  height_list = list(file_reader)
#remove the header row
height_list.pop(0)
height_list

In [None]:
#@title
y=[]
x=[]
for i in range(len(height_list)):
  x.append(height_list[i][0])
  y.append(i)

In [None]:
#@title
import numpy as np
#convert strings to floats
x = np.array(x).astype(float)
fig = plt.figure(figsize =(3,5))
# Creating plot
plt.boxplot(x)
# show plot
plt.show()

In [None]:
#@title
fig = plt.figure(figsize =(3,3))
# Creating plot
plt.scatter(x,y)
plt.xlabel("height")
plt.ylabel("dataID")
# show plot
plt.show()

In [None]:
#@title
data_mean, data_std = mean(x), std(x)
cut_off = data_std * 3
lower, upper = data_mean - cut_off, data_mean + cut_off
print("lower: ",lower,"upper:", upper)
outliers = []
for i in range(len(height_list)):
  #print(height_list[i][0])
  if float(height_list[i][0])<= lower or float(height_list[i][0]) >= upper:
    outliers.append(height_list[i][0])
outliers

In [None]:
#@title
plt.scatter(x,y)
#plt.axvline(y=lower, c='red')
plt.axvline(x=upper, c='red')