<a href="https://colab.research.google.com/github/cagBRT/Data/blob/main/6_InterquartileRange.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook demonstrates how to:<br>
Use the Interquartile Range (IQR) and boxplots to find outliers.

There are three notebooks in the Outliers section:<br>
1. This notebook
2. [InterquartileRange](https://colab.research.google.com/github/cagBRT/Data/blob/main/6_InterquartileRange.ipynb)
3. [AutomaticOutlierDetection](https://colab.research.google.com/github/cagBRT/Data/blob/main/7_AutomaticOutlierDetection.ipynb)

In [None]:
# Clone the entire repo.
!git clone -s https://github.com/cagBRT/Data.git cloned-repo
%cd cloned-repo

Not all data is normal or normal enough to treat it as being drawn from a Gaussian distribution. <br>
A good statistic for summarizing a non-Gaussian distribution sample of data is the Interquartile Range, or<br>
 IQR for short.

IQR is equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles,<br>

 >IQR = Q3 −  Q1

The IQR can be clearly seen on a box plot on the data

The interquartile range is often used to find outliers in data. <br>
Outliers here are defined as observations that:<br>
 >fall below Q1 − 1.5 IQR <br>
 above Q3 + 1.5 IQR. <br>

In a boxplot, the highest and lowest occurring value within this limit are indicated by whiskers of the box (frequently with an additional bar at the end of the whisker) and any outliers as individual points.

In [None]:
from IPython.display import Image
Image("iqr.png" , width=400)

In [None]:
# identify outliers with interquartile range
from numpy.random import seed
from numpy.random import randn
from numpy import percentile

In [None]:
# seed the random number generator seed(1)
# generate univariate observations
data = 5 * randn(10000) + 50
data

In [None]:
data.shape

In [None]:
# calculate interquartile range
q25, q75 = percentile(data, 25), percentile(data, 75)
iqr = q75 - q25
print('Percentiles: 25th=%.3f, 75th=%.3f, IQR=%.3f' % (q25, q75, iqr)) # calculate the outlier cutoff
cut_off = iqr * 1.5
lower, upper = q25 - cut_off, q75 + cut_off

In [None]:
import matplotlib.pyplot as plt
y=[]
for i in range(len(data)):
  y.append(i)

In [None]:
plt.scatter(data,y)
plt.axvline(lower, c='red')
plt.axvline(upper, c='red')
plt.show()

In [None]:
# identify outliers
outliers = [x for x in data if x < lower or x > upper]
print('Identified outliers: %d' % len(outliers))
# remove outliers
outliers_removed = [x for x in data if x >= lower and x <= upper]
print('Non-outlier observations: %d' % len(outliers_removed))
yy=[]
for i in range(len(outliers_removed)):
  yy.append(i)

In [None]:
plt.scatter(outliers_removed,yy)
plt.axvline(lower, c='red')
plt.axvline(upper, c='red')
plt.show()

In [None]:
plt.boxplot(data, y)

In [None]:
plt.boxplot(outliers_removed, y)

Assignment <br>
Use the IRQ and boxplot to find any outliers in:<br>
Outliers.csv

In [None]:
#@title
#!cat /content/cloned-repo/Outliers.csv
import csv

with open('/content/cloned-repo/Outliers.csv') as csvfile:
  file_reader = csv.reader(csvfile, delimiter=',', quotechar='|')
  qoutlist=list(file_reader)

x=[]
for i in range(1,len(qoutlist)):
  x.append(float(qoutlist[i][0]))

In [None]:
#@title
yout=[]
for i in range(len(qoutlist)):
  yout.append(i)
plt.boxplot(x, yout)