*The codes were developed on Windows 10, and were not tested on other machines. Anaconda 5.2.0 is chosen as a Python interpreter.*

# Outlier Detection and Correction

This notebook introduces the concept and methodology of implementing CDF plotting for statistical analysis.

<br>
<div class="alert alert-info">
<h4>Disclaimer</h4><p>The information on this page is based on the petroleum engineering class taught at the <i>University of Texas at Austin</i>, <b>Introduction to Geostatistics</b> by <b>Dr. Michael Pyrcz</b> on Fall 2018. This notebook is a work of a petrolem engineering student, <b>Eric Kim</b>.
</div>

# 0. Sample Data Set

The provided spreadsheet **PoroPermSampleData.xlsx** includes sample permeability  data that will be used through out this notebook

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook

data = pd.read_excel('PoroPermSampleData.xlsx', sheet_name='Sheet1')
permeability = np.array(data['Permeability (mD)'])
depth = np.array(data['Depth'])

data.iloc[:, [0, 2]].head()

Unnamed: 0,Depth,Permeability (mD)
0,0.25,265.528738
1,0.5,116.89122
2,0.75,136.920016
3,1.0,216.668629
4,1.25,131.594114


# 1. Purpose - What Do We Want To Achieve?

> **1. Detect outliers by defining upper & lower fence**

> **2. Perform actions on outliers - remove, transform, or separate**

# 2. Detect Outliers

Two methods can be used detecting outliers:

> **Standard Deviation Method** - if data is Gaussian distributed

> **Interquartile (IQR) Method** - if data is not Gaussian enough

## 2.1 Standard Deviation Method
## 2.2 Interquartile (IQR) Method

Not all data is Gaussian or Gaussian enough to apply *Standard Deviation Method*. In that case, *Interquartile (IQR) Method* can be used to detect outliers.

<p><center>IQR = P75 - P25</center></p>

<p><center>Lower Fence = P25 - 1.5 $\times$ IQR</center></p>

<p><center>Upper Fence = P75 + 1.5 $\times$ IQR</center></p>

First, the values for P75 and P25 must be calculated. To achieve that, **numpy.percentile** will be used. [Numpy Documentation](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.percentile.html)

In [24]:
P75 = np.percentile(permeability, 75)
P25 = np.percentile(permeability, 25)
IQR = P75 - P25

pd.DataFrame(data=np.array(np.round([P75, P25, IQR], 1)).T, index=['P75', 'P25', 'IQR'], columns=['Permeability (mD)']).T

Unnamed: 0,P75,P25,IQR
Permeability (mD),206.6,104.0,102.6


The result can also be graphically obtained. 

In [None]:
# 