# Exploratory Data Analysis (EDA)

# Notebook Content

# Outlier Treatment
1. Z Score
2. IQR Score

# Outlier Treatment
An outlier is an **observation point that is distant from other observations**. The boxplot and Scatter plot are two visualization tools that in most cases prove to be effective in outlier detection.<br>
Outlier Treatment can be done by two methods that will be explained in detail  below: <br>
1. z score 
2. IQR 

### 1. Z Score
The __Z-score__ is the signed number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured.<br>
While calculating the Z-score we re-scale and centre the data and look for data points which are too far from zero. These data points which are way too far from zero will be treated as the outliers. In most of the cases a __threshold of 3 or -3__ is used i.e. if the Z-score value is greater than or less than 3 or -3 respectively, that data point will be identified as outliers.

There are no data points with z>3 in our data. Hence we will Download the Boston Data set to show how outlier removal can be done 

In [43]:
from sklearn.datasets import load_boston
boston = load_boston()
x = boston.data
y = boston.target
columns = boston.feature_names
#create the dataframe
boston_df = pd.DataFrame(boston.data)
boston_df.columns = columns
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [44]:
#Finding the z score
from scipy import stats
import numpy as np
z = np.abs(stats.zscore(boston_df))


In [45]:
#Finding the outliers
threshold = 3
# print(np.where(z > 3))
# To remove or filter the outliers and get the clean data:
boston_df_zs = boston_df[(z < 3).all(axis=1)]
boston_df_zs.shape, boston_df.shape

((415, 13), (506, 13))

Hence we see the columns containing the outliers are removed

In [46]:
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


<a class="list-group-item list-group-item-action" data-toggle="list" href="#Notebook-Content" role="tab" aria-controls="settings">Go to top<span class="badge badge-primary badge-pill"></span></a> 


### 2. IQR Score
The **interquartile range (IQR)**, also called the widespread  or middle 50%, or technically H-spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, **IQR = Q3 − Q1**<br>

In other words, the IQR is the first quartile subtracted from the third quartile; these quartiles can be clearly seen on a box plot on the data<br>

It is a measure of the dispersion like standard deviation or variance, but is much more robust against outliers<br>

In [47]:
# function for removing outlier based on IQR
# it should only be used for continuous variable
#Calculating the IQR
def remove_outlier(df):
    Q1 = boston_df.quantile(0.25)
    Q3 = boston_df.quantile(0.75)
    IQR = Q3 - Q1
    #Removing outliers
    df1 = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
    print(IQR)
    return df1

remove_outlier(boston_df).head()

CRIM         3.595038
ZN          12.500000
INDUS       12.910000
CHAS         0.000000
NOX          0.175000
RM           0.738000
AGE         49.050000
DIS          3.088250
RAD         20.000000
TAX        387.000000
PTRATIO      2.800000
B           20.847500
LSTAT       10.005000
dtype: float64


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


• This is the IQR for each column in the dataframe<br>
• The IQR scores helps in detecting the outliers outliers. 

Just like Z-score, the outliers outliers can be filtered out by keeping only valid values.

<a class="list-group-item list-group-item-action" data-toggle="list" href="#Notebook-Content" role="tab" aria-controls="settings">Go to top<span class="badge badge-primary badge-pill"></span></a> 
