<h2><u>Statistical Foundations</u></h2>

# Module 2 – Exploratory Data Analysis
<h2>Demo 1: Detecting and Removing Outliers</h2>

In this demo, you will be shown how to detect and remove outliers using Z-score and IQR score.

In [20]:
#Import the required libraries
import pandas as pd
from sklearn import datasets
from scipy import stats
import numpy as np

In [21]:
#Load the Boston House Pricing Dataset which is included in the sklearn dataset API
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
X = housing.data
y = housing.target
columns = housing.feature_names

In [22]:
#Create the dataframe
boston_df = pd.DataFrame(housing.data)
boston_df.columns = columns
boston_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


### Using Z-Score

In [23]:
#Step1: Use Z-score function defined in scipy library to detect the outliers
boston_df_z = boston_df
z = np.abs(stats.zscore(boston_df))
print(z)

         MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  \
0      2.344766  0.982143  0.628559   0.153758    0.974429  0.049597   
1      2.332238  0.607019  0.327041   0.263336    0.861439  0.092512   
2      1.782699  1.856182  1.155620   0.049016    0.820777  0.025843   
3      0.932968  1.856182  0.156966   0.049833    0.766028  0.050329   
4      0.012881  1.856182  0.344711   0.032906    0.759847  0.085616   
...         ...       ...       ...        ...         ...       ...   
20635  1.216128  0.289187  0.155023   0.077354    0.512592  0.049110   
20636  0.691593  0.845393  0.276881   0.462365    0.944405  0.005021   
20637  1.142593  0.924851  0.090318   0.049414    0.369537  0.071735   
20638  1.054583  0.845393  0.040211   0.158778    0.604429  0.091225   
20639  0.780129  1.004309  0.070443   0.138403    0.033977  0.043682   

       Latitude  Longitude  
0      1.052548   1.327835  
1      1.043185   1.322844  
2      1.038503   1.332827  
3      1.038503   1

Looking at the code and the output above, it is difficult to say which data point is an outlier.
So let’s define a threshold to identify an outlier.

In [24]:
#Step2: Define a threshold
threshold = 3
print(np.where(z > 3))

(array([  131,   283,   409,   510,   511,   512,   514,   570,   576,
         710,   780,   799,   864,   865,   867,   869,   871,   922,
         923,   977,   985,   986,   995,  1010,  1021,  1024,  1024,
        1039,  1060,  1086,  1102,  1102,  1233,  1233,  1234,  1234,
        1235,  1235,  1238,  1238,  1239,  1239,  1240,  1240,  1541,
        1560,  1561,  1563,  1564,  1566,  1566,  1574,  1582,  1583,
        1586,  1591,  1593,  1602,  1617,  1621,  1636,  1637,  1642,
        1644,  1645,  1646,  1700,  1867,  1867,  1872,  1872,  1879,
        1889,  1889,  1910,  1910,  1911,  1911,  1912,  1912,  1913,
        1913,  1914,  1914,  1925,  1926,  1926,  1930,  1978,  1978,
        1979,  1979,  2025,  2119,  2213,  2294,  2311,  2392,  2392,
        2395,  2395,  2396,  2396,  2397,  2397,  2398,  2398,  2511,
        2511,  2776,  2826,  2963,  2969,  2975,  2978,  2999,  3004,
        3086,  3086,  3167,  3177,  3258,  3258,  3292,  3334,  3350,
        3350,  3364

The first array contains the list of row numbers and second array contains the respective column numbers, which means that <b><i>z[55][1]</i> has a z-score higher than 3</b>.

In [25]:
#Step3: Print the z-score of z[55][1]
#print(z[55][1])

# Step 3: Access z-score for specific row and column in a pandas DataFrame
print(z.iloc[55, 1])  # Use iloc for integer-based indexing

1.8561815225324745


So, the data point — 55th record on column ZN is an outlier.

In [26]:
#Step4: Remove the outliers using the z-score
boston_df_z = boston_df_z[(z < 3).all(axis=1)]

print("The no. of rows before outlier filtering was: ", boston_df.shape)
print("The no. of rows after outlier filtering is: ", boston_df_z.shape)

The no. of rows before outlier filtering was:  (20640, 8)
The no. of rows after outlier filtering is:  (19794, 8)


Hence, we filtered out around 90+ rows from the dataset i.e. outliers have been removed.

### Using IQR Score

In [27]:
#Step1: Calculate the IQR
boston_df_iqr = boston_df
Q1 = boston_df_iqr.quantile(0.25)
Q3 = boston_df_iqr.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

MedInc          2.179850
HouseAge       19.000000
AveRooms        1.611665
AveBedrms       0.093447
Population    938.000000
AveOccup        0.852520
Latitude        3.780000
Longitude       3.790000
dtype: float64


In [28]:
#Step2: Detect the outliers
#print(boston_df_iqr < (Q1 - 1.5 * IQR)) |(boston_df_iqr > (Q3 + 1.5 * IQR))

# Step 2: Detect outliers using IQR
outliers = (boston_df < (Q1 - 1.5 * IQR)) | (boston_df > (Q3 + 1.5 * IQR))

# Print the outliers DataFrame
print(outliers)

       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0       False     False     False      False       False     False     False   
1       False     False     False      False       False     False     False   
2       False     False     False      False       False     False     False   
3       False     False     False      False       False     False     False   
4       False     False     False      False       False     False     False   
...       ...       ...       ...        ...         ...       ...       ...   
20635   False     False     False      False       False     False     False   
20636   False     False     False      False       False     False     False   
20637   False     False     False      False       False     False     False   
20638   False     False     False      False       False     False     False   
20639   False     False     False      False       False     False     False   

       Longitude  
0          False  
1

TypeError: Cannot perform 'ror_' with a dtyped [bool] array and scalar of type [NoneType]

The data point where we have False that means these values are valid whereas <b><i>True</i> indicates presence of an outlier</b>.

In [None]:
#Step3: Remove the outliers using the IQR score
boston_df_out = boston_df_iqr[~((boston_df_iqr < (Q1 - 1.5 * IQR)) |(boston_df_iqr > (Q3 + 1.5 * IQR))).any(axis=1)]

print("The no. of rows before outlier filtering was: ", boston_df_iqr.shape)
print("The no. of rows after outlier filtering is: ", boston_df_out.shape)

Hence, the outliers have been removed.