<a href="https://colab.research.google.com/github/agarwalpratik/aiml/blob/main/Data_Preprocessing_Techniques_and_hypothises.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#Data Preprocessing techniques meant to deal with  data in which we are going to perform Inferential Stats.
#The ideal goal of these techniques is to make the data compatible for the algorithm for AI modelling.

#Basic Expectation from Inferential Stats
# a. Data Must be complete
# b. Data must be STRICTLY NUMERIC

#Preprocessing Task:
# 1. Check and Handle Missing Data
# 2. Check and Handle Categorical Data
# 3. Check and Handle Ordinal Data
# 4. Perform Data Standardization/Normalization

In [None]:
#1. Check and Handle MissinG Data
#
# The process of dealing with missing values such that the NATURE of the data is preserved is called as IMPUTATION.
#
# Guidelines to Handle Missing Data
# ======================================
#
# Missing data can be handled in 3 ways:
# 1. Statistical Way
# 2. Domain Way
# 3. Hybrid Way
#
# ================================================================================================================
# 1. Statistical way
# ================================================================================================================
#
#  a. For Numerical Columns:
#          a. Continuous ND: Replace Missing value(NaN) with the mean value of the column
#          b. Discrete ND:   Replace Missing value(NaN) with the median value of the column
#
#  b. For Non-numerical column:
#         Replace Missing value(NaN) with the Mode's first value of the column
# ===============================================================================================================
# b. Domain way
# ===============================================================================================================
#
# Replace Missing value(NaN) with the default value of the column
# Default value can be derived from the domain expert
#
# e.g. Real Estate in Mumbai, Maharashtra, India ----------------- MMRDA ------Guidelines and rules for Builder
#
# Car Parking
#   1. 2BHK flat -------------- 1 car parking
#   2. 3BHK flat -------------- 2 car parking
#   3. 4BHK or more ----------- 3 Car parking
#
# Parking ---> NaN
# ===============================================================================================================
# c. Hybrid Perspective
# ===============================================================================================================
#
# Some columns will follow domain approach where as others follow Stat approach
#
#================================================================================================================


In [1]:
import pandas as pd
import numpy as np

In [3]:
data = pd.read_csv('melb_data.csv')

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         13580 non-null  object 
 1   Address        13580 non-null  object 
 2   Rooms          13580 non-null  int64  
 3   Type           13580 non-null  object 
 4   Price          13580 non-null  float64
 5   Method         13580 non-null  object 
 6   SellerG        13580 non-null  object 
 7   Date           13580 non-null  object 
 8   Distance       13580 non-null  float64
 9   Postcode       13580 non-null  float64
 10  Bedroom2       13580 non-null  float64
 11  Bathroom       13580 non-null  float64
 12  Car            13518 non-null  float64
 13  Landsize       13580 non-null  float64
 14  BuildingArea   7130 non-null   float64
 15  YearBuilt      8205 non-null   float64
 16  CouncilArea    12211 non-null  object 
 17  Lattitude      13580 non-null  float64
 18  Longti

In [5]:
data.isna().sum()

Unnamed: 0,0
Suburb,0
Address,0
Rooms,0
Type,0
Price,0
Method,0
SellerG,0
Date,0
Distance,0
Postcode,0


In [6]:
data

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.79960,144.99840,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.80790,144.99340,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.80930,144.99440,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.79690,144.99690,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.80720,144.99410,Northern Metropolitan,4019.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13575,Wheelers Hill,12 Strada Cr,4,h,1245000.0,S,Barry,26/08/2017,16.7,3150.0,...,2.0,2.0,652.0,,1981.0,,-37.90562,145.16761,South-Eastern Metropolitan,7392.0
13576,Williamstown,77 Merrett Dr,3,h,1031000.0,SP,Williams,26/08/2017,6.8,3016.0,...,2.0,2.0,333.0,133.0,1995.0,,-37.85927,144.87904,Western Metropolitan,6380.0
13577,Williamstown,83 Power St,3,h,1170000.0,S,Raine,26/08/2017,6.8,3016.0,...,2.0,4.0,436.0,,1997.0,,-37.85274,144.88738,Western Metropolitan,6380.0
13578,Williamstown,96 Verdon St,4,h,2500000.0,PI,Sweeney,26/08/2017,6.8,3016.0,...,1.0,5.0,866.0,157.0,1920.0,,-37.85908,144.89299,Western Metropolitan,6380.0


In [None]:
# ================================================================================================================
# 1. Statistical way
# ================================================================================================================
#
#  a. For Numerical Columns:
#          a. Continuous ND: Replace Missing value(NaN) with the mean value of the column
#          b. Discrete ND:   Replace Missing value(NaN) with the median value of the column
#
#  b. For Non-numerical column:
#         Replace Missing value(NaN) with the Mode's first value of the column
#
# For example purpose, I will consider
# Salary as continuous nd
# Age as discrete nd

In [7]:
data['Car'].median()

2.0

In [8]:
data['Car'].fillna(data['Car'].median(), inplace=True)

In [9]:
data['BuildingArea'].median()

126.0

In [10]:
data['BuildingArea'].fillna(data['BuildingArea'].median(), inplace=True)

In [11]:
data['YearBuilt'].median()

1970.0

In [12]:
data['YearBuilt'].fillna(data['YearBuilt'].median(), inplace=True)

In [13]:
data

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,126.0,1970.0,Yarra,-37.79960,144.99840,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.80790,144.99340,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.80930,144.99440,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,126.0,1970.0,Yarra,-37.79690,144.99690,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.80720,144.99410,Northern Metropolitan,4019.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13575,Wheelers Hill,12 Strada Cr,4,h,1245000.0,S,Barry,26/08/2017,16.7,3150.0,...,2.0,2.0,652.0,126.0,1981.0,,-37.90562,145.16761,South-Eastern Metropolitan,7392.0
13576,Williamstown,77 Merrett Dr,3,h,1031000.0,SP,Williams,26/08/2017,6.8,3016.0,...,2.0,2.0,333.0,133.0,1995.0,,-37.85927,144.87904,Western Metropolitan,6380.0
13577,Williamstown,83 Power St,3,h,1170000.0,S,Raine,26/08/2017,6.8,3016.0,...,2.0,4.0,436.0,126.0,1997.0,,-37.85274,144.88738,Western Metropolitan,6380.0
13578,Williamstown,96 Verdon St,4,h,2500000.0,PI,Sweeney,26/08/2017,6.8,3016.0,...,1.0,5.0,866.0,157.0,1920.0,,-37.85908,144.89299,Western Metropolitan,6380.0


In [14]:
data.isna().sum()

Unnamed: 0,0
Suburb,0
Address,0
Rooms,0
Type,0
Price,0
Method,0
SellerG,0
Date,0
Distance,0
Postcode,0


In [15]:
#Normality Test
# ======================================================================================
# Check if the given column has a Normal Distribution | Bell Curve | Gaussian Dist
# ========================================================================================

#Perform Hypothesis Test
#
# 1. Create a Viable Question (Binary Outcome)
#
#
# Check if  Rdspend is Normally Distributed
#
#
# 2. Convert the question into Hypothesis (H0, H1)
#
#      Null Hypothesis (H0) ------- RDSpend is NOT Normally Distributed
#      Alt  Hypothesis (H1) ------- RDSpend is Normally Distributed
#
#
#
#
# 3. Select the statistical test/formula/tool/method to validate the hypothesis (Who wins?)
#
#
#  Normality Test ----- Shapiro Test
#
# 4. Select/Determine the SL of the project
#
#
#   SL = 0.05
#
#
# 5. Calc the p-value from the test and compare the same with the SL to determine who win
#
#

SL = 0.05

from scipy.stats import shapiro

w_statistic,pvalue = shapiro(data['Car'])

if pvalue >= SL:
  print("Alt  Hypothesis (H1) ------- Car is Normally Distributed")
  print(w_statistic)
else:
  print("Null Hypothesis (H0) ------- Car is NOT Normally Distributed")
  print(w_statistic)

Null Hypothesis (H0) ------- Car is NOT Normally Distributed
0.8226456444282846


  res = hypotest_fun_out(*samples, **kwds)


In [18]:
SL = 0.05

from scipy.stats import shapiro

w_statistic,pvalue = shapiro(data['BuildingArea'])

if pvalue >= SL:
  print("Alt  Hypothesis (H1) ------- BuildingArea is Normally Distributed")
  print(w_statistic)
else:
  print("Null Hypothesis (H0) ------- BuildingArea is NOT Normally Distributed")
  print(w_statistic)

Null Hypothesis (H0) ------- BuildingArea is NOT Normally Distributed
0.02961871516301673


In [19]:
SL = 0.05

from scipy.stats import shapiro

w_statistic,pvalue = shapiro(data['YearBuilt'])

if pvalue >= SL:
  print("Alt  Hypothesis (H1) ------- YearBuilt is Normally Distributed")
  print(w_statistic)
else:
  print("Null Hypothesis (H0) ------- YearBuilt is NOT Normally Distributed")
  print(w_statistic)

Null Hypothesis (H0) ------- YearBuilt is NOT Normally Distributed
0.8420733359042144


In [None]:
# Test for Feature Elimination
#=========================================================================================================
# The goal of this technique is to identify which column/variable in my dataset can HIGHLY CONTRIBUTE for
# knowledge discovery (Statistical Significance)

# In simple terms, its all about identifying which columns can help better understand the population's pattern
#
# a. Parametric Test
#=========================================================================================================
# If the given two columns are passing NORMALITY TEST (if the columns are NORMALLY DISTRIBUTED), you can
# use parametric test using following rule:
#
# H0 ----- Eliminate one of the features
# Ha ----- Preserve both features
#
#
# b. Non Parametric Test
#=========================================================================================================
#
# If any one column is failing NORMALITY TEST (if the columns are NOT NORMALLY DISTRIBUTED), you can
# use non-parametric test using following rule:
#
# H0 ----- Eliminate one of the features
# Ha ----- Preserve both features

In [20]:
SL = 0.05

from scipy.stats import wilcoxon

_,pvalue = wilcoxon(data['Car'], data['BuildingArea'])

if pvalue <= SL:
  print("Ha ---- Car and BuildingArea are Statistically different. Therefore preserve both")
else:
  print("H0 ---- Car and BuildingArea are Statistically same. Therefore eliminate one of them")

Ha ---- Car and BuildingArea are Statistically different. Therefore preserve both


In [21]:
SL = 0.05

from scipy.stats import wilcoxon

_,pvalue = wilcoxon(data['YearBuilt'], data['BuildingArea'])

if pvalue <= SL:
  print("Ha ---- YearBuilt and BuildingArea are Statistically different. Therefore preserve both")
else:
  print("H0 ---- YearBuilt and BuildingArea are Statistically same. Therefore eliminate one of them")

Ha ---- YearBuilt and BuildingArea are Statistically different. Therefore preserve both
