<a href="https://colab.research.google.com/github/abiflynn/supervised_machine_learning/blob/main/housing_prices_classification/2_housing_prices.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 7: Housing Prices
* Iteration 3, One Hot Encoding

# Read the Data / Import Packages

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

path = '/content/drive/MyDrive/WBS CODING/Projects/Project 7: (Supervised Machine Learning)/housing-classification-iter3.csv'

housing_df = pd.read_csv(path)
housing_df.head(5)

Unnamed: 0,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch,Expensive,MSZoning,Condition1,Heating,Street,CentralAir,Foundation
0,8450,65.0,856,3,0,0,2,0,0,0,RL,Norm,GasA,Pave,Y,PConc
1,9600,80.0,1262,3,1,0,2,298,0,0,RL,Feedr,GasA,Pave,Y,CBlock
2,11250,68.0,920,3,1,0,2,0,0,0,RL,Norm,GasA,Pave,Y,PConc
3,9550,60.0,756,3,1,0,3,0,0,0,RL,Norm,GasA,Pave,Y,BrkTil
4,14260,84.0,1145,4,1,0,3,192,0,0,RL,Norm,GasA,Pave,Y,PConc


### The Data


* LotFrontage: Linear feet of street connected to property
* LotArea: Lot size in square feet
* TotalBsmtSF: Total square feet of basement area
* BedroomAbvGr: Bedrooms above grade (does NOT include basement bedrooms)
* Fireplaces: Number of fireplaces
* PoolArea: Pool area in square feet
* GarageCars: Size of garage in car capacity
* WoodDeckSF: Wood deck area in square feet
* ScreenPorch: Screen porch area in square feet
* MSZoning: Identifies the general zoning classification of the sale.
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM	Residential Medium Density

* Condition1: Proximity to various conditions
	
       Artery	Adjacent to arterial street
       Feedr	Adjacent to feeder street	
       Norm	Normal	
       RRNn	Within 200' of North-South Railroad
       RRAn	Adjacent to North-South Railroad
       PosN	Near positive off-site feature--park, greenbelt, etc.
       PosA	Adjacent to postive off-site feature
       RRNe	Within 200' of East-West Railroad
       RRAe	Adjacent to East-West Railroad

* Heating: Type of heating
		
       Floor	Floor Furnace
       GasA	Gas forced warm air furnace
       GasW	Gas hot water or steam heat
       Grav	Gravity furnace	
       OthW	Hot water or steam heat other than gas
       Wall	Wall furnace

* Street: Type of road access to property

       Grvl	Gravel	
       Pave	Paved

* CentralAir: Central air conditioning

       N	No
       Y	Yes

* Foundation: Type of foundation
		
       BrkTil	Brick & Tile
       CBlock	Cinder Block
       PConc	Poured Contrete	
       Slab	Slab
       Stone	Stone
       Wood	Wood

# Upgrade Scikit-learn

In [2]:
# this needs to be run for each new runtime
# because colab has scikit-learn 1.0.2 pre-installed 
# and we need newer version (1.2.0 and higher)
# to use .set_output() method
!pip install scikit-learn --upgrade

# if you plan on running the whole notebook again during the same runtime
# you can comment the line above

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# X and Y Creation & Data Splitting

In [4]:
#X and y creation
x = housing_df
y = x.pop("Expensive")

#Data splitting
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=123)

# Categorical Encoding - "MANUAL" Approach (Without using Pipelines)


We will need two different strategies to deal with missing values in numerical and categorical features.

### Replacing NaNs in Categorical Features

For numerical data you impute the mean. This can't be done with  categorical values as they don’t have a “mean”. Here, we will replace NaNs with a string that marks them: “N_A”.

In [5]:
#Selecting non-numerical columns
x_train_cat = x_train.select_dtypes(exclude="number")

#Defining the imputer to use "N_A" as replacement value
cat_imputer = SimpleImputer(strategy="constant", 
                            fill_value="N_A").set_output(transform='pandas')

#Fitting and transforming
x_cat_imputed = cat_imputer.fit_transform(x_train_cat)

x_cat_imputed.head()

Unnamed: 0,MSZoning,Condition1,Heating,Street,CentralAir,Foundation
318,RL,Norm,GasA,Pave,Y,PConc
580,RL,Norm,GasA,Pave,Y,CBlock
961,RL,PosN,GasA,Pave,Y,CBlock
78,RL,Norm,GasA,Pave,N,CBlock
5,RL,Norm,GasA,Pave,Y,Wood


### Replacing NaNs in Numerical Features

In [6]:
#Selecting numerical columns
x_train_num = x_train.select_dtypes(include="number")

#Imputing the mean
num_imputer = SimpleImputer(strategy="mean").set_output(transform='pandas')

#Fitting and transforming
x_num_imputed = num_imputer.fit_transform(x_train_num)

x_num_imputed.head()

Unnamed: 0,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch
318,9900.0,90.0,1347.0,4.0,1.0,0.0,3.0,340.0,0.0
580,14585.0,69.58427,1144.0,3.0,2.0,0.0,2.0,216.0,0.0
961,12227.0,69.58427,1330.0,4.0,1.0,0.0,2.0,550.0,0.0
78,10778.0,72.0,1768.0,4.0,0.0,0.0,0.0,0.0,0.0
5,14115.0,85.0,796.0,1.0,0.0,0.0,2.0,40.0,0.0


### Concatenating All Columns 

In [7]:
#Concatenating all columns
x_imputed = pd.concat([x_cat_imputed, x_num_imputed], axis=1)

x_imputed.head()

Unnamed: 0,MSZoning,Condition1,Heating,Street,CentralAir,Foundation,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch
318,RL,Norm,GasA,Pave,Y,PConc,9900.0,90.0,1347.0,4.0,1.0,0.0,3.0,340.0,0.0
580,RL,Norm,GasA,Pave,Y,CBlock,14585.0,69.58427,1144.0,3.0,2.0,0.0,2.0,216.0,0.0
961,RL,PosN,GasA,Pave,Y,CBlock,12227.0,69.58427,1330.0,4.0,1.0,0.0,2.0,550.0,0.0
78,RL,Norm,GasA,Pave,N,CBlock,10778.0,72.0,1768.0,4.0,0.0,0.0,0.0,0.0,0.0
5,RL,Norm,GasA,Pave,Y,Wood,14115.0,85.0,796.0,1.0,0.0,0.0,2.0,40.0,0.0


### One Hot Encoding 

One Hot encoding means creating a new binary column for each category in every categorical column. Fortunately, a Scikit-Learn transformer takes care of everything.

Fitting the OneHotEncoder. As with any transformer, we have to:
1.   Import it
2.   Initialize it
3.   Fit it to the data
4.   Use it to transform the data

In [8]:
#Import
from sklearn.preprocessing import OneHotEncoder

#Initialize
my_onehot = OneHotEncoder(sparse_output=False).set_output(transform='pandas')

#Fit
my_onehot.fit(x_cat_imputed)

#Transform
x_cat_imputed_onehot = my_onehot.transform(x_cat_imputed)

NOTE: If we leave sparse_output=True, the result will be a "sparse matrix": an object that Scikit-Learn creates when a matrix contains mostly zeros. In that case we would not be able to use .set_output(transform='pandas').

In [9]:
x_cat_imputed_onehot.head()

Unnamed: 0,MSZoning_C (all),MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,Condition1_Artery,Condition1_Feedr,Condition1_Norm,Condition1_PosA,Condition1_PosN,...,Street_Grvl,Street_Pave,CentralAir_N,CentralAir_Y,Foundation_BrkTil,Foundation_CBlock,Foundation_PConc,Foundation_Slab,Foundation_Stone,Foundation_Wood
318,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
580,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
961,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
78,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0


### Concatenating "One-Hot" Columns and Numerical Columns 

In [10]:
x_imputed = pd.concat([x_cat_imputed_onehot, x_num_imputed], axis=1)

x_imputed.head(3)

Unnamed: 0,MSZoning_C (all),MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,Condition1_Artery,Condition1_Feedr,Condition1_Norm,Condition1_PosA,Condition1_PosN,...,Foundation_Wood,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch
318,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,9900.0,90.0,1347.0,4.0,1.0,0.0,3.0,340.0,0.0
580,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,14585.0,69.58427,1144.0,3.0,2.0,0.0,2.0,216.0,0.0
961,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,12227.0,69.58427,1330.0,4.0,1.0,0.0,2.0,550.0,0.0


# Categorical Encoding - "Automated" Approach (Using Pipelines)

All the steps in the manual approach can be synthetised by using Scikit-Learn Pipelines and specifically something called `ColumnTransformer`, which allows you to apply different transformations to two or more groups of columns: in our case, categorical and numerical columns.

This process is also called creating "branches" in the pipeline. One branch for the categorical columns and another for the numerical columns. Each branch will contain as many transformers as we want. Then, the branches will meet again, and the transformed columns will be automatically concatenated.

### Creating the "Numeric Pipe" and the "Categoric Pipe"

In [11]:
# select categorical and numerical column names
x_cat_columns = x.select_dtypes(exclude="number").copy().columns
x_num_columns = x.select_dtypes(include="number").copy().columns

# create numerical pipeline, only with the SimpleImputer(strategy="mean")
numeric_pipe = make_pipeline(
    SimpleImputer(strategy="mean"))
 
 # create categorical pipeline, with the SimpleImputer(fill_value="N_A") and the OneHotEncoder
categoric_pipe = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="N_A"),
    OneHotEncoder(sparse_output=False, handle_unknown='ignore')
)

### Using ColumnTransformer a Pipeline with 2 Branches

In [12]:
#Import
from sklearn.compose import ColumnTransformer

#Created the columntransformer
preprocessor = ColumnTransformer(
    transformers=[
        ("num_pipe", numeric_pipe, x_num_columns), #1 branch called "num_pipe", will apply the steps in the numeric_pipe to the columns named in x_num_columns
        ("cat_pipe", categoric_pipe, x_cat_columns), #2 branch called "cat_pipe", will apply the steps in the categoric_pipe to the columns named in x_cat_columns
    ]
)

### Creating the Full_Pipeline

In [13]:
full_pipeline = make_pipeline(preprocessor, 
                              DecisionTreeClassifier()).set_output(transform='pandas')

In [14]:
full_pipeline.fit(x_train, y_train)

In [15]:
full_pipeline.predict(x_train)

array([1, 0, 1, ..., 1, 0, 0])

# Using the new Pipeline(with branches) to train a DecisionTree with GridSearch Cross Validation.

In [16]:

from sklearn.preprocessing import StandardScaler

full_pipeline = make_pipeline(preprocessor,
                              StandardScaler(), 
                              DecisionTreeClassifier()).set_output(transform='pandas')

# param_grid = {
#     "columntransformer__num_pipe__simpleimputer__strategy":["mean", "median"],
#     "decisiontreeclassifier__max_depth": range(2, 14, 2),
#     "decisiontreeclassifier__min_samples_leaf": range(3, 12, 2)
# }

param_grid = {
    "columntransformer__num_pipe__simpleimputer__strategy":["mean", "median"],
    "standardscaler__with_mean":[True, False],
    "standardscaler__with_std":[True, False],
    "decisiontreeclassifier__max_depth": range(2, 14),
    "decisiontreeclassifier__min_samples_leaf": range(3, 10),
    "decisiontreeclassifier__criterion":["gini", "entropy"]
}

search = GridSearchCV(full_pipeline,
                      param_grid,
                      cv=5,
                      verbose=1, 
                      error_score='raise')

search.fit(x_train, y_train)

Fitting 5 folds for each of 1344 candidates, totalling 6720 fits


In [17]:
# training accuracy
y_train_pred = search.predict(x_train)

train_accuracy = accuracy_score(y_train, y_train_pred)

train_accuracy_rounded = round(train_accuracy, 4)

print("The training data prediction is {:.2%} accurate".format(train_accuracy_rounded))

The training data prediction is 93.07% accurate


In [18]:
# testing accuracy
y_test_pred = search.predict(x_test)

test_accuracy = accuracy_score(y_test, y_test_pred)

test_accuracy_rounded = round(test_accuracy, 4)

print("The test data prediction is {:.2%} accurate".format(test_accuracy_rounded))

The test data prediction is 90.75% accurate
