# Handling Categorical Features

While we did handle the cleaning process earlier on, just for the sake of using more of what scikit-learn has to offer, I'm going to be creating a pipleline which is going to clean the numerical features as well as the categorical ones too. This might seem counterintuitive, but I want to experiment with as many things as possible during this project by taking another approach.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sys
import os

sys.path.append(os.path.abspath(os.path.join("..")))

pd.set_option("mode.copy_on_write", True)

In [2]:
# Using the raw datasets to perform cleaning using pipeline(s)

train = pd.read_csv("../data/raw/train.csv", index_col="Id")
test = pd.read_csv("../data/raw/test.csv", index_col="Id")

In [3]:
train.drop(columns="SalePrice", inplace=True)

In [4]:
from src.data_pipelines import Pipelines

pipelines = Pipelines()

ct_mm = pipelines.create_column_transformer_mm()
ct_ss = pipelines.create_column_transformer_ss()

In [5]:
train_trans_mm = ct_mm.fit_transform(train)
test_trans_mm = ct_mm.transform(test)

In [6]:
train_trans_ss = ct_ss.fit_transform(train)
test_trans_ss = ct_ss.transform(test)

Using the combination of ColumnTransformer and Pipelines for numerical and categorical types, have successfully imputed values, encoded categorical types using OrdinalEncoder and OneHotEncoder, and scaled the values in order to provide the models accurate data using MinMaxScaler and StandardScaler in within differnt ColumnTransformers. Compared to the cleaning that was done before, the ColumnTransformer makes it simple to directly affect one type of data by specifying which pipeline to use one what data. This allows for reproducible results if new data were ever added and fewer possible errors within each feature that required modification. One thing to note is that the numerical features from data exploration which had heavy tailed did not get logarithmically transformed. This might affect overall model performace, later, but this is something that can be observed when the time comes.

Using these dataset transformations, let's create a dataframe for each.

In [7]:
# Did not provide .toarray() since MinMaxScaler is not able to take a sparse matrix. Using spare_output=False within OneHotEncoder to prevent this issue
train_mm_df = pd.DataFrame(train_trans_mm, columns=ct_mm.get_feature_names_out(), index=train.index)
test_mm_df = pd.DataFrame(test_trans_mm, columns=ct_mm.get_feature_names_out(), index=test.index)


# Since the OneHotEncoder outputs using a sparse matrix, each much be converted into an array for creating a dataframe with the correct size
train_ss_df = pd.DataFrame(train_trans_ss.toarray(), columns=ct_ss.get_feature_names_out(), index=train.index)
test_ss_df = pd.DataFrame(test_trans_ss.toarray(), columns=ct_ss.get_feature_names_out(), index=test.index)

In [10]:
print("MinMax ColumnTransform Shape:")
print(f"Training: {train_mm_df.shape}")
print(f"Testing: {test_mm_df.shape}")

print("\nStandardScaler ColumnTransform Shape:")
print(f"Training: {train_ss_df.shape}")
print(f"Testing: {test_ss_df.shape}")

MinMax ColumnTransform Shape:
Training: (1460, 226)
Testing: (1459, 226)

StandardScaler ColumnTransform Shape:
Training: (1460, 307)
Testing: (1459, 307)


In [12]:
ct_mm.get_feature_names_out()

array(['pipeline-1__MSSubClass', 'pipeline-1__LotFrontage',
       'pipeline-1__LotArea', 'pipeline-1__OverallQual',
       'pipeline-1__OverallCond', 'pipeline-1__YearBuilt',
       'pipeline-1__YearRemodAdd', 'pipeline-1__MasVnrArea',
       'pipeline-1__BsmtFinSF1', 'pipeline-1__BsmtFinSF2',
       'pipeline-1__BsmtUnfSF', 'pipeline-1__TotalBsmtSF',
       'pipeline-1__1stFlrSF', 'pipeline-1__2ndFlrSF',
       'pipeline-1__LowQualFinSF', 'pipeline-1__GrLivArea',
       'pipeline-1__BsmtFullBath', 'pipeline-1__BsmtHalfBath',
       'pipeline-1__FullBath', 'pipeline-1__HalfBath',
       'pipeline-1__BedroomAbvGr', 'pipeline-1__KitchenAbvGr',
       'pipeline-1__TotRmsAbvGrd', 'pipeline-1__Fireplaces',
       'pipeline-1__GarageYrBlt', 'pipeline-1__GarageCars',
       'pipeline-1__GarageArea', 'pipeline-1__WoodDeckSF',
       'pipeline-1__OpenPorchSF', 'pipeline-1__EnclosedPorch',
       'pipeline-1__3SsnPorch', 'pipeline-1__ScreenPorch',
       'pipeline-1__PoolArea', 'pipeline-1__Mis

Using the `get_feature_names_out()` function, observing the new features created by the ColumnTransformer/Pipeline(s) is made simple. Focsuing on the categorical data, depending on the type (nominal or ordinal) you may observe some differences. The OneHotEncoder, mainly used for nominal data, creates new feature values for each category within a column. Take the Utilities feature for example, OneHotEncoder creates four new feature columns: AllPub, NoSewr, NoSeWa, and ELO. Depending on the value used within that entry, the columns within that 'parent feature' will either be assigned a 0 or 1. For ordinal data using the OrindalEncoder creates a list of numbers as the number of categories within the feature and replaces those values with the numerical representation. 

Now that the categorical data has been handled, let's move onto the feature engineering portion which will be much shorter.

In [14]:
train_mm_df.to_csv("../data/processed/train_minmax.csv")
test_mm_df.to_csv("../data/processed/test_minmax.csv")

train_ss_df.to_csv("../data/processed/train_standard.csv")
test_ss_df.to_csv("../data/processed/test_standard.csv")