After referencing data_description, the encoding plan is as follows:

## Ordinal
Fence: Fence quality (GdPrv	Good Privacy MnPrv	Minimum Privacy GdWo	Good Wood MnWw	Minimum Wood/Wire NA	No Fence)
PavedDrive: Paved driveway (Y	Paved P	Partial Pavement N	Dirt/Gravel)
Functional: Home functionality (Assume typical unless deductions are warranted) (Typ	Typical Functionality Min1	Minor Deductions 1 Min2	Minor Deductions 2 Mod	Moderate Deduction Maj1	Major Deductions 1 Maj2	Major Deductions 2 Sev	Severely Damaged Sal	Salvage only)
Electrical: Electrical system (2 SBrkr	Standard Circuit Breakers & Romex 1 FuseA	Fuse Box over 60 AMP and all Romex wiring (Average) 0 Mix	Mixed -1 FuseF	60 AMP Fuse Box and mostly Romex wiring (Fair)  -2 FuseP	60 AMP Fuse Box and mostly knob & tube wiring (poor))
BsmtExposure: Refers to walkout or garden level walls (Gd	Good Exposure Av	Average Exposure (split levels or foyers typically score average or above)	 Mn	Mimimum Exposure No	No Exposure NA	No Basement)

__{"Ex": 5, "Gd": 4, "TA": 3, "Fa": 2, "Po": 1, NaN: 0}__

HeatingQC: Heating quality and condition 
KitchenQual: Kitchen quality 
GarageCond: Garage condition 
GarageQual: Garage quality 
FireplaceQu: Fireplace quality 
BsmtQual: Evaluates the height of the basement
BsmtCond: Evaluates the general condition of the basement
ExterCond: Evaluates the present condition of the material on the exterior
ExterQual: Evaluates the quality of the material on the exterior 
PoolQC: Pool quality 

## Nominal
### OHE (option to do drop if binary)
MoSold: Month Sold (MM)
GarageFinish: Interior finish of the garage
CentralAir: Central air conditioning binary
Street: Type of road access to property
Alley: Type of alley access to property
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property

### Hash
SaleCondition: Condition of sale
SaleType: Type of sale
GarageType: Garage location
Heating: Type of heating
MSSubClass: Identifies the type of dwelling involved in the sale.
MSZoning: Identifies the general zoning classification of the sale.
Neighborhood: Physical locations within Ames city limits			
Condition1: Proximity to various conditions
Condition2: Proximity to various conditions (if more than one is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
RoofStyle: Type of roof	
RoofMatl: Roof material
Exterior1st: Exterior covering on house	
Exterior2nd: Exterior covering on house (if more than one material)	
MasVnrType: Masonry veneer type
Foundation: Type of foundation
BsmtFinType1: Rating of basement finished area 
BsmtFinType2: Rating of basement finished area (if multiple types)

## Numeric Columns
MiscVal: $Value of miscellaneous feature
YrSold: Year Sold (YYYY)
PoolArea: Pool area in square feet
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageYrBlt: Year garage was built
Fireplaces: Number of fireplaces
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
BedroomAbvGr: Bedrooms above grade (does NOT include basement bedrooms)
KitchenAbvGr: Kitchens above grade
BsmtFinSF1: Type 1 finished square feet
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
MasVnrArea: Masonry veneer area in square feet
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
YearBuilt: Original construction date
YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)
OverallQual: Rates the overall material and finish of the house
OverallCond: Rates the overall condition of the house

__sum all porches into one Porch column__

ScreenPorch: Screen porch area in square feet
3SsnPorch: Three season porch area in square feet
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet

# Target
SalePrice

Steps:
1. Split data into train and test

2. Drops from EDA
- drop `Id`
- drop `MiscFeature`: Miscellaneous feature not covered in other categories
- drop duplicates

3. Preprocessing
- Instantiate Transformers
- Build Pipelines
- Select columns
- Create tuples
- Column Transformer
- Encoding

4. New Baseline Model!

## Library imports

In [None]:
import pandas as pd
pd.set_option('display.max_columns', 100)
import numpy as np
np.set_printoptions(legacy="1.25")
import seaborn as sns
import matplotlib.pyplot as plt
import math
import scipy

from sklearn import set_config
set_config(transform_output='pandas')

from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso, ElasticNet
from sklearn.metrics import (mean_absolute_error, mean_squared_error, r2_score, 
                             mean_absolute_percentage_error, root_mean_squared_error)
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, RobustScaler, PolynomialFeatures, FunctionTransformer
from sklearn.tree import DecisionTreeRegressor, plot_tree

import category_encoders as ce
from category_encoders.hashing import HashingEncoder
from category_encoders.ordinal import OrdinalEncoder

from xgboost import XGBRegressor

## Load dataset

In [18]:
file_path = "../../house-prices-advanced-regression-techniques/input/train.csv"
houses = pd.read_csv(file_path)

# Train/Test Split

In [None]:
X = houses.drop(columns = "SalePrice")
y = houses["SalePrice"].copy()
y = np.log(y) # taking logarithm for a more normally distributed target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Drop indicated columns

In [20]:
for _ in [X_train, X_test]:
    _.drop(columns=["Id", "MiscFeature"], inplace=True)

# Build Pipelines

In [21]:
# column selection
ohe_cols = ["MoSold",
            "GarageFinish",
            "CentralAir",
            "Street",
            "Alley",
            "LotShape",
            "LandContour",
            "Utilities",
            "LotConfig",
            "LandSlope"]
num_cols = ["MiscVal",
            "YrSold",
            "PoolArea",
            "GarageCars",
            "GarageArea",
            "GarageYrBlt",
            "Fireplaces",
            "TotRmsAbvGrd",
            "1stFlrSF",
            "2ndFlrSF",
            "LowQualFinSF",
            "GrLivArea",
            "BsmtFullBath",
            "BsmtHalfBath",
            "FullBath",
            "HalfBath",
            "BedroomAbvGr",
            "KitchenAbvGr",
            "BsmtFinSF1",
            "BsmtFinSF2",
            "BsmtUnfSF",
            "TotalBsmtSF",
            "MasVnrArea",
            "LotFrontage",
            "LotArea",
            "YearBuilt",
            "YearRemodAdd",
            "OverallQual",
            "OverallCond"]
porch_cols = ["ScreenPorch",
            "3SsnPorch",
            "WoodDeckSF",
            "OpenPorchSF",
            "EnclosedPorch"]
hash_cols = ["SaleCondition",
            "SaleType",
            "GarageType",
            "Heating",
            "MSSubClass",
            "MSZoning",
            "Neighborhood",		
            "Condition1",
            "Condition2",
            "BldgType",
            "HouseStyle",
            "RoofStyle",
            "RoofMatl",
            "Exterior1st",
            "Exterior2nd",
            "MasVnrType",
            "Foundation",
            "BsmtFinType1",
            "BsmtFinType2"]
ord_cols = ["Fence",
            "PavedDrive",
            "Functional",
            "Electrical",
            "BsmtExposure",
            "HeatingQC",
            "KitchenQual",
            "GarageCond",
            "GarageQual",
            "FireplaceQu",
            "BsmtQual",
            "BsmtCond",
            "ExterCond",
            "ExterQual",
            "PoolQC"]

# Porch sum function
def porch_func(df):
    df = df.copy()
    df["Porch_sum"] = 0
    for porch in porch_cols:
        df["Porch_sum"] = df["Porch_sum"] + df[porch]
    df.drop(porch_cols, axis=1, inplace=True)
    return df

# find hash components needed and add one for safety
hash_n = math.ceil(math.log2(max(houses.select_dtypes(include="object").nunique()))) + 1

# ordinal map
ord_map = [{"col": "Fence", "mapping": {"Missing":0, "MnWw":1, "GdWo":2, "MnPrv":3, "GdPrv":4}},
             {"col": "PavedDrive", "mapping": {"N":1, "P":2, "Y":3}},
             {"col": "Functional", "mapping": {"Sal":-7, "Sev":-6, "Maj2":-5, "Maj1":-4, "Mod":-3, "Min2":-2, "Min1":-1, "Missing": 0, "Typ":0}},
             {"col": "Electrical", "mapping": {"Missing":0, "FuseP": -2, "FuseF":-1, "Mix":0, "FuseA":1, "SBrkr":2}},
             {"col": "BsmtExposure", "mapping": {"Missing":0, "No":1, "Mn":2, "Av":3, "Gd":4}},
             {"col": "HeatingQC", "mapping": {"Ex": 5, "Gd": 4, "TA": 3, "Fa": 2, "Po": 1, "Missing": 0}},
             {"col": "KitchenQual", "mapping": {"Ex": 5, "Gd": 4, "TA": 3, "Fa": 2, "Po": 1, "Missing": 0}},
             {"col": "GarageCond", "mapping": {"Ex": 5, "Gd": 4, "TA": 3, "Fa": 2, "Po": 1, "Missing": 0}},
             {"col": "GarageQual", "mapping": {"Ex": 5, "Gd": 4, "TA": 3, "Fa": 2, "Po": 1, "Missing": 0}},
             {"col": "FireplaceQu", "mapping": {"Ex": 5, "Gd": 4, "TA": 3, "Fa": 2, "Po": 1, "Missing": 0}},
             {"col": "BsmtQual", "mapping": {"Ex": 5, "Gd": 4, "TA": 3, "Fa": 2, "Po": 1, "Missing": 0}},
             {"col": "BsmtCond", "mapping": {"Ex": 5, "Gd": 4, "TA": 3, "Fa": 2, "Po": 1, "Missing": 0}},
             {"col": "ExterCond", "mapping": {"Ex": 5, "Gd": 4, "TA": 3, "Fa": 2, "Po": 1, "Missing": 0}},
             {"col": "ExterQual", "mapping": {"Ex": 5, "Gd": 4, "TA": 3, "Fa": 2, "Po": 1, "Missing": 0}},
             {"col": "PoolQC", "mapping": {"Ex": 5, "Gd": 4, "TA": 3, "Fa": 2, "Po": 1, "Missing": 0}}]

# Instantiate Transformers
zero_imputer = SimpleImputer(strategy="constant", fill_value=0)
missing_imputer = SimpleImputer(strategy="constant", fill_value="Missing")

ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore', drop="if_binary")
porch_sum = FunctionTransformer(porch_func)
hash = HashingEncoder(cols=hash_cols, n_components=hash_n)
ord = OrdinalEncoder(cols=ord_cols, mapping=ord_map)

# pipelines
ohe_pipe = Pipeline([("Missing Imputer", missing_imputer),
                     ("One Hot Encoder", ohe)])
hash_pipe = Pipeline([("Missing Imputer", missing_imputer),
                     ("Hashing Encoder", hash)])
ord_pipe = Pipeline([("Missing Imputer", missing_imputer),
                     ("Ordinal Encoder", ord)])


# Column Transformer Tuples
num_tuple = ("Numeric Imputation", zero_imputer, num_cols)
ohe_tuple = ("One Hot Encoder", ohe_pipe, ohe_cols)
porch_tuple = ("Sum of Porches", porch_sum, porch_cols)
hash_tuple = ("Hashing Encoder", hash_pipe, hash_cols)
ord_tuple = ("Ordinal Encoder", ord_pipe, ord_cols)

In [22]:
# Column Selector
preprocessor = ColumnTransformer([num_tuple, ohe_tuple, porch_tuple, hash_tuple, ord_tuple], remainder='drop',
                                 verbose_feature_names_out=False)

# Transform Data
preprocessor.fit(X_train)
X_train_proc = preprocessor.transform(X_train)
X_test_proc = preprocessor.transform(X_test)

# New Baseline Submission

In [23]:
rf_proc = RandomForestRegressor(random_state=42)
rf_proc.fit(X_train_proc, y_train)
proc_train_pred = rf_proc.predict(X_train_proc)
rf_proc_RMSE_train = root_mean_squared_error(np.log1p(y_train), np.log1p(proc_train_pred))
print("Training log RMSE:", rf_proc_RMSE_train)
proc_test_pred = rf_proc.predict(X_test_proc)
rf_proc_RMSE_test = root_mean_squared_error(np.log1p(y_test), np.log1p(proc_test_pred))
print("Test log RMSE:", rf_proc_RMSE_test)

Training log RMSE: 0.06021361995696828
Test log RMSE: 0.15096888170194378


In [None]:
test_file_path = "../../house-prices-advanced-regression-techniques/input/test.csv"
test_data = pd.read_csv(test_file_path)
ids = test_data.pop("Id")

test_data.drop(columns="MiscFeature", inplace=True)
test_data_proc = preprocessor.transform(test_data)

preds = rf_proc.predict(test_data_proc)
preds = np.exp(preds)

output = pd.DataFrame({"Id": ids,
                       "SalePrice": preds})

output.head()



Unnamed: 0,Id,SalePrice
0,1461,130900.33
1,1462,157418.75
2,1463,180003.53
3,1464,190552.55
4,1465,206770.68


In [26]:
sample_submission_file_path = "../../house-prices-advanced-regression-techniques/input/sample_submission.csv"
sample_submission_df = pd.read_csv(sample_submission_file_path)
sample_submission_df["SalePrice"] = preds
sample_submission_df.to_csv("01_post_data_processing_submission.csv", index=False)
sample_submission_df.head()

Unnamed: 0,Id,SalePrice
0,1461,130900.33
1,1462,157418.75
2,1463,180003.53
3,1464,190552.55
4,1465,206770.68
