In [2]:
nominal_cols = ['MSSubClass', 'MoSold', 'BldgType', 'MasVnrType', 'GarageType', 'SaleType', 'Condition1',
                'Condition2', 'SaleCondition', 'Neighborhood', 'Exterior1st', 'HouseStyle', 'RoofMatl',
                'BsmtFinType2', 'RoofStyle', 'BsmtFinType1', 'Heating', 'Foundation', 'LotConfig', 'MSZoning',
                'Electrical']

collapse_to_binary_cols = ['LowQualFinSF', 'MiscVal', '3SsnPorch', 'PoolArea', 'BsmtFullBath', 'HalfBath',
                           'BsmtHalfBath', 'BsmtFinSF2', 'EnclosedPorch', 'ScreenPorch', 'Fence']

right_skewed_cols = ['LotArea', 'GrLivArea', 'BsmtUnfSF', 'SalePrice', '1stFlrSF', 'TotalBsmtSF', 'LotFrontage']

skewed_and_binary = ['2ndFlrSF', 'OpenPorchSF', 'WoodDeckSF', 'MasVnrArea']

drop = ['PoolQC', 'MiscFeature', 'Utilities', 'Id', 'GarageArea', 'TotRmsAbvGrd', 'Alley', 'Exterior2nd']

obj_ordinal_cols = ['ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'HeatingQC', 'KitchenQual', 'FireplaceQu',
                    'GarageQual', 'GarageCond', 'Functional', 'LandContour', 'LotShape', 'BsmtExposure',
                    'LandSlope', 'GarageFinish', 'PavedDrive']

obj_already_binary_cols = ['Street', 'CentralAir']

year_cols = ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold']

imabalanced_numerical = {
    'KitchenAbvGr': 'collapse to multiple kitches flag.',
    'TotRmsAbvGrd': 'join 2 and 3, and 10, 11, 12, 14',
    'OverallCond': 'collapsed into poor average good',
    'BedroomAbvGr': 'Collapse to 3 bins (Small / Typical / Large) OR drop in favor of TotRmsAbvGrd.',
    'GarageCars': 'collapse 4 into 3'
}

imbalanced_object = {
    'Condition1': 'Collapse very rare categories (<= 5 obs) into "other"',
    'Condition2': 'Collapse into "Norm" vs "Other"',
    'SaleCondition': 'collapse Adjland, alloca, and family into other',
    'GarageType': 'collapse less than 10 into other',
    'SaleType': 'Collapse less than 10 into other',
    'Exterior2nd': 'combine less than and combine Wd Sdng and Wd Shng',
    'Exterior1st': 'combine less than and combine Wd Sdng and Wd Shng',
    'RoofMatl': 'collapse into other all except CompShg and WdShngl',
    'RoofStyle': 'keep Gambrel and Hip, collapse others into other',
    'heating': 'keep GasA and GasW, collapse others into other',
    'Foundation': 'collapse stone and wood into others',
    'Electrical': 'collapse FusP and Mix'
}

lists_dict = {
    "nominal_cols": nominal_cols,
    "collapse_to_binary_cols": collapse_to_binary_cols,
    "right_skewed_cols": right_skewed_cols,
    "skewed_and_binary": skewed_and_binary,
    "drop": drop,
    "obj_ordinal_cols": obj_ordinal_cols,
    "obj_already_binary_cols": obj_already_binary_cols,
    "year_cols": year_cols
}

🏠 Size & Area Features

TotalSF = TotalBsmtSF + 1stFlrSF + 2ndFlrSF → overall living + basement size.

PorchSF = OpenPorchSF + EnclosedPorch + 3SsnPorch + ScreenPorch → one combined porch measure.

TotalBaths = FullBath + HalfBath*0.5 + BsmtFullBath + BsmtHalfBath*0.5 → cleaner bathroom measure.

BasementPct = TotalBsmtSF / TotalSF → proportion of house that’s basement.

LotRatio = GrLivArea / LotArea → density of construction.

AvgRoomSize = GrLivArea / TotRmsAbvGrd (even if you drop TotRmsAbvGrd, this ratio is useful).

⏳ Time-Based Features

Use YrSold as reference:

HouseAge = YrSold – YearBuilt → age of the house at sale.

RemodAge = YrSold – YearRemodAdd → how long since last remodel.

GarageAge = YrSold – GarageYrBlt (clip negatives to 0).

IsRemodeled = binary flag if YearBuilt != YearRemodAdd.

DecadeBuilt = (YearBuilt // 10) * 10 → bin into decades.

🚪 Garage & Basement Features

HasGarage = from GarageType / GarageCars > 0.

GarageCapacityPerSF = GarageCars / GrLivArea → relative size.

FinishedBsmtPct = (TotalBsmtSF - BsmtUnfSF) / TotalBsmtSF → % finished basement.

HasBasement = binary if TotalBsmtSF > 0.

🏡 Location & Lot Features

CornerLot = flag from LotConfig == 'Corner'.

LotFrontageRatio = LotFrontage / LotArea → shape measure.

Neighborhood_Tier = group neighborhoods by median SalePrice (high/med/low tier).

MSSubClass_Category = map MSSubClass codes into meaningful categories (e.g., “1-Story,” “2-Story,” “Split Level”).

🔥 Quality / Condition Interactions

OverallQualityIndex = combine OverallQual (numeric) + OverallCond (ordinal) → more stable quality measure.

ExterScore = mean of ExterQual + ExterCond.

GarageScore = mean of GarageQual + GarageCond.

KitchenScore = just KitchenQual, or combine with Functional.

QualityAgeInteraction = OverallQual * HouseAge → newer but poor-quality vs. older but well-built.

⚡ Utility / Convenience Flags

HasCentralAir = from CentralAir.

Has2ndFlr = binary if 2ndFlrSF > 0.

HasPorch = binary if PorchSF > 0.

HasWoodDeck = binary if WoodDeckSF > 0.

HasMasonryVeneer = binary if MasVnrArea > 0.

🎯 Interaction Ideas (cross-bucket)

Size × Neighborhood: a big house in a low-value neighborhood doesn’t increase price as much as in a high-value one. Could be modeled with interaction terms.

Quality × Area: OverallQual * GrLivArea → large but poor-quality homes might not price the same as smaller but high-quality ones.

Condition × YearBuilt: newer homes in “adjacent to positive feature” conditions might be premium.

The cleanest approach is to build a pipeline for each feature bucket (e.g., skewed numeric, ordinal, nominal, year, binary), chaining the relevant steps like imputing, encoding, scaling, or transformations. These pipelines are then combined inside a ColumnTransformer, which applies the right sequence of transformations to each group while keeping the whole preprocessing reproducible and consistent.