# Disclaimer
This is my first Kaggle submission. While I hope my work can be useful, please note that some steps may appear non-standard or unconventional. Any seemingly “weird” decisions were likely made intentionally based on the underlying logic and problem context.

---

# Notebook Overview

In this notebook, I perform the following key steps:

- **Basic Exploratory Data Analysis (EDA):**
  High-level examinination of features to uncover missing values, data types in the data.
- **Feature Engineering:**
    Focused on handling of missing values using group-based imputation methods, logical transformations, and categorical encoding to capture the essence of the dataset.
- **Modeling and Hyperparameter Tuning:**
    Implemented several models – including `XGBoost`, `ElasticNet`, `Bagging Regressor`, and `Random Forest` – and optimized them using `GridSearchCV` with `KFold cross validation`. <br>
    **Training Metric:** `Root Mean Squared Error (RMSE)` applied to the logarithm of the predicted and observed sale prices.

In [None]:
!pip install scikit-learn==1.2.2   # xgboost compatibility issue (February 2025) 

In [None]:
import pandas as pd
import numpy as np
import seaborn as sb

from copy import deepcopy
from pandas.api.types import is_numeric_dtype

from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import KFold, cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.model_selection import GridSearchCV, ParameterGrid

import warnings
warnings.filterwarnings(action="ignore")

In [None]:
train_raw = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
test_raw = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')

In [None]:
y_train = train_raw[['Id', 'SalePrice']].copy()
y_train['SalePrice'] = np.log1p(y_train['SalePrice'])

x_id = test_raw['Id'].copy()
y_train
train_features=train_raw.drop('SalePrice', axis=1)
combined_raw = pd.concat([train_features, test_raw], axis=0)

# EDA

In [None]:
print("*** combined_raw")

print(f"--- Shape of the dataset: {combined_raw.shape}\n")
print(f"--- Null values:")
combined_null_stats = combined_raw.isnull().sum()
print(combined_null_stats[combined_null_stats > 0])

In [None]:
print(f"--- Dataset info:\n{combined_raw.info()}\n")

# Cleanup Functions Declaration

The key idea behind data cleanup was to fill the missing values with as much meaningful information as possible & dropping the low-information columns.

General and group-specific rules (garage/basement) were applied. The data was also transformed into numerical format.

The following function, `fillna_and_transform_numeric`, is designed to handle missing values in `numeric features` by performing median imputation within groups. This approach leverages the idea that similar observations (as defined by one or more grouping features) tend to have similar values for a given numeric feature.

In [None]:
def fillna_and_transform_numeric(dataframe, numeric_features, grouping_features):

    for feature in numeric_features:
        # First try to fill using the full grouping_features list
        dataframe[feature] = dataframe[feature].fillna(
            dataframe.groupby(grouping_features)[feature].transform('median')
        )
        # If many are still missing, fall back to grouping by the first grouping_feature only
        if dataframe[feature].isnull().sum() > 1:
            dataframe[feature] = dataframe[feature].fillna(
                dataframe.groupby(grouping_features[0])[feature].transform('median')
            )

    return dataframe

This function, `fillna_and_transform_categorical`, processes `categorical columns` by first converting non-missing category labels into integers and then imputing missing values based on the most frequent category (mode) within similar groups.

In [None]:
def fillna_and_transform_categorical(dataframe, categorical_features, grouping_features):

    for feature in categorical_features:
        # Map non-missing categories to integers
        unique_categories = dataframe[feature].dropna().unique()
        category_mapping = {cat: idx for idx, cat in enumerate(unique_categories, start=1)}
        dataframe[feature] = dataframe[feature].map(category_mapping)

        # Fill missing values groupwise by mode
        dataframe[feature] = dataframe.groupby(grouping_features)[feature].transform(
            lambda x: x.fillna(x.mode().iloc[0] if not x.mode().empty else 0)
        )
        # Fallback using grouping by the first grouping feature if still missing
        if dataframe[feature].isnull().sum() > 1:
            dataframe[feature] = dataframe.groupby(grouping_features[0])[feature].transform(
                lambda x: x.fillna(x.mode().iloc[0] if not x.mode().empty else 0)
            )

    return dataframe

The function `fillna_and_transform_provided_zero_features` is designed to process a set of features (specified in the `to_zero_features list`) by ensuring that they contain no missing values.

In [None]:
def fillna_and_transform_provided_zero_features(dataframe, to_zero_features):

    for feature in to_zero_features:
        # If feature is not numeric, first map unique categories to integers.
        if not is_numeric_dtype(dataframe[feature]):
            unique_categories = dataframe[feature].dropna().unique()
            category_mapping = {cat: idx for idx, cat in enumerate(unique_categories, start=1)}
            dataframe[feature] = dataframe[feature].map(category_mapping)
        # Finally, fill any missing values with 0.
        dataframe[feature] = dataframe[feature].fillna(0)

    return dataframe

This function, `fillna_and_transform_basement`, is designed to handle missing values for `basement-related features` in a dataset. It treats numeric and categorical basement columns separately and uses the key column (`TotalBsmtSF`) to guide the imputation logic.

In [None]:
def fillna_and_transform_basement(dataframe, basement_numeric_cols, basement_categorical_cols, grouping_features, key_col):
    # Ensure key col (e.g. TotalBsmtSF) has no NAs
    dataframe[key_col] = dataframe[key_col].fillna(0)

    # Create masks for rows with and without basement info.
    mask_no_basement = dataframe[key_col] == 0
    mask_with_basement = ~mask_no_basement

    # Case 1: For rows with no basement info, set both numeric and categorical columns to 0.
    dataframe.loc[mask_no_basement, basement_numeric_cols] = 0
    dataframe.loc[mask_no_basement, basement_categorical_cols] = 0

    # Case 2: For rows with basement info
    # For numeric columns: fill missing values with an equal split of the deficit.
    def fill_numeric_basement(row):
        if row[key_col] > 0:
            missing = [col for col in basement_numeric_cols if pd.isna(row[col])]
            if missing:
                known_sum = row[basement_numeric_cols].sum(skipna=True)
                deficit = row[key_col] - known_sum
                if deficit < 0:
                    deficit = 0
                fill_value = deficit / len(missing)
                for col in missing:
                    row[col] = fill_value
        return row

    dataframe.loc[mask_with_basement, :] = dataframe.loc[mask_with_basement, :].apply(fill_numeric_basement, axis=1)

    # For categorical columns: apply mapping and groupwise fill only to rows with basement info.
    for col in basement_categorical_cols:
        # Only consider rows with basement (mask_with_basement) and non-zero, non-null values.
        non_missing = dataframe.loc[mask_with_basement & dataframe[col].notna() & (dataframe[col] != 0), col].unique()
        category_mapping = {cat: idx for idx, cat in enumerate(non_missing, start=1)}

        # Apply mapping only on rows with basement info.
        dataframe.loc[mask_with_basement, col] = dataframe.loc[mask_with_basement, col].map(category_mapping)

        # Groupwise fill missing values using mode (only among rows with basement info).
        dataframe.loc[mask_with_basement, col] = dataframe.loc[mask_with_basement].groupby(grouping_features)[col].transform(
            lambda x: x.fillna(x.mode().iloc[0] if not x.mode().empty else 0)
        )

        # Any remaining missing values in rows with basement are set to 0.
        dataframe.loc[mask_with_basement, col] = dataframe.loc[mask_with_basement, col].fillna(0)

    return dataframe


The function `fillna_and_transform_garage` is designed to handle missing values and standardize the garage-related features in the dataset. It uses the key column (`GarageType`) to split the data into rows with and without a garage, then `imputes or maps features accordingly`.

In [None]:
def fillna_and_transform_garage(dataframe, garage_cols, key_col):
    # Replace missing GarageType with 0 and create masks
    dataframe[key_col] = dataframe[key_col].fillna(0)
    mask_no_garage = dataframe[key_col] == 0
    mask_garage = ~mask_no_garage

    # For rows without GarageType (no garage info), set all garage-related columns to 0.
    dataframe.loc[mask_no_garage, garage_cols] = 0

    # For rows with GarageType information, impute numeric values.
    # Impute 'GarageCars' using mode within each GarageType group.
    mode_by_type = dataframe.loc[mask_garage].groupby(key_col)['GarageCars'].agg(
        lambda x: x.mode().iloc[0] if not x.mode().empty else 0
    )
    def impute_garage_cars(row):
        if pd.isna(row['GarageCars']):
            return mode_by_type.get(row[key_col], 0)
        return row['GarageCars']
    dataframe.loc[mask_garage, 'GarageCars'] = dataframe.loc[mask_garage].apply(impute_garage_cars, axis=1)

    # Impute 'GarageArea' using median within each GarageType group.
    median_by_type = dataframe.loc[mask_garage].groupby(key_col)['GarageArea'].median()
    def impute_garage_area(row):
        if pd.isna(row['GarageArea']):
            return median_by_type.get(row[key_col], 0)
        return row['GarageArea']
    dataframe.loc[mask_garage, 'GarageArea'] = dataframe.loc[mask_garage].apply(impute_garage_area, axis=1)

    # Process remaining categorical garage columns: 'GarageFinish', 'GarageQual', 'GarageCond'
    cat_cols = ['GarageFinish', 'GarageQual', 'GarageCond']
    for col in cat_cols:
        # Restrict processing to rows with garage information; exclude 0 from unique values.
        non_missing = dataframe.loc[mask_garage & (dataframe[col] != 0) & (dataframe[col].notna()), col].unique()
        mapping = {cat: idx for idx, cat in enumerate(non_missing, start=1)}
        dataframe.loc[mask_garage, col] = dataframe.loc[mask_garage, col].map(mapping)

        # Within each GarageType group, fill missing values in the column using mode.
        dataframe.loc[mask_garage, col] = dataframe.loc[mask_garage].groupby(key_col)[col].transform(
            lambda x: x.fillna(x.mode().iloc[0] if not x.mode().empty else 0)
        )
        # Any leftover missing values are set to 0.
        dataframe.loc[mask_garage, col] = dataframe.loc[mask_garage, col].fillna(0)

    return dataframe

The function `transform_categorical_grouping_features` converts the specified grouping features from `categorical labels into numeric codes`.

In [None]:
def transform_categorical_grouping_features(dataframe, grouping_features):
    for feature in grouping_features:
        unique_categories = dataframe[feature].dropna().unique()
        mapping = {cat: idx for idx, cat in enumerate(unique_categories, start=1)}
        dataframe[feature] = dataframe[feature].map(mapping)
    return dataframe


The function `drop_low_information_features` is designed to remove columns that are deemed to have `low information content`.

In [None]:
def drop_low_information_features(dataframe, features_to_drop):
    dataframe.drop(features_to_drop, axis=1, inplace=True)

    return dataframe

In [None]:
def data_transformation(df):

  grouping_features=['Neighborhood', 'MSSubClass']
  grouping_keys_for_transform = ['GarageType']

  #

  to_zero_features = ['MasVnrArea','FireplaceQu', 'GarageYrBlt', 'PoolQC',
                      'MiscFeature','Alley','Fence','MasVnrType']

  numeric_features = ['LotFrontage']

  categorical_features = ['MSZoning', 'Exterior1st', 'Exterior2nd', 'KitchenQual',
                          'Utilities', 'Functional', 'SaleType', 'GarageType', 'Street',
                          'LotShape', 'LandContour', 'LotConfig', 'LandSlope',
                          'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
                          'RoofStyle', 'RoofMatl', 'ExterQual', 'ExterCond', 'Foundation',
                          'Heating', 'HeatingQC', 'CentralAir', 'Electrical',
                          'PavedDrive', 'SaleCondition']

  basement_numeric_cols = ['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'BsmtFullBath', 'BsmtHalfBath']
  basement_categorical_cols = ['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']
  key_col_basement = 'TotalBsmtSF'

  garage_cols = ['GarageCars', 'GarageArea', 'GarageFinish', 'GarageQual', 'GarageCond']
  key_col_garage = 'GarageType'

  #

  df_transformed = deepcopy(df)

  df_transformed = fillna_and_transform_provided_zero_features(df_transformed, to_zero_features)
  df_transformed = fillna_and_transform_numeric(df_transformed, numeric_features, grouping_features)
  df_transformed = fillna_and_transform_categorical(df_transformed, categorical_features, grouping_features)

  df_transformed = fillna_and_transform_basement(df_transformed, basement_numeric_cols,
                                basement_categorical_cols, grouping_features, key_col_basement)

  df_transformed = fillna_and_transform_garage(df_transformed, garage_cols, key_col_garage)

  df_transformed = transform_categorical_grouping_features(df_transformed, grouping_features)
  df_transformed = transform_categorical_grouping_features(df_transformed, grouping_keys_for_transform)

  for feature in df_transformed.columns:
    df_transformed[feature] = df_transformed[feature].astype('int64')

  return df_transformed

# Combined Dataframe : Employing Cleanup

In [None]:
combined = data_transformation(combined_raw)

In [None]:
print("*** combined")

print(f"--- Shape of the dataset:{combined.shape}")
print(f"--- Null values: ", end="")
test_null_stats = combined.isnull().sum()
print(test_null_stats[test_null_stats > 0])

In [None]:
X_train = combined[combined['Id'].isin(y_train['Id'])].copy()
X_train.drop(['Id'],axis=1,inplace=True)
X_test = combined[~combined['Id'].isin(y_train['Id'])].copy()
X_test.drop(['Id'],axis=1,inplace=True)

y_train.drop(['Id'], axis=1,inplace=True)

# Models & Hyperparameters Declaration

In [None]:
model_dict = {
    'xgb': {
        'model': xgb.XGBRegressor(random_state=42),
        'params': {
            'n_estimators': [100, 200, 300, 400, 500],
            'learning_rate': [0.04],
            'max_depth': [2, 3, 4, 5],
            'min_child_weight': [1, 2, 3]
        }
    },
    'elasticnet': {
        'model': ElasticNet(random_state=42, max_iter=1000),
        'params': {
            'alpha': [0.1, 0.2, 0.3, 0.4, 0.5, 0.75, 1, 1.25, 1.5],
            'l1_ratio': [0, 0.25, 0.5, 0.75, 1]
        }
    },
    'bagging': {
        'model': BaggingRegressor(random_state=42),
        'params': {
            'n_estimators': [375, 400, 425],
            'max_samples': [0.3, 0.4, 0.5]
        }
    },
    'random_forest': {
        'model': RandomForestRegressor(random_state=42),
        'params': {
            'n_estimators': [300, 400, 500],
            'max_depth': [10, 15, 20, 25, 30],
            'min_samples_split': [2, 3, 4]
        }
    }
}


cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Best Model Search

In [None]:
best_models = {}
for name, definition in model_dict.items():
    model = definition['model']
    param_grid = definition['params']

    grid = GridSearchCV(model, param_grid, cv=cv,
                        scoring='neg_root_mean_squared_error',
                        n_jobs=-1)
    grid.fit(X_train, y_train)

    best_models[name] = grid
    print(f"Model: {name}")
    print("Best Parameters:", grid.best_params_)
    print("Best RMSE:", -grid.best_score_)
    print("--------------------------------------------------")

# Employing Best Model on Test Dataset & Submission File Generation

In [None]:
best_model_name = min(best_models, key=lambda model: -best_models[model].best_score_)

model_details = best_models[best_model_name]

print("Best model based on RMSE:",best_model_name)
print("Best Parameters:", model_details.best_params_)
print("Best RMSE:", -model_details.best_score_)

print("\nInstantiating best model...")

# Instantiate the corresponding model with the best parameters
if best_model_name == 'xgb':
    best_model = xgb.XGBRegressor(**model_details.best_params_)
elif best_model_name == 'elasticnet':
    best_model = ElasticNet(**model_details.best_params_)
elif best_model_name == 'bagging':
    best_model = BaggingRegressor(**model_details.best_params_)
elif best_model_name == 'random_forest':
    best_model = RandomForestRegressor(**model_details.best_params_)

best_model.fit(X_train, y_train)

In [None]:
predictions = best_model.predict(X_test)

predicted_prices = np.expm1(predictions)

pd.DataFrame({'Id': x_id, 'SalePrice': predicted_prices}).to_csv('predictions.csv', index=False)