# Motivation and Other Submissions
Eda.ipynb and model.ipynb were my quick first model (score around 0.135)
- No feature engineering
- Sqrt of SalePrice
- Ordinal encoding for all categorical variables
- Grid search on XGBoost (single model)

This model2 is now inspired by reading some other Kaggle submissions

[Top 1% Solution w/ Data Leakage](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/discussion/83751)

Feature Engineering

1. Fill NA values (some data leakage here with using medians)
2. Fill NA numerics w/ 0
3. Transform numeric features with box-cox (data leakage here as well)
4. Summed and year features (ex. bathrooms = \sum(bathrooms))
5. Has features (has pool, has 2nd floor, etc.)

Additional Engineering

6. Add dummies for categorical variables
7. Remove columns where one value dominates
8. Remove outliers

Modelling

1. Ridge Regression (Robust Scalar)
2. Lasso Regression
3. Elastic Net Regression
4. Support Vector Regression
5. Gradient Boosting Regressor
6. Light GBM
7. XGBoost
8. StackingCVRegressor w/ XGBoost as meta regressor

Final prediction is Blend of all previous models (including Stacking Regressor as one of those models)

[Top 4% no Data Leakage](https://www.kaggle.com/code/miftahuladib/housing-price-regression-top-4?scriptVersionId=202452540)

Feature Engineering

1. Sum features (like baths, porcharea, rooms)
2. Year features (ex. transform to 2025 - yearbuilt)
3. Fill NA Values (simply fill with 0 or 'No')

Additional Engineering

1. Remove Outliers (based on scatterplots of various numerical features)
2. Column Transformer (numeric --> standardScalar, ordinal --> ordinalEncoder, categorical --> oneHotEncoded)

Modelling

1. Random forest regressor
2. XGBoost
3. Ridge regression
4. Light GBM
5. CatBoost
6. VotingRegressor (not used)
7. StackingRegressor

# Goals and Plan

Goal: Understand which components of the other submissions are most relevant (how they affect score)

These components include:

1. Feature Engineering (which new features are best: summed features, year features, has features)
2. Feature Transformation and Filling (boxcox transformation, fill na with null vs. values)
3. Data leakage (how much does it help)

4. Column Transformers (scaling, ordinal transform vs. one hot encoding)
5. Removing outliers

6. Model blending

Base model:
1. All possible engineered features
2. No boxcox transformations, fill na with 0/'No'
3. No scaling, all ordinal transform
4. No outlier removal
5. XGBoost only

Unilaterally change the following and Record Test Score:
1. Removing engineered features
2. Boxcox transformation (with and without data leakage)
3. Scaling + Separate ordinal and One Hot Transformations
4. Outlier removal
5. Models blended

In [56]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import xgboost as xgb
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import make_scorer,root_mean_squared_error

In [57]:
train_path = './data/train.csv'
df = pd.read_csv(train_path)
X, Y = df.drop(labels=['SalePrice'], axis=1), np.log1p(df['SalePrice'])

In [58]:
# Fill NA

# display(X.isna().sum().T.sort_values(ascending=False)[:20])

numeric_dtypes = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
def fillNull(X):
    for col in X.columns:
        if X[col].dtype in numeric_dtypes: X[col] = X[col].fillna(0)
        else: X[col] = X[col].fillna('No')

fillNull(X)

# display(X.isna().sum().T.sort_values(ascending=False)[:20])

In [60]:
# Feature Engineering (simply taken from other Kaggle submissions, goal is to see effectiveness)

def featureEngineering(X: pd.DataFrame):
    # Summed features
    X['YrBltAndRemod']=X['YearBuilt']+X['YearRemodAdd']
    X['TotalSF']=X['TotalBsmtSF'] + X['1stFlrSF'] + X['2ndFlrSF']
    X['Total_sqr_footage'] = (X['BsmtFinSF1'] + X['BsmtFinSF2'] +
                                    X['1stFlrSF'] + X['2ndFlrSF'])
    X['Total_Bathrooms'] = (X['FullBath'] + (0.5 * X['HalfBath']) +
                                X['BsmtFullBath'] + (0.5 * X['BsmtHalfBath']))
    X['Total_porch_sf'] = (X['OpenPorchSF'] + X['3SsnPorch'] +
                                X['EnclosedPorch'] + X['ScreenPorch'] +
                                X['WoodDeckSF'])

    # Has features
    X['haspool'] = X['PoolArea'].apply(lambda x: 1 if x > 0 else 0)
    X['has2ndfloor'] = X['2ndFlrSF'].apply(lambda x: 1 if x > 0 else 0)
    X['hasgarage'] = X['GarageArea'].apply(lambda x: 1 if x > 0 else 0)
    X['hasbsmt'] = X['TotalBsmtSF'].apply(lambda x: 1 if x > 0 else 0)
    X['hasfireplace'] = X['Fireplaces'].apply(lambda x: 1 if x > 0 else 0)

    # Remove useless features
    X = X.drop(labels=['Id', 'Utilities', 'Street', 'PoolQC'], axis=1)

featureEngineering(X)

In [61]:
categorical_cols = [col for col in X.columns if X[col].dtype not in numeric_dtypes]

pipeline = Pipeline([
    ('encoder', ColumnTransformer([
        ('ordinal', OrdinalEncoder(handle_unknown='use_encoded_value', 
                                    unknown_value=-1), 
        categorical_cols)
    ], remainder='passthrough')),
    
    ('xgb', xgb.XGBRegressor(
        random_state=42
    ))
])

In [70]:
params = {
    'xgb__n_estimators': [100,500,3000],
    'xgb__learning_rate': [0.005, 0.01, 0.1],
    'xgb__max_depth': [3]
    }

gridsearch = GridSearchCV(
    pipeline,
    scoring='neg_root_mean_squared_error',
    param_grid=params,
    n_jobs=-1,
    cv=5,
    return_train_score=True
    )

In [71]:
gridsearch.fit(X, Y)

The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



In [74]:
results = pd.DataFrame(gridsearch.cv_results_['params'])
results['train_rmse'] = gridsearch.cv_results_['mean_train_score']
results['test_rmse'] = gridsearch.cv_results_['mean_test_score']
results = results.sort_values(by='test_rmse', ascending=False)
results

Unnamed: 0,xgb__learning_rate,xgb__max_depth,xgb__n_estimators,train_rmse,test_rmse
5,0.01,3,3000,-0.052551,-0.123028
2,0.005,3,3000,-0.07097,-0.123848
7,0.1,3,500,-0.035925,-0.124003
8,0.1,3,3000,-0.001917,-0.124582
6,0.1,3,100,-0.080482,-0.124786
4,0.01,3,500,-0.09911,-0.132139
1,0.005,3,500,-0.133338,-0.153381
3,0.01,3,100,-0.215986,-0.223903
0,0.005,3,100,-0.283858,-0.287842


In [75]:
test_path = './data/test.csv'
test_df = pd.read_csv(test_path)
fillNull(test_df)
featureEngineering(test_df)

test_y = gridsearch.best_estimator_.predict(test_df)
display(gridsearch.best_params_)

test_df['SalePrice'] = np.expm1(test_y)
test_df[['Id', 'SalePrice']].to_csv('results.csv', index=False)

{'xgb__learning_rate': 0.01, 'xgb__max_depth': 3, 'xgb__n_estimators': 3000}

In [77]:
solution_path = './data/solution.csv'
solution_df = pd.read_csv(solution_path)
score = root_mean_squared_error(np.log(test_df['SalePrice']), np.log(solution_df['SalePrice']))
score

0.13113237881924056