# Motivation and Other Submissions
Eda.ipynb and model.ipynb were my quick first model (score around 0.135)
- No feature engineering
- Sqrt of SalePrice
- Ordinal encoding for all categorical variables
- Grid search on XGBoost (single model)

This model2 is now inspired by reading some other Kaggle submissions

[Top 1% Solution w/ Data Leakage](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/discussion/83751)

Feature Engineering

1. Fill NA values (some data leakage here with using medians)
2. Fill NA numerics w/ 0
3. Transform numeric features with box-cox (data leakage here as well)
4. Summed and year features (ex. bathrooms = \sum(bathrooms))
5. Has features (has pool, has 2nd floor, etc.)

Additional Engineering

6. Add dummies for categorical variables
7. Remove columns where one value dominates
8. Remove outliers

Modelling

1. Ridge Regression (Robust Scalar)
2. Lasso Regression
3. Elastic Net Regression
4. Support Vector Regression
5. Gradient Boosting Regressor
6. Light GBM
7. XGBoost
8. StackingCVRegressor w/ XGBoost as meta regressor

Final prediction is Blend of all previous models (including Stacking Regressor as one of those models)

[Top 4% no Data Leakage](https://www.kaggle.com/code/miftahuladib/housing-price-regression-top-4?scriptVersionId=202452540)

Feature Engineering

1. Sum features (like baths, porcharea, rooms)
2. Year features (ex. transform to 2025 - yearbuilt)
3. Fill NA Values (simply fill with 0 or 'No')

Additional Engineering

1. Remove Outliers (based on scatterplots of various numerical features)
2. Column Transformer (numeric --> standardScalar, ordinal --> ordinalEncoder, categorical --> oneHotEncoded)

Modelling

1. Random forest regressor
2. XGBoost
3. Ridge regression
4. Light GBM
5. CatBoost
6. VotingRegressor (not used)
7. StackingRegressor

# Goals and Plan

Goal: Understand which components of the other submissions are most relevant (how they affect score)

These components include:

1. Feature Engineering (which new features are best: summed features, year features, has features)
2. Feature Transformation and Filling (boxcox transformation, fill na with null vs. values)
3. Data leakage (how much does it help)

4. Column Transformers (scaling, ordinal transform vs. one hot encoding)
5. Removing outliers

6. Model blending

Base model:
1. All possible engineered features
2. No boxcox transformations, fill na with 0/'No'
3. No scaling, all ordinal transform
4. No outlier removal
5. XGBoost only

Unilaterally change the following and Record Test Score:
1. Removing engineered features
2. Boxcox transformation (with and without data leakage)
3. Scaling + Separate ordinal and One Hot Transformations
4. Outlier removal
5. Models blended

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler, RobustScaler
from sklearn.linear_model import Ridge, Lasso, RidgeCV, LassoCV
from sklearn.compose import ColumnTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import xgboost as xgb
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import make_scorer,root_mean_squared_error

In [2]:
train_path = './data/train.csv'
df = pd.read_csv(train_path)
X, Y = df.drop(labels=['SalePrice'], axis=1), np.log1p(df['SalePrice'])

In [3]:
# Fill NA

numeric_dtypes = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
def fillNull(X):
    for col in X.columns:
        if X[col].dtype in numeric_dtypes: X[col] = X[col].fillna(0)
        else: X[col] = X[col].fillna('No')

fillNull(X)

In [4]:
# Feature Engineering (simply taken from other Kaggle submissions, goal is to see effectiveness)

def featureEngineering(X: pd.DataFrame):
    # Summed features
    X['YrBltAndRemod']=X['YearBuilt']+X['YearRemodAdd']
    X['TotalSF']=X['TotalBsmtSF'] + X['1stFlrSF'] + X['2ndFlrSF']
    X['Total_sqr_footage'] = (X['BsmtFinSF1'] + X['BsmtFinSF2'] +
                                    X['1stFlrSF'] + X['2ndFlrSF'])
    X['Total_Bathrooms'] = (X['FullBath'] + (0.5 * X['HalfBath']) +
                                X['BsmtFullBath'] + (0.5 * X['BsmtHalfBath']))
    X['Total_porch_sf'] = (X['OpenPorchSF'] + X['3SsnPorch'] +
                                X['EnclosedPorch'] + X['ScreenPorch'] +
                                X['WoodDeckSF'])

    # Has features
    X['haspool'] = X['PoolArea'].apply(lambda x: 1 if x > 0 else 0)
    X['has2ndfloor'] = X['2ndFlrSF'].apply(lambda x: 1 if x > 0 else 0)
    X['hasgarage'] = X['GarageArea'].apply(lambda x: 1 if x > 0 else 0)
    X['hasbsmt'] = X['TotalBsmtSF'].apply(lambda x: 1 if x > 0 else 0)
    X['hasfireplace'] = X['Fireplaces'].apply(lambda x: 1 if x > 0 else 0)

featureEngineering(X)

In [6]:
def find_model(modelTuple, params):
    categorical_cols = [col for col in X.columns if X[col].dtype not in numeric_dtypes]
    numerical_cols = [col for col in X.columns if X[col].dtype in numeric_dtypes]

    pipeline = Pipeline([
        ('encoder', ColumnTransformer([
            ('scalar', RobustScaler(), numerical_cols),
            ('ohe', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
        ], remainder='passthrough')),
        
        modelTuple
    ])

    gridsearch = GridSearchCV(
        pipeline,
        scoring='neg_root_mean_squared_error',
        param_grid=params,
        n_jobs=-1,
        cv=5,
        return_train_score=True
    )

    gridsearch.fit(X, Y)

    results = pd.DataFrame(gridsearch.cv_results_['params'])
    results['train_rmse'] = gridsearch.cv_results_['mean_train_score']
    results['test_rmse'] = gridsearch.cv_results_['mean_test_score']
    results = results.sort_values(by='test_rmse', ascending=False)
    display(results.head())

    return gridsearch.best_estimator_

In [5]:
# XGB
categorical_cols = [col for col in X.columns if X[col].dtype not in numeric_dtypes]
numerical_cols = [col for col in X.columns if X[col].dtype in numeric_dtypes]

pipeline = Pipeline([
    ('encoder', ColumnTransformer([
        ('scalar', RobustScaler(), numerical_cols),
        ('ohe', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ], remainder='passthrough')),
    
    ('xgb', xgb.XGBRegressor(
        random_state=42
    ))
])

params = {
    'xgb__n_estimators': [100,500,3000],
    'xgb__learning_rate': [0.005, 0.01, 0.1],
    'xgb__max_depth': [3]
    }

gridsearch = GridSearchCV(
    pipeline,
    scoring='neg_root_mean_squared_error',
    param_grid=params,
    n_jobs=-1,
    cv=5,
    return_train_score=True
    )

gridsearch.fit(X, Y)

# View results
results = pd.DataFrame(gridsearch.cv_results_['params'])
results['train_rmse'] = gridsearch.cv_results_['mean_train_score']
results['test_rmse'] = gridsearch.cv_results_['mean_test_score']
results = results.sort_values(by='test_rmse', ascending=False)
results

KeyboardInterrupt: 

In [54]:
# Ridge Regression
categorical_cols = [col for col in X.columns if X[col].dtype not in numeric_dtypes]
numerical_cols = [col for col in X.columns if X[col].dtype in numeric_dtypes]

param_grid_ridge = {
    'ridge__alpha': [16,17,18,19,20],
    'ridge__solver': ['auto', 'lsqr', 'sparse_cg', 'sag']
}

pipeline = Pipeline([
    ('encoder', ColumnTransformer([
        ('scalar', RobustScaler(), numerical_cols),
        ('ohe', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ], remainder='passthrough')),
    
    ('ridge', Ridge())
])

gridsearch = GridSearchCV(
    pipeline,
    scoring='neg_root_mean_squared_error',
    param_grid=param_grid_ridge,
    n_jobs=-1,
    cv=5,
    return_train_score=True
    )

gridsearch.fit(X, Y)

# View results
results = pd.DataFrame(gridsearch.cv_results_['params'])
results['train_rmse'] = gridsearch.cv_results_['mean_train_score']
results['test_rmse'] = gridsearch.cv_results_['mean_test_score']
results = results.sort_values(by='test_rmse', ascending=False)
results

Unnamed: 0,ridge__alpha,ridge__solver,train_rmse,test_rmse
15,19,sag,-0.112776,-0.138088
14,19,sparse_cg,-0.112776,-0.138088
12,19,auto,-0.112776,-0.138088
10,18,sparse_cg,-0.112445,-0.13809
11,18,sag,-0.112445,-0.13809
8,18,auto,-0.112445,-0.13809
4,17,auto,-0.112104,-0.138094
6,17,sparse_cg,-0.112104,-0.138094
7,17,sag,-0.112104,-0.138094
0,16,auto,-0.111717,-0.138118


In [55]:
def test_model(model):
    test_path = './data/test.csv'
    test_df = pd.read_csv(test_path)
    fillNull(test_df)
    featureEngineering(test_df)

    test_y = model.predict(test_df)

    test_df['SalePrice'] = np.expm1(test_y)
    test_df[['Id', 'SalePrice']].to_csv('results.csv', index=False)

    solution_path = './data/solution.csv'
    solution_df = pd.read_csv(solution_path)
    score = root_mean_squared_error(np.log(test_df['SalePrice']), np.log(solution_df['SalePrice']))
    return score

test_model(gridsearch.best_estimator_)

0.13418579739329983

# Results
Feature Engineering:
1. No feature engineering: 0.1345
2. Summed features: 0.1307
3. Has features: 0.1345

Scaling:
1. Baseline ordinal encoding:  0.1307
2. One hot encoding (instead of ordinal): 0.1317
3. One hot + Robust scalar: 0.1309

Models:
1. XGB: 0.1309
2. Ridge regression: 0.134