# **Modeling and Evaluation**

## Objectives

This notebook focuses on training, optimizing, and evaluating a regression model that predicts the **sale price of houses** in Ames, Iowa. It is the core modeling step for **Business Requirement 2**, where the client wants to estimate property values based on known house attributes.

Key goals include:
- Selecting a subset of features that most strongly predict sale price
- Trying multiple regression models and choosing the most effective
- Evaluating the model using R², MAE, RMSE, and visualization
- Saving the model pipeline for deployment in the Streamlit app

## Inputs

- outputs\datasets\cleaned\TrainSetCleaned.csv
- outputs\datasets\cleaned\TestSetCleaned.csv

## Outputs

- `X_train.csv` and `X_test.csv`: Train/test sets with selected features
- `y_train.csv` and `y_test.csv`: Corresponding targets (SalePrice)
- Fitted regression pipeline (`best_regressor_pipeline.pkl`)
- Feature importance plot (`feature_importance.png`)
- Evaluation metrics and model summary


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with `os.getcwd()`

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\Arthur\\OneDrive\\Documentos\\Code Institute\\PP5\\PP5-heritage-housing-issues-ml\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* `os.path.dirname()` gets the parent directory
* `os.chir()` defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
print("New working directory set to:", os.getcwd())

New working directory set to: c:\Users\Arthur\OneDrive\Documentos\Code Institute\PP5\PP5-heritage-housing-issues-ml


---

# Load Data

Train Set

In [4]:
import numpy as np
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/house_prices_records.csv")
      .drop(labels=['EnclosedPorch', 'WoodDeckSF'], axis=1)
  )

print(df.shape)
df.head(5)

(1460, 22)


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GarageYrBlt,...,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,548,RFn,2003.0,...,8450,65.0,196.0,61,5,7,856,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,460,RFn,1976.0,...,9600,80.0,0.0,0,8,6,1262,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,608,RFn,2001.0,...,11250,68.0,162.0,42,5,7,920,2001,2002,223500
3,961,,,No,216,ALQ,540,642,Unf,1998.0,...,9550,60.0,0.0,35,5,7,756,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,836,RFn,2000.0,...,14260,84.0,350.0,84,5,8,1145,2000,2000,250000


---

# Step 2: ML Pipeline with all data

## ML pipeline for Data Cleaning and Feature Engineering

In [14]:
from sklearn.base import BaseEstimator, TransformerMixin

class DataFrameImputer(BaseEstimator, TransformerMixin):
    def __init__(self, imputer, column_names):
        self.imputer = imputer
        self.column_names = column_names

    def fit(self, X, y=None):
        self.imputer.fit(X)
        return self

    def transform(self, X):
        array = self.imputer.transform(X)
        return pd.DataFrame(array, columns=self.column_names, index=X.index)


In [16]:
class DebugMissingChecker(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        missing_cols = X.columns[X.isnull().any()].tolist()
        if missing_cols:
            print("⚠️ Still missing values in:", missing_cols)
            raise ValueError("Pipeline halted: NaNs remain in the data")
        return X

def PipelineDataCleaningAndFeatureEngineering():
    from feature_engine.imputation import (
        MeanMedianImputer,
        ArbitraryNumberImputer,
        CategoricalImputer
    )
    from feature_engine.encoding import OrdinalEncoder
    from feature_engine.transformation import LogTransformer, PowerTransformer, YeoJohnsonTransformer
    from feature_engine.selection import SmartCorrelatedSelection

    num_impute_zero = ['2ndFlrSF', 'MasVnrArea', 'GarageYrBlt']
    num_impute_median = ['BedroomAbvGr']
    cat_impute_missing = ['BsmtExposure', 'BsmtFinType1', 'GarageFinish']
    ordinal_encode = ['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual']

    log_transform = ['GrLivArea']
    log10_transform = ['1stFlrSF']
    yeojohnson_transform = ['GarageArea', 'LotFrontage']
    power_transform = ['TotalBsmtSF']

    pipeline = PipelineDataCleaningAndFeatureEngineering([
        ("num_zero", ArbitraryNumberImputer(arbitrary_number=0, variables=num_impute_zero)),
        ("num_median", MeanMedianImputer(imputation_method='median', variables=num_impute_median)),
        ("cat_missing", CategoricalImputer(imputation_method='missing', variables=cat_impute_missing)),

        # Check that no missing values remain
        ("debug_nan_check", DebugMissingChecker()),

        ("ordinal_encoder", OrdinalEncoder(encoding_method='arbitrary', variables=ordinal_encode)),

        ("log_transform", LogTransformer(variables=log_transform)),
        ("log10_transform", LogTransformer(variables=log10_transform, base='10')),
        ("yeojohnson", YeoJohnsonTransformer(variables=yeojohnson_transform)),
        ("power_transform", PowerTransformer(variables=power_transform)),

        ("correlation_filter", SmartCorrelatedSelection(
            method='spearman',
            threshold=0.6,
            selection_method='variance')),
    ])

    return pipeline

# Create ML Pipeline

We define a flexible and modular ML pipeline using `sklearn.pipeline`. This pipeline includes encoding, transformations, multicollinearity reduction, scaling, feature selection, and the model itself.

All transformation steps are based on decisions made in the Feature Engineering notebook, including:
- Ordinal encoding for 4 categorical features
- Log, power, and Yeo-Johnson transformations for skewed numeric features
- SmartCorrelatedSelection with threshold = 0.6

In [17]:
# Feat Scaling
from sklearn.preprocessing import StandardScaler

# Feat Selection
from sklearn.feature_selection import SelectFromModel

# ML algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier

def PipelineClf(model):
    pipeline_base = Pipeline([
        ("scaler", StandardScaler()),
        ("feat_selector", SelectFromModel(model)),
        ("model", model),
    ])
    return pipeline_base


Custom Class for hyperparameter optimisation

In [18]:
from sklearn.model_selection import GridSearchCV
import numpy as np
import pandas as pd

class HyperparameterOptimizationSearch:
    def __init__(self, models, params):
        """
        Initializes the search with dictionaries of models and hyperparameter grids.
        
        Parameters:
        - models: dict of model name → estimator (e.g., {'XGB': XGBRegressor()})
        - params: dict of model name → dict of hyperparameter grid
        """
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}  # Will store results after .fit()

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        """
        Runs GridSearchCV for each model in self.models.

        Parameters:
        - X, y: training data
        - cv: number of cross-validation folds
        - n_jobs: number of cores for parallel processing
        - verbose: verbosity level (0–3)
        - scoring: scoring metric, e.g., 'r2'
        - refit: whether to refit the best model on the whole data (default: False)
        """
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")

            # Assumes a separate function that wraps preprocessing and the model into a pipeline
            model = PipelineClf(self.models[key])

            # Retrieve the grid of hyperparameters
            params = self.params[key]

            # Run Grid Search
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                              verbose=verbose, scoring=scoring)
            gs.fit(X, y)

            # Store result in dictionary
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        """
        Builds a summary DataFrame with scores and hyperparameters for all models.
        """
        def row(key, scores, params):
            """
            Helper function to generate one row of the results summary.
            """
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})  # Combine param dict with scores

        rows = []

        for k in self.grid_searches:
            # Get all parameter combinations tested
            params = self.grid_searches[k].cv_results_['params']
            scores = []

            # Collect scores from each CV split (split0_test_score, split1_test_score, etc.)
            for i in range(self.grid_searches[k].cv):
                key = f"split{i}_test_score"
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))  # reshape for stacking

            all_scores = np.hstack(scores)  # Combine split scores into one array

            # Build a row for each hyperparameter combination tested
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        # Combine all rows into a summary DataFrame
        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        # Set column order: estimator, min/mean/max scores, std, then hyperparameters
        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches


## Split Train Test Set

In [19]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['SalePrice'], axis=1),
    df['SalePrice'],
    test_size=0.2,
    random_state=0
)

print("* Train set:", X_train.shape, y_train.shape,
      "\n* Test set:",  X_test.shape, y_test.shape)


* Train set: (1168, 21) (1168,) 
* Test set: (292, 21) (292,)


In [20]:
X_train.head(10)

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GarageYrBlt,...,KitchenQual,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,YearRemodAdd
618,1828,0.0,,Av,48,,1774,774,Unf,2007.0,...,Gd,11694,90.0,452.0,108,5,9,1822,2007,2007
870,894,0.0,2.0,No,0,Unf,894,308,,1962.0,...,TA,6600,60.0,0.0,0,5,5,894,1962,1962
92,964,0.0,2.0,No,713,ALQ,163,432,Unf,1921.0,...,TA,13360,80.0,0.0,0,7,5,876,1921,2006
817,1689,0.0,3.0,No,1218,GLQ,350,857,RFn,2002.0,...,Gd,13265,,148.0,59,5,8,1568,2002,2002
302,1541,0.0,3.0,No,0,Unf,1541,843,RFn,2001.0,...,Gd,13704,118.0,150.0,81,5,7,1541,2001,2002
1454,1221,0.0,2.0,No,410,GLQ,811,400,RFn,2004.0,...,Gd,7500,62.0,0.0,113,5,7,1221,2004,2005
40,1324,0.0,3.0,No,643,Rec,445,440,,1965.0,...,TA,8658,84.0,101.0,138,5,6,1088,1965,1965
959,696,720.0,3.0,No,604,ALQ,92,484,Unf,1999.0,...,Gd,2572,24.0,0.0,44,5,7,696,1999,1999
75,526,462.0,2.0,Gd,462,GLQ,0,297,Unf,1973.0,...,TA,1596,21.0,0.0,101,5,4,462,1973,1973
1389,869,349.0,3.0,No,375,ALQ,360,440,Unf,2003.0,...,TA,6000,60.0,0.0,0,6,6,735,1941,1950


## Run PipelineDataCleaningAndFeatureEngineering

In [21]:
pipeline_data_cleaning_feat_eng = PipelineDataCleaningAndFeatureEngineering()
X_train = pipeline_data_cleaning_feat_eng.fit_transform(X_train)
X_test = pipeline_data_cleaning_feat_eng.transform(X_test)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

TypeError: PipelineDataCleaningAndFeatureEngineering() takes 0 positional arguments but 1 was given

In [None]:
pipeline_data_cleaning_feat_eng = PipelineDataCleaningAndFeatureEngineering()
X_train = pipeline_data_cleaning_feat_eng.fit_transform(X_train)
X_test = pipeline_data_cleaning_feat_eng.transform(X_test)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

## Grid Search CV - Sklearn

### Use default hyperparameters to find most suitable algorithm

In [None]:
models_quick_search = {
    'LinearRegression': LinearRegression(),
    "DecisionTreeRegressor": DecisionTreeRegressor(random_state=0),
    "RandomForestRegressor": RandomForestRegressor(random_state=0),
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=0),
    "AdaBoostRegressor": AdaBoostRegressor(random_state=0),
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=0),
    "XGBRegressor": XGBRegressor(random_state=0),
}

params_quick_search = {
    'LinearRegression': {},
    "DecisionTreeRegressor": {},
    "RandomForestRegressor": {},
    "ExtraTreesRegressor": {},
    "AdaBoostRegressor": {},
    "GradientBoostingRegressor": {},
    "XGBRegressor": {},
}

Do a hyperparameter optimisation search using default hyperparameters

In [None]:
search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search)
search.fit(X_train, y_train, scoring='r2', n_jobs=-1, cv=5)

Check results

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary