# **5. Classification:**

## Objectives

* Develop an ML pipeline for data cleaning, feature engineering, and model training.
* Perform hyperparameter optimization using GridSearchCV to find the best model
* Split the data into training and testing sets.
* Evaluate the models using various regression metrics and visualize the results.
* Identify and visualize the most important features.
* Refit the pipeline using only the most important features.

## Inputs

1. Data File
* house_price_records.csv

2. Models for hyperparameter search:
    * Linear Regression
    * Random Forest Regressor
    * Gradient Boosting Regressor
    * Decision Tree Regressor
    * Extra Trees Regressor
    * AdaBoost Regressor

3. Hyperparameter Grids
    * Various hyperparameters for each model

## Outputs
1. Train-Test Split
2. Grid Search Results
3. Best Model and Parameters
4. Feature importance
5. Model performance metrics
6. Saved files 


---

# Change working directory

Change the working directory from its current folder to its parent folder
* Access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

Make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You have set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Load the data

In [None]:
import numpy as np
import pandas as pd
df1 = pd.read_csv("/workspace/heritage-housing-mvp/outputs/datasets/collection/house_prices_records.csv")
print(df1.shape)
df1.head(3)

---

# ML Pipeline with all data

#### ML pipeline for data cleaning and feature engineering

In [None]:
from sklearn.pipeline import Pipeline
from feature_engine.selection import DropFeatures
from feature_engine.imputation import MeanMedianImputer, CategoricalImputer
from feature_engine.encoding import OrdinalEncoder
from feature_engine.outliers import Winsorizer
from feature_engine.transformation import LogTransformer, PowerTransformer
from feature_engine.selection import SmartCorrelatedSelection
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.decomposition import PCA

# ML algorithms
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

def PipelineOptimization(model):
    pipeline = Pipeline([
        ('median', MeanMedianImputer(imputation_method='median',
                                variables=['LotFrontage','GarageYrBlt','MasVnrArea','2ndFlrSF']) ),

        ('drop', DropFeatures(features_to_drop=['EnclosedPorch', 'WoodDeckSF']) ),

        ('categorical', CategoricalImputer(imputation_method='missing',
                                fill_value='None',
                                variables=['BsmtFinType1', 'GarageFinish']) ),
    
        ('mean', MeanMedianImputer(imputation_method='mean',
                                variables=['BedroomAbvGr']) ),
    
        ('winsorizer', Winsorizer(capping_method='iqr', fold=1.5, tail='both', variables=[
                            'GarageArea', 'LotArea', 'LotFrontage', 'MasVnrArea',
                            'OpenPorchSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF']) ),
    
        ('OrdinalCategoricalEncoder', OrdinalEncoder(encoding_method='arbitrary',
                                                    variables = ['BsmtExposure',
                                                                'BsmtFinType1',
                                                                'GarageFinish',
                                                                'KitchenQual']) ),
    
        ('lt', LogTransformer(variables = ['1stFlrSF']) ),
    
        ('power', PowerTransformer(variables= ['2ndFlrSF',
                                        'BsmtUnfSF',
                                        'GarageArea',
                                        'MasVnrArea',
                                        'TotalBsmtSF']) ),

        ('boxcox', PowerTransformer(variables=['GrLivArea', 'LotFrontage', 'LotArea']) ),

        ('yeo_johnson', PowerTransformer(variables=['BedroomAbvGr', 'GarageYrBlt', 'OpenPorchSF']) ),

        ('SmartCorrelatedSelection', SmartCorrelatedSelection(variables=None,
        method='spearman', threshold=0.7, selection_method='variance')),

        ('feat_scaling', StandardScaler() ),

        ('feat_selection', SelectFromModel(model) ),

        ("model", model),
    ])

    return pipeline

Custom Class for Hyperparameter Optimisation

* Code is copied from walkthrough project prepared by Code Institute.

In [None]:
from sklearn.model_selection import GridSearchCV

class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")

            model = PipelineOptimization(self.models[key])
            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                            verbose=verbose, scoring=scoring )
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)
        columns = ['estimator', 'min_score',
                    'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]
        return df[columns], self.grid_searches

### Split into train and validation set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df1.drop(['SalePrice'], axis=1),
    df1['SalePrice'],
    test_size=0.2,
    random_state=0,
)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

---

# Grid Search CV - Sklearn

Use default hyperparameters to find most suitable algorithm

In [None]:
models_quick_search = {
    'LinearRegression': LinearRegression(),
    'RandomForestRegressor': RandomForestRegressor(random_state=0),
    'GradientBoostingRegressor': GradientBoostingRegressor(random_state=0),
    'DecisionTreeRegressor': DecisionTreeRegressor(random_state=0),
    'ExtraTreesRegressor': ExtraTreesRegressor(random_state=0),
    'AdaBoostRegressor': AdaBoostRegressor(random_state=0)
}

params_quick_search = {
    'LinearRegression': {},
    'RandomForestRegressor': {},
    'GradientBoostingRegressor': {},
    'DecisionTreeRegressor': {},
    'ExtraTreesRegressor': {},
    'AdaBoostRegressor': {},
}

Make a hyperparameter optimization search using default hyperparameters

In [None]:
search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search)
search.fit(X_train, y_train, scoring='r2', n_jobs=-1, cv=5)

Check the results produced

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

The client's business requirement is an R2 score of at least 0.75
The best results were from GradientBoostingRegressor. Therefore further extensive search will be conducted on that particular estimator.

### Extensive search

In [None]:
models_search = {
    "LinearRegression": LinearRegression(),
    "DecisionTreeRegressor": DecisionTreeRegressor(random_state=0),
    "RandomForestRegressor": RandomForestRegressor(random_state=0),
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=0),
    "AdaBoostRegressor": AdaBoostRegressor(random_state=0),
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=0),
}

params_search = {
    "LinearRegression": {},

    "DecisionTreeRegressor": {'model__max_depth': [None,4, 15],
                             'model__min_samples_split': [2,50],
                             'model__min_samples_leaf': [1,50],
                             'model__max_leaf_nodes': [None,50],
        },

    "RandomForestRegressor": {'model__n_estimators': [100,50, 140],
                             'model__max_depth': [None,4, 15],
                             'model__min_samples_split': [2,50],
                             'model__min_samples_leaf': [1,50],
                             'model__max_leaf_nodes': [None,50],
        },

    "ExtraTreesRegressor": {'model__n_estimators': [100,50,150],
        'model__max_depth': [None, 3, 15],
        'model__min_samples_split': [2, 50],
        'model__min_samples_leaf': [1,50],
        },

    "AdaBoostRegressor": {'model__n_estimators': [50,25,80,150],
                          'model__learning_rate':[1,0.1, 2],
                          'model__loss':['linear', 'square', 'exponential'],
        },

    "GradientBoostingRegressor": {'model__n_estimators': [100,50,140],
                                  'model__learning_rate':[0.1, 0.01, 0.001],
                                  'model__max_depth': [3,15, None],
                                  'model__min_samples_split': [2,50],
                                  'model__min_samples_leaf': [1,50],
                                  'model__max_leaf_nodes': [None,50],
        },
}

Do a hyperparameter optimization search using default hyperparameters

In [None]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train, scoring='r2', n_jobs=-1, cv=5)

Define model and parameters for Extensive Search

In [None]:
models_search = {
        "ExtraTreesRegressor": ExtraTreesRegressor(random_state=0)
}

params_search={"ExtraTreesRegressor": 
        {'model__n_estimators': [100,50,150],
        'model__max_depth': [None, 3, 15],
        'model__min_samples_split': [2, 50],
        'model__min_samples_leaf': [1,50],
        },
}

In [None]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train, scoring='r2', n_jobs=-1, cv=5)

Check results

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

---

Establish the best model

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

Parameters for best model

In [None]:
best_parameters = grid_search_pipelines[best_model].best_params_
best_parameters

Define the best regressor, based on search

In [None]:
best_regressor_pipeline = grid_search_pipelines[best_model].best_estimator_
best_regressor_pipeline

---

# Assess the feature importance

Code has been adjusted from Code Institute's churnometer project.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

data_cleaning_feat_eng_steps = 11
columns_after_data_cleaning_feat_eng = (Pipeline(best_regressor_pipeline.steps[:data_cleaning_feat_eng_steps])
                                        .transform(X_train)
                                        .columns)

best_features = columns_after_data_cleaning_feat_eng[best_regressor_pipeline['feat_selection'].get_support(
)].to_list()

df_feature_importance = (pd.DataFrame(data={
    'Feature': columns_after_data_cleaning_feat_eng[best_regressor_pipeline['feat_selection'].get_support()],
    'Importance': best_regressor_pipeline['model'].feature_importances_})
    .sort_values(by='Importance', ascending=False)
)

print(f"* These are the {len(best_features)} most important features in descending order. "
    f"The model was trained on them: \n{df_feature_importance['Feature'].to_list()}")

df_feature_importance.plot(kind='bar', x='Feature', y='Importance')
plt.show()

These are the 5 most important features. The model was trained on them: OverallQual, GrLivArea, TotalBsmtSF, YearBuilt and GarageArea.

**Evaluate performance on Train and Test sets:**

In [None]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import numpy as np

def regression_performance(X_train, y_train, X_test, y_test, pipeline):
    print("Model Evaluation \n")
    print("* Train Set")
    regression_evaluation(X_train, y_train, pipeline)
    print("* Test Set")
    regression_evaluation(X_test, y_test, pipeline)

def regression_evaluation(X, y, pipeline):
    prediction = pipeline.predict(X)
    print('R2 Score:', r2_score(y, prediction).round(3))
    print('Mean Absolute Error:', mean_absolute_error(y, prediction).round(3))
    print('Mean Squared Error:', mean_squared_error(y, prediction).round(3))
    print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y, prediction)).round(3))
    print("\n")

def regression_evaluation_plots(X_train, y_train, X_test, y_test, pipeline, alpha_scatter=0.5):
    pred_train = pipeline.predict(X_train)
    pred_test = pipeline.predict(X_test)

    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))
    sns.scatterplot(x=y_train, y=pred_train, alpha=alpha_scatter, ax=axes[0])
    sns.lineplot(x=y_train, y=y_train, color='red', ax=axes[0])
    axes[0].set_xlabel("Actual")
    axes[0].set_ylabel("Predictions")
    axes[0].set_title("Train Set")

    sns.scatterplot(x=y_test, y=pred_test, alpha=alpha_scatter, ax=axes[1])
    sns.lineplot(x=y_test, y=y_test, color='red', ax=axes[1])
    axes[1].set_xlabel("Actual")
    axes[1].set_ylabel("Predictions")
    axes[1].set_title("Test Set")

    plt.show()

In [None]:
regression_performance(X_train, y_train, X_test, y_test, best_regressor_pipeline)
regression_evaluation_plots(X_train, y_train, X_test, y_test, best_regressor_pipeline)

Train and Test set both succeeded the R2 and are over 0.75 score.

In [None]:
best_regressor_pipeline

# Pipeline refitting with best features

### Rewrite Pipeline

In [None]:
def PipelineOptimization(model):
    pipeline_base = Pipeline([
    
        ('winsorizer', Winsorizer(capping_method='iqr', fold=1.5, tail='both', variables=[
                            'GarageArea', 'TotalBsmtSF']) ),
    
        ('power', PowerTransformer(variables= ['GarageArea',
                                        'TotalBsmtSF']) ),

        ('boxcox', PowerTransformer(variables=['GrLivArea']) ),

        ('feat_scaling', StandardScaler() ),

        ("model", ExtraTreesRegressor(max_depth=15, min_samples_split=50,
                                        n_estimators=150, random_state=0))])

    return pipeline_base

Split train test set with only best features.

In [None]:
X_train = X_train.filter(best_features)
X_test = X_test.filter(best_features)

In [None]:
print("* Train set:", X_train.shape, y_train.shape, "\n* Test Set:", X_test.shape, y_test.shape)
X_train.head(3)

Use the same model from the previous GridCV search and the best parameters from the previous GridCV search.

In [None]:
models_search

In [None]:
best_parameters

In [None]:
params_search = {
    "ExtraTreesRegressor": {'model__n_estimators': [50,100,150],
        'model__max_depth': [None, 3, 15],
        'model__min_samples_split': [2, 50],
        'model__min_samples_leaf': [1, 50],
        },
}

Perform GridSearch CV

In [None]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train, scoring= 'r2', n_jobs=-1, cv=5)

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

In [None]:
best_pipeline_regressor = grid_search_pipelines[best_model].best_estimator_
best_pipeline_regressor

---

## Save the files

In [None]:
import joblib
import os

version = 'v1'
file_path = f'outputs/ml_pipeline/predict_price/{version}'

try:
    os.makedirs(name=file_path)
except Exception as e:
    print(e)

In [None]:
X_train.head()

In [None]:
X_train.to_csv(f"{file_path}/X_train.csv", index=False)

In [None]:
y_train.head()

In [None]:
y_train.to_csv(f"{file_path}/y_train.csv", index=False)

In [None]:
X_test.head()

In [None]:
X_test.to_csv(f"{file_path}/X_test.csv", index=False)

In [None]:
y_test.head()

In [None]:
y_test.to_csv(f"{file_path}/y_test.csv", index=False)

ML pipeline for predicting tenure

In [None]:
best_pipeline_regressor

In [None]:
joblib.dump(value=best_pipeline_regressor, filename=f"{file_path}/regression_pipeline.pkl")

---

## Feature importance plot

In [None]:
file_path='docs/plots/features_importance.png'

In [None]:
df_feature_importance.plot(kind='bar', x='Feature', y='Importance')
plt.savefig(f'{file_path}/feature_importance.png', bbox_inches='tight')

---

## Conclusion

In this project, it was developed a robust machine learning pipeline to predict house prices using various regression models. It was started by implementing extensive data cleaning and feature engineering steps. The core of the project involved hyperparameter optimization through GridSearchCV, where it was tested multiple models and configurations to identify the best-performing model.

The Gradient Boosting Regressor initially showed promising results, but further optimization revealed that the Extra Trees Regressor with specific hyperparameters provided the best performance, exceeding the business requirement of an R2 score of at least 0.75 on both the training and test sets. It was identified the most influential features and refit the model using these key predictors. Finally, the optimized model and relevant datasets were saved for future use.

This structured approach not only ensured the creation of an accurate and reliable predictive model but also highlighted the importance of feature selection and model tuning in achieving superior performance. The project demonstrates the effectiveness of a systematic ML pipeline in solving realworld regression problems.