# Notebook 05 - Modelling & Evaluation

## Objectives

* Fit and evaluate a model for predicting the sale price of a house

## Inputs

* CSV file generated in Notebook 01: outputs/datasets/collection/house_price_records.csv
* Instructions on which variables to use for data cleaning and feature engineering. These are found in their respective notebooks.

## Outputs

* Train set (features and target)
* Test set (features and target)
* ML Pipeline to predict the sale price for a given property
* Feature importance plot

## Additional Comments / Conclusions

---

# Import Packages

In [13]:
import os
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Feature Engineering
from feature_engine import transformation as vt
from feature_engine.encoding import OrdinalEncoder
from feature_engine.selection import SmartCorrelatedSelection

# Feature Scaling
from sklearn.preprocessing import StandardScaler

# Feature Selection
from sklearn.feature_selection import SelectFromModel

# ML algorithms & hyperparameter optimisation
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
from xgboost import XGBRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import ExtraTreesRegressor

---

# Change working directory

* This notebook is stored in the `jupyter_notebooks` subfolder
* The current working directory therefore needs to be changed to the workspace, i.e., the working directory needs to be changed from the current folder to its parent folder

Firstly, the current directory is accessed with os.getcwd()

In [4]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\franc\\housing-price-predictor\\jupyter_notebooks'

Next, the working directory is set as the parent of the current `jupyter_notebooks` directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory
* This allows access to all the files and folders within the workspace, rather than solely those within the `jupyter_notebooks` directory

In [5]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Finally, confirm that the new current directory has been successfully set

In [6]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\franc\\housing-price-predictor'

---

# Load Data

In [7]:
df = pd.read_csv("outputs/datasets/collection/house_price_records.csv")
print(df.shape)
df.head()

(1460, 24)


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


---

# Pipeline, Custom Class and Data Preparation

In this section of the notebook we undertake the following tasks:
* Create the ML pipeline for regression
* State the Custom Class for Hyperparameter Optimisation
* Split the Train and Test Sets

## Create ML Pipeline

* We combine the data cleaning and feature engineering pipelines that were created in their respective notebooks.
* We then add feature selection, feature scaling and modelling.

In [11]:
# Pipeline optimisation
def PipelineOptimisation(model):
  pipeline_base = Pipeline([
    # Data Cleaning (copied from Data Cleaning notebook)
     ('median', MeanMedianImputer(imputation_method='median',
                                 variables=['2ndFlrSF', 'BedroomAbvGr',
                                 'LotFrontage', 'MasVnrArea']) ),
    ('categorical_missing', CategoricalImputer(imputation_method='missing',
                                     fill_value='Missing',
                                     variables=['BsmtFinType1']) ),
    ('garage_absent', ArbitraryNumberImputer(arbitrary_number=0,
                                               variables=['GarageYrBlt'])),
    ('categorical_frequent', CategoricalImputer(imputation_method='frequent',
                                     variables=['GarageFinish']) ),
    ('drop',  DropFeatures(features_to_drop=['EnclosedPorch', 'WoodDeckSF']) ),

    # Feature Engineering (copied from Feature Engineering notebook)
    ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary', 
                                                 variables=['BsmtExposure',
                                                            'BsmtFinType1',
                                                            'GarageFinish',
                                                            'KitchenQual'] ) ),
    
    ('log_transform', vt.LogTransformer(variables=['1stFlrSF',
                                                   'GrLivArea',
                                                   'LotArea',
                                                   'LotFrontage'])),
    
    ('yeo_johnson_transform', vt.YeoJohnsonTransformer(variables=['BsmtUnfSF',
                                                                  'GarageArea',
                                                                  'OpenPorchSF'])),
    
    ('power_transform', vt.PowerTransformer(variables=['TotalBsmtSF',
                                                       'MasVnrArea'])),    
       
    ("SmartCorrelatedSelection",SmartCorrelatedSelection(variables= None,
     method="spearman", threshold=0.6,selection_method="variance") ),

    ("feat_scaling", StandardScaler() ),

    ("feat_selection",  SelectFromModel(model) ),

    ("model", model ),
    ])

  return pipeline_base

## Custom Class for Hyperparameter Optimisation

Custom Class for Hyperparameter Optimisation, copied and adapted from Code Institute's Walkthrough Project 2 on customer churn:

In [12]:
class HyperparameterOptimisationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model = PipelineOptimisation(self.models[key])

            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                              verbose=verbose, scoring=scoring)
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score',
                   'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches


## Split Train and Test Sets

In [14]:
X_train, X_test,y_train, y_test = train_test_split(
                                    df.drop(['SalePrice'], axis=1) ,
                                    df['SalePrice'],
                                    test_size=0.2,
                                    random_state=0
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

* Train set: (1168, 23) (1168,) 
* Test set: (292, 23) (292,)


---

# Hyperparameter Optimisation

## Grid Search CV - sklearn

* Use default hyperparameters to find the most suitable algorithm
* Perform extensive search on most suitable algorithm to find best hyperparameter configuration

---

# Assess Feature Importance

* xxx

---

# Evaluate Regressor on Train and Test Sets

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [8]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)


IndentationError: expected an indented block (2852421808.py, line 5)