# **Sale Price prediction**

## Objectives

* Fit and evaluate a regression model to predict the Sale Price of a house in Ames, Iowa

## Inputs

* outputs/datasets/collection/house_prices_records.csv
* instructions on which variables to use for data cleaning and feature engineering, found in data cleaning and feature engineering notebooks.


## Outputs
* generate a list with variables to engineer

## Conclusions

* Train set (features and target)
* 
Test set (features and target
* 
Data cleaning and Feature Engineering pipeli
* e
Modeling pipel
* ne
Feature importance plot

---

# Change working directory

In [1]:
import numpy
import os

In [2]:
current_dir = os.getcwd()
current_dir

'/workspaces/heritage-housing-issues/jupyter_notebooks'

Change the working directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspaces/heritage-housing-issues'

---

# Load Data

Train Set

In [5]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/house_price_records.csv")
     .drop(labels=['EnclosedPorch', 'WoodDeckSF', '1stFlrSF', 'GarageArea', 'GarageYrBlt', 'GrLivArea', 'YearRemodAdd'],axis=1))
df.head()

Unnamed: 0,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageFinish,KitchenQual,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,SalePrice
0,854.0,3.0,No,706,GLQ,150,RFn,Gd,8450,65.0,196.0,61,5,7,856,2003,208500
1,0.0,3.0,Gd,978,ALQ,284,RFn,TA,9600,80.0,0.0,0,8,6,1262,1976,181500
2,866.0,3.0,Mn,486,GLQ,434,RFn,Gd,11250,68.0,162.0,42,5,7,920,2001,223500
3,,,No,216,ALQ,540,Unf,Gd,9550,60.0,0.0,35,5,7,756,1915,140000
4,,4.0,Av,655,GLQ,490,RFn,Gd,14260,84.0,350.0,84,5,8,1145,2000,250000


Test Set

# ML Pipeline with all Data

## ML pipeline for Data Cleaning and Feature Engineering

*Checking variable type and distribution, missing levels and what these variables mean in a business context*

In [6]:
from sklearn.pipeline import Pipeline

### Data Cleaning
from feature_engine.imputation import MeanMedianImputer
from feature_engine.imputation import CategoricalImputer

### Feature Engineering
from feature_engine import creation
from feature_engine.encoding import OrdinalEncoder
from feature_engine.selection import SmartCorrelatedSelection
from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer


def PipelineDataCleaningAndFeatureEngineering():
    pipeline_base = Pipeline([
        # Data Cleaning 
        ("mean", MeanMedianImputer(imputation_method='mean', variables=['BedroomAbvGr', 'LotFrontage'])),

        ("median", MeanMedianImputer(imputation_method='median', variables=['2ndFlrSF', 'MasVnrArea'])),

        ("categorical", CategoricalImputer(imputation_method='missing', fill_value='None', variables=['GarageFinish', 'BsmtFinType1'])),

        # Feature Engineering
        ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary', variables=['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual'])),

        ("lt", vt.LogTransformer(variables = ['LotArea'])),

        ("pt", vt.PowerTransformer(variables = ['LotFrontage', 'MasVnrArea', 'OpenPorchSF', 'TotalBsmtSF', '2ndFlrSF'])),

        # ("yj", vt.YeoJohnsonTransformer(variables = ['1stFlrSF'])),

        ("winsorizer", Winsorizer(capping_method='iqr', tail='both', fold=1.5, variables = ['LotArea', 'LotFrontage', 'MasVnrArea', 'OpenPorchSF', 'TotalBsmtSF', '2ndFlrSF'])),

        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")),

    ])

    return pipeline_base


PipelineDataCleaningAndFeatureEngineering()

Pipeline(steps=[('mean',
                 MeanMedianImputer(imputation_method='mean',
                                   variables=['BedroomAbvGr', 'LotFrontage'])),
                ('median',
                 MeanMedianImputer(variables=['2ndFlrSF', 'MasVnrArea'])),
                ('categorical',
                 CategoricalImputer(fill_value='None',
                                    variables=['GarageFinish',
                                               'BsmtFinType1'])),
                ('OrdinalCategoricalEncoder',
                 OrdinalEncoder(encoding_method='arbitrary',
                                variab...
                 PowerTransformer(variables=['LotFrontage', 'MasVnrArea',
                                             'OpenPorchSF', 'TotalBsmtSF',
                                             '2ndFlrSF'])),
                ('winsorizer',
                 Winsorizer(capping_method='iqr', fold=1.5, tail='both',
                            variables=['LotArea', 'Lot

## Pipeline for Modelling and Hyperparameter Optimisation

In [7]:
# Feat Scaling
from sklearn.preprocessing import StandardScaler

# Feat Selection
from sklearn.feature_selection import SelectFromModel

# ML algorithms
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import ExtraTreesRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression


def PipelineClf(model):
    pipeline_base = Pipeline([
        ("scaler", StandardScaler()),
        ("feat_selection", SelectFromModel(model)),
        ("model", model),
    ])

    return pipeline_base

  from pandas import MultiIndex, Int64Index


Custom Class for Hyperparameter Optimisation

In [8]:
from sklearn.model_selection import GridSearchCV


class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")

            model = PipelineClf(self.models[key])
            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                              verbose=verbose, scoring=scoring, )
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)
        columns = ['estimator', 'min_score',
                   'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]
        return df[columns], self.grid_searches

## Split Train and Test Set

In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['SalePrice'], axis=1),
    df['SalePrice'],
    test_size=0.2,
    random_state=0,
)

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

* Train set: (1168, 16) (1168,) 
* Test set: (292, 16) (292,)


## Grid Search CV: Sklearn

Use hyperparameters to find most suitable algorithm

In [10]:
models_quick_search = {
    "LinearRegression": LinearRegression(),
    "DecisionTreeRegressor": DecisionTreeRegressor(random_state=0),
    "RandomForestRegressor": RandomForestRegressor(random_state=0),
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=0),
    "AdaBoostRegressor": AdaBoostRegressor(random_state=0),
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=0),
    "XGBRegressor": XGBRegressor(random_state=0),
}

params_quick_search = {
    "LinearRegression": {},
    "DecisionTreeRegressor": {},
    "RandomForestRegressor": {},
    "ExtraTreesRegressor": {},
    "AdaBoostRegressor": {},
    "GradientBoostingRegressor": {},
    "XGBRegressor": {},
}

Check Train Set Target distribution

In [11]:
search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search)
search.fit(X_train, y_train, scoring = 'r2', n_jobs=-1, cv=5)


Running GridSearchCV for LinearRegression 

Fitting 5 folds for each of 1 candidates, totalling 5 fits




ValueError: could not convert string to float: 'Av'

Traceback (most recent call last):
  File "/home/codeany/.local/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/codeany/.local/lib/python3.8/site-packages/sklearn/pipeline.py", line 341, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/home/codeany/.local/lib/python3.8/site-packages/sklearn/pipeline.py", line 303, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/home/codeany/.local/lib/python3.8/site-packages/joblib/memory.py", line 353, in __call__
    return self.func(*args, **kwargs)
  File "/home/codeany/.local/lib/python3.8/site-packages/sklearn/pipeline.py", line 754, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/home/codeany/.local/lib/python3.8/site-packages/sklearn/base.py", line 702, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "/home/codeany/.local/lib/python

Use SMOTE to balance Train Set target

In [14]:
from imblearn.over_sampling import SMOTE
oversample = SMOTE(sampling_strategy='minority', random_state=0)
X_train, y_train = oversample.fit_resample(X_train, y_train)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

ValueError: Expected n_neighbors <= n_samples,  but n_samples = 1, n_neighbors = 6

Check Train Set Target distribution after SMOTE

In [None]:
y_train.value_counts().plot(kind='bar',title='Train Set Target Distribution')
plt.show()

# Conclusions and Next Steps

Feature Engineering Transformers:

* Categorical encoding: BsmtExposure, BsmtFinType1, GarageFinish, KitchenQual

* Numerical encoding & winsorizer: GrLivArea, GarageArea, LotArea, LotFrontage, MasVnrArea, OpenPorchSF, TotalBsmtSF, 1stFlrSF, 2ndFlrSF

* Smart Correlation: 1stFlrSF, GarageArea, GarageYrBlt, GrLivArea, YearRemodAdd
