# Prediction of Sale Price

## Objective 
Develop and assess a regression model to predict SalePrice, addressing Business Requirement 2.

## Inputs
outputs/datasets/collection/HousePrices.csv

## Outputs
Train set (features and target)
Test set (features and target)
Data cleaning and feature engineering pipeline
Features importance plot

## CRISP-DM
Modelling and evaluation.


# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/Heritage-Housing/jupyter_notebooks'

We want to make the child of the current directory the new current directory
* os.chdir() defines the new current directory

In [2]:
os.chdir('/workspaces/Heritage-Housing')
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/Heritage-Housing'

## Load Data

In [5]:
import pandas as pd
%matplotlib inline
train_set_path = "outputs/datasets/cleaned/clean_set.csv"
df = pd.read_csv(train_set_path)
df.head(3)

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GarageYrBlt,...,KitchenQual,LotArea,LotFrontage,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,548,RFn,2003.0,...,Gd,8450,65.0,61,5,7,856,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,460,RFn,1976.0,...,TA,9600,80.0,0,8,6,1262,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,608,RFn,2001.0,...,Gd,11250,68.0,42,5,7,920,2001,2002,223500


## Machine Learning Pipline

* We first create a ML pipine for our Data Cleaning and Feature engineering 

In [6]:
from sklearn.pipeline import Pipeline

### Feature Engineering
from feature_engine import transformation as vt
from feature_engine.encoding import OrdinalEncoder
from feature_engine.selection import SmartCorrelatedSelection

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithms
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor

selection_method = "variance"
corr_method = "spearman"

def PipelineOptimization(model):
    pipeline_base = Pipeline([

        ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary',
                                                    variables=['BsmtExposure',
                                                                'BsmtFinType1',
                                                                'GarageFinish',
                                                                'KitchenQual'])),

        ("NumericLogTransform", vt.LogTransformer(variables=['1stFlrSF',
                                                            'LotArea',
                                                            'GrLivArea','LotFrontage'])),
        ("NumericPowerTransform", vt.PowerTransformer(variables=['TotalBsmtSF','OpenPorchSF'])),
        ("NumericYeoJohnsonTransform",
        vt.YeoJohnsonTransformer(variables=['TotalBsmtSF'])),

        ("SmartCorrelatedSelection",
        SmartCorrelatedSelection(variables=None,
                                    method=corr_method,
                                    threshold=0.8,
                                    selection_method=selection_method
                                    )),

        ("feat_scaling", StandardScaler()),

        ("feat_selection",  SelectFromModel(model)),

        ("model", model),

    ])

    return pipeline_base