# **Modelling and Evaluation: Regression**

## Objectives

1. Fit and evaluate a regression model which will predict house prices.
2. Answers business criteria 2. Predict prices for client house data.
3. We have agreed an R2 score of at least 0.75 on the train set as well as on the test set.
4. Can predict house prices based on input data from streamlit dashboard. 

## Inputs

1. House_prices_records_clean.csv
2. Inherited_houses_clean.csv
3. Findings of feature engineering notebook

## Outputs

1. Data sets for train, validate and test sets.
2. Feature engineering pipeline.
3. Trained Model.
4. Predictions for client house prices and supporting data.



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/Heritage-Housing/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/Heritage-Housing'

# Load data

Note that we are using the cleaned data sets from the clean data notebook.

In [4]:
import numpy as np
import pandas as pd

house_prices_clean_df = pd.read_csv(f"outputs/datasets/clean_data/House_prices_records_clean.csv")
inherited_houses_clean_df = pd.read_csv(f"outputs/datasets/clean_data/Inherited_houses_clean.csv")

In [5]:
house_prices_clean_df.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GarageYrBlt,...,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,548,RFn,2003.0,...,8450,65.0,196.0,61,5,7,856,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,460,RFn,1976.0,...,9600,80.0,0.0,0,8,6,1262,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,608,RFn,2001.0,...,11250,68.0,162.0,42,5,7,920,2001,2002,223500
3,961,0.0,3.0,No,216,ALQ,540,642,Unf,1998.0,...,9550,60.0,0.0,35,5,7,756,1915,1970,140000
4,1145,0.0,4.0,Av,655,GLQ,490,836,RFn,2000.0,...,14260,84.0,350.0,84,5,8,1145,2000,2000,250000


In [6]:
inherited_houses_clean_df.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GarageYrBlt,...,KitchenQual,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,YearRemodAdd
0,896,0,2,No,468.0,Rec,270.0,730.0,Unf,1961.0,...,TA,11622,80.0,0.0,0,6,5,882.0,1961,1961
1,1329,0,3,No,923.0,ALQ,406.0,312.0,Unf,1958.0,...,Gd,14267,81.0,108.0,36,6,6,1329.0,1958,1958
2,928,701,3,No,791.0,GLQ,137.0,482.0,Fin,1997.0,...,TA,13830,74.0,0.0,34,5,5,928.0,1997,1998
3,926,678,3,No,602.0,GLQ,324.0,470.0,Fin,1998.0,...,Gd,9978,78.0,20.0,36,6,6,926.0,1998,1998


---

# Split data

We need to split our house price data into train, validate and test sets. We will follow the conventional ration of 7:1:2,

First we split into train and test sets.

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test,y_train, y_test = train_test_split(
                                    house_prices_clean_df.drop(['SalePrice'],axis=1),
                                    house_prices_clean_df['SalePrice'],
                                    test_size=0.2,
                                    random_state=0
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

* Train set: (1168, 21) (1168,) 
* Test set: (292, 21) (292,)


Then we split the test set again. We set the ratio to 0.2 as we want an overall proportion of 7:1:2.

In [8]:
X_train, X_val,y_train, y_val = train_test_split(
                                    X_train,
                                    y_train,
                                    test_size=0.2,
                                    random_state=0
                                    )

print("* Train set:", X_train.shape, y_train.shape)
print("* Validation set:",  X_val.shape, y_val.shape)
print("* Test set:",   X_test.shape, y_test.shape)

* Train set: (934, 21) (934,)
* Validation set: (234, 21) (234,)
* Test set: (292, 21) (292,)


---

# Pipeline

We apply the feature engineering steps outlined in the notebook

In [9]:
from sklearn.pipeline import Pipeline
from feature_engine.selection import DropFeatures
from feature_engine.encoding import OrdinalEncoder
from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer
from sklearn.preprocessing import StandardScaler



pipeline = Pipeline([
    ('drop_features', DropFeatures(features_to_drop = ['1stFlrSF',
                                                        '2ndFlrSF',
                                                        'GarageArea',
                                                        'GarageYrBlt',
                                                        'YearBuilt']) ),

    ("OrdinalCategoricalEncoder",OrdinalEncoder(encoding_method='arbitrary', 
                                                  variables = ['BsmtExposure',
                                                               'BsmtFinType1',
                                                               'GarageFinish',
                                                               'KitchenQual'] ) ),

    ('pt', vt.PowerTransformer(variables = ['BedroomAbvGr',
                                            'BsmtFinSF1',
                                            'BsmtUnfSF',
                                            'GrLivArea',
                                            'LotArea',
                                            'LotFrontage',
                                            'MasVnrArea',
                                            'OpenPorchSF',
                                            'OverallCond',
                                            'OverallQual',
                                            'TotalBsmtSF',
                                            'YearRemodAdd']) ),

    ('winsorizer_iqr', Winsorizer(capping_method='iqr', fold=1.5, tail='both')),
    
    ( "feat_scaling",StandardScaler() )
  ])

We can then fit our train, validate and test sets with the pipeline.

In [10]:
X_train = pipeline.fit_transform(X_train)
X_val= pipeline.transform(X_val)
X_test = pipeline.transform(X_test)

---

# The Model

---

# Predict house prices

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
