# **Regression**

## Objectives

* Fit and evaluate a "Regression Model" to predict "SalePrice"

## Inputs

* "outputs/datasets/collection/cleaned/CleanedHousePrices.csv"

## Outputs

**************************will be added at the end


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/P5-Heritage-Housing/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/P5-Heritage-Housing'

# Loading of Data

In [4]:
import numpy as np
import pandas as pd
cleaned_house_prices_df = pd.read_csv("/workspace/P5-Heritage-Housing/outputs/datasets/collection/cleaned/CleanedHousePrices.csv")
print(cleaned_house_prices_df.shape)
cleaned_house_prices_df.head(3)

(1460, 22)


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GarageYrBlt,...,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,548,RFn,2003.0,...,8450,65.0,196.0,61,5,7,856,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,460,RFn,1976.0,...,9600,80.0,0.0,0,8,6,1262,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,608,RFn,2001.0,...,11250,68.0,162.0,42,5,7,920,2001,2002,223500


---

# ML Pipeline: Regressor

## Creation of ML Pipeline

### Packages

In [5]:
# For Feature Engineering
from feature_engine.encoding import OrdinalEncoder
from feature_engine import transformation as vt
from feature_engine.selection import SmartCorrelatedSelection

# For Feature Scaling
from sklearn.preprocessing import StandardScaler

# For Feature Selection
from sklearn.feature_selection import SelectFromModel

# For ML Algorithms
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import Pipeline
from xgboost import XGBRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import ExtraTreesRegressor

In [6]:
def PipelineOptimization(model):
    pipeline_base = Pipeline([

        ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary',
                                                 variables=['KitchenQual', 'GarageFinish', 'BsmtFinType1', 'BsmtExposure'])),

        ("YeoJohnsonTransformer", vt.YeoJohnsonTransformer(variables=['1stFlrSF', 'GrLivArea', 'LotArea', 'LotFrontage'])),

        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")),

        ("feat_scaling", StandardScaler()),

        ("feat_selection",  SelectFromModel(model)),

        ("model", model),

    ])

    return pipeline_base

---

### Hyperparameter Optimization

- Custom Class is taken from "Scikit-Learn Unit 9B: NLP(Natural Language Processing) Best Algoritm Hyperparameters" lesson.

In [7]:
from sklearn.model_selection import GridSearchCV


class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model = PipelineOptimization(self.models[key])

            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                              verbose=verbose, scoring=scoring)
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score',
                   'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches

### Splitting Dataset to TrainSet and TestSet

In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    cleaned_house_prices_df.drop(['SalePrice'], axis=1),
    cleaned_house_prices_df['SalePrice'],
    test_size=0.2,
    random_state=0
)

print("* TrainSet:", X_train.shape, y_train.shape,
      "\n* TestSet:",  X_test.shape, y_test.shape)

* TrainSet: (1168, 21) (1168,) 
* TestSet: (292, 21) (292,)


### Grid Search CV

- GridSearchCV is a tool in scikit-learn library used to find the best hyperparameters for a model.

In [11]:
models_crude_search = {
    'LinearRegression': LinearRegression(),
    "DecisionTreeRegressor": DecisionTreeRegressor(random_state=0),
    "RandomForestRegressor": RandomForestRegressor(random_state=0),
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=0),
    "AdaBoostRegressor": AdaBoostRegressor(random_state=0),
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=0),
    "XGBRegressor": XGBRegressor(random_state=0),
}

params_crude_search = {
    'LinearRegression': {},
    "DecisionTreeRegressor": {},
    "RandomForestRegressor": {},
    "ExtraTreesRegressor": {},
    "AdaBoostRegressor": {},
    "GradientBoostingRegressor": {},
    "XGBRegressor": {},
}

#### Comprehensive Searching

In [13]:
crude_search = HyperparameterOptimizationSearch(models=models_crude_search, params=params_crude_search)
crude_search.fit(X_train, y_train, scoring='r2', n_jobs=-1, cv=5)


Running GridSearchCV for LinearRegression 

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Running GridSearchCV for DecisionTreeRegressor 

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Running GridSearchCV for RandomForestRegressor 

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Running GridSearchCV for ExtraTreesRegressor 

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Running GridSearchCV for AdaBoostRegressor 

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Running GridSearchCV for GradientBoostingRegressor 

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Running GridSearchCV for XGBRegressor 

Fitting 5 folds for each of 1 candidates, totalling 5 fits


In [14]:
grid_search_summary, grid_search_pipelines = crude_search.score_summary(sort_by='mean_score')
grid_search_summary

Unnamed: 0,estimator,min_score,mean_score,max_score,std_score
5,GradientBoostingRegressor,0.643137,0.772857,0.834133,0.067405
3,ExtraTreesRegressor,0.682734,0.766507,0.805553,0.042962
2,RandomForestRegressor,0.618654,0.762363,0.822354,0.073567
0,LinearRegression,0.7225,0.761412,0.830006,0.038917
4,AdaBoostRegressor,0.507389,0.673009,0.789042,0.098587
6,XGBRegressor,0.554363,0.653108,0.710833,0.056162
1,DecisionTreeRegressor,0.355924,0.531838,0.650633,0.111503


- A more extensive parameter search can be performed for the best performing models. Here is a detailed comprehensive search definition:

In [25]:
models_extensive_search = {
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=0),
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=0),
    'LinearRegression': LinearRegression(),
    "RandomForestRegressor": RandomForestRegressor(random_state=0),
}

params_extensive_search = {
    "GradientBoostingRegressor": {
        'model__n_estimators': [50, 100],
        'model__max_depth': [3,7,10],
        'model__min_samples_split': [5,7],

    },

    "ExtraTreesRegressor":{
        'model__n_estimators': [7],
        'model__max_depth': [3,7],
        'model__min_samples_split': [4,5,7],
    },
    
    'LinearRegression': {},
    
    "RandomForestRegressor": {
        'model__n_estimators': [7],
        'model__max_depth': [3,7,10],
        'model__min_samples_split': [4,5,7],
    },
}

In [26]:
search = HyperparameterOptimizationSearch(models=models_extensive_search, params=params_extensive_search)
search.fit(X_train, y_train, scoring = 'r2', n_jobs=-1, cv=6)


Running GridSearchCV for ExtraTreesRegressor 

Fitting 6 folds for each of 6 candidates, totalling 36 fits



Running GridSearchCV for GradientBoostingRegressor 

Fitting 6 folds for each of 12 candidates, totalling 72 fits

Running GridSearchCV for LinearRegression 

Fitting 6 folds for each of 1 candidates, totalling 6 fits

Running GridSearchCV for RandomForestRegressor 

Fitting 6 folds for each of 9 candidates, totalling 54 fits


In [33]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

Unnamed: 0,estimator,min_score,mean_score,max_score,std_score,model__max_depth,model__min_samples_split,model__n_estimators
12,GradientBoostingRegressor,0.610738,0.760618,0.807907,0.067881,7.0,7.0,50.0
6,GradientBoostingRegressor,0.528962,0.755841,0.822732,0.103223,3.0,5.0,50.0
8,GradientBoostingRegressor,0.528962,0.755424,0.822732,0.102862,3.0,7.0,50.0
13,GradientBoostingRegressor,0.59896,0.754889,0.810252,0.07106,7.0,7.0,100.0
10,GradientBoostingRegressor,0.598433,0.75486,0.816733,0.071386,7.0,5.0,50.0
11,GradientBoostingRegressor,0.591726,0.750328,0.815616,0.072757,7.0,5.0,100.0
18,LinearRegression,0.616022,0.74968,0.817007,0.06982,,,
16,GradientBoostingRegressor,0.627971,0.747726,0.803854,0.057732,10.0,7.0,50.0
17,GradientBoostingRegressor,0.624305,0.745066,0.805475,0.059174,10.0,7.0,100.0
7,GradientBoostingRegressor,0.486392,0.744606,0.831732,0.116949,3.0,5.0,100.0


In [34]:
best_model = grid_search_summary.iloc[0,0]
best_model

'GradientBoostingRegressor'

In [29]:
best_regressor_pipeline = grid_search_pipelines[best_model].best_estimator_
best_regressor_pipeline

NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
