# XGBoost/Pipeline/GridSearchCV/EarlyStopping Machine Learning Model - GENERAL TEMPLATE

**Author: Daniel Diaz Moncada**

**Date: 2022-09-29**

**Version: 0.1.1**


* It is a general Machine Learning template I built to use mainly for XGBoost models benefiting from the combination of doing Hyperparameter-Tuning with a GridSearchCV and using Early Stopping while pre-processing data through a Pipeline.

* During the Hyperparameter-Tuning phase, data is actually pre-processed externally, and then fed to the model, meaning that the preprocessor object and the predicting model are not bundled together in the same Pipeline, and the reason for this limitation is that since Early Stopping is being used, it requires an evaluation dataset which is simply not preprocessed directly as the training dataset when given to a Pipeline that bundles a preprocessor object and a predicting model.

* The evaluation dataset is required in order to verify whether the loss function is increasing or decreasing in order to stop early or continue training for that round when testing any given combination of hyperparameters.

* After determining the optimal set of hyperparameters, the original training dataset and validation dataset are joined as a complete training dataset, and a Final Model is generated and trained with the best combination of hyperparameters found. This Final Model does not require EarlyStopping for training since only one combination of parameters will be run. In this Final Model, we do not require Early Stopping or any further evaluation set which previously generated the inconvenience of requiring preprocessing data outside the Pipeline, therefore we can now build our final model with the great convenience of bundling together a preprocessor object and a predicting model within a Pipeline and finally fed to a GridSearchCV object that performs cross-validation, however this time it won't require hyperparameter tuning, but simply use a single combination of the best hyperparameters found during the Hyperparameter-Tuning phase.

* This Final Model after performing its final training is dumped into a .pkl file and then the .pkl file is loaded and makes predictions on the new unseen data to which it is fed.

* This template is a very convenient and adaptable way to get started making powerful ML models with a great architecture, with all the benefits of an efficient and adaptable Pre-Processing Pipeline, an XGBoost model, Early Stopping for efficient enough Hyperparameter-Tuning, cross-validation.

* There is much more you can add to improve the model, and adjust it to your needs, but it is definitely a great template to start with.

# Import Libraries, Modules and Objects

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

# List all your imports
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# List all input data files contained within input directory

* Consider your notebook(s) to be stored in a directory that is sibling to your notebook(s) directory

In [2]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('../input/'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

../input/house-prices-advanced-regression-techniques/sample_submission.csv
../input/house-prices-advanced-regression-techniques/data_description.txt
../input/house-prices-advanced-regression-techniques/train.csv
../input/house-prices-advanced-regression-techniques/test.csv


# Initial Variables

In [3]:
# INITIAL VARIABLES
INPUT_DIRECTORY_PATH = '../input/house-prices-advanced-regression-techniques/'
DATA_DESCRIPTION_FILE_PATH = INPUT_DIRECTORY_PATH + 'data_description.txt'
TRAINING_DATASET_FILEPATH = INPUT_DIRECTORY_PATH + 'train.csv'
TESTING_DATASET_FILEPATH = INPUT_DIRECTORY_PATH + 'test.csv'
TARGET_COLUMN_NAME = 'SalePrice'
MAXIMUM_ALLOWED_CARDINALITY = -1      # If set to -1 will allow its original maximum cardinality on all categorical columns
VALIDATION_SIZE=0.2 # over-rided below, therefore this does nothing here, just keeping it for later

# Differentiating Numerical and Categorical Features

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X_full = pd.read_csv(TRAINING_DATASET_FILEPATH, index_col='Id')
X_test_full = pd.read_csv(TESTING_DATASET_FILEPATH, index_col='Id')

# Usually out of Kaggle, you won't have testing datasets already split in separate files from the very beginning,
# but instead they will be consolidated altogether in a single dataset that you will then split on your own or use CV.

# Remove rows with missing target, separate target from predictors
X_full.dropna(axis=0, subset=[TARGET_COLUMN_NAME], inplace=True)
y = X_full[TARGET_COLUMN_NAME]
X_full.drop([TARGET_COLUMN_NAME], axis=1, inplace=True)

# Identify numeric columns only
numeric_cols = [cname for cname in X_full.columns if X_full[cname].dtype in ['int64', 'float64', 'int32', 'float32']]

# Identify categorical columns only
categorical_cols = [cname for cname in X_full.columns if X_full[cname].dtype in ['object', 'category']]

# Sets the categorical columns to a maximum cardinality allowed when maximum_allowed_cardinality is not -1
if MAXIMUM_ALLOWED_CARDINALITY is not -1:
    categorical_cols = [cname for cname in categorical_cols if X_full[cname].nunique() <= MAXIMUM_ALLOWED_CARDINALITY]

# Identify all columns to be used
total_cols = numeric_cols + categorical_cols
X = X_full[total_cols].copy()
X_test = X_test_full[total_cols].copy()

# Print numeric and categorical columns separately
print(f"\nNUMERIC COLS = {len(numeric_cols)}")
print(f"\nCATEGORICAL COLS = {len(categorical_cols)}")

# Verify shape of X, y and X_test datasets
print(f"\nX.shape = {X.shape}")
print(f"\ny.shape = {y.shape}")
print(f"\nX_test.shape = {X_test.shape}")


NUMERIC COLS = 36

CATEGORICAL COLS = 43

X.shape = (1460, 79)

y.shape = (1460,)

X_test.shape = (1459, 79)


# EXPLORATORY DATA ANALYSIS (EDA)

### (\*OPTIONAL) READING DATA DESCRIPTION OF EACH FEATURE FROM THE ORIGINAL DATASET, IF AVAILABLE

In [5]:
with open(DATA_DESCRIPTION_FILE_PATH) as f:
    contents = f.read()
    print(contents)

MSSubClass: Identifies the type of dwelling involved in the sale.	

        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES
        45	1-1/2 STORY - UNFINISHED ALL AGES
        50	1-1/2 STORY FINISHED ALL AGES
        60	2-STORY 1946 & NEWER
        70	2-STORY 1945 & OLDER
        75	2-1/2 STORY ALL AGES
        80	SPLIT OR MULTI-LEVEL
        85	SPLIT FOYER
        90	DUPLEX - ALL STYLES AND AGES
       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150	1-1/2 STORY PUD - ALL AGES
       160	2-STORY PUD - 1946 & NEWER
       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190	2 FAMILY CONVERSION - ALL STYLES AND AGES

MSZoning: Identifies the general zoning classification of the sale.
		
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM

### DESCRIPTIVE STATISTICS ON FEATURES AND TARGET:

In [6]:
# Descriptive Statistics on Features
print("\nDESCRIPTIVE STATISTICS ON FEATURES(X):")
display(X.describe().style)

# Descriptive Statistics on Target
print("\nDESCRIPTIVE STATISTICS ON TARGET(y):")
display(y.to_frame().describe())


DESCRIPTIVE STATISTICS ON FEATURES(X):


Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
count,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1379.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,46.549315,567.240411,1057.429452,1162.626712,346.992466,5.844521,1515.463699,0.425342,0.057534,1.565068,0.382877,2.866438,1.046575,6.517808,0.613014,1978.506164,1.767123,472.980137,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753
std,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,161.319273,441.866955,438.705324,386.587738,436.528436,48.623081,525.480383,0.518911,0.238753,0.550916,0.502885,0.815778,0.220338,1.625393,0.644666,24.689725,0.747315,213.804841,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095
min,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,0.0,0.0,334.0,0.0,0.0,334.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,1900.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0
25%,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,0.0,223.0,795.75,882.0,0.0,0.0,1129.5,0.0,0.0,1.0,0.0,2.0,1.0,5.0,0.0,1961.0,1.0,334.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0
50%,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,0.0,477.5,991.5,1087.0,0.0,0.0,1464.0,0.0,0.0,2.0,0.0,3.0,1.0,6.0,1.0,1980.0,2.0,480.0,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0
75%,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,0.0,808.0,1298.25,1391.25,728.0,0.0,1776.75,1.0,0.0,2.0,1.0,3.0,1.0,7.0,1.0,2002.0,2.0,576.0,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0
max,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,1474.0,2336.0,6110.0,4692.0,2065.0,572.0,5642.0,3.0,2.0,3.0,2.0,8.0,3.0,14.0,3.0,2010.0,4.0,1418.0,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0



DESCRIPTIVE STATISTICS ON TARGET(y):


Unnamed: 0,SalePrice
count,1460.0
mean,180921.19589
std,79442.502883
min,34900.0
25%,129975.0
50%,163000.0
75%,214000.0
max,755000.0


### OBSERVE THE VALUES

In [7]:
# Take a look at the Features data of the first 5 examples
print("\nFEATURES(X):")
display(X.head().style)

# Take a look at the Target data of the first 5 examples
print("\nTARGET(y):")
display(y.to_frame().head().style)


FEATURES(X):


Unnamed: 0_level_0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,MSZoning,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,Heating,HeatingQC,CentralAir,Electrical,KitchenQual,Functional,FireplaceQu,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1
1,60,65.0,8450,7,5,2003,2003,196.0,706,0,150,856,856,854,0,1710,1,0,2,1,3,1,8,0,2003.0,2,548,0,61,0,0,0,0,0,2,2008,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,Gable,CompShg,VinylSd,VinylSd,BrkFace,Gd,TA,PConc,Gd,TA,No,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,,Attchd,RFn,TA,TA,Y,,,,WD,Normal
2,20,80.0,9600,6,8,1976,1976,0.0,978,0,284,1262,1262,0,0,1262,0,1,2,0,3,1,6,1,1976.0,2,460,298,0,0,0,0,0,0,5,2007,RL,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,Gable,CompShg,MetalSd,MetalSd,,TA,TA,CBlock,Gd,TA,Gd,ALQ,Unf,GasA,Ex,Y,SBrkr,TA,Typ,TA,Attchd,RFn,TA,TA,Y,,,,WD,Normal
3,60,68.0,11250,7,5,2001,2002,162.0,486,0,434,920,920,866,0,1786,1,0,2,1,3,1,6,1,2001.0,2,608,0,42,0,0,0,0,0,9,2008,RL,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,Gable,CompShg,VinylSd,VinylSd,BrkFace,Gd,TA,PConc,Gd,TA,Mn,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,TA,Attchd,RFn,TA,TA,Y,,,,WD,Normal
4,70,60.0,9550,7,5,1915,1970,0.0,216,0,540,756,961,756,0,1717,1,0,1,0,3,1,7,1,1998.0,3,642,0,35,272,0,0,0,0,2,2006,RL,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,Gable,CompShg,Wd Sdng,Wd Shng,,TA,TA,BrkTil,TA,Gd,No,ALQ,Unf,GasA,Gd,Y,SBrkr,Gd,Typ,Gd,Detchd,Unf,TA,TA,Y,,,,WD,Abnorml
5,60,84.0,14260,8,5,2000,2000,350.0,655,0,490,1145,1145,1053,0,2198,1,0,2,1,4,1,9,1,2000.0,3,836,192,84,0,0,0,0,0,12,2008,RL,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,Gable,CompShg,VinylSd,VinylSd,BrkFace,Gd,TA,PConc,Gd,TA,Av,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,TA,Attchd,RFn,TA,TA,Y,,,,WD,Normal



TARGET(y):


Unnamed: 0_level_0,SalePrice
Id,Unnamed: 1_level_1
1,208500
2,181500
3,223500
4,140000
5,250000


### (\*OPTIONAL) RE-CLASSIFY NUMERICAL CODE FEATURES THAT ARE ACTUALLY CATEGORICAL AND REQUIRE TO BE TRANSFORMED FROM NUMBER DTYPE INTO CATEGORICAL DTYPE
This are the features containing number values that have been arbitrarily assigned without implying any magnitude, and having been done by the owner of the dataset, meaning that its encoding was not considered as part of the pre-processing phase of the Machine Learning cycle.

In [8]:
# For example this feature is of dtype int64 when its code despite being a number is actually an
# arbitrary code which denotes no magnitude, therefore it must be transformed into type 'category'

number_categorical_features = ['MSSubClass']

for feature in number_categorical_features:
    original_dtype = X[feature].dtype
    # Change dtype for the feature
    X[feature] = X[feature].astype('category')
    print(f'{feature} : Original dtype: {original_dtype} --> Final dtype: {X[feature].dtype}\n')
    
    # Add feature to categorical_cols and remove from numeric_cols 
    categorical_cols.append(feature)
    numeric_cols.remove(feature)


MSSubClass : Original dtype: int64 --> Final dtype: category



In [9]:
# Observe the number of columns transferred from numeric_cols to categorical_cols
print(f'categorical_cols: {len(categorical_cols)}')
print(f'numeric_cols: {len(numeric_cols)}')

# Observe the order of columns after transferred from numeric_cols to categorical_cols
print('\n CATEGORICAL: ', categorical_cols)
print('\n NUMERICAL: ', numeric_cols)

# Re-order the columns in the same order as the lists for better visibility
total_cols = numeric_cols + categorical_cols
X = X[total_cols]
X_test = X_test[total_cols]

# Display X and observe the new order of columns
display(X.head().style)

# Show info about the dtypes of features
X.info()

categorical_cols: 44
numeric_cols: 35

 CATEGORICAL:  ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition', 'MSSubClass']

 NUMERICAL:  ['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBl

Unnamed: 0_level_0,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,MSZoning,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,Heating,HeatingQC,CentralAir,Electrical,KitchenQual,Functional,FireplaceQu,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,SaleType,SaleCondition,MSSubClass
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1
1,65.0,8450,7,5,2003,2003,196.0,706,0,150,856,856,854,0,1710,1,0,2,1,3,1,8,0,2003.0,2,548,0,61,0,0,0,0,0,2,2008,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,Gable,CompShg,VinylSd,VinylSd,BrkFace,Gd,TA,PConc,Gd,TA,No,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,,Attchd,RFn,TA,TA,Y,,,,WD,Normal,60
2,80.0,9600,6,8,1976,1976,0.0,978,0,284,1262,1262,0,0,1262,0,1,2,0,3,1,6,1,1976.0,2,460,298,0,0,0,0,0,0,5,2007,RL,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,Gable,CompShg,MetalSd,MetalSd,,TA,TA,CBlock,Gd,TA,Gd,ALQ,Unf,GasA,Ex,Y,SBrkr,TA,Typ,TA,Attchd,RFn,TA,TA,Y,,,,WD,Normal,20
3,68.0,11250,7,5,2001,2002,162.0,486,0,434,920,920,866,0,1786,1,0,2,1,3,1,6,1,2001.0,2,608,0,42,0,0,0,0,0,9,2008,RL,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,Gable,CompShg,VinylSd,VinylSd,BrkFace,Gd,TA,PConc,Gd,TA,Mn,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,TA,Attchd,RFn,TA,TA,Y,,,,WD,Normal,60
4,60.0,9550,7,5,1915,1970,0.0,216,0,540,756,961,756,0,1717,1,0,1,0,3,1,7,1,1998.0,3,642,0,35,272,0,0,0,0,2,2006,RL,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,Gable,CompShg,Wd Sdng,Wd Shng,,TA,TA,BrkTil,TA,Gd,No,ALQ,Unf,GasA,Gd,Y,SBrkr,Gd,Typ,Gd,Detchd,Unf,TA,TA,Y,,,,WD,Abnorml,70
5,84.0,14260,8,5,2000,2000,350.0,655,0,490,1145,1145,1053,0,2198,1,0,2,1,4,1,9,1,2000.0,3,836,192,84,0,0,0,0,0,12,2008,RL,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,Gable,CompShg,VinylSd,VinylSd,BrkFace,Gd,TA,PConc,Gd,TA,Av,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,TA,Attchd,RFn,TA,TA,Y,,,,WD,Normal,60


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 79 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   LotFrontage    1201 non-null   float64 
 1   LotArea        1460 non-null   int64   
 2   OverallQual    1460 non-null   int64   
 3   OverallCond    1460 non-null   int64   
 4   YearBuilt      1460 non-null   int64   
 5   YearRemodAdd   1460 non-null   int64   
 6   MasVnrArea     1452 non-null   float64 
 7   BsmtFinSF1     1460 non-null   int64   
 8   BsmtFinSF2     1460 non-null   int64   
 9   BsmtUnfSF      1460 non-null   int64   
 10  TotalBsmtSF    1460 non-null   int64   
 11  1stFlrSF       1460 non-null   int64   
 12  2ndFlrSF       1460 non-null   int64   
 13  LowQualFinSF   1460 non-null   int64   
 14  GrLivArea      1460 non-null   int64   
 15  BsmtFullBath   1460 non-null   int64   
 16  BsmtHalfBath   1460 non-null   int64   
 17  FullBath       1460 non-null   in

### CATEGORICAL FEATURES CARDINALITY

In [10]:
# Cardinality of Categorical Features (Notice the number Categorical features now included)
print("\nCATEGORICAL FEATURES CARDINALITY: (DESCENDING ORDER)")
X[categorical_cols].nunique().to_frame('CARDINALITY').sort_values('CARDINALITY', ascending=False)


CATEGORICAL FEATURES CARDINALITY: (DESCENDING ORDER)


Unnamed: 0,CARDINALITY
Neighborhood,25
Exterior2nd,16
MSSubClass,15
Exterior1st,15
Condition1,9
SaleType,9
RoofMatl,8
HouseStyle,8
Condition2,8
Functional,7


# Differentiating Categorical Features into Ordinal or Nominal

### IDENTIFY EACH CATEGORICAL FEATURE AS EITHER: ORDINAL OR NOMINAL

In [11]:
# Take a look at the Features data of the first 5 examples
print("\nFEATURES(X):")
display(X.head().style)


FEATURES(X):


Unnamed: 0_level_0,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,MSZoning,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,Heating,HeatingQC,CentralAir,Electrical,KitchenQual,Functional,FireplaceQu,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,SaleType,SaleCondition,MSSubClass
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1
1,65.0,8450,7,5,2003,2003,196.0,706,0,150,856,856,854,0,1710,1,0,2,1,3,1,8,0,2003.0,2,548,0,61,0,0,0,0,0,2,2008,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,Gable,CompShg,VinylSd,VinylSd,BrkFace,Gd,TA,PConc,Gd,TA,No,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,,Attchd,RFn,TA,TA,Y,,,,WD,Normal,60
2,80.0,9600,6,8,1976,1976,0.0,978,0,284,1262,1262,0,0,1262,0,1,2,0,3,1,6,1,1976.0,2,460,298,0,0,0,0,0,0,5,2007,RL,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,Gable,CompShg,MetalSd,MetalSd,,TA,TA,CBlock,Gd,TA,Gd,ALQ,Unf,GasA,Ex,Y,SBrkr,TA,Typ,TA,Attchd,RFn,TA,TA,Y,,,,WD,Normal,20
3,68.0,11250,7,5,2001,2002,162.0,486,0,434,920,920,866,0,1786,1,0,2,1,3,1,6,1,2001.0,2,608,0,42,0,0,0,0,0,9,2008,RL,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,Gable,CompShg,VinylSd,VinylSd,BrkFace,Gd,TA,PConc,Gd,TA,Mn,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,TA,Attchd,RFn,TA,TA,Y,,,,WD,Normal,60
4,60.0,9550,7,5,1915,1970,0.0,216,0,540,756,961,756,0,1717,1,0,1,0,3,1,7,1,1998.0,3,642,0,35,272,0,0,0,0,2,2006,RL,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,Gable,CompShg,Wd Sdng,Wd Shng,,TA,TA,BrkTil,TA,Gd,No,ALQ,Unf,GasA,Gd,Y,SBrkr,Gd,Typ,Gd,Detchd,Unf,TA,TA,Y,,,,WD,Abnorml,70
5,84.0,14260,8,5,2000,2000,350.0,655,0,490,1145,1145,1053,0,2198,1,0,2,1,4,1,9,1,2000.0,3,836,192,84,0,0,0,0,0,12,2008,RL,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,Gable,CompShg,VinylSd,VinylSd,BrkFace,Gd,TA,PConc,Gd,TA,Av,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,TA,Attchd,RFn,TA,TA,Y,,,,WD,Normal,60


### EXPLICITLY INCLUDE ANY ORDINAL FEATURE YOU IDENTIFY IN THE 'explicit_ordinal' LIST VARIABLE
TO USE IT YOU MUST SET: CATEGORICAL_COLS_TYPE = 'explicit_ordinal'

In [12]:
# SETTING ORDINAL CATEGORICAL FEATURES EXPLICITLY (only works when CATEGORICAL_COLS_TYPE = 'explicit_ordinal')

explicit_ordinal = ['MSZoning', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'HouseStyle','ExterQual',
                            'ExterCond','BsmtQual','BsmtCond', 'BsmtExposure','BsmtFinType1','BsmtFinType2','HeatingQC','KitchenQual',
                            'Functional','FireplaceQu', 'GarageFinish', 'GarageQual','GarageCond', 'PavedDrive', 'PoolQC', 'Fence']

CATEGORICAL_COLS_TYPE = 'explicit_ordinal' # CAN BE: ['all_ordinal', 'all_nominal', 'explicit_ordinal']

In [13]:
if CATEGORICAL_COLS_TYPE == 'explicit_ordinal':
    # This sets ordinal features explicitly, and sets the remainder of categorical features as nominal features
    categorical_cols_ordinal = explicit_ordinal
elif CATEGORICAL_COLS_TYPE == 'all_ordinal':
     # This sets all categorical features as ordinal
    categorical_cols_ordinal = categorical_cols
elif CATEGORICAL_COLS_TYPE == 'all_nominal':
    # This sets all categorical features as nominal
    categorical_cols_ordinal = []

# NOMINAL CATEGORICAL FEATURES ARE DETERMINED IMPLICITLY (if any categorical feature is not within categorical_cols_ordinal , then it would be considered nominal)
categorical_cols_nominal = [cname for cname in categorical_cols if cname not in categorical_cols_ordinal]

print(f'Total Categorical features [{len(categorical_cols)}]: ', categorical_cols)
print(f'\nNominal features [{len(categorical_cols_nominal)}]: ', categorical_cols_nominal)
print(f'\nOrdinal features [{len(categorical_cols_ordinal)}]: ', categorical_cols_ordinal)

Total Categorical features [44]:  ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition', 'MSSubClass']

Nominal features [19]:  ['Street', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'Foundation', 'Heating', 'CentralAir', 'Electrical', 'GarageType', 'MiscFeature', 'SaleType', 'SaleCondition', 'MSSubClass']

Ordinal features [25]:  ['MSZoning', 'Alley', 'LotShape', 'LandContour', 'Utilities',

# Splitting Training & Validation Datasets

In [14]:
# Split validation set from training data
VALIDATION_SIZE = 0.2
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=VALIDATION_SIZE, random_state=0)

# Verify shape of X, y and X_test datasets
print(f"\nX.shape = {X.shape}   --> X_train.shape = {X_train.shape}   &   X_valid.shape = {X_valid.shape}")
print(f"\ny.shape = {y.shape}   --> y_train.shape = {y_train.shape}   &   y_valid.shape = {y_valid.shape}")
print(f"\nX_test.shape = {X_test.shape}")


X.shape = (1460, 79)   --> X_train.shape = (1168, 79)   &   X_valid.shape = (292, 79)

y.shape = (1460,)   --> y_train.shape = (1168,)   &   y_valid.shape = (292,)

X_test.shape = (1459, 79)


# PRE-PROCESSING

In [15]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler, QuantileTransformer, RobustScaler, MaxAbsScaler, PolynomialFeatures # Not in use here, but always handy

numerical_transformer_standard = Pipeline(steps=[('impute', SimpleImputer(strategy='median')),
                                                 ('standard_scaling', StandardScaler())])

categorical_transformer_nominal = Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')),
                                                  ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))])

categorical_transformer_ordinal = Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')),
                                                  ('ordinal', OrdinalEncoder(handle_unknown='use_encoded_value',unknown_value=-1))])

preprocessor = ColumnTransformer(transformers=[('numerical_standard', numerical_transformer_standard, numeric_cols),
                                               ('categorical_nominal', categorical_transformer_nominal, categorical_cols_nominal),
                                               ('categorical_ordinal', categorical_transformer_ordinal, categorical_cols_ordinal)])


preprocessing_pipe = Pipeline(steps=[('preprocessor', preprocessor)])

# Data Transformation
X_train_transformed = preprocessing_pipe.fit_transform(X_train)
X_valid_transformed = preprocessing_pipe.transform(X_valid)
X_test_transformed = preprocessing_pipe.transform(X_test)

# The below line is to be used in the future, while feeding X_unseen as the new input for generating new predictions, after the final model is trained and saved.
# For example, think of X_unseen as the new input you would receive in a Flask ML app from the user of the app.
# X_unseen_transformed = preprocessing_pipe.transform(X_unseen)

# The huge limitation of this is that the saved model is not transforming the data from within the model, meaning that when you deploy it into a ML app as a .pkl file, when you
# receive the input it won't transform the data since the transformation of the data is being handled outside of the model. However I did this for being able to use the convenience of
# Hyperparameter Tuning using GridSearchCV along with early stopping to avoid  overfitting, however once the optimal parameters are determined, I should actually restructure my model
# so that this time it is a complete Pipeline model that feeds the GridSearchCV directly, and do not use early stopping this time, and that way I can benefit from trasnforming the whole datasets
# from within the Pipeline instead of having to transform it externally. This is the way to go.
# My idea is obviously not required here in the notebook or the competitions, but whenever I need to save the model in a .pkl file and simply receive input in order to return a prediction as
# output then in that case having the complete pipeline+GridSearchCV model is the best way to go.

# FEATURE ENGINEERING

# MODELING

In [16]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import precision_score, recall_score, accuracy_score, make_scorer, mean_squared_error # Not in use here, but always handy
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV

# Define the model

MODEL_NUM = 1

if MODEL_NUM == 1:
    model = XGBRegressor(random_state=0, n_estimators =1000, learning_rate=0.05)
elif MODEL_NUM == 2:
    model = RandomForestRegressor(random_state=0, n_estimators=300)

# Fit the model (Actually it is better to test a pipeline model first without the GridSearchCV in order to get a notion of the parameters first while using early stopping,
# and testing individual parameters alone first so that you can reduce the number of extremely naive guesses first, and then use the GridSearchCV model with few hyperparameters
# and few values since early stopping can't be used directly since it requires validation data (however, I believe that the validation dataset goes externally and does not get
# transformed with the Pipeline, but need to check this out with real numbers and check whether it is taken care of or not, and if not I can transform the data externally, with
# another pipeline similar to the one of the training dataset. Need to perform the test to be certain. I run 256 combinations of hyperparameters with XGBRegressor model and it took me
# about 2.5 hours to complete the training, no accelerator was used and the RAM used seem to be about 2GB in Kaggle Kernel. It is best to test changing a single parameter
# individually, before using excess of naive values and excess of parameters, that only radically increase training time.)

# pipe_model.fit(X_train, y_train, early_stopping_rounds=35, eval_set=[(X_valid, y_valid)], verbose=False)


# ORIGINAL ALTERNATIVE
# pipe_model = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])

# Alternative 2
pipe_model = model

# Hyperparameter Tuning & Cross-Validation with GridSearchCV

In [17]:
# param_grid = {'model__n_estimators': [1000, 2000, 3000],
#               'model__max_depth': [3,4],
#               'model__learning_rate': [0.01, 0.04, 0.6],
#              }

param_grid_2 = {'n_estimators': [3000],
                'max_depth': [3],
                'learning_rate': [0.01, 0.05, 0.1],
             }

fit_params={'early_stopping_rounds':100,
            
            # Will take the last tuple of the list of eval_set as the evaluation set for early_stopping_rounds
            #'eval_set' : [(X_train_transformed, y_train), (X_valid_transformed, y_valid)],
            'eval_set' : [(X_valid_transformed, y_valid)],
            'eval_metric' : 'mae',
            'verbose': False}

# Hyperparameter-Tuning Model
grid_model = GridSearchCV(pipe_model,
                          param_grid_2,
                          cv=5,
                          verbose=0,
                          n_jobs=-1,
                          scoring='neg_mean_absolute_error')

grid_model

GridSearchCV(cv=5,
             estimator=XGBRegressor(base_score=None, booster=None,
                                    callbacks=None, colsample_bylevel=None,
                                    colsample_bynode=None,
                                    colsample_bytree=None,
                                    early_stopping_rounds=None,
                                    enable_categorical=False, eval_metric=None,
                                    gamma=None, gpu_id=None, grow_policy=None,
                                    importance_type=None,
                                    interaction_constraints=None,
                                    learning_rate=0.05, max_bin=None,
                                    max_cat...ta_step=None,
                                    max_depth=None, max_leaves=None,
                                    min_child_weight=None, missing=nan,
                                    monotone_constraints=None,
                                    n_e

# Model Training

In [18]:
%%time
import time # Just to compare fit times
start_time = time.time()

# grid_model.fit(X_train_transformed, y_train, **fit_params_2)
grid_model.fit(X_train_transformed, y_train, **fit_params)

end_time = time.time()
print("Tune Fit Time:", end_time - start_time)



Tune Fit Time: 102.61391043663025
CPU times: user 34.5 s, sys: 134 ms, total: 34.6 s
Wall time: 1min 42s


In [19]:
# This doesn't work with GridSearchCV object, but with a xgboost object. evals_result() is a method for xgboost, therefore it is
# not compatible being used with GridSearchCV. Use this cell only if you use a trained model that is an xgboost object and not a 
# GridSearchCV object. Maybe it works with a Pipeline, but I don't think so since it is a method of xgboost.

# from matplotlib import pyplot as plt

# results = model.evals_result()

# plt.figure(figsize=(10,7))
# plt.plot(results["validation_0"]["rmse"], label="Training loss")
# plt.plot(results["validation_1"]["rmse"], label="Validation loss")
# plt.axvline(model.best_ntree_limit, color="gray", label="Optimal tree number")
# plt.xlabel("Number of trees")
# plt.ylabel("Loss")
# plt.legend()

### Observe the the best combination of parameters in your GridSearchCV model

In [20]:
# Return GridSearchCV results sorted by 'mean_test_score' in ascending order
# Remember scoring='neg_mean_absolute_error' therefore results will be sorted from better to worse
pd.DataFrame(dict(grid_model.cv_results_)).sort_values('mean_test_score', ascending=False).style

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
1,17.405423,3.1768,0.010336,0.001065,0.05,3,3000,"{'learning_rate': 0.05, 'max_depth': 3, 'n_estimators': 3000}",-14297.698501,-17731.855786,-15828.880025,-16118.027729,-13010.239405,-15397.340289,1616.356534,1
2,11.445381,2.268533,0.006964,0.001137,0.1,3,3000,"{'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 3000}",-14561.782853,-17732.872563,-15999.920139,-16205.310639,-13186.386668,-15537.254572,1546.558869,2
0,42.919341,15.090177,0.015835,0.003898,0.01,3,3000,"{'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 3000}",-15427.561365,-17793.874382,-15637.144715,-16117.178447,-12910.156954,-15577.183173,1571.262278,3


# Observe the best set of parameters and its Mean CV Score as well as the Standard Deviation of its CV Score
(The score function used was Mean Absolute Error)

In [21]:
mean_score = grid_model.cv_results_['mean_test_score'][grid_model.best_index_]
std_score = grid_model.cv_results_['std_test_score'][grid_model.best_index_]

# grid_model.best_params_, mean_score, std_score

print(f"Best parameters: {grid_model.best_params_}")
print(f"Mean CV score: {mean_score: .6f}")
print(f"Standard deviation of CV score: {std_score: .6f}")

Best parameters: {'learning_rate': 0.05, 'max_depth': 3, 'n_estimators': 3000}
Mean CV score: -15397.340289
Standard deviation of CV score:  1616.356534


In [22]:
"""PREVIOUS REFERENCE FROM USING THE FOLLOWING param_grid object:

param_grid = {'model__n_estimators': [100, 300, 500, 1000, 1500, 2000, 3000],
              'model__max_depth': [1,2,3,5,7,10],
              'model__learning_rate': [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3],
             }

Best parameters: {'model__learning_rate': 0.01, 'model__max_depth': 3, 'model__n_estimators': 3000}
Mean CV score: -15267.545908
Standard deviation of CV score:  1608.780303

NOTE: No fit_params were used, meaning that no 'early_stopping_rounds' or 'eval_set' were used, which could have generated overfitting.
"""

"PREVIOUS REFERENCE FROM USING THE FOLLOWING param_grid object:\n\nparam_grid = {'model__n_estimators': [100, 300, 500, 1000, 1500, 2000, 3000],\n              'model__max_depth': [1,2,3,5,7,10],\n              'model__learning_rate': [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3],\n             }\n\nBest parameters: {'model__learning_rate': 0.01, 'model__max_depth': 3, 'model__n_estimators': 3000}\nMean CV score: -15267.545908\nStandard deviation of CV score:  1608.780303\n\nNOTE: No fit_params were used, meaning that no 'early_stopping_rounds' or 'eval_set' were used, which could have generated overfitting.\n"

In [23]:
preds_valid = grid_model.predict(X_valid_transformed)

# Calculate MAE
mae = mean_absolute_error(y_valid, preds_valid)

# Regression Metrics
print("Mean Absolute Error:" , mae)

Mean Absolute Error: 16166.37241812928


# RE-TRAIN MODEL W/ OPTIMAL PARAMETERS USING BOTH TRAINING AND VALIDATION DATASETS TOGETHER

* Now that you have tuned the hyperparameters, join the training and validation datasets and re-train the model with the best parameters

* Actually here we can concatenate the datasets without the external preprocessing transformation, but in its pure way. (Remember that the external preprocessing transformation was only necessary for being able to use XGBoost with early stopping while still using GridSearchCV, but now during re-training we do not need early stopping anymore)

* Then restructure the model in a Pipeline, without using early stopping, and train the model using GridSearchCV.

## Join the previous training sets with the validation sets, after defining optimal hyperparameters

In [24]:
# After determining the best set of parameters, we concatenate all the training and validation data as the new training data,
# therefore we will concatenate X_train with X_valid and y_train with y_valid :
# new_X_train = pd.concat([pd.DataFrame(X_train_transformed), pd.DataFrame(X_valid_transformed)], axis =0) # DESACTIVAR ESTA LINEA
new_X_train = pd.concat([X_train, X_valid], axis =0) # ACTIVAR ESTA LINEA Y RESTRUCTURAR MODELO EN CELDAS DE ABAJO (SEGUIR MAÑANA...SUEÑO)
new_y_train = pd.concat([y_train, y_valid], axis =0)

## Use the model's attribute .best_params_ to collect the best parameters for your model

In [25]:
best_params = grid_model.best_params_
best_params

{'learning_rate': 0.05, 'max_depth': 3, 'n_estimators': 3000}

## Redefine your final model plugging in optimal hyperparameters

* This time include both the pre-processor and model in the same Pipeline.
* Exclude fit_params now, since early stopping is no longer needed, which previously required an evaluation set to be pre-processed outside of the Pipeline which was very inconvenient, but since optimal hyperparameters were already determined, then early stopping is no longer needed, and any new data can be fed directly and conveniently through the Pipeline model.
* The Pipeline model is then fed to the GridSearchCV model, which does not need to be tuned again, however does require cross-validation to be fitted properly and make the final training of the final model with the already determined optimal parameters.

In [26]:
%%time

model_best = XGBRegressor(random_state=0,
                          n_estimators = best_params['n_estimators'],
                          learning_rate = best_params['learning_rate'],
                          max_depth = best_params['max_depth'])

# Fit the model
# model.fit(X_train, y_train, early_stopping_rounds=35, eval_set=[(X_valid, y_valid)], verbose=False)


# OJO : REACTIVAR ESTE MODELO DE PIPELINE CON PREPROCESSOR + MODEL_BEST
pipe_model_best = Pipeline(steps=[('preprocessor', preprocessor),
                                  ('model_best', model_best)])

# OJO : DESACTIVAR ESTE MODELO SIN PREPROCESSOR STEP PARA NO TENER QUE TRANSFORMAR DATA POR FUERA DEL PIPELINE, YA NO ES NECESARIO YA NO NECESITAMOS fit_params ni early stopping.
# pipe_model_best = Pipeline(steps=[('model_best', model_best)])

param_grid_best = {'model_best__n_estimators': [best_params['n_estimators']],
                   'model_best__max_depth': [best_params['max_depth']],
                   'model_best__learning_rate': [best_params['learning_rate']]
                  }

# No fit_params are given in this case since we have no y_testing data, and besides we already determined the hyperparameters to be used,
# so we don't need early stopping. However, I want to double check opinions on whether early stopping is used on the final model after the 
# optimal parameters were determined(where we previously DID use early stopping to avoid overfitting). However does using early stopping in the
# final training where the validation dataset is absorbed for training, prevents from overfitting or is simply not necessary anymore, since it
# was safely used before during hyperparameter tuning? Or perhaps is it actually even damaging to your model it you stop early while training
# with optimal parameters?

# Could using early stopping FOR THE SECOND TIME as well in training the final model AFTER HYPERPARAMETER TUNING help avoid overfitting model?

grid_model_best = GridSearchCV(pipe_model_best, param_grid_best, cv=10, n_jobs=-1, scoring='neg_mean_absolute_error')

CPU times: user 178 µs, sys: 6 µs, total: 184 µs
Wall time: 188 µs


In [27]:
grid_model_best.fit(new_X_train, new_y_train) # This data is not supposed to be preprocessed outside of the Pipeline+GridSearchCV model, but instead automatically

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('numerical_standard',
                                                                         Pipeline(steps=[('impute',
                                                                                          SimpleImputer(strategy='median')),
                                                                                         ('standard_scaling',
                                                                                          StandardScaler())]),
                                                                         ['LotFrontage',
                                                                          'LotArea',
                                                                          'OverallQual',
                                                                          'OverallCond',
                               

# See the model's parameters with .get_params()

In [28]:
grid_model_best.get_params()

{'cv': 10,
 'error_score': nan,
 'estimator__memory': None,
 'estimator__steps': [('preprocessor',
   ColumnTransformer(transformers=[('numerical_standard',
                                    Pipeline(steps=[('impute',
                                                     SimpleImputer(strategy='median')),
                                                    ('standard_scaling',
                                                     StandardScaler())]),
                                    ['LotFrontage', 'LotArea', 'OverallQual',
                                     'OverallCond', 'YearBuilt', 'YearRemodAdd',
                                     'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
                                     'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF',
                                     '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
                                     'B...
                                                     OrdinalEncoder(handle_unknown='use_encoded_value',
     

# Save model as .pkl

In [29]:
import joblib
import pickle

joblib.dump(grid_model_best, 'grid_model_best.pkl')

['grid_model_best.pkl']

# Load model

In [30]:
loaded_model = joblib.load('grid_model_best.pkl')

# Make predictions with X_test_transformed

In [31]:
# preds_test = loaded_model.predict(X_test_transformed) # This data is not supposed to be preprocessed, but instead all managed inside of the Pipeline+GridSearchCV model
preds_test = loaded_model.predict(X_test) # This data is not supposed to be preprocessed, but instead all managed inside of the Pipeline+GridSearchCV model

# Make predictions with new, upcoming, unseen data

* Now that our final model is trained with optimal parameters, we can make predictions with new, unseen data such as the input generated from an end user on a Flask ML app.

* We only need to deliver it in a pandas DataFrame with the same column names as used in the training data, specifically on the same columns as the ones used in the preprocessor steps, which may exclude many columns from the training dataset.

* The other option is to feed input values in the exact same order as ALL the columns of the training dataset, which is more prone to error, and if the training set has too many columns is only worse, so the best, recommended way is to simply use a Pandas DataFrame with named columns to keep it straight, clear and avoid ambiguity and misinterpretation of data.

In [32]:
# Make predictions from future input from unseen data such as input from a Flask ML app
# preds_unseen = grid_model_best.predict(X_unseen)

# Dump predictions to .CSV file for competition

In [33]:
# Save test predictions to file
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)