# Housing Prices

This project was initially written as a submission for the "Housing Prices Competition for Kaggle Learn Users" (https://www.kaggle.com/competitions/home-data-for-ml-course/overview). A description of competition is given below.

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

The data was provided in a form of two CSV files - train.csv (containing training data with information about the price of the house) and test.csv (containing testing without information about the price of the house). In the following analysis only the train.csv file will be used.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from pprint import pprint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

## 1. Importing data

In [2]:
data = pd.read_csv('./dataset/train.csv')
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


## 2. Data exploration

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [4]:
data.shape

(1460, 81)

The data describe 1460 houses with 81 features (both numerical and object type). There are some missing values in this dataframe. The prediction target is in column 'SalePrice'. A detailed description of the individual parameters describing the houses can be found on the competition website, under 'data' (https://www.kaggle.com/competitions/home-data-for-ml-course/data). 

Most of the data in columns 'Alley', 'FireplaceQu', 'PoolQC', 'Fence' and 'MiscFeature' are missing so they will not be taken into account in further analysis.

In [5]:
# Drop columns with significant proportion of missing data
data.drop(['Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature'], axis = 1, inplace = True)

# Print remaining columns with missing data and number of missing data
data.isnull().sum()[data.isnull().sum()>0]

LotFrontage     259
MasVnrType        8
MasVnrArea        8
BsmtQual         37
BsmtCond         37
BsmtExposure     38
BsmtFinType1     37
BsmtFinType2     38
Electrical        1
GarageType       81
GarageYrBlt      81
GarageFinish     81
GarageQual       81
GarageCond       81
dtype: int64

In the remaining columns missing data are not the majority, they can be used in further analysis.

## 3. Data manipulation

In [6]:
# Select prediction target (y) and features (X)
y = data.SalePrice
X = data.drop(['SalePrice'], axis=1)

# Divide into training and validation dataset
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size = 0.8, test_size = 0.2, random_state = 0)

In further analysis two different approaches will be applied. In first approach columns with missing data will be dropped and in the second approach values will be imputed with various strategies. Modelling will be performed with RandomForestRegressor. The score_df function defined below will be used to assess the effectiveness of individual data manipulation.

In [7]:
def score_df(X_train, X_val, y_train, y_val):
    '''
    Function that measure the accuracy of model by mean absolute error
    '''
    model = RandomForestRegressor(random_state=0)
    model.fit(X_train, y_train)
    pred = model.predict(X_val)
    r2 = r2_score(y_val, pred)
    mae = mean_absolute_error(y_val, pred)    
    
    print('R2 value: ', r2)
    print('Mean absolute error: ', mae, ' dollars')
    return [mae, r2]

The cardinality of columns with categorical data will be investigated.

In [8]:
# Categorical columns in the training dataset
obj_cols = [col for col in X_train.columns if X_train[col].dtype == 'object']

# Get number of unique entries in each column with categorical data
obj_nunique = list(map(lambda col: X_train[col].nunique(), obj_cols))
obj_dict = dict(zip(obj_cols, obj_nunique))

# Print number of unique entries by column, in ascending order
sorted(obj_dict.items(), key=lambda x: x[1])

[('Street', 2),
 ('Utilities', 2),
 ('CentralAir', 2),
 ('LandSlope', 3),
 ('GarageFinish', 3),
 ('PavedDrive', 3),
 ('LotShape', 4),
 ('LandContour', 4),
 ('MasVnrType', 4),
 ('ExterQual', 4),
 ('BsmtQual', 4),
 ('BsmtCond', 4),
 ('BsmtExposure', 4),
 ('KitchenQual', 4),
 ('MSZoning', 5),
 ('LotConfig', 5),
 ('BldgType', 5),
 ('ExterCond', 5),
 ('HeatingQC', 5),
 ('Electrical', 5),
 ('GarageQual', 5),
 ('GarageCond', 5),
 ('Condition2', 6),
 ('RoofStyle', 6),
 ('Foundation', 6),
 ('BsmtFinType1', 6),
 ('BsmtFinType2', 6),
 ('Heating', 6),
 ('Functional', 6),
 ('GarageType', 6),
 ('SaleCondition', 6),
 ('RoofMatl', 7),
 ('HouseStyle', 8),
 ('Condition1', 9),
 ('SaleType', 9),
 ('Exterior1st', 15),
 ('Exterior2nd', 16),
 ('Neighborhood', 25)]

To keep the dataset in reasonable size, only columns with cardinality less than 10 will be investigated and encoded.

In [9]:
# Select categorical columns with cardinality < 10 
object_cols = [col for col in obj_cols if X_train[col].nunique() < 10]

# Select numerical columns
numerical_cols = [col for col in X_train.columns if X_train[col].dtype in ['int64', 'float64']]

selected_cols = object_cols + numerical_cols

# Keep only selected columns
X_train = X_train[selected_cols]
X_val = X_val[selected_cols]


## 4. $1^{st}$ approach

<b> Dropping columns with missing values </b>

In [10]:
# Names of columns with missing values
col_miss = [col for col in X_train.columns if X_train[col].isnull().any()]

col_miss

['MasVnrType',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Electrical',
 'GarageType',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'LotFrontage',
 'MasVnrArea',
 'GarageYrBlt']

In [11]:
# Drop columns with missing values in training and validation datasets
X_train_red = X_train.drop(col_miss, axis = 1)
X_val_red = X_val.drop(col_miss, axis = 1)

print('Shape of training dataset before:', X_train.shape,
      ' and after the removal of columns with missing data', X_train_red.shape)
print('Shape of training dataset before:', X_val.shape,
      ' and after the removal of columns with missing data', X_val_red.shape)

Shape of training dataset before: (1168, 72)  and after the removal of columns with missing data (1168, 58)
Shape of training dataset before: (292, 72)  and after the removal of columns with missing data (292, 58)


In [12]:
# Select columns with categorical data after reduction
obj_red = [col for col in X_train_red.columns if X_train_red[col].dtype == 'object']
print('Columns with categorical data: ',obj_red)

# Select columns with numerical data after reduction
num_red = [col for col in X_train_red.columns if X_train_red[col].dtype in ['int64', 'float64']]
print('\nColumns with numerical data: ', num_red)

Columns with categorical data:  ['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'ExterQual', 'ExterCond', 'Foundation', 'Heating', 'HeatingQC', 'CentralAir', 'KitchenQual', 'Functional', 'PavedDrive', 'SaleType', 'SaleCondition']

Columns with numerical data:  ['Id', 'MSSubClass', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']


Due to lack of data, 14 columns were removed from the analyzed set. Next, columns with categorical values will be encoded using OneHotEncoder.

In [13]:
OH_encoder = OneHotEncoder(sparse = False, handle_unknown = 'ignore')

# Encode categorical data
OH_train_red = pd.DataFrame(OH_encoder.fit_transform(X_train_red[obj_red]))
OH_val_red = pd.DataFrame(OH_encoder.transform(X_val_red[obj_red]))

# Reset indexes
OH_train_red.index = X_train_red.index
OH_val_red.index = X_val_red.index

# Select numerical data
num_train_red = X_train_red[num_red]
num_val_red = X_val_red[num_red]

# Merge columns with numerical and categorical data
OH_X_train_red = pd.concat([num_train_red, OH_train_red], axis = 1)
OH_X_val_red = pd.concat([num_val_red, OH_val_red], axis = 1)

In [14]:
# Test first approach
score_1 = score_df(OH_X_train_red, OH_X_val_red, y_train, y_val)
mae_1 = score_1[0]
r2_1 = score_1[1]

R2 value:  0.8482606090509224
Mean absolute error:  17514.224246575344  dollars


In [15]:
print('Mean absolute error (MAE) for first approach (dropping columns with missing values) was ', mae_1, ' dollars.')
print('R2 value for first approach (dropping columns with missing values) was ', r2_1)

Mean absolute error (MAE) for first approach (dropping columns with missing values) was  17514.224246575344  dollars.
R2 value for first approach (dropping columns with missing values) was  0.8482606090509224


In [16]:
# Create dictionary with MAE values for different approaches
mae_dict = {'Dropping columns with missing values' : mae_1}

# Create dictionary with R2 values for different approaches
r2_dict = {'Dropping columns with missing values' : r2_1}

### 5. $2^{nd}$ approach

<b> Imputation of missing values</b>

Missing numerical values will be imputed using SimpleImputer with 3 different strategies - mean, constant and most_frequent. All of them seems reasonable with the data that are missing.
We can assume, that lacks in for example column 'LotFrontage', which stands for linear feet of street connected to property might be caused by:
- lack of this information (here, we probably shoud replace it with mean value of other data in this column or assumme, that there is some typical linear feet of street connected to property in this area and replace it with the most common value),
- the fact that there is no linear feet of street connected to property and then we probably should replace this values with 0 (constant)

All missing categorical data will be replaced with the most frequent value in each column.

In [17]:
cat_imputer = SimpleImputer(strategy = 'most_frequent')

# Imput categorical missing data
X_train_cat_imp = pd.DataFrame(cat_imputer.fit_transform(X_train[object_cols]))
X_val_cat_imp = pd.DataFrame(cat_imputer.transform(X_val[object_cols]))

# Put back removed column names
X_train_cat_imp.columns = X_train[object_cols].columns
X_val_cat_imp.columns = X_val[object_cols].columns

# Encode categorical data
OH_X_train = pd.DataFrame(OH_encoder.fit_transform(X_train_cat_imp[object_cols]))
OH_X_val = pd.DataFrame(OH_encoder.transform(X_val_cat_imp[object_cols]))

<b>Numerical missing data imputation</b>

In [18]:
# Test different strategies for SimpleImputer
for strategy in ['mean', 'most_frequent', 'constant']:
    num_imputer = SimpleImputer(strategy = strategy)
    
    print(strategy)
    
    # Imput numerical missing data
    X_train_num = pd.DataFrame(num_imputer.fit_transform(X_train[numerical_cols]))
    X_val_num = pd.DataFrame(num_imputer.transform(X_val[numerical_cols]))
    
    # Put back removed column names
    X_train_num.columns = X_train[numerical_cols].columns
    X_val_num.columns = X_val[numerical_cols].columns
    
    # Merge columns with numerical and categorical data
    X_train_imp = pd.concat([X_train_num, OH_X_train], axis = 1)
    X_val_imp = pd.concat([X_val_num, OH_X_val], axis = 1)
    
    # Test second approach with imputation by mean value
    score_strategy = score_df(X_train_imp, X_val_imp, y_train, y_val)
    print('\n')
    
    mae_name = 'Imputation - most_frequent (cat) and '+str(strategy)+' (num)'
    mae = score_strategy[0]
    
    r2_name = 'Imputation - most_frequent (cat) and '+str(strategy)+' (num)'
    r2 = score_strategy[1]
    
    # Add MAE result to the dictionary
    mae_dict[mae_name] = mae
    
    # Add R2 result to the dictionary
    r2_dict[r2_name] = r2

mean
R2 value:  0.8306449871710455
Mean absolute error:  17779.75917808219  dollars


most_frequent
R2 value:  0.8245125041079162
Mean absolute error:  17970.68174657534  dollars


constant
R2 value:  0.8261279676568541
Mean absolute error:  17745.829520547948  dollars




### 6. Conclusion - data manipulation

In [19]:
print('Mean Absolute Error (MAE) values for tested approaches:\n')
pprint(mae_dict)
print('\nR2 values for tested approaches:\n')
pprint(r2_dict)

Mean Absolute Error (MAE) values for tested approaches:

{'Dropping columns with missing values': 17514.224246575344,
 'Imputation - most_frequent (cat) and constant (num)': 17745.829520547948,
 'Imputation - most_frequent (cat) and mean (num)': 17779.75917808219,
 'Imputation - most_frequent (cat) and most_frequent (num)': 17970.68174657534}

R2 values for tested approaches:

{'Dropping columns with missing values': 0.8482606090509224,
 'Imputation - most_frequent (cat) and constant (num)': 0.8261279676568541,
 'Imputation - most_frequent (cat) and mean (num)': 0.8306449871710455,
 'Imputation - most_frequent (cat) and most_frequent (num)': 0.8245125041079162}


The lowest value of mean absolute error was obtained by the approach based on dropping columns with missing values. However, this approach may lead to the loss of potentially important information for the model and the difference between dropping columns and replacing missing values with a constant (0) is not very large. 

In data from second approach R2 values for strategy = 'mean' and 'constant' are similar, while the values obtained for 'constant' are slightly lower. However this difference is rather low.

Considering this, in further analysis second approach, with strategy = 'constant' will be used.


In [20]:
# Prepare the data for further testing
num_imputer = SimpleImputer(strategy = 'constant')
   
# Imput numerical missing data
X_train_num = pd.DataFrame(num_imputer.fit_transform(X_train[numerical_cols]))
X_val_num = pd.DataFrame(num_imputer.transform(X_val[numerical_cols]))
    
# Put back removed column names
X_train_num.columns = X_train[numerical_cols].columns
X_val_num.columns = X_val[numerical_cols].columns
    
# Merge columns with numerical and categorical data
X_train = pd.concat([X_train_num, OH_X_train], axis = 1)
X_val = pd.concat([X_val_num, OH_X_val], axis = 1)

### 7. Hyperparameter tuning

<b> RandomizedSearchCV </b>

In [21]:
rf = RandomForestRegressor()
print('Parameters currently in use: ')
pprint(rf.get_params())

Parameters currently in use: 
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}


Gnerally, most important parameters are:
- n_estimators - number of trees in the forest
- max_depth - max number of levels in each decision tree
- min_samples_split - min number of data points placed in a node before the node is split
- min_samples_leaf - min number of data points allowed in a leaf node
- max_features - max number of features considered for splitting a node
- bootstrap - method for sampling data points (with or without replacement)

and they will be tuned.

Hyperparameter tuning will be performed with the use of RandomizedSearchCV method.

To use RandomizedSearchCV at first there is a need to prepare a parameter grid.

In [22]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 2000, num = 20)]

# Maximum number of levels in each decision tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# Number of features to consider at every split
max_features = ['auto', 'sqrt', 'log2']

# Method of selecting samples for training each tree
bootstrap = [True, False]


# Create the parameter grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
pprint(random_grid)

{'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
 'max_features': ['auto', 'sqrt', 'log2'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [100,
                  200,
                  300,
                  400,
                  500,
                  600,
                  700,
                  800,
                  900,
                  1000,
                  1100,
                  1200,
                  1300,
                  1400,
                  1500,
                  1600,
                  1700,
                  1800,
                  1900,
                  2000]}


In [23]:
# Search for best hyperparameters using RandomizedSearchCV (100 combinations, 3 fold cross validation)
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, 
                               n_iter = 100, cv = 3, verbose=2, random_state=0, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed: 12.8min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 24.6min finished


RandomizedSearchCV(cv=3, error_score=nan,
                   estimator=RandomForestRegressor(bootstrap=True,
                                                   ccp_alpha=0.0,
                                                   criterion='mse',
                                                   max_depth=None,
                                                   max_features='auto',
                                                   max_leaf_nodes=None,
                                                   max_samples=None,
                                                   min_impurity_decrease=0.0,
                                                   min_impurity_split=None,
                                                   min_samples_leaf=1,
                                                   min_samples_split=2,
                                                   min_weight_fraction_leaf=0.0,
                                                   n_estimators=100,
                              

In [24]:
# Best model performance
rf_random.best_params_

{'n_estimators': 1200,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 40,
 'bootstrap': False}

In [25]:
# Assign the parameters of the best model obtained using RandomizedSearchCV to the rf_random_best
rf_random_best = rf_random.best_estimator_

In [26]:
def evaluate(model, X_train, y_train, X_val, y_val):
    '''
    Function for evaluation of model performance
    '''
    model.fit(X_train, y_train)
    pred = model.predict(X_val)
    r2 = r2_score(y_val, pred)
    mae = mean_absolute_error(y_val, pred)
    
    print('Model Performance:')
    print('R2 value: ', r2)
    print('Mean absolute error: ', mae, ' dollars')
    
    return [r2, mae]

In [27]:
# Basic model evaluation
print('Basic model:')
evaluate(rf, X_train, y_train, X_val, y_val)

# Evaluation of model after RandomizedSearchCV
print('\nModel after RandomizedSearchCV:')
evaluate(rf_random_best, X_train, y_train, X_val, y_val)

Basic model:
Model Performance:
R2 value:  0.833543543019414
Mean absolute error:  17559.388767123288  dollars

Model after RandomizedSearchCV:
Model Performance:
R2 value:  0.8428845375644615
Mean absolute error:  17199.963907914764  dollars


[0.8428845375644615, 17199.963907914764]

After the RandomizedSearchCV efficiency of the model is improved. Mean absolute error is smaller by about 27.06 dollars and R2 value is greater by about 0.0205.

Further improvement of Random Forest model will be implemented using the GridSearchCV method. Grid of parameters to analyze will be determined basing on the results from RandomizedSearchCV.

<b>GridSearchCV</b>

In [28]:
# Create the parameter grid based on the results of RandimizedSearchCV
param_grid = {'n_estimators': [1150, 1200, 1250, 1300, 1350, 1400],
               'max_depth': [35, 40, 45, 50],
               'min_samples_split': [2, 3, 4],
               'min_samples_leaf': [1, 2, 3, 4],
               'bootstrap': [False]}
pprint(param_grid)

{'bootstrap': [False],
 'max_depth': [35, 40, 45, 50],
 'min_samples_leaf': [1, 2, 3, 4],
 'min_samples_split': [2, 3, 4],
 'n_estimators': [1150, 1200, 1250, 1300, 1350, 1400]}


In [29]:
# Create a model, parameter 'max_features' will be set as 'sqrt'
rf_1 = RandomForestRegressor(max_features = 'sqrt')

# Search for best hyperparameters using GridSearchCV
grid_search = GridSearchCV(estimator = rf_1, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 288 candidates, totalling 864 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  5.9min
[Parallel(n_jobs=-1)]: Done 357 tasks      | elapsed: 13.1min
[Parallel(n_jobs=-1)]: Done 640 tasks      | elapsed: 22.9min
[Parallel(n_jobs=-1)]: Done 864 out of 864 | elapsed: 30.8min finished


GridSearchCV(cv=3, error_score=nan,
             estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,
                                             criterion='mse', max_depth=None,
                                             max_features='sqrt',
                                             max_leaf_nodes=None,
                                             max_samples=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators=100, n_jobs=None,
                                             oob_score=False, random_state=None,
                                             verbose=0, warm_start=False),
             iid='deprecated', n_jo

In [30]:
grid_search.best_params_

{'bootstrap': False,
 'max_depth': 40,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 1150}

In [31]:
# Assign the parameters of the best model obtained using RandomizedSearchCV to the rf_random_best
rf_grid_best = grid_search.best_estimator_

In [32]:
# Evaluation of model after RandomizedSearchCV
print('Model after RandomizedSearchCV:')
evaluate(rf_random_best, X_train, y_train, X_val, y_val)

# Basic model (for GridSearchCV) evaluation
print('\nBasic model (for GridSearchCV):')
evaluate(rf_1, X_train, y_train, X_val, y_val)

# Evaluation of model after GridSearchCV
print('\nModel after GridSearchCV:')
evaluate(rf_grid_best, X_train, y_train, X_val, y_val)

Model after RandomizedSearchCV:
Model Performance:
R2 value:  0.8451789333610115
Mean absolute error:  17163.150550799084  dollars

Basic model (for GridSearchCV):
Model Performance:
R2 value:  0.8276722128541301
Mean absolute error:  18247.79869863014  dollars

Model after GridSearchCV:
Model Performance:
R2 value:  0.8434768872894878
Mean absolute error:  17251.2957176891  dollars


[0.8434768872894878, 17251.2957176891]

Model after tuning hyperparameters with use of GridSearchCV had slightly worse performance than the model after applying RandomizedSearchCV. Considering this, parameters determined by the RandomizedSearchCV will be applied into a final model.

In [33]:
final_model = RandomForestRegressor(n_estimators = 1200,
                                    min_samples_split = 2,
                                    min_samples_leaf = 1,
                                    max_features = 'sqrt',
                                    max_depth = 40,
                                    bootstrap = False, 
                                    random_state = 0
                                   )

In [34]:
# Evaluation of final model
print('Final model:')
evaluate(final_model, X_train, y_train, X_val, y_val)

Final model:
Model Performance:
R2 value:  0.8465803190620228
Mean absolute error:  17181.76764649924  dollars


[0.8465803190620228, 17181.76764649924]

### 8. Wrap it up! (into a pipeline)


In [35]:
# Read the data
data = pd.read_csv('./dataset/train.csv')

# Drop columns with significant proportion of missing data
data.drop(['Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature'], axis = 1, inplace = True)

# Select prediction target (y) and features (X)
y = data.SalePrice
X = data.drop(['SalePrice'], axis=1)

# Divide into training and validation dataset
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size = 0.8, test_size = 0.2, random_state = 0)

# Select categorical columns with cardinality < 10 
object_cols = [col for col in obj_cols if X_train[col].nunique() < 10]

# Select numerical columns
numerical_cols = [col for col in X_train.columns if X_train[col].dtype in ['int64', 'float64']]

selected_cols = object_cols + numerical_cols

# Keep only selected columns
X_train = X_train[selected_cols]
X_val = X_val[selected_cols]

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy = 'constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps = [
    ('inputer', SimpleImputer(strategy = 'most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown = 'ignore'))
]) 

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, object_cols)
    ])

# Define model
final_model = RandomForestRegressor(n_estimators = 1200,
                                    min_samples_split = 2,
                                    min_samples_leaf = 1,
                                    max_features = 'sqrt',
                                    max_depth = 40,
                                    bootstrap = False, 
                                    random_state = 0
                                   )

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', final_model)
                             ])

# Preprocessing of training dataset, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation dataset, get predictions
pred = my_pipeline.predict(X_val)

# Evaluate the model
print('Mean ABsolute Error (MAE) for final model is:', mean_absolute_error(y_val, pred),' dollars')
print('R2 value for final model is:', r2_score(y_val, pred))


Mean ABsolute Error (MAE) for final model is: 17181.76764649924  dollars
R2 value for final model is: 0.8465803190620228
