# Modelling - House Prices - Random Forest Approach

![](https://raw.githubusercontent.com/UlrikThygePedersen/Kaggle/main/Code/House%20Prices%20-%20Advanced%20Regression%20Techniques/random_forest_logo.png)

# Table of Contents

* [Introduction](#introduction)
* [House Keeping](#house)
* [Data Cleaning](#clean)
* [Train-Test Split](#split)
* [Random Forest Regressor Model](#forest)
* [XGBoost Regressor Model](#XGB)
* [Hyperparameter Tuning](#hyper)
* [Conclusion](#conc)

# Introduction <a id="introduction"></a>

*Machine learning modeling refers to the process of building and training a machine learning model to make predictions or decisions based on input data. The process typically involves the following steps:*

* *Data preparation: Collecting and cleaning the data that will be used to train the model.*
* *Feature engineering: Selecting and transforming the features that will be used by the model.*
* *Model selection: Choosing an appropriate machine learning algorithm or model architecture that is suited to the problem at hand.*
* *Model training: Using the prepared data to train the model.*
* *Model evaluation: Evaluating the performance of the model using metrics such as accuracy, precision, recall, or F1-score.*
* *Model tuning or optimization: Adjusting the hyperparameters of the model to improve its performance.*
* *Model deployment: Putting the trained model into production, where it can be used to make predictions on new data.*

*The goal of machine learning modeling is to build a model that can generalize well to new data and make accurate predictions or decisions. This process is iterative and may involve several rounds of model selection, training, and evaluation to find the best performing model.*

In this notebook I will dive into the [House Prices - Advanced Regression Techniques](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques) dataset to explore and learn along the way.

**Hope you enjoy, let me know how I can improve, and if you liked it, an upvote would help me out alot!**

**Looking for Exploratory Data Analysis on this dataset? Check out my notebook [Exploratory Data Analysis](https://www.kaggle.com/code/ulrikthygepedersen/exploratory-data-analysis-house-prices/notebook)**

**Want to learn more about making your data ready for modelling? Check out my notebook on [Feature Engineering](https://www.kaggle.com/code/ulrikthygepedersen/feature-engineering-house-prices/notebook)**

**Want to learn more about how to further reduce features? Check out my notebook on [Principal Component Analysis](https://www.kaggle.com/code/ulrikthygepedersen/reducing-features-principal-component-analysis/notebook)**

# House Keeping <a id="house"></a>

## Import Libraries, load dataset and do a short summary

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler, OrdinalEncoder

from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV, cross_val_score

# load datasets
df_train = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
df_test = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')
df_sample_submission = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/sample_submission.csv')

# mark train and test sets for future split
df_train['train_test'] = 'Train'
df_test['train_test'] = 'Test'

#combine to a single dataframe with all data for feature engineering
df_all = pd.concat([df_train, df_test])

# print dataset shape and columns
trow, tcol = df_train.shape
erow, ecol = df_test.shape
srow, scol = df_sample_submission.shape

print(f'''
Train Dataset:
Loaded train dataset with shape {df_train.shape} ({trow} rows and {tcol} columns)

Test Dataset:
Loaded test dataset with shape {df_test.shape} ({erow} rows and {ecol} columns)

Sample Submission Dataset:
Loaded sample submission dataset with shape {df_sample_submission.shape} ({srow} rows and {scol} columns)
''')


Train Dataset:
Loaded train dataset with shape (1460, 82) (1460 rows and 82 columns)

Test Dataset:
Loaded test dataset with shape (1459, 81) (1459 rows and 81 columns)

Sample Submission Dataset:
Loaded sample submission dataset with shape (1459, 2) (1459 rows and 2 columns)



# Data Cleaning <a id="clean"></a>

Based on my [previous notebook on Exploratory Data Analysis](https://www.kaggle.com/code/ulrikthygepedersen/exploratory-data-analysis-house-prices), I will drop features with little information to increase model training time and accuracy:

In [2]:
# drop the Id and PoolQC columns
df_all = df_all.drop(['Id', 
                      'PoolQC', 
                      'PoolArea'], 
                      axis=1)

# drop features with little information based on visualizations
df_all = (df_all.drop(['BsmtFinSF2',
                       'LowQualFinSF',
                       'BsmtHalfBath',
                       'KitchenAbvGr',
                       'EnclosedPorch',
                       '3SsnPorch',
                       'MiscVal',
                       'Street', 
                       'Utilities', 
                       'Condition2', 
                       'RoofMatl', 
                       'Heating',
                       'MiscFeature'], 
                       axis=1))

# drop features with little information based on heatmap
df_all = (df_all.drop(['MSSubClass',
                       'OverallCond',
                       'ScreenPorch',
                       'MoSold',
                       'YrSold'], 
                       axis=1))

# Feature Engineering <a id="feature"></a>

Based on my [previous notebook on Feature Engineering](https://www.kaggle.com/code/ulrikthygepedersen/feature-engineering-house-prices), I will impute missing values, encode categorical features and scale numerical features:

In [3]:
# replace numerical features with the mean of the column
for col in df_all.columns:
    if((df_all[col].dtype == 'float64') or (df_all[col].dtype == 'int64')):
        df_all[col].fillna(df_all[col].mean(), inplace=True)

# replace categorical features with the most common value of the column
for col in df_all.columns:
    if df_all[col].dtype == 'object':
        df_all[col].fillna(df_all[col].mode()[0], inplace=True)
        
# encode ordinal features
for col in ['BsmtQual', 'BsmtCond']:
    OE = OrdinalEncoder(categories=[['No', 'Po', 'Fa', 'TA', 'Gd', 'Ex']])
    df_all[col] = OE.fit_transform(df_all[[col]])

    
for col in ['ExterQual', 'ExterCond', 'KitchenQual']:
    OE = OrdinalEncoder(categories=[['Po', 'Fa', 'TA', 'Gd', 'Ex']])
    df_all[col] = OE.fit_transform(df_all[[col]])
    

OE = OrdinalEncoder(categories=[['N', 'P', 'Y']])
df_all['PavedDrive'] = OE.fit_transform(df_all[['PavedDrive']])


OE = OrdinalEncoder(categories=[['Mix', 'FuseP', 'FuseF', 'FuseA', 'SBrkr']])
df_all['Electrical'] = OE.fit_transform(df_all[['Electrical']])


for col in ['BsmtFinType1', 'BsmtFinType2']:
    OE = OrdinalEncoder(categories=[['No', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ']])
    df_all[col] = OE.fit_transform(df_all[[col]])


OE = OrdinalEncoder(categories=[['C (all)', 'RH', 'RM', 'RL', 'FV']])
df_all['MSZoning'] = OE.fit_transform(df_all[['MSZoning']])


OE = OrdinalEncoder(categories=[['Slab', 'BrkTil', 'Stone', 'CBlock', 'Wood', 'PConc']])
df_all['Foundation'] = OE.fit_transform(df_all[['Foundation']])


OE = OrdinalEncoder(categories=[['MeadowV', 'IDOTRR', 'BrDale', 'Edwards', 'BrkSide', 'OldTown', 'NAmes', 'Sawyer', 'Mitchel', 'NPkVill', 'SWISU', 'Blueste', 'SawyerW', 'NWAmes', 'Gilbert', 'Blmngtn', 'ClearCr', 'Crawfor', 'CollgCr', 'Veenker', 'Timber', 'Somerst', 'NoRidge', 'StoneBr', 'NridgHt']])
df_all['Neighborhood'] = OE.fit_transform(df_all[['Neighborhood']])


OE = OrdinalEncoder(categories=[['None', 'BrkCmn', 'BrkFace', 'Stone']])
df_all['MasVnrType'] = OE.fit_transform(df_all[['MasVnrType']])


OE = OrdinalEncoder(categories=[['AdjLand', 'Abnorml','Alloca', 'Family', 'Normal', 'Partial']])
df_all['SaleCondition'] = OE.fit_transform(df_all[['SaleCondition']])


OE = OrdinalEncoder(categories=[['Gambrel', 'Gable','Hip', 'Mansard', 'Flat', 'Shed']])
df_all['RoofStyle'] = OE.fit_transform(df_all[['RoofStyle']])

# scale all numerical features
numerical_features = df_all.select_dtypes(exclude="object").columns

scaler = StandardScaler()

df_all[numerical_features] = scaler.fit_transform(df_all[numerical_features])

# re add SalePrice
df_train = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
df_test = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')
df_all2 = pd.concat([df_train, df_test])

df_all['SalePrice'] = df_all2['SalePrice']

# ONE HOT ENCODING - COMING SOON

#one_hot_features = ['Alley', 
#                    'LotShape', 
#                    'LandContour', 
#                    'LotConfig', 
#                    'LandSlope', 
#                    'Condition1', 
#                    'GarageQual', 
#                    'GarageCond', 
#                    'Fence', 
#                    'SaleType']

#df_dummies = pd.get_dummies(df_all[one_hot_features])

#df_all = df_all.drop(one_hot_features, axis=1)

#df_all = df_all.join(df_dummies)

#df_all

drop_col = ['Alley', 
            'LotShape', 
            'LandContour', 
            'LotConfig', 
            'LandSlope', 
            'Condition1', 
            'GarageQual', 
            'GarageCond', 
            'Fence', 
            'SaleType',
           'BldgType',
           'HouseStyle',
           'Exterior1st',
           'Exterior2nd',
           'GarageFinish',
           'GarageType',
           'FireplaceQu',
           'Functional',
            'BsmtExposure',
            'HeatingQC',
            'CentralAir'
           ]

df_all = df_all.drop(drop_col, axis=1)

# Train-Test Split <a id="split"></a>

In [4]:
# resplit into train and test sets
X_train = df_all[df_all['train_test'] == 'Train'].drop(['train_test', 'SalePrice'], axis =1)
X_test = df_all[df_all['train_test'] == 'Test'].drop(['train_test', 'SalePrice'], axis =1)
y_train = df_all[df_all['train_test'] == 'Train']['SalePrice']
y_test = df_all[df_all['train_test'] == 'Test']['SalePrice']

print(f'Before training models our train set has {X_train.shape} rows and columns and our test set has {X_test.shape} rows and columns.')

Before training models our train set has (1460, 38) rows and columns and our test set has (1459, 38) rows and columns.


# Random Forest Regressor Model <a id="forest"></a>

A random forest regressor is a type of ensemble machine learning model that is used for regression tasks. It is built using a combination of multiple decision trees. Each decision tree is trained on a different subset of the data and with different subsets of the features. The final prediction is made by averaging the predictions of all the decision trees in the forest.

The key idea behind a random forest regressor is to combine the predictions of multiple decision trees, which can decrease the variance and increase the stability of the model. Random forests are also less prone to overfitting than a single decision tree, as they average out the noise in the data.

A random forest regressor is trained using a technique called bagging (Bootstrap Aggregating) which creates multiple training sets by randomly sampling the data with replacement. Each decision tree is then trained on a different bootstrapped training set, which leads to the creation of a diverse set of decision trees.

Random forest regressor can be used for both linear and non-linear regression problem. It is a robust model, which works well for both high-dimensional and low-dimensional datasets and also works well for datasets with a large number of features.

In [5]:
# define simple function to judge model performance
def model_performance(model, name):
    print(name)
    print(f'Best Score: {model.best_score_}')
    print(f'Best Parameters: {model.best_params_}\n') 

# instanciate a Random Forest Regressor model
rf = RandomForestRegressor(random_state=42)

# fit the model to the training data
rf.fit(X_train, y_train)

# use the model to predict on the test set
y_hat = rf.predict(X_test).astype(int)

# make a dataframe and save it as a csv for submission
submission = pd.DataFrame({'Id': df_test['Id'], 'SalePrice': y_hat})
submission.to_csv('simple_submission.csv', index=False)

# Hyperparameter Tuning <a id="hyper"></a>

Hyperparameter tuning refers to the process of systematically searching for the best combination of hyperparameters for a machine learning model. Hyperparameters are parameters that are not learned from data, but set before the training process begins. 

Examples of hyperparameters include the learning rate of a neural network, the number of trees in a random forest, and the regularization term in a linear regression. Hyperparameter tuning is important because it can significantly impact the performance of a machine learning model. Common techniques for hyperparameter tuning include: 

* Grid search
* Random search
* Bayesian optimization

Since our **Random Forest Regressor** has such a large **parameter space**, we are first using a **RandomizedSearchCV** to narrow it down and then a full **GridSearchCV** to fully optimize our model:

## Randomized Search CV

In [6]:
# instanciate a Random Forest Regressor model
rf = RandomForestRegressor(random_state=42)

# set up a parameter grid to search for the best combination of hyperparameters
parameter_grid = {'n_estimators': [50,100,200], 
                  'bootstrap': [True,False],
                  'max_depth': [10,20,50,75,None],
                  'max_features': ['auto','sqrt'],
                  'min_samples_leaf': [1,2,4],
                  'min_samples_split': [2,5,10]
                  }

# fit the model with all combinations of the parameters from the grid
rf_random = RandomizedSearchCV(rf, 
                               param_distributions=parameter_grid,
                               n_iter=50,
                               cv=3, 
                               verbose=True,
                               random_state=42,
                               n_jobs=-1)

# fit the model to the training data
rf_random.fit(X_train, y_train)

# evaluate the model
model_performance(rf_random, 'Random Forest Regressor (RandomizedSearchCV)')

Fitting 3 folds for each of 50 candidates, totalling 150 fits
Random Forest Regressor (RandomizedSearchCV)
Best Score: 0.8887659710833334
Best Parameters: {'n_estimators': 200, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 50, 'bootstrap': False}



The most important arguments in RandomizedSearchCV are **n_iter**, which controls the number of different combinations to try, and **cv** which is the number of folds to use for cross validation. 

More **iterations** will cover a wider search space and more **cv** folds reduces the chances of overfitting, but raising each will increase the run time. Machine learning is a field of trade-offs, and **performance vs time** is one of the most fundamental.

**We can view the best parameters from fitting the random search:**

In [7]:
rf_random.best_params_

{'n_estimators': 200,
 'min_samples_split': 5,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 50,
 'bootstrap': False}

Random search allowed us to narrow down the range for each hyperparameter. Now that we know where to concentrate our search, we can explicitly specify every combination of settings to try. We do this with **GridSearchCV**, a method that, instead of sampling randomly from a distribution, evaluates all combinations we define. 

**To use Grid Search, we make another grid based on the best values provided by random search:**

## Grid Search CV

In [8]:
# instanciate a Random Forest Regressor model
rf = RandomForestRegressor(random_state=42)

# set up a parameter grid to search for the best combination of hyperparameters
parameter_grid = {'n_estimators': [150,200,250], 
                  'bootstrap': [False],
                  'max_depth': [25,50,75],
                  'max_features': ['sqrt'],
                  'min_samples_leaf': [1,2],
                  'min_samples_split': [3,5,7],
                  }

# fit the model with all combinations of the parameters from the grid
rf_grid = GridSearchCV(estimator = rf, 
                         param_grid=parameter_grid,
                         cv=3, 
                         verbose=True,
                         n_jobs=-1)

# extract the best performing model and fit it to the data
best_rf_model = rf_grid.fit(X_train, y_train)

# evaluate the model
model_performance(best_rf_model, 'Random Forest Regressor (GridSearchCV)')

best_rf_model.fit(X_train, y_train)

# use the model to predict on the test set
y_hat_tuned = best_rf_model.predict(X_test).astype(int)

# make a dataframe and save it as a csv for submission
tuned_submission = pd.DataFrame({'Id': df_test['Id'], 'SalePrice': y_hat_tuned})
tuned_submission.to_csv('submission.csv', index=False)

# print the tuned submission dataframe
tuned_submission

Fitting 3 folds for each of 54 candidates, totalling 162 fits
Random Forest Regressor (GridSearchCV)
Best Score: 0.8889382493998639
Best Parameters: {'bootstrap': False, 'max_depth': 50, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 150}

Fitting 3 folds for each of 54 candidates, totalling 162 fits


Unnamed: 0,Id,SalePrice
0,1461,123555
1,1462,154813
2,1463,181125
3,1464,187537
4,1465,189907
...,...,...
1454,2915,79406
1455,2916,84294
1456,2917,166861
1457,2918,114320


# Conclusion <a id="conc"></a>

And we have a submission!

In this notebook we explored a **Random Forest Approach** to predicting the sale price of houses in the [House Prices - Advanced Regression Techniques dataset](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques)

* First we prepared the data for modelling
* Next we ran created a Random Forest model to predict Sale Price of our dataset
* And last, but not least, we improved our model by tuning its hyperparameters

**If you made it through all my notebooks, first of all, THANK YOU!**

**It has been a blast exploring this dataset and learning along the way! Now to look for a new challenge, stay tuned and take care!**