# Modelling - House Prices

![](https://mljar.com/images/machine-learning/random_forest_logo.png)

# Table of Contents

* [Introduction](#introduction)
* [House Keeping](#house)
* [Data Cleaning](#clean)
* [Train-Test Split](#split)
* [Random Forest Regressor Model](#forest)
* [XGBoost Regressor Model](#XGB)
* [Hyperparameter Tuning](#hyper)
* [Conclusion](#conc)

# Introduction <a id="introduction"></a>

*Machine learning modeling refers to the process of building and training a machine learning model to make predictions or decisions based on input data. The process typically involves the following steps:*

* *Data preparation: Collecting and cleaning the data that will be used to train the model.*
* *Feature engineering: Selecting and transforming the features that will be used by the model.*
* *Model selection: Choosing an appropriate machine learning algorithm or model architecture that is suited to the problem at hand.*
* *Model training: Using the prepared data to train the model.*
* *Model evaluation: Evaluating the performance of the model using metrics such as accuracy, precision, recall, or F1-score.*
* *Model tuning or optimization: Adjusting the hyperparameters of the model to improve its performance.*
* *Model deployment: Putting the trained model into production, where it can be used to make predictions on new data.*

*The goal of machine learning modeling is to build a model that can generalize well to new data and make accurate predictions or decisions. This process is iterative and may involve several rounds of model selection, training, and evaluation to find the best performing model.*

In this notebook I will dive into the [House Prices - Advanced Regression Techniques](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques) dataset to explore and learn along the way.

**Hope you enjoy, let me know how I can improve, and if you liked it, an upvote would help me out alot!**

**Looking for Exploratory Data Analysis on this dataset? Check out my notebook [Exploratory Data Analysis - House Prices](https://www.kaggle.com/code/ulrikthygepedersen/exploratory-data-analysis-house-prices/notebook)**

**Want to learn more about making your data ready for modelling? Check out my notebook on [Feature Engineering](https://www.kaggle.com/code/ulrikthygepedersen/feature-engineering-house-prices/notebook)**

**Want to learn more about how to further reduce features? Check out my notebook on [Principal Component Analysis](https://www.kaggle.com/code/ulrikthygepedersen/reducing-features-principal-component-analysis/notebook)**

# House Keeping <a id="house"></a>

## Import Libraries, load dataset and do a short summary

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# load datasets
df_train = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
df_test = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')
df_sample_submission = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/sample_submission.csv')

# mark train and test sets for future split
df_train['train_test'] = 1
df_test['train_test'] = 0

#combine to a single dataframe with all data for feature engineering
df_all = pd.concat([df_train, df_test])

# print dataset shape and columns
trow, tcol = df_train.shape
erow, ecol = df_test.shape
srow, scol = df_sample_submission.shape

print(f'''
Train Dataset:
Loaded train dataset with shape {df_train.shape} ({trow} rows and {tcol} columns)

Test Dataset:
Loaded test dataset with shape {df_test.shape} ({erow} rows and {ecol} columns)

Sample Submission Dataset:
Loaded sample submission dataset with shape {df_sample_submission.shape} ({srow} rows and {scol} columns)
''')


Train Dataset:
Loaded train dataset with shape (1460, 82) (1460 rows and 82 columns)

Test Dataset:
Loaded test dataset with shape (1459, 81) (1459 rows and 81 columns)

Sample Submission Dataset:
Loaded sample submission dataset with shape (1459, 2) (1459 rows and 2 columns)



# Data Cleaning <a id="clean"></a>

Based on my [previous notebook on Exploratory Data Analysis](https://www.kaggle.com/code/ulrikthygepedersen/exploratory-data-analysis-house-prices), I will drop features with little information to increase model training time and accuracy:

In [2]:
# drop the Id and PoolQC columns
df_all = df_all.drop(['Id', 
                      'PoolQC', 
                      'PoolArea'], 
                      axis=1)

# drop features with little information based on visualizations
df_all = (df_all.drop(['BsmtFinSF2',
                       'LowQualFinSF',
                       'BsmtHalfBath',
                       'KitchenAbvGr',
                       'EnclosedPorch',
                       '3SsnPorch',
                       'MiscVal',
                       'Street', 
                       'Utilities', 
                       'Condition2', 
                       'RoofMatl', 
                       'Heating',
                       'MiscFeature'], 
                       axis=1))

# drop features with little information based on heatmap
df_all = (df_all.drop(['MSSubClass',
                       'OverallCond',
                       'ScreenPorch',
                       'MoSold',
                       'YrSold'], 
                       axis=1))

# Feature Engineering <a id="feature"></a>

Based on my [previous notebook on Feature Engineering](https://www.kaggle.com/code/ulrikthygepedersen/feature-engineering-house-prices), I will impute missing values, encode categorical features and scale numerical features:

In [3]:
# replace numerical features with the mean of the column
for col in df_all.columns:
    if((df_all[col].dtype == 'float64') or (df_all[col].dtype == 'int64')):
        df_all[col].fillna(df_all[col].mean(), inplace=True)

# replace categorical features with the most common value of the column
for col in df_all.columns:
    if df_all[col].dtype == 'object':
        df_all[col].fillna(df_all[col].mode()[0], inplace=True)
        
# encode ordinal features
for col in ['BsmtQual', 'BsmtCond']:
    OE = OrdinalEncoder(categories=[['No', 'Po', 'Fa', 'TA', 'Gd', 'Ex']])
    df_all[col] = OE.fit_transform(df_all[[col]])

    
for col in ['ExterQual', 'ExterCond', 'KitchenQual']:
    OE = OrdinalEncoder(categories=[['Po', 'Fa', 'TA', 'Gd', 'Ex']])
    df_all[col] = OE.fit_transform(df_all[[col]])
    

OE = OrdinalEncoder(categories=[['N', 'P', 'Y']])
df_all['PavedDrive'] = OE.fit_transform(df_all[['PavedDrive']])


OE = OrdinalEncoder(categories=[['Mix', 'FuseP', 'FuseF', 'FuseA', 'SBrkr']])
df_all['Electrical'] = OE.fit_transform(df_all[['Electrical']])


for col in ['BsmtFinType1', 'BsmtFinType2']:
    OE = OrdinalEncoder(categories=[['No', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ']])
    df_all[col] = OE.fit_transform(df_all[[col]])


OE = OrdinalEncoder(categories=[['C (all)', 'RH', 'RM', 'RL', 'FV']])
df_all['MSZoning'] = OE.fit_transform(df_all[['MSZoning']])


OE = OrdinalEncoder(categories=[['Slab', 'BrkTil', 'Stone', 'CBlock', 'Wood', 'PConc']])
df_all['Foundation'] = OE.fit_transform(df_all[['Foundation']])


OE = OrdinalEncoder(categories=[['MeadowV', 'IDOTRR', 'BrDale', 'Edwards', 'BrkSide', 'OldTown', 'NAmes', 'Sawyer', 'Mitchel', 'NPkVill', 'SWISU', 'Blueste', 'SawyerW', 'NWAmes', 'Gilbert', 'Blmngtn', 'ClearCr', 'Crawfor', 'CollgCr', 'Veenker', 'Timber', 'Somerst', 'NoRidge', 'StoneBr', 'NridgHt']])
df_all['Neighborhood'] = OE.fit_transform(df_all[['Neighborhood']])


OE = OrdinalEncoder(categories=[['None', 'BrkCmn', 'BrkFace', 'Stone']])
df_all['MasVnrType'] = OE.fit_transform(df_all[['MasVnrType']])


OE = OrdinalEncoder(categories=[['AdjLand', 'Abnorml','Alloca', 'Family', 'Normal', 'Partial']])
df_all['SaleCondition'] = OE.fit_transform(df_all[['SaleCondition']])


OE = OrdinalEncoder(categories=[['Gambrel', 'Gable','Hip', 'Mansard', 'Flat', 'Shed']])
df_all['RoofStyle'] = OE.fit_transform(df_all[['RoofStyle']])

# scale all numerical features
numerical_features = df_all.select_dtypes(exclude="object").columns

scaler = StandardScaler()

df_all[numerical_features] = scaler.fit_transform(df_all[numerical_features])

NameError: name 'OrdinalEncoder' is not defined

# Train-Test Split <a id="split"></a>

# Random Forest Regressor Model <a id="forest"></a>

A random forest regressor is a type of ensemble machine learning model that is used for regression tasks. It is built using a combination of multiple decision trees. Each decision tree is trained on a different subset of the data and with different subsets of the features. The final prediction is made by averaging the predictions of all the decision trees in the forest.

The key idea behind a random forest regressor is to combine the predictions of multiple decision trees, which can decrease the variance and increase the stability of the model. Random forests are also less prone to overfitting than a single decision tree, as they average out the noise in the data.

A random forest regressor is trained using a technique called bagging (Bootstrap Aggregating) which creates multiple training sets by randomly sampling the data with replacement. Each decision tree is then trained on a different bootstrapped training set, which leads to the creation of a diverse set of decision trees.

Random forest regressor can be used for both linear and non-linear regression problem. It is a robust model, which works well for both high-dimensional and low-dimensional datasets and also works well for datasets with a large number of features.

# XGBoost Regressor Model <a id="XGB"></a>

XGBoost (eXtreme Gradient Boosting) is an open-source, distributed gradient boosting library that is used for supervised learning problems, such as regression and classification. XGBoost regressor is a specific implementation of the XGBoost library that is used for regression tasks.

Like random forest regressor, XGBoost regressor is also an ensemble method, which combines the predictions of multiple decision trees. However, the main difference between XGBoost and random forest is the way they construct their decision trees. XGBoost uses a technique called gradient boosting, which builds decision trees in a sequential manner. Each tree is trained to correct the errors made by the previous tree.

XGBoost also includes several other techniques to improve the performance of the model, such as regularization, which helps to prevent overfitting, and a technique called "weighted quantile sketch" which helps to handle missing values in the dataset.

XGBoost regressor is a powerful model that is often used in machine learning competitions and has been known to perform well on a wide range of datasets. It is also computationally efficient and can scale well to large datasets.

# Hyperparameter Tuning <a id="hyper"></a>

Hyperparameter tuning refers to the process of systematically searching for the best combination of hyperparameters for a machine learning model. Hyperparameters are parameters that are not learned from data, but set before the training process begins. 

Examples of hyperparameters include the learning rate of a neural network, the number of trees in a random forest, and the regularization term in a linear regression. Hyperparameter tuning is important because it can significantly impact the performance of a machine learning model. Common techniques for hyperparameter tuning include: 

* Grid search
* Random search
* Bayesian optimization.

# Conclusion <a id="conc"></a>