# **Model Building and Evaluation**

## Objectives

* **Business Requirement 2:**
    * Develop a machine learning model to predict the sale price of Lydia's four inherited houses and any house in Ames, Iowa, with at least 75% accuracy.

## Inputs

## Outputs

## Install Python packages in the Notebook

In [1]:
%pip install -r /workspace/HeritageHousing/requirements.txt

Collecting ydata-profiling (from -r /workspace/HeritageHousing/requirements.txt (line 9))
  Downloading ydata_profiling-4.9.0-py2.py3-none-any.whl.metadata (20 kB)
Collecting feature-engine (from -r /workspace/HeritageHousing/requirements.txt (line 10))
  Downloading feature_engine-1.8.1-py2.py3-none-any.whl.metadata (9.8 kB)
Collecting scipy>=1.6.0 (from scikit-learn->-r /workspace/HeritageHousing/requirements.txt (line 5))
  Downloading scipy-1.13.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
Collecting pydantic>=2 (from ydata-profiling->-r /workspace/HeritageHousing/requirements.txt (line 9))
  Downloading pydantic-2.8.2-py3-none-any.whl.metadata (125 kB)
Collecting visions<0.7.7,>=0.7.5 (from visions[type_image_path]<0.7.7,>=0.7.5->ydata-profiling->-r /workspace/HeritageHousing/requirements.txt (line 9))
  Downloading visions-0.7.6-py3-none-any.whl.metadata (11 kB)
Collecting numpy (from -r /workspace/HeritageHousing/requirements.txt (line 2))
  Down

## Change working directory

Before starting we need to change to the correct directory (from where it is to its parent folder).

We first access the current directory using os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/HeritageHousing/notebooks'

We want to make the parent of the current directory the new current directory.

* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Below will confirm the current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspace/HeritageHousing'

## Libraries Import

In [5]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

## Load the Datasets

Loading the main dataset and the inherited houses data and checking to make sure they load correctly

In [8]:
df = pd.read_csv('outputs/datasets/collection/HousePricing.csv')
inherited_houses = pd.read_csv('outputs/datasets/collection/InheritedHouses.csv')

In [9]:
print("Main Housing Dataset:")
df.head()

Main Housing Dataset:


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


In [10]:
print("\nInherited Houses Dataset:")
inherited_houses.head()


Inherited Houses Dataset:


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd
0,896,0,2,No,468.0,Rec,270.0,0,730.0,Unf,...,11622,80.0,0.0,0,6,5,882.0,140,1961,1961
1,1329,0,3,No,923.0,ALQ,406.0,0,312.0,Unf,...,14267,81.0,108.0,36,6,6,1329.0,393,1958,1958
2,928,701,3,No,791.0,GLQ,137.0,0,482.0,Fin,...,13830,74.0,0.0,34,5,5,928.0,212,1997,1998
3,926,678,3,No,602.0,GLQ,324.0,0,470.0,Fin,...,9978,78.0,20.0,36,6,6,926.0,360,1998,1998


## Data Cleaning

In this step, we will handle missing values and ensure the data is in good shape for model training. We'll apply the same cleaning steps to both the main dataset and the inherited house dataset.

**Check for Missing Values**

First, let's check for missing values in both datasets.

In [12]:
print("Missing values in the main housing dataset:")
print(df.isnull().sum().sort_values(ascending=False))

Missing values in the main housing dataset:
EnclosedPorch    1324
WoodDeckSF       1305
LotFrontage       259
GarageFinish      235
BsmtFinType1      145
BedroomAbvGr       99
2ndFlrSF           86
GarageYrBlt        81
BsmtExposure       38
MasVnrArea          8
1stFlrSF            0
BsmtFinSF1          0
BsmtUnfSF           0
GrLivArea           0
LotArea             0
GarageArea          0
KitchenQual         0
OpenPorchSF         0
OverallQual         0
OverallCond         0
TotalBsmtSF         0
YearBuilt           0
YearRemodAdd        0
SalePrice           0
dtype: int64


In [13]:
print("\nMissing values in the inherited houses dataset:")
print(inherited_houses.isnull().sum().sort_values(ascending=False))


Missing values in the inherited houses dataset:
1stFlrSF         0
2ndFlrSF         0
BedroomAbvGr     0
BsmtExposure     0
BsmtFinSF1       0
BsmtFinType1     0
BsmtUnfSF        0
EnclosedPorch    0
GarageArea       0
GarageFinish     0
GarageYrBlt      0
GrLivArea        0
KitchenQual      0
LotArea          0
LotFrontage      0
MasVnrArea       0
OpenPorchSF      0
OverallCond      0
OverallQual      0
TotalBsmtSF      0
WoodDeckSF       0
YearBuilt        0
YearRemodAdd     0
dtype: int64


**Handling Missing Values**

Now that we have identified which columns have missing values, we will handle them based on the nature of each feature. We'll apply the same strategy we used earlier for the main dataset and inherited houses.

* **Handling Missing Values in Main Housing Data**

In [14]:
# For 'EnclosedPorch' and 'WoodDeckSF', assume missing means the absence of those features. Fill with 0.
df['EnclosedPorch'].fillna(0, inplace=True)
df['WoodDeckSF'].fillna(0, inplace=True)

# For 'LotFrontage', fill with the median value as it’s a critical numeric feature.
df['LotFrontage'].fillna(df['LotFrontage'].median(), inplace=True)

# For 'GarageFinish' and 'GarageYrBlt', assume missing means no garage. Fill with 'No Garage' and 0 respectively.
df['GarageFinish'].fillna('No Garage', inplace=True)
df['GarageYrBlt'].fillna(0, inplace=True)

# For 'BsmtFinType1' and 'BsmtExposure', assume missing means no basement. Fill with 'No Basement' and 'No Exposure'.
df['BsmtFinType1'].fillna('No Basement', inplace=True)
df['BsmtExposure'].fillna('No Exposure', inplace=True)

# For 'BedroomAbvGr', fill with the mode (most common number of bedrooms).
df['BedroomAbvGr'].fillna(df['BedroomAbvGr'].mode()[0], inplace=True)

# For '2ndFlrSF', assume missing means no second floor. Fill with 0.
df['2ndFlrSF'].fillna(0, inplace=True)

# For 'MasVnrArea', assume missing means no masonry veneer. Fill with 0.
df['MasVnrArea'].fillna(0, inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['EnclosedPorch'].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['WoodDeckSF'].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always b

* **Handling Missing Values in the Inherited Houses Dataset**

We will use the same strategy for the inherited house data

In [15]:
# Fill missing values for inherited houses in the same way
inherited_houses['EnclosedPorch'].fillna(0, inplace=True)
inherited_houses['WoodDeckSF'].fillna(0, inplace=True)
inherited_houses['LotFrontage'].fillna(inherited_houses['LotFrontage'].median(), inplace=True)
inherited_houses['GarageFinish'].fillna('No Garage', inplace=True)
inherited_houses['GarageYrBlt'].fillna(0, inplace=True)
inherited_houses['BsmtFinType1'].fillna('No Basement', inplace=True)
inherited_houses['BsmtExposure'].fillna('No Exposure', inplace=True)
inherited_houses['BedroomAbvGr'].fillna(inherited_houses['BedroomAbvGr'].mode()[0], inplace=True)
inherited_houses['2ndFlrSF'].fillna(0, inplace=True)
inherited_houses['MasVnrArea'].fillna(0, inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  inherited_houses['EnclosedPorch'].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  inherited_houses['WoodDeckSF'].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we

* **Check for Missing Values Again**

After handling the missing values, it's a good practice to check again to make sure there are no remaining missing values.

In [16]:
print("Missing values in the main housing dataset:")
print(df.isnull().sum().sort_values(ascending=False))

Missing values in the main housing dataset:
1stFlrSF         0
2ndFlrSF         0
BedroomAbvGr     0
BsmtExposure     0
BsmtFinSF1       0
BsmtFinType1     0
BsmtUnfSF        0
EnclosedPorch    0
GarageArea       0
GarageFinish     0
GarageYrBlt      0
GrLivArea        0
KitchenQual      0
LotArea          0
LotFrontage      0
MasVnrArea       0
OpenPorchSF      0
OverallCond      0
OverallQual      0
TotalBsmtSF      0
WoodDeckSF       0
YearBuilt        0
YearRemodAdd     0
SalePrice        0
dtype: int64


In [17]:
print("\nMissing values in the inherited houses dataset:")
print(inherited_houses.isnull().sum().sort_values(ascending=False))


Missing values in the inherited houses dataset:
1stFlrSF         0
2ndFlrSF         0
BedroomAbvGr     0
BsmtExposure     0
BsmtFinSF1       0
BsmtFinType1     0
BsmtUnfSF        0
EnclosedPorch    0
GarageArea       0
GarageFinish     0
GarageYrBlt      0
GrLivArea        0
KitchenQual      0
LotArea          0
LotFrontage      0
MasVnrArea       0
OpenPorchSF      0
OverallCond      0
OverallQual      0
TotalBsmtSF      0
WoodDeckSF       0
YearBuilt        0
YearRemodAdd     0
dtype: int64


## Feature Engineering

In this step, we'll create new features and transform categorical data into numerical data, preparing the dataset for modeling.

* **Creating New Features**

Since we’ve already discussed that total square footage (TotalSF) is an important attribute, we’ll create that feature by summing up the square footage from the first floor, second floor, and basement.

In [18]:
df['TotalSF'] = df['1stFlrSF'] + df['2ndFlrSF'] + df['TotalBsmtSF']
inherited_houses['TotalSF'] = inherited_houses['1stFlrSF'] + inherited_houses['2ndFlrSF'] + inherited_houses['TotalBsmtSF']

In [19]:
df[['TotalSF', 'SalePrice']].head()

Unnamed: 0,TotalSF,SalePrice
0,2566.0,208500
1,2524.0,181500
2,2706.0,223500
3,1717.0,140000
4,2290.0,250000


In [20]:
inherited_houses[['TotalSF']].head()

Unnamed: 0,TotalSF
0,1778.0
1,2658.0
2,2557.0
3,2530.0


* **Encoding Categorical Variables**

We need to convert categorical variables into numeric ones for the machine learning algorithms to process them. Since the categories (like quality ratings) have an inherent order, we’ll use label encoding for categorical features such as BsmtExposure, BsmtFinType1, GarageFinish, and KitchenQual.

Define mappings for each categorical column

In [21]:
bsmt_exposure_mapping = {'No': 0, 'Mn': 1, 'Av': 2, 'Gd': 3, 'No Exposure': 4}
bsmt_fin_type_mapping = {'No Basement': 0, 'Unf': 1, 'LwQ': 2, 'Rec': 3, 'BLQ': 4, 'ALQ': 5, 'GLQ': 6}
garage_finish_mapping = {'No Garage': 0, 'Unf': 1, 'RFn': 2, 'Fin': 3}
kitchen_qual_mapping = {'Fa': 0, 'TA': 1, 'Gd': 2, 'Ex': 3}

Apply the mappings to the main dataset

In [22]:
df['BsmtExposure'] = df['BsmtExposure'].map(bsmt_exposure_mapping)
df['BsmtFinType1'] = df['BsmtFinType1'].map(bsmt_fin_type_mapping)
df['GarageFinish'] = df['GarageFinish'].map(garage_finish_mapping)
df['KitchenQual'] = df['KitchenQual'].map(kitchen_qual_mapping)

Apply the mappings to the inherited houses dataset

In [23]:
inherited_houses['BsmtExposure'] = inherited_houses['BsmtExposure'].map(bsmt_exposure_mapping)
inherited_houses['BsmtFinType1'] = inherited_houses['BsmtFinType1'].map(bsmt_fin_type_mapping)
inherited_houses['GarageFinish'] = inherited_houses['GarageFinish'].map(garage_finish_mapping)
inherited_houses['KitchenQual'] = inherited_houses['KitchenQual'].map(kitchen_qual_mapping)

Display the first few rows of the main dataset to verify the mappings

In [24]:
df[['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual']].head()

Unnamed: 0,BsmtExposure,BsmtFinType1,GarageFinish,KitchenQual
0,0,6,2,2
1,3,5,2,1
2,1,6,2,2
3,0,5,1,2
4,2,6,2,2


Display the first few rows of the inherited houses dataset to verify the mappings

In [25]:
inherited_houses[['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual']].head()

Unnamed: 0,BsmtExposure,BsmtFinType1,GarageFinish,KitchenQual
0,0,3,1,1
1,0,5,1,2
2,0,6,3,1
3,0,6,3,2


* **Final Check Before Model Building**

Ensure there are no categorical variables left and confirm the data types of all columns.

In [26]:
print("Data types in the main dataset:")
print(df.dtypes)

Data types in the main dataset:
1stFlrSF           int64
2ndFlrSF         float64
BedroomAbvGr     float64
BsmtExposure       int64
BsmtFinSF1         int64
BsmtFinType1       int64
BsmtUnfSF          int64
EnclosedPorch    float64
GarageArea         int64
GarageFinish       int64
GarageYrBlt      float64
GrLivArea          int64
KitchenQual        int64
LotArea            int64
LotFrontage      float64
MasVnrArea       float64
OpenPorchSF        int64
OverallCond        int64
OverallQual        int64
TotalBsmtSF        int64
WoodDeckSF       float64
YearBuilt          int64
YearRemodAdd       int64
SalePrice          int64
TotalSF          float64
dtype: object


In [27]:
print("\nData types in the inherited houses dataset:")
print(inherited_houses.dtypes)


Data types in the inherited houses dataset:
1stFlrSF           int64
2ndFlrSF           int64
BedroomAbvGr       int64
BsmtExposure       int64
BsmtFinSF1       float64
BsmtFinType1       int64
BsmtUnfSF        float64
EnclosedPorch      int64
GarageArea       float64
GarageFinish       int64
GarageYrBlt      float64
GrLivArea          int64
KitchenQual        int64
LotArea            int64
LotFrontage      float64
MasVnrArea       float64
OpenPorchSF        int64
OverallCond        int64
OverallQual        int64
TotalBsmtSF      float64
WoodDeckSF         int64
YearBuilt          int64
YearRemodAdd       int64
TotalSF          float64
dtype: object
