# Housing Price Prediction
### Data Source: [Kaggle](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview)

#### Author: Nick Faupel
#### Date: 2024-01-15

## Overview

#### Goal:
Predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable. 

#### Metric:
Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)


#### Description:
Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.


In [1]:
# Import any necessary libraries

import numpy as np 
import pandas as pd 
%matplotlib inline
import matplotlib.pyplot as plt  
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
from statsmodels.graphics.gofplots import ProbPlot

import sklearn
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.metrics import mean_squared_error, make_scorer
from scipy.stats import skew
from IPython.display import display
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor as VIF
from statsmodels.stats.anova import anova_lm
from ISLP.models import (ModelSpec as MS,
                         summarize,
                         poly)
from statsmodels.nonparametric.smoothers_lowess import lowess


import warnings
def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn #ignore annoying warnings (from sklearn and seaborn)


from scipy import stats
from scipy.stats import norm, skew 

# Limiting floats output to 3 decimal points
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))

In [2]:
# Show versions of some of the libraries used
import sys
print("Python version: ", sys.version)
print(np.__name__, np.__version__)
print(pd.__name__, pd.__version__)
print(sns.__name__, sns.__version__)
print(sklearn.__name__, sklearn.__version__)

Python version:  3.11.5 (main, Sep 11 2023, 08:31:25) [Clang 14.0.6 ]
numpy 1.24.3
pandas 2.1.4
seaborn 0.12.2
sklearn 1.3.0


## Dataset Description
### File descriptions
`train.csv` - the training set
`test.csv` - the test set
`data_description.txt` - full description of each column, originally prepared by Dean De Cock but lightly edited to match the column names used here
`sample_submission.csv` - a benchmark submission from a linear regression on year and month of sale, lot square footage, and number of bedrooms


### Data fields
In total, there are **81 variables**. This includes **79 predictor variables**, **1 Id column**, and **1 response variable**.\
Some of the predictor variables are categorical and dictionaries of their values can be found in the `data_description.txt` file. \
The response variable, `SalePrice`, is a continous numeric variable. Thus, this is a regression problem.

`SalePrice`: the property's sale price in dollars. This is the target variable that you're trying to predict. \
`MSSubClass`: The building class \
`MSZoning`: The general zoning classification \
`LotFrontage`: Linear feet of street connected to property \
`LotArea`: Lot size in square feet \
`Street`: Type of road access \
`Alley`: Type of alley access \
`LotShape`: General shape of property \
`LandContour`: Flatness of the property \
`Utilities`: Type of utilities available \
`LotConfig`: Lot configuration \
`LandSlope`: Slope of property \
`Neighborhood`: Physical locations within Ames city limits \
`Condition1`: Proximity to main road or railroad \
`Condition2`: Proximity to main road or railroad (if a second is present) \
`BldgType`: Type of dwelling \
`HouseStyle`: Style of dwelling \
`OverallQual`: Overall material and finish quality \
`OverallCond`: Overall condition rating \
`YearBuilt`: Original construction date \
`YearRemodAdd`: Remodel date \
`RoofStyle`: Type of roof \
`RoofMatl`: Roof material \
`Exterior1st`: Exterior covering on house \
`Exterior2nd`: Exterior covering on house (if more than one material) \
`MasVnrType`: Masonry veneer type \
`MasVnrArea`: Masonry veneer area in square feet \
`ExterQual`: Exterior material quality \
`ExterCond`: Present condition of the material on the exterior \
`Foundation`: Type of foundation \
`BsmtQual`: Height of the basement \
`BsmtCond`: General condition of the basement \
`BsmtExposure`: Walkout or garden level basement walls \
`BsmtFinType1`: Quality of basement finished area \
`BsmtFinSF1`: Type 1 finished square feet \
`BsmtFinType2`: Quality of second finished area (if present) \
`BsmtFinSF2`: Type 2 finished square feet \
`BsmtUnfSF`: Unfinished square feet of basement area \
`TotalBsmtSF`: Total square feet of basement area \
`Heating`: Type of heating \
`HeatingQC`: Heating quality and condition \
`CentralAir`: Central air conditioning \
`Electrical`: Electrical system \
`1stFlrSF`: First Floor square feet \
`2ndFlrSF`: Second floor square feet \
`LowQualFinSF`: Low quality finished square feet (all floors) \
`GrLivArea`: Above grade (ground) living area square feet \
`BsmtFullBath`: Basement full bathrooms \
`BsmtHalfBath`: Basement half bathrooms \
`FullBath`: Full bathrooms above grade \
`HalfBath`: Half baths above grade \
`Bedroom`: Number of bedrooms above basement level \
`Kitchen`: Number of kitchens \
`KitchenQual`: Kitchen quality \
`TotRmsAbvGrd`: Total rooms above grade (does not include bathrooms) \
`Functional`: Home functionality rating \
`Fireplaces`: Number of fireplaces \
`FireplaceQu`: Fireplace quality \
`GarageType`: Garage location \
`GarageYrBlt`: Year garage was built \
`GarageFinish`: Interior finish of the garage \
`GarageCars`: Size of garage in car capacity \
`GarageArea`: Size of garage in square feet \
`GarageQual`: Garage quality \
`GarageCond`: Garage condition \
`PavedDrive`: Paved driveway \
`WoodDeckSF`: Wood deck area in square feet \
`OpenPorchSF`: Open porch area in square feet \
`EnclosedPorch`: Enclosed porch area in square feet \
`3SsnPorch`: Three season porch area in square feet \
`ScreenPorch`: Screen porch area in square feet \
`PoolArea`: Pool area in square feet \
`PoolQC`: Pool quality \
`Fence`: Fence quality \
`MiscFeature`: Miscellaneous feature not covered in other categories \
`MiscVal`: $Value of miscellaneous feature \
`MoSold`: Month Sold \
`YrSold`: Year Sold \
`SaleType`: Type of sale \
`SaleCondition`: Condition of sale \

In [3]:
# Load the training dataset
houses_train = pd.read_csv('train.csv')

# View the first 5 rows
houses_train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [4]:
print(f'The training dataset contains {houses_train.shape[0]} rows and {houses_train.shape[1]} columns')

# Check the column names
houses_train.columns

The training dataset contains 1460 rows and 81 columns


Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

As this dataset has so many predictor variables, especially when considering the categorical variables with many values that would need to be transformed (encoded as dummy variables), it is likely that I will choose a smaller subset of these variables to work with. Odds are that not all of these variables are useful for making accurate predictions.

I believe this also makes sense given the overall size of the datasets, which do not provide the number of observations that we might have available when working with real world data. Using too many predictor variables on such a small training dataset is likely to lead to overfitting since it may cause the model to pick up on too much noise.

In [17]:
houses_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

#### Incorrectly coded categorical variables:

MSSubClass: Identifies the type of dwelling involved in the sale.	

        20	=      1-STORY 1946 & NEWER ALL STYLES
        30	=      1-STORY 1945 & OLDER
        40	=      1-STORY W/FINISHED ATTIC ALL AGES
        45	=      1-1/2 STORY - UNFINISHED ALL AGES
        50	=      1-1/2 STORY FINISHED ALL AGES
        60	=      2-STORY 1946 & NEWER
        70	=      2-STORY 1945 & OLDER
        75	=      2-1/2 STORY ALL AGES
        80	=      SPLIT OR MULTI-LEVEL
        85	=      SPLIT FOYER
        90	=      DUPLEX - ALL STYLES AND AGES
       120	=      1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150	=      1-1/2 STORY PUD - ALL AGES
       160	=      2-STORY PUD - 1946 & NEWER
       180	=      PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190	=      2 FAMILY CONVERSION - ALL STYLES AND AGES


OverallQual: Rates the overall material and finish of the house

       10	=      Very Excellent
       9	=      Excellent
       8	=      Very Good
       7	=      Good
       6	=      Above Average
       5	=      Average
       4	=      Below Average
       3	=      Fair
       2	=      Poor
       1	=      Very Poor
	
OverallCond: Rates the overall condition of the house

       10	=      Very Excellent
       9	=      Excellent
       8	=      Very Good
       7	=      Good
       6	=      Above Average
       5	=      Average
       4	=      Below Average
       3	=      Fair
       2	=      Poor
       1	=      Very Poor

In [6]:
# Fix some of the categorical variables that are coded incorrectly
cat_vars_to_fix = ['MSSubClass', 'OverallQual', 'OverallCond']

for i in cat_vars_to_fix:
    houses_train[i] = houses_train[i].astype('category')

# Check that the variable types have been correctly updated
houses_train[cat_vars_to_fix].dtypes

MSSubClass     category
OverallQual    category
OverallCond    category
dtype: object

In [5]:
houses_train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Id,1460.0,730.5,421.61,1.0,365.75,730.5,1095.25,1460.0
MSSubClass,1460.0,56.897,42.301,20.0,20.0,50.0,70.0,190.0
LotFrontage,1201.0,70.05,24.285,21.0,59.0,69.0,80.0,313.0
LotArea,1460.0,10516.828,9981.265,1300.0,7553.5,9478.5,11601.5,215245.0
OverallQual,1460.0,6.099,1.383,1.0,5.0,6.0,7.0,10.0
OverallCond,1460.0,5.575,1.113,1.0,5.0,5.0,6.0,9.0
YearBuilt,1460.0,1971.268,30.203,1872.0,1954.0,1973.0,2000.0,2010.0
YearRemodAdd,1460.0,1984.866,20.645,1950.0,1967.0,1994.0,2004.0,2010.0
MasVnrArea,1452.0,103.685,181.066,0.0,0.0,0.0,166.0,1600.0
BsmtFinSF1,1460.0,443.64,456.098,0.0,0.0,383.5,712.25,5644.0
