# Iowa Housing Prices Prediction

 # Abstract
 This project utilizes Iowa's Housing dataset, which provides comprehensive information on residential properties in Ames and Des Moines, Iowa, to develop accurate predictive models for house prices. Since Des Moines is the largest populated city in Iowa, so I chose this dataset. Key features selected for analysis include "Heating," "HeatingQC," "GrLivArea," "SalePrice," "BedroomAbvGr," "KitchenQual," "GarageCars," "GarageQual," and "WoodDeckSF." Categorical variables are converted to numerical values to facilitate exploratory data analysis (EDA) and model building. By drawing on previous research and methodologies, including Gusthema's Kaggle kernel and studies on housing price determinants by Deng et al. and DiPasquale and Wheaton, this project aims to enhance the understanding of housing price dynamics in Lowa. To enhance model performance, I conducted hyperparameter tuning and cross-validation. The final model achieved a high degree of accuracy and demonstrated robust predictive capabilities. The results of this study can assist real estate professionals, potential buyers, and policymakers in making informed decisions regarding property investments and pricing strategies


# Research Question

What factors have the greatest influence on predicting the sales price of residential properties, and how accurately can these factors be leveraged to predict SalePrice in a given dataset?

## Background and Prior Work

Predicting housing prices is a crucial task in real estate, aiding buyers, sellers, and investors in making informed decisions. In the context of the Ames and Des Moines Housing dataset, which comprises extensive information on residential properties in Ames, Iowa, this project aims to develop accurate predictive models for house prices. The dataset includes diverse features such as dwelling types, zoning classifications, lot sizes, building styles, and various property attributes, offering a rich resource for exploring the determinants of housing prices in the region.

Prior work in housing price prediction has explored similar datasets and methodologies to understand the factors influencing property values and develop predictive models. For instance, Gusthema's Kaggle kernel titled "House Prices Prediction using TFDF" provides an example of utilizing TensorFlow Decision Forests (TF-DF) to predict house prices based on the Ames Housing dataset. This work likely encompasses data preprocessing, feature engineering, and model training to achieve accurate predictions.

Additionally, research in real estate economics has investigated the determinants of housing prices in various contexts. Studies like that by Deng et al. (2016) analyzed housing price determinants in Beijing, China, while DiPasquale and Wheaton (1996) examined the impact of neighborhood characteristics on property values in the United States. These studies highlight the importance of factors such as location, size, amenities, and neighborhood attributes in shaping housing prices.

By synthesizing insights from prior work and leveraging the extensive features of the Ames Housing dataset, this project aims to contribute to the understanding of housing price dynamics in Ames, Iowa. Through exploratory data analysis, feature engineering, and advanced regression techniques, I seek to develop accurate predictive models for house prices while uncovering actionable insights for stakeholders in the local real estate market.

# Hypothesis


The sales price of residential properties is primarily influenced by factors such as location, size (square footage), number of bedrooms and bathrooms, overall condition, and amenities in Iowa, based on the data of Ames and Des Moines.

The sales price of residential properties is predominantly influenced by location, size, condition, and amenities. Factors such as neighborhood desirability, square footage, overall condition, and the presence of amenities like bedrooms, bathrooms, and updated features play crucial roles in determining property value. Understanding and accurately assessing these factors are essential for predicting house prices effectively in the real estate market.

# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name: Ames.csv
  - Number of observations: 1320
  - Number of variables: 15
    

- Dataset #2
  - Dataset Name: Des Moines.csv
  - Number of observations: 1610
  - Number of variables: 15
 
    
In these dataset, I will include variables that like heating, different sizes of living area, numbers of different rooms, and also their conditions. Most of the data have numbers to it, like the size or the numbers, I also assgin the number to the conditions by dividing them into different levels. Moreover, there are variables that contains strings. This dataset is very close to out ideal dataset, but I will further choose the variables that I need later and also asign the different values to certain variables for easier analysis. 

These variables description are from the original dataset. Based on our normal knowledge and experiences, I choose some features first to do the analysis.

These features include ['Foundation', 'GrLivArea', 'MSSubClass', 'TotalBsmtSF','YrSold','YearBuilt', 'BedroomAbvGr', 'OverallQual','OverallCond', 'Neighborhood', 'Functional', 'KitchenAbvGr', 'TotRmsAbvGrd', 'GarageCars']. In the future analysis and model building, I may also choose or delete some features.
However some values are string such as "Ex", but I expect to have int values so that it will be convenient for us to do EDA and model building. 

Hence, I change these values to corresponding numbers. I change Ex to 5, Gd to 4, TA to 3, Fa to 2, Po to 1, NA to 0. And also replace Typ, Min1, Min2, Mod, Maj1, Maj2, Sev and Sal correlated to 0 to 7

## Dataset training

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df_Ames = pd.read_csv('Ames.csv')
df_Des_Moines = pd.read_csv('Des Moines.csv')
df = pd.concat([df_Ames, df_Des_Moines], axis=0)
lst = ['SalePrice','Foundation', 'GrLivArea', 'MSSubClass', 'TotalBsmtSF','YrSold','YearBuilt', 'BedroomAbvGr', 'OverallQual','OverallCond', 'Neighborhood', 'Functional', 
       'KitchenAbvGr', 'TotRmsAbvGrd', 'GarageCars']
df = df[lst]
df.isna().any()
df.replace({'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'NA':0}, inplace = True)
df.replace({'Typ': 0, 'Min1': 1,'Min2': 2, 'Mod': 3, 'Maj1': 4, 'Maj2': 5, 'Sev': 6, 'Sal' : 7}, inplace = True)
df.fillna(0, inplace = True)
df['Yrowned'] = df['YrSold'] - df['YearBuilt']
df.columns

Index(['SalePrice', 'Foundation', 'GrLivArea', 'MSSubClass', 'TotalBsmtSF',
       'YrSold', 'YearBuilt', 'BedroomAbvGr', 'OverallQual', 'OverallCond',
       'Neighborhood', 'Functional', 'KitchenAbvGr', 'TotRmsAbvGrd',
       'GarageCars', 'Yrowned'],
      dtype='object')

### Section 1 of EDA - Evaluation of Machine Learning Models for Predicting House Prices

The provided code is designed to evaluate the performance of different machine learning models for predicting house prices based on a variety of features. It begins by importing necessary libraries for data manipulation and machine learning. It then prepares the feature set (`X`) and the target variable (`y`) from a DataFrame `df`, selecting relevant columns like 'GrLivArea', 'OverallQual', and 'Neighborhood'. The data is split into training and test sets with an 80-20 ratio. The core of the process is a loop that runs ten iterations of model training and evaluation. For each iteration, it splits the data, constructs a preprocessing pipeline that includes one-hot encoding for the 'Neighborhood' categorical variable, and combines this with a RandomForestRegressor model. The pipeline is then fitted to the training data, and R² scores are computed for both the training and test sets. These scores are collected across all iterations, and the average R² scores for training and test sets are calculated and printed, providing a measure of model performance and generalization capability.

In [3]:
from scipy.stats import f_oneway

# Performing ANOVA
neighborhoods = df['Neighborhood'].unique()
neighborhoods_prices = [df[df['Neighborhood'] == neighborhood]['SalePrice'] for neighborhood in neighborhoods]
anova_result_np = f_oneway(*neighborhoods_prices)
print(anova_result_np)

foundations = df['Foundation'].unique()
foundations_prices = [df[df['Foundation'] == foundation]['SalePrice'] for foundation in foundations]
anova_result_fp = f_oneway(*foundations_prices)
print(anova_result_fp)

F_onewayResult(statistic=144.3950774998117, pvalue=0.0)
F_onewayResult(statistic=227.56782672988663, pvalue=1.2671240384616744e-205)


In [4]:
#X = df[['GrLivArea','MSSubClass', 'BedroomAbvGr', 'Yrowned', 'TotalBsmtSF','OverallQual','OverallCond', 'Neighborhood', 'Functional', 'KitchenAbvGr', 'TotRmsAbvGrd', 'GarageCars']]
y = df['SalePrice']
df.columns

Index(['SalePrice', 'Foundation', 'GrLivArea', 'MSSubClass', 'TotalBsmtSF',
       'YrSold', 'YearBuilt', 'BedroomAbvGr', 'OverallQual', 'OverallCond',
       'Neighborhood', 'Functional', 'KitchenAbvGr', 'TotRmsAbvGrd',
       'GarageCars', 'Yrowned'],
      dtype='object')

In [5]:
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

train_r2_scores = []
test_r2_scores = []

X = df[['GrLivArea','MSSubClass', 'BedroomAbvGr', 'Yrowned', 'TotalBsmtSF','OverallQual','OverallCond', 'Neighborhood', 'Functional', 'KitchenAbvGr', 'TotRmsAbvGrd', 'GarageCars']]
y = df['SalePrice']
y

0       127500
1       149900
2       120000
3       146000
4       376162
         ...  
1605    142500
1606    131000
1607    132000
1608    170000
1609    188000
Name: SalePrice, Length: 2930, dtype: int64

In [6]:
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

X = df[['GrLivArea','MSSubClass', 'BedroomAbvGr', 'Yrowned', 'TotalBsmtSF','OverallQual','OverallCond', 'Neighborhood', 'Functional', 'KitchenAbvGr', 'TotRmsAbvGrd', 'GarageCars']]
y = df['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Loop 200 times
for _ in range(200):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    preprocessor = ColumnTransformer(
        transformers=[
            ('onehot', OneHotEncoder(), ['Neighborhood'])
        ],
        remainder='passthrough'
    )
    pipeline_GB = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', GradientBoostingRegressor(n_estimators=200, learning_rate=0.1, max_depth=3))
    ])

    pipeline_LR = Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', LinearRegression())
    ])
    
    pipeline_RF = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100))
    ])

    train_r2_scores_RF = []
    test_r2_scores_RF = []

    pipeline_RF.fit(X_train, y_train)
    train_score_RF = pipeline_RF.score(X_train, y_train)
    test_score_RF = pipeline_RF.score(X_test, y_test)
    train_r2_scores_RF.append(train_score_RF)
    test_r2_scores_RF.append(test_score_RF)

    train_r2_scores_GB = []
    test_r2_scores_GB = []

    pipeline_GB.fit(X_train, y_train)
    train_score_GB = pipeline_GB.score(X_train, y_train)
    test_score_GB = pipeline_GB.score(X_test, y_test)
    train_r2_scores_GB.append(train_score_GB)
    test_r2_scores_GB.append(test_score_GB)

    train_r2_scores_LR = []
    test_r2_scores_LR = []

    pipeline_LR.fit(X_train, y_train)
    train_score_LR = pipeline_LR.score(X_train, y_train)
    test_score_LR = pipeline_LR.score(X_test, y_test)
    train_r2_scores_LR.append(train_score_LR)
    test_r2_scores_LR.append(test_score_LR)

# Compute the average R^2 scores
avg_train_r2_RF = np.mean(train_r2_scores_RF)
avg_test_r2_RF = np.mean(test_r2_scores_RF)

avg_train_r2_GB = np.mean(train_r2_scores_GB)
avg_test_r2_GB = np.mean(test_r2_scores_GB)

avg_train_r2_LR = np.mean(train_r2_scores_LR)
avg_test_r2_LR = np.mean(test_r2_scores_LR)

print(f"Average Training R^2 score (RF) over 200 iterations: {avg_train_r2_RF:.3f}")
print(f"Average Test R^2 score (RF) over 200 iterations: {avg_test_r2_RF:.3f}")
print(f"Average Training R^2 score (GB) over 200 iterations: {avg_train_r2_GB:.3f}")
print(f"Average Test R^2 score (GB) over 200 iterations: {avg_test_r2_GB:.3f}")
print(f"Average Training R^2 score (LR) over 200 iterations: {avg_train_r2_LR:.3f}")
print(f"Average Test R^2 score (LR) over 200 iterations: {avg_test_r2_LR:.3f}")


Average Training R^2 score (RF) over 200 iterations: 0.982
Average Test R^2 score (RF) over 200 iterations: 0.916
Average Training R^2 score (GB) over 200 iterations: 0.954
Average Test R^2 score (GB) over 200 iterations: 0.925
Average Training R^2 score (LR) over 200 iterations: 0.834
Average Test R^2 score (LR) over 200 iterations: 0.877


# Discusison and Conclusion

In the course of our analysis, I explored multiple regression techniques, including Linear Regression and Gradient Boosting Regression, alongside the Random Forest Regressor. Despite their potential applicability, these models did not yield satisfactory results in terms of predictive accuracy, as indicated by relatively low R² scores. Consequently, they were not selected for further consideration in our modeling approach. This iterative process of experimentation and evaluation underscores the importance of selecting appropriate modeling techniques that align with the complexity and nuances of the housing price prediction task.

The decision to exclude Linear Regression and Random Forest Regressor models was driven by the pursuit of models that could effectively capture the non-linear relationships and interactions among the numerous features influencing housing prices. While Linear Regression assumes a linear relationship between the independent and dependent variables, the housing market often exhibits non-linear behavior due to the diverse array of factors influencing property values. Similarly, Random Forest Regressor, although capable of capturing non-linear relationships, may have struggled to adequately represent the complex interactions among features without extensive hyperparameter tuning and feature engineering. The linear model are assuming linear relationship between features and the sale price variable, while from the data visualization, some of the feature I are not sure they are guaranteed linear relationship while Gradient Boosting inherently handle non-linear relationships between features and the sale price by using multiple decision trees. Each tree is built on a random subset of the data and features, which allows the ensemble to capture complex, non-linear patterns. Compare to Gradient Boosting Regression, Random Forests are less likely to overfit compared to a single decision tree. This makes them more robust on unseen data, assuming the number of trees is sufficient and not excessively high.

Despite their exclusion from the final model selection, the exploration of Linear Regression and Gradient Boosting Regression provided valuable insights into the limitations of certain modeling approaches in the context of housing price prediction. By systematically evaluating the performance of different algorithms, I gained a deeper understanding of the strengths and weaknesses inherent in each method. This iterative process of model selection and refinement highlights the importance of rigorously assessing various modeling techniques to identify the most suitable approach for a given task. The linear model are assuming linear relationship between features and the sale price variable, while from the data visualization, some of the feature I are not sure they are guaranteed linear relationship while Random Forests is less prone to overfitting because it averages the results of many trees, which individually might overfit.  Gradient Boosting on other hand builds one tree at a time where each new tree helps to correct errors made by previously built trees. This sequential correction of residuals can lead to a highly optimized predictor.

Moreover, based on the R^2 score performed above, since Gradient Boosting Regression has the highest score: 0.926 performed, higher than two other testing methods. Random Forest has the score: 0.916 and Linear Regression has: 0.877. In this way, I decided to choose to use Gradient Boosting Regression. 

In conclusion, while Linear Regression and Random Forest were considered during the modeling process, they ultimately did not meet the criteria for predictive accuracy required for our analysis. The decision to focus on the Gradient Boosting Regression was based on its superior performance in capturing the complex relationships within the housing market data. Moving forward, continued exploration of alternative modeling techniques and robust evaluation methods will be essential for further refining our predictive models and enhancing our understanding of housing price dynamics.