Predicting housing prices in Ames, Iowa

Overview

The project focuses of finding the best strategy to predict the sale price of individual residential properties in Ames, Iowa.

Problem Statement

Which machine learning model is more effective in predicting sale price?
Which are the features of a residential property that have the largest impact on sales price?

Data

Ames Housing Dataset

Original version: contained 2930 observations and 82 variables of individual residential properties sold in Ames, Iowa between 2006 and 2010.
Data Source

Version used in this project: contained 2051 observations and 81 variables.
Data Source

Missing Data

Values that appeared typos were replaced by their reasonable counterpart (Garage Year Built 2207, with 2007)

Feature Engineering

used total SF instead of all SF variables - basement type 1-2, unfinished basement, 1st fl, 2nd fl

Scaling

y variable: log-transformed
X variables: StandardScaler

Modeling

Baseline - Linear Regression - Manually Selected Features

Features: variables that had at least a .5 correlation strength (absolute value of correlation) in relation to SalePrice
All variables were unscaled. No polynomial features.

Linear Regression - Automated Feature Selection

All features in the model - except:
'Id', 'PID',
'BsmtFin SF 1',
'BsmtFin SF 2',
'Bsmt Unf SF',
'Total Bsmt SF',
'2nd Flr SF',
'Low Qual Fin SF',
'Gr Liv Area',
'Wood Deck SF',
'Open Porch SF',
'Enclosed Porch',
'3Ssn Porch',
'Screen Porch',
'Pool Area'
All polynomial features added - except bias variables
All featured scaled with StandardScaler.
SelectKBest - 45 best features. Y variable - logtransformed Sale Price

Regularized Regression Models

Ridge Regression

Sklearn.linear_models RidgeCV
Features - all variables that went into the SelectKBest transformer.
Alphas - np.logspace(0, 4, 50)

LASSO Regression

Sklearn.linear_models LassoCV
Features - all variables that went into the SelectKBest transformer.
Alphas - np.logspace(0, 4, 50)

Evaluating the Models

The models were evaluated based on their accuracy (training dataset: R-squared score), generalizability (comparison of training and testing R-squared scores), and cross-validation R-squared scores.

	Training R^2	Testing R^2	Cross-Val R^2
Linear Regression baseline	.80	.86	.78
Linear Regression automated	.87	.85	.78
Ridge Regression		.98	.79
Lasso Regression		.97	.87

All three models performed better when I log scaled the y variable and increased the test size.

Baseline model seems to be the best: it has a relatively low bias level, it explains 80% of the variance in the sale price. The cross-validation score supports this interpretation (being only 2 percentage points lower than the training score). The relatively higher R^2 score on the test data set indicates that the testing subsample of the data was not fully representative of the total dataset.

The linear regression model with automated feature selection performed better than the two regularized models, but not as well as baseline model. Although the model explains 87% of the variance in the sale price, and seems sufficiently generalizable (the testing R^2 is close to the training R^2), the almost 10 percentage points lower cross-validation score indicates that the model is unstable.

Comparing the training and test R^2 scores of each regularized regression model suggest that both of these models are overfit, and quite likely they will not generalize well to previously unseen data.

Conclusion and Recommendations

Based on this preliminary analysis of the data I would suggest the use a simpler linear regression model when predicting housing prices in Ames, Iowa. Although more complex machine learning models often perform better when it comes to prediction, given the features of this particular dataset, the unregularized regression models performed much more reliably.
When we have a dataset with a large number of features, using the SelectKBest function of the sklearn.feature_selection library has its undoubtable advantages. However, given that the model with intuitively selected features had better bias/variance indicators than the model with automated feature selection, I recommend the use of a methodological hybryd: machine aided - intuitive feature selection.
Outliers identified during EDA seem to make the more complex models unstable. Further analysis may reveal underlying trends in the data that may require special attention. Revealing those trends would help fine-tune the models to increase the reliability of our predictions.

Presentation

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
datasets		datasets
figures		figures
.gitignore		.gitignore
.gitignore.orig		.gitignore.orig
EDA_and_DataCleaning.ipynb		EDA_and_DataCleaning.ipynb
Kaggle_Submissions.ipynb		Kaggle_Submissions.ipynb
Missing_data.ipynb		Missing_data.ipynb
Model_Tuning.ipynb		Model_Tuning.ipynb
Modeling.ipynb		Modeling.ipynb
Preprocessing_and_Feature_Engineering.ipynb		Preprocessing_and_Feature_Engineering.ipynb
Presentation.pdf		Presentation.pdf
README.md		README.md
README.md.orig		README.md.orig
Test_EDA_and_DataCleaning.ipynb		Test_EDA_and_DataCleaning.ipynb
Test_Preprocessing_and_Feature_Engineering.ipynb		Test_Preprocessing_and_Feature_Engineering.ipynb

biborsz/Ames-housing-price-prediction

Folders and files

Latest commit

History

Repository files navigation