The project focuses of finding the best strategy to predict the sale price of individual residential properties in Ames, Iowa.
-
Which machine learning model is more effective in predicting sale price?
-
Which are the features of a residential property that have the largest impact on sales price?
Ames Housing Dataset
Original version: contained 2930 observations and 82 variables of individual residential properties sold in Ames, Iowa between 2006 and 2010.
Data Source
Version used in this project: contained 2051 observations and 81 variables.
Data Source
- Values that appeared typos were replaced by their reasonable counterpart (Garage Year Built 2207, with 2007)
- used total SF instead of all SF variables - basement type 1-2, unfinished basement, 1st fl, 2nd fl
Scaling
- y variable: log-transformed
- X variables: StandardScaler
Features: variables that had at least a .5 correlation strength (absolute value of correlation) in relation to SalePrice
All variables were unscaled. No polynomial features.
All features in the model - except:
'Id', 'PID',
'BsmtFin SF 1',
'BsmtFin SF 2',
'Bsmt Unf SF',
'Total Bsmt SF',
'2nd Flr SF',
'Low Qual Fin SF',
'Gr Liv Area',
'Wood Deck SF',
'Open Porch SF',
'Enclosed Porch',
'3Ssn Porch',
'Screen Porch',
'Pool Area'
All polynomial features added - except bias variables
All featured scaled with StandardScaler.
SelectKBest - 45 best features.
Y variable - logtransformed Sale Price
Sklearn.linear_models RidgeCV
Features - all variables that went into the SelectKBest transformer.
Alphas - np.logspace(0, 4, 50)
Sklearn.linear_models LassoCV
Features - all variables that went into the SelectKBest transformer.
Alphas - np.logspace(0, 4, 50)
The models were evaluated based on their accuracy (training dataset: R-squared score), generalizability (comparison of training and testing R-squared scores), and cross-validation R-squared scores.
Training R^2 | Testing R^2 | Cross-Val R^2 | |
---|---|---|---|
Linear Regression baseline |
.80 | .86 | .78 |
Linear Regression automated |
.87 | .85 | .78 |
Ridge Regression | .98 | .79 | |
Lasso Regression | .97 | .87 |
All three models performed better when I log scaled the y variable and increased the test size.
Baseline model seems to be the best: it has a relatively low bias level, it explains 80% of the variance in the sale price. The cross-validation score supports this interpretation (being only 2 percentage points lower than the training score). The relatively higher R^2 score on the test data set indicates that the testing subsample of the data was not fully representative of the total dataset.
The linear regression model with automated feature selection performed better than the two regularized models, but not as well as baseline model. Although the model explains 87% of the variance in the sale price, and seems sufficiently generalizable (the testing R^2 is close to the training R^2), the almost 10 percentage points lower cross-validation score indicates that the model is unstable.
Comparing the training and test R^2 scores of each regularized regression model suggest that both of these models are overfit, and quite likely they will not generalize well to previously unseen data.
-
Based on this preliminary analysis of the data I would suggest the use a simpler linear regression model when predicting housing prices in Ames, Iowa. Although more complex machine learning models often perform better when it comes to prediction, given the features of this particular dataset, the unregularized regression models performed much more reliably.
-
When we have a dataset with a large number of features, using the SelectKBest function of the sklearn.feature_selection library has its undoubtable advantages. However, given that the model with intuitively selected features had better bias/variance indicators than the model with automated feature selection, I recommend the use of a methodological hybryd: machine aided - intuitive feature selection.
-
Outliers identified during EDA seem to make the more complex models unstable. Further analysis may reveal underlying trends in the data that may require special attention. Revealing those trends would help fine-tune the models to increase the reliability of our predictions.