In [2]:
import pandas as pd

# Predicting Median House Prices in California Census Blocks (Census 1990)

Author: DSCI 522 Group 312

Date: January 25, 2020

## Summary

This analysis focuses on predicting the median house prices in census blocks given independent variable about the location, home characteristics, and the demographics of the census block. This dataset was sourced from Kaggle, and many other people have completed [similar analyses](https://www.kaggle.com/camnugent/california-housing-prices/kernels).

Our goal is to build a model that will predict median house value with a higher model validation score than the 0.60 achieved by [Eric Chen](https://www.kaggle.com/ericfeng84), the author of [The California House Price](https://www.kaggle.com/ericfeng84/the-california-housing-price) Kaggle page from which the dataset was obtained.

We aim to bring additional insight to the existing models including looking at multicollinearity and trying KNN with a variety of different values for n_neighbors.

## Methods

### Data
This dataset is a modified version of [The California Housing Dataset](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html), with [additional columns added by Aurélien Geron](https://github.com/ageron/handson-ml). This dataset contains information about median California house values per census block as sourced from the 1990 US Census.

### Analysis
We used Linear Regression, K-Nearest Neighbour Trees, and a Random Forest Regressor to predict the median house value given the independent variables.

### Results and Discussion
The Exploratory Data Analysis focused on identifying linear relationships between the independent variables and the dependent variable as well as looking at correlations between independent variables. Previous analyses of this dataset highlighted that linear regression was an appropriate prediction method for the median housing value, but generally, they lacked insight into multicollinearity (the correlation and linear relationships between independent variables). Of all of the variables examined, the Variance Inflation Factor (VIF) was higher than 1, which means that there is strong evidence of multicollinearity.

In [3]:
pd.read_csv('eda_charts/vif_table.csv')

Unnamed: 0,variable,vif_val
0,housing_median_age,1.163905
1,total_rooms,11.849443
2,total_bedrooms,34.891047
3,population,6.582837
4,households,33.871693
5,median_income,1.524263
6,intercept,18.278944


The variable with the highest VIF is total bedrooms, and this appears to be strongly linearly related to the total number of rooms, given that the room count includes the bedrooms.

![Image](eda_charts/total-rooms_total-bedrooms.png)

A common approach to address multicollinearity is to remove variables with high VIFs. As is common in this case, the Linear Regression model performed best when all (or all but one) of the variables were included. The following table shows scores on the training and testing data based on the number of features to predict based on, using scikit-learn's RSE. Each row shows the result when adding the next best predictor.

In [4]:
pd.read_csv('ml_results/lr_rfe_results_table.csv')

Unnamed: 0,n_features_to_select,train_error,test_error
0,1,0.978,0.982
1,2,0.757,0.76
2,3,0.416,0.414
3,4,0.413,0.412
4,5,0.376,0.394
5,6,0.375,0.395
6,7,0.372,0.391
7,8,0.36,0.376
8,9,0.36,0.376


It is also helpful to visualize the correlation values between the existing variables, which confirms the results of the VIF table and allows it to be visualized clearly:
![Image](eda_charts/correlation_heatmap.png)

We also attempted to fit a K-Nearest Neighbor to our data, and unlike previous analyses of our data, KNN performed better than simple linear regression. A Standard Scaler was used to pre-process the data, which likely contributed to the success of KNN. The KNN model did use the independent variables "latitude" and "longitude", and so KNN would have taken into account the physical distance between the census blocks. The goal of our project is not to predict based on Census data for other states, but it is important to note that especially in a model like KNN, if this were to be used on a housing set of a different location, the latitude and longitude values would likely skew the data and would need to be excluded. The following is a subset of KNN values with different numbers of "nearest neighbors" to base results on. The most favorable results are around where n_neighbors is in the range of 9 and 14.

In [6]:
pd.read_csv('ml_results/knn_results_table.csv')

Unnamed: 0,n_neighbours,train_error,test_error
0,1,0.0,0.417
1,2,0.105,0.324
2,3,0.147,0.293
3,4,0.169,0.282
4,5,0.187,0.271
5,6,0.196,0.264
6,7,0.205,0.261
7,8,0.212,0.26
8,9,0.218,0.26
9,10,0.222,0.258


Opportunities for improvement of the predictive model include:
- Increasing the breadth of Machine Learning Models used to predict the median housing value;
- Obtaining cross-validation scores, rather than the simple score;
- Conducting more in-depth analysis to address multicollinearity.

## References

Balla, Deepanshu. n.d. SPLITTING Data into Training and Test Sets with R. https://www.listendata.com/2015/02/splitting-data-into-training-and-test.html.

de Jonge, Edwin 2018. docopt: Command-Line Interface Specification Language. https://CRAN.R-project.org/package=docopt.

Kuhn, Max. 2020. Caret: Classification and Regression Training. https://CRAN.R-project.org/package=caret.

Lang, Michael 2017. checkmate: Fast Argument Checks for Defensive R Programming. https://journal.r-project.org/archive/2017/RJ-2017-028/index.html.

R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Wickham, Hadley. 2011. testthat: Get Started with Testing. https://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf.

Wickham, Hadley. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.

Wickham, Hadley, and Lionel Henry. 2019. Tidyr: Tidy Messy Data. https://CRAN.R-project.org/package=tidyr.

Wickham, Hadley, Jim Hester, and Romain Francois. 2018. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay; 12(Oct):2825−2830, 2011 https://scikit-learn.org/stable/

Bernard J. (2016) Python Data Analysis with pandas. In: Python Recipes Handbook. Apress, Berkeley, CA. https://pandas.pydata.org/

