# `RandomForestRegressor` California housing data

Here is my experiment notebook with random forest regressors on the [California housing data](http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html). Steps for data:

* downloaded a [direct link](http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz)
* merged the two files, one of which was data and one of which was the headers
* I put the headers into the data file and called it `data/CaliforniaHousing.csv` subdirectory.

First step was to import the packages I would use. (This should all work with a standard Anaconda3 installation.)



In [39]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

pd.options.display.max_rows = 20 # don't display many rows

Then, I can pull in the data:

In [40]:
housing = pd.read_csv('data/CaliforniaHousing.csv')
housing

Unnamed: 0,longitude,latitude,housingMedianAge,totalRooms,totalBedrooms,population,households,medianIncome,medianHouseValue
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0
5,-122.25,37.85,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0
6,-122.25,37.84,52.0,2535.0,489.0,1094.0,514.0,3.6591,299200.0
7,-122.25,37.84,52.0,3104.0,687.0,1157.0,647.0,3.1200,241400.0
8,-122.26,37.84,42.0,2555.0,665.0,1206.0,595.0,2.0804,226700.0
9,-122.25,37.84,52.0,3549.0,707.0,1551.0,714.0,3.6912,261100.0


Everything is numeric so it is ready to shove into a random forest. The last column is the target or dependent variable: `medianHouseValue`. Split the columns into independent and dependent variables:

In [41]:
X, y = housing.drop('medianHouseValue', axis=1), housing['medianHouseValue']
y

0        452600.0
1        358500.0
2        352100.0
3        341300.0
4        342200.0
5        269700.0
6        299200.0
7        241400.0
8        226700.0
9        261100.0
           ...   
20630    112000.0
20631    107200.0
20632    115600.0
20633     98300.0
20634    116800.0
20635     78100.0
20636     77100.0
20637     92300.0
20638     84700.0
20639     89400.0
Name: medianHouseValue, Length: 20640, dtype: float64

Ok, now we train the predictor/regressor:

In [46]:
regr = RandomForestRegressor()
regr.fit(X, y)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

Let's use $R^2$ to check proportion of variability of target captured by the model:

In [59]:
print(f"R-squared score validating with ALL training data: {regr.score(X, y)}")

R-squared score validating with ALL training data: 0.9302597902888223


Heh, that *seems* pretty good and it should! How do we know if it *is* good? Hmm...how well does it generalize? Split into 80/20 train/test samples and train with 80% of data:

In [60]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

regr = RandomForestRegressor()
regr.fit(X_train, y_train)
print(f"R-squared score validating with training data: {regr.score(X_train, y_train)}")

R-squared score validating with training data: 0.9625630711089146


Hmm..ok about the same, a little better. Let's try on test data:

In [57]:
print(f"R-squared validating with testing data: {regr.score(X_test, y_test)}")

R-squared validating with testing data: 0.7968746963172103


Much worse but not horrible. Let's try increasing number of trees from 10 to 100:

In [61]:
regr = RandomForestRegressor(n_estimators=100)
regr.fit(X_train, y_train)
print(f"R-squared score validating with training data: {regr.score(X_train, y_train)}")
print(f"R-squared validating with testing data: {regr.score(X_test, y_test)}")

R-squared score validating with training data: 0.9747393874709399
R-squared validating with testing data: 0.8158536020202538


Better on test set but still big diff between fit of model to training and test so somewhat overfit?

Heh, let's try gradient boosting trees which are supposed to be very good:

In [65]:
from sklearn.ensemble import GradientBoostingRegressor
gbrt=GradientBoostingRegressor(n_estimators=100)
gbrt.fit(X_train, y_train)
y_pred=gbrt.predict(X_test)
print("R-squared for Train: %.2f" %gbrt.score(X_train, y_train))
print("R-squared for Test: %.2f" %gbrt.score(X_test, y_test))

R-squared for Train: 0.79
R-squared for Test: 0.77


Interesting. A little bit worse than random forests but not by much. I note that their $R^2$ scores are more even than with rain forests so perhaps the gradient boosting treaties would generalize better?