# Review, Extending the CA Housing Predictions with Trees and Forests

This notebook extends our previous exercise notebook "Week03_Exercise_Housing.ipynb", with my filled-in version being "Week03_Exercise_Housing_FilledIn.ipynb".

## Review (abbreviated)

Execute the following cell to import our libraries:

In [None]:
# for data wrangling, plotting, numerical analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# for our data
from sklearn.datasets import fetch_california_housing

# for ML data processing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# for our ML models
from sklearn.linear_model import LinearRegression

# for our ML evaluation
from sklearn.metrics import mean_squared_error, r2_score

We'll use a dataset from Scikit-Learn:  [California Housing Dataset](https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset)

In [None]:
california_housing = fetch_california_housing(as_frame=True)
ca_housing_df = california_housing.frame

Use the cell below to look at some sample rows of the `ca_housing_df` dataframe:

In [None]:
ca_housing_df.head()

Use the `info` method to look at the number of rows & columns, and see whether there are any null values:

In [None]:
ca_housing_df.info()

Make simple histogram plots of all variables (e.g. with `ca_housing_df.hist()`).
* Are they normally distributed? Bi-modal?  mostly normal with a couple outliers?  uniformly distributed with obvious caps to the allowable range of values?
* If you'd like, tinker with the number of bins for the histogram, zoom in on the ranges, etc
* You may also find it useful to change the figure size (e.g. `figsize=(12, 10)` as an input parameter to `hist`) or use `plt.tight_layout()` after the plotting command to keep multiple plots from overlapping

In [None]:
ca_housing_df.hist(bins=30, figsize=(12, 10))
plt.tight_layout()

You can get a matrix of correlation coefficients by using the dataframe's `corr()` method.
* Check that out in the cell below
* Which variables are most correlated with the target variable of `MedHouseVal`?
* Which pairs of variables are highly correlated with each other?

In [None]:
ca_housing_df.corr()

In [None]:
plt.figure(figsize=(8,5))
sns.heatmap(ca_housing_df.corr(), annot=True)

In [None]:
plt.figure(figsize=(8,5))
plt.plot(ca_housing_df['Latitude'], ca_housing_df['Longitude'], 'bo', alpha=0.2)

Use `train_test_split` to make a training set and test set, where `MedHouseVal` is your target variable and all other variables are your feature variables.
* You can use `california_housing.data` and `california_housing.target` to get your features and target, or you can use `ca_housing_df.loc[:, ca_housing_df.columns != 'MedHouseVal']` and `ca_housing_df.loc[:, 'MedHouseVal']`  (or other options too)

In [None]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(california_housing.data, 
                                                    california_housing.target, 
                                                    test_size=0.2, 
                                                    random_state=42)

We further scale the data.  This gives better convergence and stability properties for some algorithms.

*Remember*: the training data should be used to determine scaling properties, NOT all the data.  We don't want any information about our test data to prematurely "leak" into our training set.

In [None]:
# Standardize the features
scaler = StandardScaler()
scaler.set_output(transform='pandas')

# Use training data for the scaler fit, as well as transformation
X_train_scaled = scaler.fit_transform(X_train)

# Only use the transform (and not the fit_transform) on the test features
X_test_scaled = scaler.transform(X_test)

In [None]:
X_train

In [None]:
X_train_scaled

In [None]:
X_train.mean()

In [None]:
X_train_scaled.mean()

In the next cell:
* train `Linear_Regression` on your training set
* assess the learned model's performance on the test data using `mean_squared_error`
* make a plot of the coefficient amplitudes
  * the coefficient values are stored in the `coef_` attribute of the variable for your `LinearRegression` object

In [None]:
# Initialize Linear Regression (without regularization)
model_lr = LinearRegression()

# Train the model
model_lr.fit(X_train_scaled, y_train)

In [None]:
# We now have a model with learned coefficients:
print(f"Linear Regression (No Regularization) - Intercept (bias) term:")
print(model_lr.intercept_)
print("Linear Regression (No Regularization) - Coefficients:")
print(model_lr.coef_)

Evaluation:

In [None]:
# 'score' requires passing in your feature and target values
# for the test set
print('The score method returns: %.2f' % model_lr.score(X_test_scaled, y_test))

# mean_squared_error and r2_score are metrics that assess predictions against target values
# so you need to get the predicted values first
y_pred1 = model_lr.predict(X_test_scaled)
print('MSE_lr = %.2f' % mean_squared_error(y_test, y_pred1))
print('R2_lr = %.2f' % r2_score(y_test, y_pred1))

## Adding Trees

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree

In [None]:
# Inherent Interpretability: Decision Tree Regressor
# Create and fit a decision tree regressor

model_tree = DecisionTreeRegressor(random_state=42, max_depth=3)
model_tree.fit(X_train_scaled, y_train)

y_pred1 = model_tree.predict(X_test_scaled)
print('R2_tree_best = %.2f' % r2_score(y_test, y_pred1))

In [None]:
# Visualize the decision tree (for inherent interpretability)
plt.figure(figsize=(12,8))
tree.plot_tree(model_tree, 
               feature_names=california_housing.feature_names, 
               filled=True, 
               rounded=True);

And how do we know what the best value of `max_depth` is? (Or any other hyperparameter?)
* cross validation!

**Cross Validation Method #1**

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
md_range = range(1,20)
md_scores = []
for md in md_range:
    model = DecisionTreeRegressor(random_state=42, max_depth=md)
    loss = cross_val_score(model,
                           X_train_scaled,
                           y_train, 
                           cv=5, 
                           scoring='neg_mean_squared_error')
    md_scores.append(np.sqrt(-loss).mean())
plt.scatter(md_range, md_scores)
plt.xlabel('Value of max_depth for Decision Tree Regression')
plt.ylabel('Cross-Validated RMSE')
plt.show()

In [None]:
md_range[md_scores.index(min(md_scores))]

In [None]:
md_best = md_range[md_scores.index(min(md_scores))]

model_tree_best = DecisionTreeRegressor(random_state=42, max_depth=md_best)

model_tree_best.fit(X_train_scaled, y_train)

y_pred1 = model_tree_best.predict(X_test_scaled)
print('R2_tree_best = %.2f' % r2_score(y_test, y_pred1))

**Cross Validation Method #2**

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
cv_grid = GridSearchCV(DecisionTreeRegressor(random_state=42),
                       param_grid = {
                           'max_depth' : range(1,20),
                       })
cv_grid.fit(X_train_scaled, y_train)
cv_grid.best_params_

In [None]:
y_pred1 = cv_grid.predict(X_test_scaled)
print('R2_tree_best = %.2f' % r2_score(y_test, y_pred1))

# Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
model_rf = RandomForestRegressor(n_estimators=500, 
                                max_leaf_nodes=9, 
                                n_jobs=-1,
                                random_state=42)

model_rf.fit(X_train_scaled, y_train)

test_score = model_rf.score(X_test_scaled, y_test)
print(f"R2 of Random Forest: {test_score:.2f}")

y_pred1 = model_rf.predict(X_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test, y_pred1))
print("RMSE of Random Forest: %f" % (rmse))

In [None]:
plt.figure(figsize=(12,8))
tree.plot_tree(model_rf.estimators_[1], 
               feature_names=california_housing.feature_names,
               filled=True);

In [None]:
plt.barh(california_housing.feature_names, 
         model_rf.feature_importances_)

In [None]:
cv_grid = GridSearchCV(RandomForestRegressor(n_jobs=-1,random_state=42),
                       param_grid = {
                           'max_depth' : [5,9,15],
                           'n_estimators' : [100,200]
                       })
cv_grid.fit(X_train_scaled, y_train)
cv_grid.best_params_

In [None]:
y_pred = cv_grid.predict(X_test_scaled)
r2score = r2_score(y_test,y_pred)
print('R2 of the best Random Forest regressor after CV is %.2f' % (r2score))

In [None]:
plt.barh(california_housing.feature_names, 
         cv_grid.best_estimator_.feature_importances_)

And Boosting?
* we can review at the end if there is interest