# Decision Trees for Regression

In this example, we will learn how to train a decision tree with **scikit-learn**. <br/>

#### Dataset: California Housing

As an example dataset we will use the California Housing dataset. The data contains information from the 1990 California census.  A description of this dataset can be found here: https://www.kaggle.com/datasets/camnugent/california-housing-prices

Fortunately, the dataset is already provided in scikit-learn. So, there is no need to fetch the data from Kaggle manually. See: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html

In [1]:
# Import modules which are relevant for this project
from sklearn.datasets import fetch_california_housing
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score

In [2]:
# To ensure reproducibility

RANDOM_STATE = 42

np.random.seed(RANDOM_STATE)

## Load the dataset

In [3]:
dataset = fetch_california_housing(as_frame=True)

In [4]:
housing_df = dataset['data']
target_df = dataset['target']

housing_df['HousePrice'] = target_df

## Prepare a training, validation and testing set

In the previous notebook we have discovered that a training and testing set is not sufficient to properly train and tune the parameters of a model. We need a third set (validation set).

One way to create the validation  would be to use the `train_test_split` function to split the training set into a smaller training set and a validation set, then train your models against the smaller training set and evaluate them against the validation set.

A great alternative is to use **cross validation**, more precisely, k-fold cross validation. 

**K-fold cross-validation**: It randomly splits the training set into k distinct subsets called folds, then it trains and evaluates the model 10 times, picking a different fold for evaluation every time and training on the other 9 folds. See: https://scikit-learn.org/stable/modules/cross_validation.html

In [5]:
train_df, test_df = train_test_split(housing_df, test_size=0.2)
# We do not further split the training set to obtain a validation set
# and instead use k-fold cross validation.

# Prepare training features 
x_train = train_df.loc[:, train_df.columns != 'HousePrice'].values
y_train = train_df['HousePrice'].values

# Prepare testing features 
x_test = test_df.loc[:, test_df.columns != 'HousePrice'].values
y_test = test_df['HousePrice'].values 

## Find the optimal depth using k-fold cross validation

In [1]:
# TODO: Determine the optimal model depth using k-fold cross validation.

## Evaluate the model performance on the testing set

In [7]:
model = DecisionTreeRegressor(max_depth=best_depth)

# Re-Train the model on the entire training set.
model.fit(x_train, y_train)

y_pred_test = model.predict(x_test)

rmse_test = mean_squared_error(y_test, y_pred_test, squared=False)

In [8]:
print(f'RMSE on the training set: ${rmse_test*100000}')

RMSE on the training set: $64517.060484520836


That's still not great but at least better than the linear regression model (~\\$80k)! 