# Random Forests for Regression

In this example, we will learn how to train a random forest regressor with **scikit-learn**. <br/>

#### Dataset: California Housing

As an example dataset we will use the California Housing dataset. The data contains information from the 1990 California census.  A description of this dataset can be found here: https://www.kaggle.com/datasets/camnugent/california-housing-prices

Fortunately, the dataset is already provided in scikit-learn. So, there is no need to fetch the data from Kaggle manually. See: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html

In [1]:
# Import modules which are relevant for this project
from sklearn.datasets import fetch_california_housing
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# To ensure reproducibility

RANDOM_STATE = 42

np.random.seed(RANDOM_STATE)

## Load the dataset

In [3]:
dataset = fetch_california_housing(as_frame=True)

In [4]:
housing_df = dataset['data']
target_df = dataset['target']

housing_df['HousePrice'] = target_df

## Prepare a training and testing set

Like we did in our linear regression example, we split the datasets into two parts. <br/>
80% of the data is used for training, and 20% is used for testing.

In [5]:
train_df, test_df = train_test_split(housing_df, test_size=0.2)

# Prepare training features 
x_train = train_df.loc[:, train_df.columns != 'HousePrice'].values
y_train = train_df['HousePrice'].values

# Prepare testing features 
x_test = test_df.loc[:, test_df.columns != 'HousePrice'].values
y_test = test_df['HousePrice'].values 

## Training a random forest regressor

We now train our first random forest regressor referred to as  `RandomForestRegressor` in Scikit.

In [6]:
# Create a random forest regressor
model = RandomForestRegressor()

In [7]:
# Fit the model
model.fit(x_train, y_train)

In [8]:
# Evaluate the performance of the regression model
y_pred_train = model.predict(x_train)
y_pred_test = model.predict(x_test)

rmse_test = mean_squared_error(y_test, y_pred_test, squared=False)
rmse_train = mean_squared_error(y_train, y_pred_train, squared=False)

print('Train RMSE:', rmse_train)
print('Test RMSE:', rmse_test)

Train RMSE: 0.18756381286355583
Test RMSE: 0.503395775621682


Nice, without finetuning of hyperparameters, your model achieves a lower RMSE (\\$50339) than the finetuned decision tree (\\$60154). This looks promising.