<a href="https://colab.research.google.com/github/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning/blob/main/GB888_IV_11_RandomForestAndBoostingForRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Random Forest And Boosting For Regression


In this tutorial, we then use random forests and boosted trees in a regression setting: We are going to predict house prices based on a dataset from a specific town.

As usually, let's start with loading the relevant libaries.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import graphviz
import pydot
from io import StringIO

from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LassoCV, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
import sklearn.metrics as metrics

## Setting and Data

As mentioned, we are going to use a well-known dataset, namely a the House Price Data from Ames, Iowa, which is used in an [ongoing data science competion on kaggle](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques). It includes 79 features describing details of 1,460 residential homes in Ames, Iowa. We are trying to predict how much a house with given features would be sold for.

### Load the data

We start by loading the data:

In [None]:
!git clone https://github.com/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning.git

In [None]:
house = pd.read_csv('MSDIA_PredictiveModelingAndMachineLearning/GB888_IV_11_HousePriceData.csv')
house.head()

In [None]:
house.shape

It turns out there are many missing variables (NaN), so we need to go though some steps to...

### Prepare the Data

We will follow a relatively simple procedure: We'll check for features with lots of missing values, and then delete these features. We will then just drop the observations with missing features. Of course, more advanced methods such as imputing the missing variables would be possible.

Let's first look for features with missing variables:

In [None]:
pd.set_option("display.max_rows", None)
house.isnull().sum(axis = 0)

So, let's drop those with lots of missing values:

In [None]:
house = house.drop(columns=['Id','LotFrontage','Alley','MasVnrType', 'BsmtQual', 'BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','FireplaceQu','GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond','PoolQC','Fence','MiscFeature'])

And let's then drop the observations with missing variables:

In [None]:
house = house.dropna()
house.info()

In [None]:
house.shape

So we only dropped relatively few observations, which is good!

Next, we will convert the factor variables to dummies and scale the numerical variables:

In [None]:
col_types = house.columns.to_series().groupby(house.dtypes).groups
numerics = list(house.select_dtypes(include=['int64']).columns)
factors = list(house.select_dtypes(include=['object']).columns)

In [None]:
house_numcols = house[numerics].drop(columns = ['SalePrice'])
house_faccols = house[factors]
dummies = pd.get_dummies(house_faccols, drop_first=True)
house_numcols_sc_0 = scale(house_numcols)
house_numcols_sc = pd.DataFrame(data=house_numcols_sc_0, columns = house_numcols.columns, index = dummies.index)
house_sc = pd.concat([house_numcols_sc, dummies], axis = 1)
house_sc = pd.concat([house_sc, house['SalePrice']], axis =1)
house_sc = house_sc.rename(columns={"SalePrice":"Y"})
house_sc.head()

So, we have the scaled numerical variables, the factor variables as dummies, and then our target variable 'Y'.

Let's now...

### Explore the data

... a bit. I am borrowing this plot from an [article](https://towardsdatascience.com/machine-learning-with-python-regression-complete-tutorial-47268e546cea) I liked:

In [None]:
x = "Y"
fig, ax = plt.subplots(nrows=1, ncols=2,  sharex=False, sharey=False)
fig.suptitle(x, fontsize=20)
### distribution
ax[0].title.set_text('distribution')
variable = house_sc[x].fillna(house_sc[x].mean())
breaks = np.quantile(variable, q=np.linspace(0, 1, 11))
variable = variable[ (variable > breaks[0]) & (variable <
                    breaks[10]) ]
sns.distplot(variable, hist=True, kde=True, kde_kws={"shade": True}, ax=ax[0])
des = house_sc[x].describe()
ax[0].axvline(des["25%"], ls='--')
ax[0].axvline(des["mean"], ls='--')
ax[0].axvline(des["75%"], ls='--')
ax[0].grid(True)
des = round(des, 2).apply(lambda x: str(x))
box = '\n'.join(("min: "+des["min"], "25%: "+des["25%"], "mean: "+des["mean"], "75%: "+des["75%"], "max: "+des["max"]))
ax[0].text(0.95, 0.95, box, transform=ax[0].transAxes, fontsize=10, va='top', ha="right", bbox=dict(boxstyle='round', facecolor='white', alpha=1))
### boxplot
ax[1].title.set_text('outliers (log scale)')
tmp_dtf = pd.DataFrame(house_sc[x])
tmp_dtf[x] = np.log(tmp_dtf[x])
tmp_dtf.boxplot(column=x, ax=ax[1])
plt.show()

So, we see that the data is a bit skewed, see the disytribution on the left-hand side. And there are quite a few outliers on the log-scale, meaning there are some very cheap and very expensive homes. This may be a challenge for linear-regression modeling.

We will split the data into *three* parts. One training set consisting of 60% of our data, one validation set we use for tuning models (20%), and finally a test set we use for comparing models (20%):

In [None]:
np.random.seed(42)
train, test = train_test_split(house_sc, test_size = 0.4)
val, test = train_test_split(test, test_size = 0.5)
X_train = train.drop(columns = ['Y']).values
y_train = train['Y'].values
X_val = val.drop(columns = ['Y']).values
y_val = val['Y'].values
X_test = test.drop(columns = ['Y']).values
y_test = test['Y'].values
X_train.shape

## Predictive Modeling

Let's start building and comparing model. As a baseline, let's run a...

###LASSO Regression Model

We use LassoCV for tuning parameter selection:

In [None]:
lasso_cv = LassoCV(alphas=np.logspace(-3, 3, 100), cv=5, random_state=0) # Increased alphas for wider search
lasso_cv.fit(X_train, y_train)
alpha_optimal = lasso_cv.alpha_

In [None]:
lasso_model = Lasso(alpha=alpha_optimal)
lasso_model.fit(X_train, y_train)

In [None]:
y_lasso_pred = lasso_model.predict(X_test)

In [None]:
## Kpi
print("R2 (explained variance):", round(metrics.r2_score(y_test, y_lasso_pred), 2))
print("Mean Absolute Perc Error (Σ(|y-pred|/y)/n):", round(np.mean(np.abs((y_test-y_lasso_pred)/y_lasso_pred)), 2))
print("Mean Absolute Error (Σ|y-pred|/n):", "{:,.0f}".format(metrics.mean_absolute_error(y_test, y_lasso_pred)))
print("Root Mean Squared Error (sqrt(Σ(y-pred)^2/n)):", "{:,.0f}".format(np.sqrt(metrics.mean_squared_error(y_test, y_lasso_pred))))
## residuals
residuals = y_test - y_lasso_pred
max_error = max(residuals) if abs(max(residuals)) > abs(min(residuals)) else min(residuals)
max_idx = list(residuals).index(max(residuals)) if abs(max(residuals)) > abs(min(residuals)) else list(residuals).index(min(residuals))
max_true, max_pred = y_test[max_idx], y_lasso_pred[max_idx]
print("Max Error:", "{:,.0f}".format(max_error))

### Random Forest Model

Let's now consider a random forest model. Let's start with a model under default parameters

In [None]:
house_rf = RandomForestRegressor(random_state=1)
house_rf.fit(X_train, y_train)

Let's predict in the test set:

In [None]:
y_rf_test_pred = house_rf.predict(X_test)

And let's check the performance:

In [None]:
## Kpi
print("R2 (explained variance):", round(metrics.r2_score(y_test, y_rf_test_pred), 2))
print("Mean Absolute Perc Error (Σ(|y-pred|/y)/n):", round(np.mean(np.abs((y_test-y_rf_test_pred)/y_rf_test_pred)), 2))
print("Mean Absolute Error (Σ|y-pred|/n):", "{:,.0f}".format(metrics.mean_absolute_error(y_test, y_rf_test_pred)))
print("Root Mean Squared Error (sqrt(Σ(y-pred)^2/n)):", "{:,.0f}".format(np.sqrt(metrics.mean_squared_error(y_test, y_rf_test_pred))))
## residuals
residuals = y_test - y_rf_test_pred
max_error = max(residuals) if abs(max(residuals)) > abs(min(residuals)) else min(residuals)
max_idx = list(residuals).index(max(residuals)) if abs(max(residuals)) > abs(min(residuals)) else list(residuals).index(min(residuals))
max_true, max_pred = y_test[max_idx], y_rf_test_pred[max_idx]
print("Max Error:", "{:,.0f}".format(max_error))

So, the performance is similar, and actually a bit worse, than the LASSO regression. We may be able to push performance a bit via tuning, so let's try using a few more trees:

In [None]:
house_rf = RandomForestRegressor(random_state=1,n_estimators=250)
house_rf.fit(X_train, y_train)

In [None]:
y_rf_test_pred = house_rf.predict(X_test)
## Kpi
print("R2 (explained variance):", round(metrics.r2_score(y_test, y_rf_test_pred), 2))
print("Mean Absolute Perc Error (Σ(|y-pred|/y)/n):", round(np.mean(np.abs((y_test-y_rf_test_pred)/y_rf_test_pred)), 2))
print("Mean Absolute Error (Σ|y-pred|/n):", "{:,.0f}".format(metrics.mean_absolute_error(y_test, y_rf_test_pred)))
print("Root Mean Squared Error (sqrt(Σ(y-pred)^2/n)):", "{:,.0f}".format(np.sqrt(metrics.mean_squared_error(y_test, y_rf_test_pred))))
## residuals
residuals = y_test - y_rf_test_pred
max_error = max(residuals) if abs(max(residuals)) > abs(min(residuals)) else min(residuals)
max_idx = list(residuals).index(max(residuals)) if abs(max(residuals)) > abs(min(residuals)) else list(residuals).index(min(residuals))
max_true, max_pred = y_test[max_idx], y_rf_test_pred[max_idx]
print("Max Error:", "{:,.0f}".format(max_error))

So, not much better, in line with the point mentioned that random forests aren't super sensitive to tuning.

Let's try...

### Gradient Boosting

Again, let's start with default parameters:

In [None]:
house_boost = GradientBoostingRegressor(random_state=1)
house_boost.fit(X_train, y_train)

And let's predict and check performance:

In [None]:
y_boost_test_pred = house_boost.predict(X_test)

In [None]:
## Kpi
print("R2 (explained variance):", round(metrics.r2_score(y_test, y_boost_test_pred), 2))
print("Mean Absolute Perc Error (Σ(|y-pred|/y)/n):", round(np.mean(np.abs((y_test-y_boost_test_pred)/y_boost_test_pred)), 2))
print("Mean Absolute Error (Σ|y-pred|/n):", "{:,.0f}".format(metrics.mean_absolute_error(y_test, y_boost_test_pred)))
print("Root Mean Squared Error (sqrt(Σ(y-pred)^2/n)):", "{:,.0f}".format(np.sqrt(metrics.mean_squared_error(y_test, y_boost_test_pred))))
## residuals
residuals = y_test - y_boost_test_pred
max_error = max(residuals) if abs(max(residuals)) > abs(min(residuals)) else min(residuals)
max_idx = list(residuals).index(max(residuals)) if abs(max(residuals)) > abs(min(residuals)) else list(residuals).index(min(residuals))
max_true, max_pred = y_test[max_idx], y_boost_test_pred[max_idx]
print("Max Error:", "{:,.0f}".format(max_error))

So we actually see a bit of a performance increase in terms of MSE. Let's try tuning the model:

In [None]:
best_rmse = 50000

n_estimators_grid = [100, 500, 1000]
learning_rate_grid = [0.05, 0.075, 0.1]

for n_estimators in n_estimators_grid:
  for learning_rate in learning_rate_grid:
    # Train the GradientBoostingClassifier with the current parameters
    house_boost = GradientBoostingRegressor(n_estimators=n_estimators, learning_rate=learning_rate, random_state=1)
    house_boost.fit(X_train, y_train)

    # Predict probabilities for the test set
    y_boost_val = house_boost.predict(X_val)
    rmse = np.sqrt(metrics.mean_squared_error(y_val, y_boost_val))
    print("RMSE:", rmse)


    # Update the best AUC and parameters if the current AUC is better
    if rmse < best_rmse:
        best_rmse = rmse
        best_params = {'n_estimators': n_estimators, 'learning_rate': learning_rate}

print("Best RMSE:", best_rmse)
print("Best parameters:", best_params)

So we see something quite curious: It appears the RMSE in the validation set (around 50,000) is much larger than in the test set (around 20,000). This showcases the challenges with using a single split of our data---there may be odd observations that can affect the fit dramatically... So, it is unclear whether the guidance from the tuning exercise is helpful. However, in line with the guidance, let's try a lower learning rate (and a moderate amount more trees):

In [None]:
house_boost = GradientBoostingRegressor(random_state=1, learning_rate=0.075, n_estimators=500)
house_boost.fit(X_train, y_train)
y_boost_test_pred = house_boost.predict(X_test)

In [None]:
## Kpi
print("R2 (explained variance):", round(metrics.r2_score(y_test, y_boost_test_pred), 2))
print("Mean Absolute Perc Error (Σ(|y-pred|/y)/n):", round(np.mean(np.abs((y_test-y_boost_test_pred)/y_boost_test_pred)), 2))
print("Mean Absolute Error (Σ|y-pred|/n):", "{:,.0f}".format(metrics.mean_absolute_error(y_test, y_boost_test_pred)))
print("Root Mean Squared Error (sqrt(Σ(y-pred)^2/n)):", "{:,.0f}".format(np.sqrt(metrics.mean_squared_error(y_test, y_boost_test_pred))))
## residuals
residuals = y_test - y_boost_test_pred
max_error = max(residuals) if abs(max(residuals)) > abs(min(residuals)) else min(residuals)
max_idx = list(residuals).index(max(residuals)) if abs(max(residuals)) > abs(min(residuals)) else list(residuals).index(min(residuals))
max_true, max_pred = y_test[max_idx], y_boost_test_pred[max_idx]
print("Max Error:", "{:,.0f}".format(max_error))

So, the performance is similar...

Note that for now, we by default considered a quadratic criterion. We can also choose the absolute value:

In [None]:
house_boost = GradientBoostingRegressor(random_state=1, loss = "absolute_error", learning_rate=0.075, n_estimators=500)
house_boost.fit(X_train, y_train)
y_boost_test_pred = house_boost.predict(X_test)
## Kpi
print("R2 (explained variance):", round(metrics.r2_score(y_test, y_boost_test_pred), 2))
print("Mean Absolute Perc Error (Σ(|y-pred|/y)/n):", round(np.mean(np.abs((y_test-y_boost_test_pred)/y_boost_test_pred)), 2))
print("Mean Absolute Error (Σ|y-pred|/n):", "{:,.0f}".format(metrics.mean_absolute_error(y_test, y_boost_test_pred)))
print("Root Mean Squared Error (sqrt(Σ(y-pred)^2/n)):", "{:,.0f}".format(np.sqrt(metrics.mean_squared_error(y_test, y_boost_test_pred))))
## residuals
residuals = y_test - y_boost_test_pred
max_error = max(residuals) if abs(max(residuals)) > abs(min(residuals)) else min(residuals)
max_idx = list(residuals).index(max(residuals)) if abs(max(residuals)) > abs(min(residuals)) else list(residuals).index(min(residuals))
max_true, max_pred = y_test[max_idx], y_boost_test_pred[max_idx]
print("Max Error:", "{:,.0f}".format(max_error))

So, the performance is fairly similar, but it optimizes the absolute error. In other applications, this can make a substantive difference.