In [None]:
%matplotlib inline

# Random forest.

A [RandomForestRegressor][gemseo.mlearning.regression.algos.random_forest.RandomForestRegressor] is a random forest model
based on [scikit-learn](https://scikit-learn.org).


In [None]:
from __future__ import annotations

import contextlib

from matplotlib import pyplot as plt
from numpy import array

from gemseo import create_design_space
from gemseo import create_discipline
from gemseo import sample_disciplines
from gemseo.mlearning import create_regression_model

## Problem

In this example,
we represent the function $f(x)=(6x-2)^2\sin(12x-4)$
by the [AnalyticDiscipline][gemseo.disciplines.analytic.AnalyticDiscipline].

!!! quote "References"
      Alexander I. J. Forrester, Andras Sobester, and Andy J. Keane.
      Engineering design via surrogate modelling: a practical guide. Wiley, 2008.



In [None]:
discipline = create_discipline(
    "AnalyticDiscipline",
    name="f",
    expressions={"y": "(6*x-2)**2*sin(12*x-4)"},
)

and seek to approximate it over the input space



In [None]:
input_space = create_design_space()
input_space.add_variable("x", lower_bound=0.0, upper_bound=1.0)

To do this,
we create a training dataset with 6 equispaced points:



In [None]:
training_dataset = sample_disciplines(
    [discipline], input_space, "y", algo_name="PYDOE_FULLFACT", n_samples=6
)

## Basics

### Training

Then,
we train an random forest regression model from these samples:



In [None]:
model = create_regression_model("RandomForestRegressor", training_dataset)
model.learn()

### Prediction

Once it is built,
we can predict the output value of $f$ at a new input point:



In [None]:
input_value = {"x": array([0.65])}
output_value = model.predict(input_value)
output_value

but cannot predict its Jacobian value:



In [None]:
with contextlib.suppress(NotImplementedError):
    model.predict_jacobian(input_value)

### Plotting

You can see that the random forest model is pretty good on the left,
but bad on the right:



In [None]:
test_dataset = sample_disciplines(
    [discipline], input_space, "y", algo_name="PYDOE_FULLFACT", n_samples=100
)
input_data = test_dataset.get_view(variable_names=model.input_names).to_numpy()
reference_output_data = test_dataset.get_view(variable_names="y").to_numpy().ravel()
predicted_output_data = model.predict(input_data).ravel()
plt.plot(input_data.ravel(), reference_output_data, label="Reference")
plt.plot(input_data.ravel(), predicted_output_data, label="Regression - Basics")
plt.grid()
plt.legend()
plt.show()

## Settings

### Number of estimators

The main hyperparameter of random forest regression is
the number of trees in the forest (default: 100).
Here is a comparison when increasing and decreasing this number:



In [None]:
model = create_regression_model(
    "RandomForestRegressor", training_dataset, n_estimators=10
)
model.learn()
predicted_output_data_1 = model.predict(input_data).ravel()
model = create_regression_model(
    "RandomForestRegressor", training_dataset, n_estimators=1000
)
model.learn()
predicted_output_data_2 = model.predict(input_data).ravel()
plt.plot(input_data.ravel(), reference_output_data, label="Reference")
plt.plot(input_data.ravel(), predicted_output_data, label="Regression - Basics")
plt.plot(input_data.ravel(), predicted_output_data_1, label="Regression - 10 trees")
plt.plot(input_data.ravel(), predicted_output_data_2, label="Regression - 1000 trees")
plt.grid()
plt.legend()
plt.show()

## Others

The `RandomForestRegressor` class of scikit-learn has a lot of settings
([read more](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)),
and we have chosen to exhibit only `n_estimators`.
However,
any argument of `RandomForestRegressor` can be set
using the dictionary `parameters`.
For example,
we can impose a minimum of two samples per leaf:



In [None]:
model = create_regression_model(
    "RandomForestRegressor", training_dataset, parameters={"min_samples_leaf": 2}
)
model.learn()
predicted_output_data_ = model.predict(input_data).ravel()
plt.plot(input_data.ravel(), reference_output_data, label="Reference")
plt.plot(input_data.ravel(), predicted_output_data, label="Regression - Basics")
plt.plot(input_data.ravel(), predicted_output_data_, label="Regression - 2 samples")
plt.grid()
plt.legend()
plt.show()