# Trees, random forests, and boosting

We investigate in this notebook how methods based on trees perform for a regression problem (on the 'bike sharing' data set already encountered in the notebook on linear regression).

**There are 6 questions to answer. Only the part on trees is due for Apr. 3rd, 2024 (the part on boosting will be for Apr. 8th).**

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from matplotlib import pyplot as plt

# set up the random number generator: given seed for reproducibility, None otherwise
# (see https://numpy.org/doc/stable/reference/random/generator.html#numpy.random.default_rng)
my_seed = 1
rng = np.random.default_rng(seed=my_seed) 

## Dataset

We re-use the 'bike sharing' data set, used for linear regression. We therefore do not further comment this example, and directly build the test and train sets.

In [None]:
df = pd.read_csv('BikeSharingDataset.csv')
df.drop(['instant','dteday','temp','casual','registered'],axis=1,inplace=True)
df.info()
df.head()

In [None]:
# randomly splitting between test and train data
fraction_train = 0.7 
fraction_test = 1.0 - fraction_train
df_train, df_test = train_test_split(df, train_size = fraction_train, test_size = fraction_test)

In [None]:
# the aim is to predict the last column from the previous ones
X_train = df_train[df_train.columns[:-1]]
y_train = df_train["cnt"]
print("Train set")
X_train.head()

In [None]:
# construction of the test set
X_test = df_test[df_test.columns[:-1]]
y_test = df_test["cnt"]
print("Test set")
X_test.head()

## Decision trees

We start by performing regression with a single tree. See https://scikit-learn.org/stable/modules/tree.html for a presentation of trees in Scikit-learn (in particular practical tips in Section 1.10.5, including elements on how to avoid overfitting), as well as https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html for a description of the various functions in the class.

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

The key parameters to avoid overfitting are
- *max_depth* to limit the maximal depth of the tree
- *min_samples_leaf* to ask for a minimal number of data points for a condition to be satisfied
We start by playing around with a few values of these parameters to get a feeling of what is done/obtained.

**Question 1.** Complete the code below to perform the desired prediction.

In [None]:
# choice of parameters 
chosen_max_depth = ... # TO COMPLETE
chosen_min_samples_leaf = ... # TO COMPLETE

# constructing + fitting the model and making the prediction
dt = DecisionTreeRegressor(...) # TO COMPLETE WITH CORRECT ARGUMENTS 
model = dt.fit(X_train,y_train)
y_pred = model.predict(X_test)

# computing the performance on the test set
# MSE between prediction and true values
mse = mean_squared_error(y_pred,y_test)
print('MSE (test):',mse) 
# coefficient of determination for the regression
R2 = dt.score(X_test,y_test)
print('R2  (test):',R2)
# plotting predicted values as a function of true values to graphically assess the quality of regression
plt.title('Correlation between actual and predicted values')
plt.xlabel('actual values')
plt.ylabel('predicted values')
plt.scatter(y_test,y_pred,color='royalblue')
plt.plot([0,1],[0,1],color='red')
plt.show()

We can visualize the tree that was learnt using the function **plot_tree** (see https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html).

In [None]:
from sklearn import tree
print('Number of data points : ',y_train.shape[0])
for pair in zip(df_train.columns, np.arange(df_train.shape[0])):
  print('X[',pair[1],'] = ', pair[0]) 
plt.figure(figsize=(15, 10))
tree.plot_tree(dt)
plt.show()

**Question 2.** For a minimal number of samples at leaves of 10, what is the best tree depth that you find, and the associated errors? What do you think of these results? 

WRITE YOUR ANSWER HERE

We next explore more systematically the choice of the parameters by cross validation.

In [None]:
dt = DecisionTreeRegressor()
dt_params = {'max_depth':np.arange(1,10),'min_samples_leaf':np.arange(5,20)}

**Question 3.** Complete the code below to find the best hyperparameters for the model and then retrain the corresponding model and compute the associated outputs.

In [None]:
# the output (best parameters) is in the 'dict' format = dictionary
from sklearn.model_selection import GridSearchCV
print('Grid search to find optimal parameters')
gs_dt = GridSearchCV(...) # TO COMPLETE
gs_dt.fit(X_train,y_train)
a = ... # BEST PARAMETERS, see attributes of grid search class
print('- Best maximal depth =',...)
print('- Best minimal number of samples in the leaves = ',...,'\n')

In [None]:
# training with best parameters
dt_final = DecisionTreeRegressor(...) # TO COMPLETE
model = dt_final.fit(X_train,y_train)
y_pred = model.predict(X_test)

# computing the performance on the test set
# MSE between prediction and true values
mse = mean_squared_error(y_pred,y_test)
print('MSE (test):',mse) 
# coefficient of determination for the regression
R2 = dt_final.score(X_test,y_test)
print('R2  (test):',R2)
# plotting predicted values as a function of true values to graphically assess the quality of regression
plt.title('Correlation between actual and predicted values')
plt.xlabel('actual values')
plt.ylabel('predicted values')
plt.scatter(y_test,y_pred,color='royalblue')
plt.plot([0,1],[0,1],color='red')
plt.show()

**Question 4.** Is the value better than the one found manually above? Do the values found by cross validation change a lot if one does another test/train split?

WRITE YOUR ANSWERE HERE

## Random forests

The test error of decision trees alone is not good. It is much better to combine regression trees into random forest (see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html).

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()

Here as well, we will find the best parameters for *RandomForestRegressor* by cross validation. The key parameters to set are 
- *n_estimators* which is the number of trees in the forest
- *max_depth* and *min_sample_leaf* which are the same parameters as for decision trees

**Question 5.** Fit the best model with the hyperparameters found by cross validation, and compare its predictive performance to the one found with a single tree.

In [None]:
rf_params = {'n_estimators':np.arange(25,150,25),'max_depth':np.arange(1,11,2),'min_samples_leaf':np.arange(2,15,3)}
print('Grid search to find optimal parameters')
gs_rf = GridSearchCV(...) # TO COMPLETE
gs_rf.fit(X_train,y_train)
b = ... # TO COMPLETE, best parameters
print('- Best number of trees = ',...)
print('- Best maximal depth =',...)
print('- Best minimal number of samples in the leaves = ',...,'\n')

In [None]:
# fitting the model with best params
RF = RandomForestRegressor(n_estimators=...,max_depth=...,min_samples_leaf=...) # TO COMPLETE
model = RF.fit(X_train,y_train)
y_pred = model.predict(X_test)

In [None]:
# compute the performance on the test set
R2 = RF.score(X_test,y_test)
print('R2  (test):',R2)
mse = mean_squared_error(y_pred,y_test)
print('MSE (test):',mse) 
plt.title('Correlation between actual and predicted values')
plt.xlabel('actual values')
plt.ylabel('predicted values')
plt.scatter(y_test,y_pred,color='royalblue')
plt.plot([0,1],[0,1],color='red')
plt.show()

WRITE YOUR ANSWER HERE

## AdaBoost

We can also consider boosting methods based on trees (see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html and the example in https://scikit-learn.org/stable/auto_examples/ensemble/plot_adaboost_regression.html).

In [None]:
from sklearn.ensemble import AdaBoostRegressor
ar = AdaBoostRegressor(base_estimator = dt_final)

We find the best parameters for AdaBoostRegressor for base learners corresponding to on the decision tree found in Question 3. 

**Question 6.** Fit the best model with the hyperparameters found by grid search, and compare its predictive performance to the one found with a single tree and random forests.

In [None]:
print('Tree maximal depth (weak learner) =',dt_final.max_depth)
print('Minimal number of samples in the leaves (weak learner) = ',dt_final.min_samples_leaf,'\n')
# key parameter: number of weak learners to be considered
print('Grid search to find optimal parameters')
ar_params = {'n_estimators':np.arange(10,200,10)}
gs_ar = GridSearchCV(...) # TO COMPLETE
gs_ar.fit(X_train,y_train)
c = ... # TO COMPLETE
print('- Best number of weak learners (trees) = ',...,'\n')

In [None]:
# Fitting the model with best params
ab_dt = AdaBoostRegressor(...) # TO COMPLETE
model = ab_dt.fit(X_train,y_train)
y_pred = model.predict(X_test)

In [None]:
# computing the performance on the test set
R2 = ab_dt.score(X_test,y_test)
print('R2  (test):',R2)
mse = mean_squared_error(y_pred,y_test)
print('MSE (test):',mse) 
plt.title('Correlation between actual and predicted values')
plt.xlabel('actual values')
plt.ylabel('predicted values')
plt.scatter(y_test,y_pred,color='royalblue')
plt.plot([0,1],[0,1],color='red')
plt.show()

WRITE YOUR ANSWERE HERE

## Extensions and complements

Here is a list of extensions and complements which can be addressed in the final project:
- do complexity pruning for trees by adapting https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html
- use XGBoost https://xgboost.readthedocs.io/en/stable/python/python_intro.html, as discussed 