# Decision Tree Regressor

In [1]:
# import libraries
import numpy as np
import pandas as pd

In [2]:
# load bike data
df_bikes = pd.read_csv('bike_rentals_cleaned.csv')

In [3]:
# load first five rows
df_bikes.head()

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt
0,1,1.0,0.0,1,0.0,6.0,0.0,2,0.344167,0.363625,0.805833,0.160446,985
1,2,1.0,0.0,1,0.0,0.0,0.0,2,0.363478,0.353739,0.696087,0.248539,801
2,3,1.0,0.0,1,0.0,1.0,1.0,1,0.196364,0.189405,0.437273,0.248309,1349
3,4,1.0,0.0,1,0.0,2.0,1.0,1,0.2,0.212122,0.590435,0.160296,1562
4,5,1.0,0.0,1,0.0,3.0,1.0,1,0.226957,0.22927,0.436957,0.1869,1600


In [4]:
# check info
df_bikes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     731 non-null    int64  
 1   season      731 non-null    float64
 2   yr          731 non-null    float64
 3   mnth        731 non-null    int64  
 4   holiday     731 non-null    float64
 5   weekday     731 non-null    float64
 6   workingday  731 non-null    float64
 7   weathersit  731 non-null    int64  
 8   temp        731 non-null    float64
 9   atemp       731 non-null    float64
 10  hum         731 non-null    float64
 11  windspeed   731 non-null    float64
 12  cnt         731 non-null    int64  
dtypes: float64(9), int64(4)
memory usage: 74.4 KB


In [5]:
# understand the data
df_bikes.describe()

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt
count,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0
mean,366.0,2.49658,0.500684,6.519836,0.028728,2.997264,0.682627,1.395349,0.495423,0.474391,0.627908,0.190476,4504.348837
std,211.165812,1.110807,0.500342,3.451913,0.167155,2.004787,0.465773,0.544894,0.183023,0.162938,0.142074,0.077458,1937.211452
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.05913,0.07907,0.0,0.022392,22.0
25%,183.5,2.0,0.0,4.0,0.0,1.0,0.0,1.0,0.337083,0.337842,0.522291,0.13495,3152.0
50%,366.0,3.0,1.0,7.0,0.0,3.0,1.0,1.0,0.498333,0.486733,0.6275,0.181596,4548.0
75%,548.5,3.0,1.0,10.0,0.0,5.0,1.0,2.0,0.655417,0.608602,0.729791,0.233206,5956.0
max,731.0,4.0,1.0,12.0,1.0,6.0,1.0,3.0,0.861667,0.840896,0.9725,0.507463,8714.0


In [6]:
# check for missing and null values
df_bikes.isna().sum()

instant       0
season        0
yr            0
mnth          0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
cnt           0
dtype: int64

We can see that the data is ready to be used for our model

In [7]:
# define X and y variables
X = df_bikes.iloc[:, :-1]
y = df_bikes.iloc[:, -1]

In [8]:
# import Decision Tree Regressor and cross_val
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score

# Initialize DecisionTreeRegressor and fit the model in cross_val_score:
model = DecisionTreeRegressor(random_state=2)
scores = cross_val_score(model, X, y, scoring='neg_mean_squared_error', cv=5)

In [9]:
# Compute the root mean squared error (RMSE) and print the results:
rmse = np.sqrt(-scores)
print('RMSE mean: %0.2f' % (rmse.mean()))

RMSE mean: 1199.79


Is the model overfitting the data because the variance is too high?

In [11]:
# The following code checks the error of the training set, before it makes predictions on the test set:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)
model.fit(X_train, y_train)
y_pred = model.predict(X_train)

In [13]:
from sklearn.metrics import mean_squared_error 
model_mse = mean_squared_error(y_train, y_pred)
model_rmse = np.sqrt(model_mse)
model_rmse

0.0

A RMSE of 0.0 means that the model has perfectly fit every data point! This perfect score combined with a cross-validation error of 1199.79 is proof that the decision tree is overfitting the data with high variance. The training set fit perfectly, but the test set missed badly. Hyperparameters may rectify the situation.

## Using Hyperparameters with Decision Tree Regressor

Generally speaking, decreasing max hyperparameters and increasing min hyperparameters will reduce variation and prevent
overfitting.

In [30]:
# import GridSearchCV
from sklearn.model_selection import GridSearchCV
params = {'max_depth':[None,2,3,4,6,8,10,20]}

reg_model = DecisionTreeRegressor(random_state=2)
grid_reg = GridSearchCV(reg_model, params, scoring='neg_mean_squared_error', cv=5, n_jobs=-1)

In [31]:
# fit the data
grid_reg.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=DecisionTreeRegressor(random_state=2), n_jobs=-1,
             param_grid={'max_depth': [None, 2, 3, 4, 6, 8, 10, 20]},
             scoring='neg_mean_squared_error')

Now that GridSearchCV has been fit on the data, you can view the best hyperparameters as follows:

In [32]:
best_params = grid_reg.best_params_
print("Best params: ", best_params)

Best params:  {'max_depth': 6}


As you can see, a max_depth value of 6 resulted in the best cross-validation score in the training set. The training score may be displayed using the best_score attribute:

In [33]:
best_score = np.sqrt(-grid_reg.best_score_)
print("Best score: {:.3f}".format(best_score))

Best score: 951.398


The test score may be displayed as follows:

In [34]:
best_model = grid_reg.best_estimator_
y_pred = best_model.predict(X_test)
rmse_test = mean_squared_error(y_test, y_pred)**0.5
print('Test score: {:.3f}'.format(rmse_test))

Test score: 864.670


Variance has been substantially reduced.

min_samples_leaf provides a restriction by increasing the number of samples that a leaf may have. As with max_depth,
min_samples_leaf is designed to reduce overfitting.
When there are no restrictions, min_samples_leaf=1 is the default, meaning that leaves may consist of unique samples (prone
to overfitting). Increasing min_samples_leaf reduces variance. If min_samples_leaf=8, all leaves must contain eight or more samples. Testing a range of values for min_samples_leaf requires going through the same process as before. Instead of copying and pasting, we write a function that displays the best parameters, training score, and test score using GridSearchCV with DecisionTreeRegressor(random_state=2) assigned to reg as a default parameter:

In [35]:
X_train.shape

(548, 12)

In [37]:
def grid_search(params, reg_model = DecisionTreeRegressor(random_state=2)):
    grid_reg = GridSearchCV(reg_model, params, scoring='neg_mean_squared_error', cv=5, n_jobs=-1)
    grid_reg.fit(X_train, y_train)
    best_params = grid_reg.best_params_ 
    print("Best params:", best_params) 
    best_score = np.sqrt(-grid_reg.best_score_) 
    print("Training score: {:.3f}".format(best_score))
    y_pred = grid_reg.predict(X_test) 
    rmse_test = mean_squared_error(y_test, y_pred)**0.5
    print('Test score: {:.3f}'.format(rmse_test))
    


In [38]:
grid_search(params={'min_samples_leaf':[1, 2, 4, 6, 8, 10, 20, 30]})

Best params: {'min_samples_leaf': 8}
Training score: 895.733
Test score: 836.326


Since the test score is better than the training score, variance has been reduced. What happens when we put min_samples_leaf and max_depth together? Let's see:

In [39]:
grid_search(params={'max_depth':[None,2,3,4,6,8,10,20],'min_samples_leaf':[1,2,4,6,8,10,20,30]})

Best params: {'max_depth': 6, 'min_samples_leaf': 2}
Training score: 870.396
Test score: 913.000


The result may be a surprise. Even though the training score has improved, the test score has not. min_samples_leaf has decreased from 8 to 2, while max_depth has remained the same. This is a valuable lesson in hyperparameter tuning. Hyperparameters should not be chosen in isolation. As for reducing variance in the preceding example, limiting min_samples_leaf to values greater than three may help

In [40]:
grid_search(params={'max_depth':[6,7,8,9,10],'min_samples_leaf':[3,5,7,9]})

Best params: {'max_depth': 9, 'min_samples_leaf': 7}
Training score: 888.119
Test score: 878.538


##### We have seen how hyperparameters can help improve an algorithms performance on test data.