<a href="https://colab.research.google.com/github/dvisionst/Ensemble_Trees_Core/blob/main/Ensemble_Trees_Core.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ensemble Trees Exercise (CORE)

- Jose Flores
- 29 July 2022

You will use the Boston Housing Data that you have used for previous exercises including the Decision Tree Regressor.   See if you can improve your results by using these ensemble methods! 

Your task is to create the best possible model to predict house prices.



In [56]:
# Import mandatory libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor



In [57]:
# Creating Dataframe from the Housing Data

df = pd.read_csv('/content/Boston_Housing_from_Sklearn.csv')
df.head()

Unnamed: 0,CRIM,NOX,RM,AGE,PTRATIO,LSTAT,PRICE
0,0.00632,0.538,6.575,65.2,15.3,4.98,24.0
1,0.02731,0.469,6.421,78.9,17.8,9.14,21.6
2,0.02729,0.469,7.185,61.1,17.8,4.03,34.7
3,0.03237,0.458,6.998,45.8,18.7,2.94,33.4
4,0.06905,0.458,7.147,54.2,18.7,5.33,36.2


In [58]:
# Setting up my features matrix and my target vector

y = df['PRICE']

X = df.drop(columns='PRICE')
X.info()
# X.info() shows no missing values and that the features matrix 
# is all numerical data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   NOX      506 non-null    float64
 2   RM       506 non-null    float64
 3   AGE      506 non-null    float64
 4   PTRATIO  506 non-null    float64
 5   LSTAT    506 non-null    float64
dtypes: float64(6)
memory usage: 23.8 KB


In [59]:
# Splitting data into two sets, training and testing:

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# 1) Using different Tree models

## Make a Decision Tree



In [60]:
# Creating the decision tree and fitting to train data no scaling is needed
decision_tree = DecisionTreeRegressor(random_state = 42)

# # fitting the model

decision_tree.fit(X_train, y_train);


In [61]:
# predicting the target values for both training and test sets

train_prediction = decision_tree.predict(X_train)
test_prediction = decision_tree.predict(X_test)

In [62]:
# evaluating the model
train_score = round(decision_tree.score(X_train, y_train), 3)
test_score = round(decision_tree.score(X_test, y_test), 3)
print(train_score)
print(test_score)



1.0
0.619


## Bagged Tree

In [63]:
# creating the bagged tree model

bag_tree = BaggingRegressor(random_state = 42)

In [64]:
# looking at the hyper parameters for the model

bag_tree.get_params()

{'base_estimator': None,
 'bootstrap': True,
 'bootstrap_features': False,
 'max_features': 1.0,
 'max_samples': 1.0,
 'n_estimators': 10,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

In [65]:
# fitting the model on trainig set
bag_tree.fit(X_train, y_train)
# using the model to make predictions
bag_tree.predict(X_test)

array([24.04, 30.77, 18.39, 24.04, 16.09, 20.4 , 19.13, 15.03, 21.07,
       21.39, 18.87, 19.46,  7.68, 19.44, 18.93, 25.27, 19.06,  7.87,
       44.92, 14.98, 24.01, 23.58, 14.26, 24.63, 14.15, 12.82, 20.67,
       14.2 , 19.37, 20.33, 20.6 , 23.18, 31.23, 21.4 , 13.94, 15.81,
       36.2 , 19.6 , 20.22, 24.6 , 18.88, 25.75, 44.15, 20.44, 22.72,
       14.5 , 14.95, 24.43, 16.76, 28.32, 22.93, 34.44, 15.92, 25.6 ,
       47.42, 22.56, 15.93, 31.44, 21.34, 20.25, 27.03, 33.4 , 27.06,
       19.23, 28.11, 16.18, 14.58, 22.82, 28.31, 16.5 , 19.59, 25.86,
        9.79, 21.71, 21.47,  6.94, 20.52, 46.12, 11.9 , 14.74, 20.45,
       11.14, 20.37,  9.44, 20.4 , 26.58, 16.95, 23.41, 24.52, 17.98,
       23.  ,  7.34, 18.97, 20.09, 26.26, 20.12, 35.89, 11.42, 12.12,
       12.27, 20.1 , 23.  , 11.72, 23.22, 20.29, 15.51, 18.07, 25.05,
       21.62, 23.58,  7.73, 14.05, 21.61, 22.51, 33.87, 12.38, 43.51,
       16.17, 18.67, 24.28, 20.11, 24.72,  8.68, 20.84, 24.5 , 21.74,
       24.2 ])

In [66]:
# evaluting the performance of model

bag_tree_train_score = round(bag_tree.score(X_train, y_train), 3)
bag_tree_test_score = round(bag_tree.score(X_test, y_test), 3)
print(bag_tree_train_score)
print(bag_tree_test_score)



0.961
0.82


## Random Forest

In [67]:
# creating the model and obtaining its hyperparameters
forest = RandomForestRegressor(random_state=42)

forest.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

In [68]:
# fitting the model on training set and making predictions

forest.fit(X_train, y_train)

forest.predict(X_test)

array([22.986, 31.391, 19.003, 23.141, 16.213, 20.666, 18.768, 15.219,
       21.251, 20.809, 20.253, 20.247,  8.237, 21.228, 19.717, 26.426,
       19.432,  8.497, 46.203, 15.325, 23.637, 23.557, 14.31 , 24.344,
       15.369, 13.575, 21.195, 13.96 , 18.668, 21.416, 19.64 , 23.35 ,
       28.457, 21.533, 14.143, 16.065, 34.532, 19.198, 20.46 , 23.926,
       18.542, 28.025, 45.118, 19.994, 22.885, 14.364, 15.116, 23.797,
       17.815, 28.089, 21.717, 34.018, 16.448, 25.876, 44.673, 21.957,
       16.028, 31.978, 21.921, 20.542, 26.234, 33.55 , 30.222, 19.88 ,
       27.288, 16.302, 14.934, 22.961, 27.268, 17.147, 20.538, 30.51 ,
       10.187, 21.264, 21.262,  7.225, 20.097, 46.97 , 12.082, 13.522,
       22.008, 12.609, 20.435,  8.976, 20.58 , 27.007, 16.026, 23.329,
       24.346, 17.787, 22.135,  7.881, 18.524, 20.042, 25.241, 19.298,
       32.793, 13.215, 12.961, 12.98 , 19.742, 24.277, 13.176, 20.387,
       21.179, 14.004, 19.233, 24.822, 20.402, 24.114,  9.165, 14.91 ,
      

In [69]:
# evaluating the performance of the model

forest_train_score = round(forest.score(X_train, y_train), 3)
forest_test_score = round(forest.score(X_test, y_test), 3)

print(forest_train_score)
print(forest_test_score)

0.977
0.834


# 2) Tune each model to optimize performance on the test set.



In [70]:
# obtaining the depth for the decision tree model
decision_tree.get_depth()

20

In [71]:
# creating a loop in order to determine the optimal max depth for the model

levels = list(range(2, 20))

# storing the valuse of each max depth in a df
scores = pd.DataFrame(index=levels, columns=['Test Score', 'Train Score'])
# looping through levels list of depths

for depth in levels:
  decision_tree = DecisionTreeRegressor(max_depth=depth, random_state=42)
  decision_tree.fit(X_train, y_train)
  train_score = decision_tree.score(X_train, y_train)
  test_score = decision_tree.score(X_test, y_test)
  scores.loc[depth, 'Train Score'] = train_score
  scores.loc[depth, 'Test Score'] = test_score

# sorting the values to identify the optimal max depth

sorted_scores = scores.sort_values(by='Test Score', ascending=False)
sorted_scores.head()

Unnamed: 0,Test Score,Train Score
7,0.846377,0.958517
10,0.84601,0.986796
11,0.829736,0.9911
12,0.827102,0.995358
6,0.825985,0.942742


from the code above it shows that the optimal max depth for the decision tree model is 7

In [72]:
# running the deccision tree model with the max depth set at 7 from previous 
# coding block

decision_tree_7 = DecisionTreeRegressor(max_depth=7, random_state=42)
decision_tree_7.fit(X_train, y_train)

train_7_score = round(decision_tree_7.score(X_train, y_train), 3)
test_7_score = round(decision_tree_7.score(X_test, y_test), 3)

print(train_7_score)
print(test_7_score)

0.959
0.846


## Optimizing Bagged Tree Model


In [73]:
# List of estimator values
estimators = [10, 20, 30, 40, 50, 100]
# Data frame to store the scores
scores_b = pd.DataFrame(index=estimators, columns=['Train Score', 'Test Score'])
# Iterate through the values to find the best number of estimators
for num_estimators in estimators:
   bag_reg = BaggingRegressor(n_estimators=num_estimators, random_state=42)
   bag_reg.fit(X_train, y_train)
   train_score_b = bag_reg.score(X_train, y_train)
   test_score_b = bag_reg.score(X_test, y_test)
   scores_b.loc[num_estimators, 'Train Score'] = train_score_b
   scores_b.loc[num_estimators, 'Test Score'] = test_score_b


scores_b = scores_b.sort_values(by='Test Score', ascending=False)
scores_b.head()


Unnamed: 0,Train Score,Test Score
40,0.97395,0.834365
50,0.975185,0.83391
100,0.977246,0.833051
20,0.9701,0.831147
30,0.973401,0.830604


In [74]:
# Save the index value of the best test score.
best_n_estimators = scores.index[0]
# Instantiate and fit the best version of the model
bag_reg_tuned = BaggingRegressor(n_estimators=best_n_estimators,)
bag_reg_tuned.fit(X_train, y_train)
# Evaluate the model
print(round(bag_reg_tuned.score(X_train, y_train), 3))
print(round(bag_reg_tuned.score(X_test, y_test), 3))


0.915
0.749


## Optimizing Random Forest Model

In [75]:
# obtaining the depth of each tree in the random forest model when max_depth is
# unlimited using list comprehension
est_depths = [estimator.get_depth() for estimator in forest.estimators_]
max(est_depths)


23

In [76]:
# looping through to get optimal results in the random forest model

depths = range(1, max(est_depths))
scores_f = pd.DataFrame(index=depths, columns=['Test Score'])
for level in depths:    
   model = RandomForestRegressor(max_depth=level)
   model.fit(X_train, y_train)
   scores_f.loc[level, 'Train Score'] = model.score(X_train, y_train)
   scores_f.loc[level, 'Test Score'] = model.score(X_test, y_test)
   
sorted_scores_f = scores_f.sort_values(by='Test Score', ascending=False)
sorted_scores_f.head()

Unnamed: 0,Test Score,Train Score
12,0.841706,0.973439
9,0.837807,0.970105
21,0.835787,0.975739
13,0.834025,0.976309
11,0.831149,0.972756


In [77]:
# choose a couple of values for n_estimators to save time
# you can use another loop later to narrow down the best number
# by trying numbers close to the best one
n_ests = [50, 100, 150, 200, 250]
scores2 = pd.DataFrame(index=n_ests, columns=['Test Score', 'Train Score'])
for n in n_ests:
   model = RandomForestRegressor(max_depth=29, n_estimators=n)
   model.fit(X_train, y_train)
   scores2.loc[n, 'Train Score'] = model.score(X_train, y_train)
   scores2.loc[n, 'Test Score'] = model.score(X_test, y_test)
scores2.head()


Unnamed: 0,Test Score,Train Score
50,0.846184,0.976891
100,0.836432,0.974774
150,0.837664,0.977161
200,0.82523,0.976292
250,0.830019,0.977108


In [78]:
# Instantiate and fit the best version of the model
forest_tuned = RandomForestRegressor(max_depth=21, random_state=42)
forest_tuned.fit(X_train, y_train)
# Evaluate the model
print(round(forest_tuned.score(X_train, y_train), 3))
print(round(forest_tuned.score(X_test, y_test), 3))

0.977
0.834


## 3) Evaluate your best model using multiple regression metrics



In [90]:
# Decision Tree is the best model because it still performs well on the training
# data set with a R^2 of 0.959 and it has the highest R^2 of 0.846
# I will be applying the different metrics on the decision tree model

dt_train_pred = decision_tree_7.predict(X_train)
dt_test_pred = decision_tree_7.predict(X_test)

train_R2 = round(r2_score(y_train, dt_train_pred) , 3)
test_R2 = round(r2_score(y_test, dt_test_pred), 3)

print(f"The R^2 value for the Training Model is: {train_R2}")
print(f"The R^2 value for the Testing Model is: {test_R2}")

The R^2 value for the Training Model is: 0.959
The R^2 value for the Testing Model is: 0.846


In [91]:
# Using sklearn module to apply MAE metric to model 

train_MAE = round(mean_absolute_error(y_train, dt_train_pred), 2)
test_MAE = round(mean_absolute_error(y_test, dt_test_pred), 2)
print(f"The MAE results for the training set is: {train_MAE}")
print(f"The MAE results for the testing set is: {test_MAE}")

The MAE results for the training set is: 1.35
The MAE results for the testing set is: 2.45


In [92]:
# Using sklearn module to apply MSE metric to model 

train_MSE = round(mean_squared_error(y_train, dt_train_pred), 2)
test_MSE = round(mean_squared_error(y_test, dt_test_pred), 2)

print(f"The MSE value for the training model is {train_MSE}")
print(f"The MSE value for the testing model is {test_MSE}")

The MSE value for the training model is 3.68
The MSE value for the testing model is 10.76


In [93]:
# Using sklearn module to apply MSE metric to model 

train_RMSE = round(np.sqrt(train_MSE), 3)
test_RMSE = round(np.sqrt(test_MSE), 3)

print(f"Model for training RMSE:  {train_RMSE}")
print(f"Model for testing RMSE: {test_RMSE}")

Model for training RMSE:  1.918
Model for testing RMSE: 3.28


## 4) Explain in a text cell how your model will perform if deployed by referring to the metrics

I have confidence in the model due to it's R^2 training score is approx 96%
and it performed the best on the test set with a 85% score. The variance could be higher on the testing set, but considering how the other models performed it is the best current option. Ideally, more data should be gathered and a better more balanced model can be created.

