# Extra Tree Algorithm (Regression)

Data Source: [Concrete Compressive]("https://archive.ics.uci.edu/ml/datasets/concrete+compressive+strength")

**Data Attributes**

Given are the variable name, variable type, the measurement unit and a brief description. The concrete compressive strength is the regression problem. The order of this listing corresponds to the order of numerals along the rows of the database.

Name -- Data Type -- Measurement -- Description

- Cement (component 1) -- quantitative -- kg in a m3 mixture -- Input Variable
- Blast Furnace Slag (component 2) -- quantitative -- kg in a m3 mixture -- Input Variable
- Fly Ash (component 3) -- quantitative -- kg in a m3 mixture -- Input Variable
- Water (component 4) -- quantitative -- kg in a m3 mixture -- Input Variable
- Superplasticizer (component 5) -- quantitative -- kg in a m3 mixture -- Input Variable
- Coarse Aggregate (component 6) -- quantitative -- kg in a m3 mixture -- Input Variable
- Fine Aggregate (component 7) -- quantitative -- kg in a m3 mixture -- Input Variable
- Age -- quantitative -- Day (1~365) -- Input Variable
- Concrete compressive strength -- quantitative -- MPa -- Output Variable

In [1]:
# Importing the necessary packages
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor, BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from math import sqrt
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Reading the data
concrete = pd.read_csv("./concrete/Concrete_Data.csv")
concrete.head()

Unnamed: 0,Cement (component 1)(kg in a m^3 mixture),Blast Furnace Slag (component 2)(kg in a m^3 mixture),Fly Ash (component 3)(kg in a m^3 mixture),Water (component 4)(kg in a m^3 mixture),Superplasticizer (component 5)(kg in a m^3 mixture),Coarse Aggregate (component 6)(kg in a m^3 mixture),Fine Aggregate (component 7)(kg in a m^3 mixture),Age (day),"Concrete compressive strength(MPa, megapascals)"
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


In [3]:
# Display the characteristics of dataset
print("Dimension of dataset are: ", concrete.shape)
print("The variables present in dataset are: \n", concrete.columns)

Dimension of dataset are:  (1030, 9)
The variables present in dataset are: 
 Index(['Cement (component 1)(kg in a m^3 mixture)',
       'Blast Furnace Slag (component 2)(kg in a m^3 mixture)',
       'Fly Ash (component 3)(kg in a m^3 mixture)',
       'Water  (component 4)(kg in a m^3 mixture)',
       'Superplasticizer (component 5)(kg in a m^3 mixture)',
       'Coarse Aggregate  (component 6)(kg in a m^3 mixture)',
       'Fine Aggregate (component 7)(kg in a m^3 mixture)', 'Age (day)',
       'Concrete compressive strength(MPa, megapascals) '],
      dtype='object')


In [4]:
# Rename the columns for simplicity
concrete.columns = ["X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "Y"]
print("The variables present in dataset are: \n", concrete.columns)

The variables present in dataset are: 
 Index(['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'Y'], dtype='object')


In [5]:
# Using seed function to generate the same dataset
np.random.seed(3000)

In [6]:
# Train-Test Split
training, test = train_test_split(concrete, test_size = 0.3)

x_trg = training.drop("Y", axis = 1)
y_trg = training["Y"]

x_test = test.drop("Y", axis = 1)
y_test = test["Y"]

### Creating Extra Tree Model

In [7]:
# Model building
concrete_extratree = ExtraTreesRegressor()

# Fit the model
concrete_extratree.fit(x_trg, y_trg)
print("Accuracy of Extra Tree model on training set is: ", concrete_extratree.score(x_trg, y_trg))
print("Accuracy of Extra Tree model on test set is: ", concrete_extratree.score(x_test, y_test))

# Prediction via Extra Tree model
concrete_extratree_pred = concrete_extratree.predict(x_test)

# Compute the RMSE of the model
concrete_extratree_rmse = sqrt(mean_squared_error(y_test, concrete_extratree_pred))
print("The RMSE for Extra Tree model is: ", concrete_extratree_rmse)

Accuracy of Extra Tree model on training set is:  0.9952080224890555
Accuracy of Extra Tree model on test set is:  0.9126882844900375
The RMSE for Extra Tree model is:  5.018593068140219


#### Creating a new Extra Tree model with Grid Search

In [8]:
# Import the necessary package 
from sklearn.model_selection import GridSearchCV

In [9]:
# Setting the parameters
param_grid = {"max_features" : ["auto", "sqrt", "log2"], "min_samples_leaf" : [0.5, 1], "criterion" : ["mse"]}

In [10]:
# Step to identify the best parameters
concrete_extratree_grid = ExtraTreesRegressor()

concrete_extratree_CV = GridSearchCV(estimator = concrete_extratree_grid, param_grid = param_grid, cv = 5)

# Fit the model
concrete_extratree_result = concrete_extratree_CV.fit(x_trg, y_trg)
print("Best Parameters: \n", concrete_extratree_CV.best_params_)

Best Parameters: 
 {'criterion': 'mse', 'max_features': 'auto', 'min_samples_leaf': 1}


In [11]:
# Model Building - new Extra Tree model with best parameters
concrete_extratree_best = ExtraTreesRegressor(criterion = concrete_extratree_result.best_params_["criterion"],
                                max_features = concrete_extratree_result.best_params_["max_features"],
                                min_samples_leaf = concrete_extratree_result.best_params_["min_samples_leaf"])

#### Evaluation the new Extra Tree Model with best parameters

In [12]:
# Fit the new model
concrete_extratree_best.fit(x_trg, y_trg)
print("Accuracy of new Extra Tree model on training set is: ", concrete_extratree_best.score(x_trg, y_trg))
print("Accuracy of new Extra Tree model on test set is: ", concrete_extratree_best.score(x_test, y_test))

Accuracy of new Extra Tree model on training set is:  0.9952080224890555
Accuracy of new Extra Tree model on test set is:  0.9200807782442204


In [13]:
# Prediction via new Extra Tree model
concrete_extratree_best_pred = concrete_extratree_best.predict(x_test)

# Compute the RMSE of new Extra Tree model
concrete_extratree_best_rmse = sqrt(mean_squared_error(y_test, concrete_extratree_best_pred))
print("The RMSE of new Extra Tree model is: ", concrete_extratree_best_rmse)

The RMSE of new Extra Tree model is:  4.8014382156797115


#### Creating a Random Forest Model

In [14]:
# Model Building
concrete_forest = RandomForestRegressor(random_state = 0)

# Fit the model
concrete_forest.fit(x_trg, y_trg)
print("Accuracy of Random Forest model on training set is: ", concrete_forest.score(x_trg, y_trg))
print("Accuracy of Random Forest model on test set is: ", concrete_forest.score(x_test, y_test))

# Prediction via Random Forest model
concrete_forest_pred = concrete_forest.predict(x_test)

# Compute the RMSE of Random Forest model
concrete_forest_rmse = sqrt(mean_squared_error(y_test, concrete_forest_pred))
print("The RMSE of Random Forest model is: ", concrete_forest_rmse)

Accuracy of Random Forest model on training set is:  0.9824288271561449
Accuracy of Random Forest model on test set is:  0.8986404568056372
The RMSE of Random Forest model is:  5.407269855337327


#### Creating a Bagging Model

In [15]:
# Model Building
concrete_bag = BaggingRegressor(base_estimator = None, n_estimators = 10, max_samples = 1.0,
                               max_features = 1.0, bootstrap = True)

# Fit the model
concrete_bag.fit(x_trg, y_trg)
print("Accuracy of Bagging Model on training set is: ", concrete_bag.score(x_trg, y_trg))
print("Accuracy of Bagging Model on test set is: ", concrete_bag.score(x_test, y_test))

# Prediction via Bagging Model
concrete_bag_pred = concrete_bag.predict(x_test)

# Compute the RMSE of Bagging Model
concrete_bag_rmse = sqrt(mean_squared_error(y_test, concrete_bag_pred))
print("The RMSE of Bagging Model is: ", concrete_bag_rmse)

Accuracy of Bagging Model on training set is:  0.9752388798288795
Accuracy of Bagging Model on test set is:  0.8837449520907321
The RMSE of Bagging Model is:  5.790974218560807


From above built models we can see that the minimum RMSE is for new Extra Tree model with best parameters (4.80). So this is the best model for the problem.