# Apply ensemble tree methods to the Concrete dataset

The dataset comes from this publication
>Yeh, I-Cheng. 2006. “Analysis of Strength of Concrete Using Design of Experiments and Neural Networks.” Journal of Materials in Civil Engineering 18 (4): 597–604. https://doi.org/10.1061/(ASCE)0899-1561(2006)18:4(597)

The following results were obtained:
- Polynomial regression: training RMS=3.96 MPa (R2 = 0.890); testing RMS=8.82 MPa (R2 = 0.791)
- Neural network: training RMS=3.01 MPa (R2 = 0.940); testing RMS=4.32 MPa (R2 = 0.929)

Note: The goal of the paper is to generate detailed characterization plots for concrete mixtures from a limited set of experiments. Not predict strength for end-user applications.


In [27]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor as RF
from sklearn.ensemble import GradientBoostingRegressor as GB
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import cross_validate, KFold
import sklearn.model_selection as skm

## Load the data

In [2]:
Concrete = pd.read_csv('ConcreteData.csv',
                       header=0,
                      names=["cement", "blast_furnace_slag", "fly_ash",
                             "water", "superplasticizer","coarse_agg",
                             "fine_agg", "age", "compressive_strength"])
Concrete.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   cement                1030 non-null   float64
 1   blast_furnace_slag    1030 non-null   float64
 2   fly_ash               1030 non-null   float64
 3   water                 1030 non-null   float64
 4   superplasticizer      1030 non-null   float64
 5   coarse_agg            1030 non-null   float64
 6   fine_agg              1030 non-null   float64
 7   age                   1030 non-null   int64  
 8   compressive_strength  1030 non-null   float64
dtypes: float64(8), int64(1)
memory usage: 72.6 KB


## Prepare training and test sets

In [3]:
# same random state as in week 03
data_train, data_test = train_test_split(Concrete, test_size=0.2, random_state=54)
X_train, y_train = data_train.drop(columns=["compressive_strength"]), data_train["compressive_strength"]
X_test, y_test = data_test.drop(columns=["compressive_strength"]), data_test["compressive_strength"]

## Find the best model using cross-validation

In [26]:
# TODO: use kfold with n_splits=10, shuffle=True, random_state=0


RandomForestRegressor(random_state=0): cv test rmse 4.99
RandomForestRegressor(max_features='sqrt', random_state=0): cv test rmse 5.36
GradientBoostingRegressor(random_state=0): cv test rmse 4.94


## Grid search best Random Forest parameters

In [29]:
# TODO: Implement grid search for max_features, max_depth and n_estimators
# same cv as above


RandomForestRegressor(max_features=None, n_estimators=800, random_state=0)
cv train rmse 1.87
cv test rmse 4.97


## Grid search best gradient boosting parameters

In [30]:
# TODO: Implement grid search for learning_rate, max_depth, n_estimators
# same cv as above


GradientBoostingRegressor(n_estimators=800, random_state=0)
cv train rmse 1.35
cv test rmse 4.07


## Re-train the best model on the training set - evaluate on the test set

In [None]:
# TODO: update to best model found above
model = ...
model.fit(X_train, y_train)
print(f"\n*** {model} ***")
print(f"Training RMSE={root_mean_squared_error(y_train, model.predict(X_train)):.2f}")
print(f"Test RMSE={root_mean_squared_error(y_test, model.predict(X_test)):.2f}")