## 🪨 Concrete Strength Prediction

Given *data about the composition and age of different concretes*, let's try to predict the **compressive strength** of a given concrete. 

We will try out many models and pick the best one to make our predictions.

Data source: https://www.kaggle.com/datasets/maajdl/yeh-concret-data

### Importing Libraries

In [3]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.svm import LinearSVR, SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor

from sklearn.model_selection import GridSearchCV

In [2]:
data = pd.read_csv('Concrete_Data_Yeh.csv')
data

Unnamed: 0,cement,slag,flyash,water,superplasticizer,coarseaggregate,fineaggregate,age,csMPa
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.30
...,...,...,...,...,...,...,...,...,...
1025,276.4,116.0,90.3,179.6,8.9,870.1,768.3,28,44.28
1026,322.2,0.0,115.6,196.0,10.4,817.9,813.4,28,31.18
1027,148.5,139.4,108.6,192.7,6.1,892.4,780.0,28,23.70
1028,159.1,186.7,0.0,175.6,11.3,989.6,788.9,28,32.77


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   cement            1030 non-null   float64
 1   slag              1030 non-null   float64
 2   flyash            1030 non-null   float64
 3   water             1030 non-null   float64
 4   superplasticizer  1030 non-null   float64
 5   coarseaggregate   1030 non-null   float64
 6   fineaggregate     1030 non-null   float64
 7   age               1030 non-null   int64  
 8   csMPa             1030 non-null   float64
dtypes: float64(8), int64(1)
memory usage: 72.5 KB


### Preprocessing

In [5]:
df = data.copy()

In [6]:
y = df['csMPa'].copy()
X = df.drop('csMPa', axis=1).copy()

In [7]:
y

0       79.99
1       61.89
2       40.27
3       41.05
4       44.30
        ...  
1025    44.28
1026    31.18
1027    23.70
1028    32.77
1029    32.40
Name: csMPa, Length: 1030, dtype: float64

In [8]:
X

Unnamed: 0,cement,slag,flyash,water,superplasticizer,coarseaggregate,fineaggregate,age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360
...,...,...,...,...,...,...,...,...
1025,276.4,116.0,90.3,179.6,8.9,870.1,768.3,28
1026,322.2,0.0,115.6,196.0,10.4,817.9,813.4,28
1027,148.5,139.4,108.6,192.7,6.1,892.4,780.0,28
1028,159.1,186.7,0.0,175.6,11.3,989.6,788.9,28


In [9]:
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=123)

In [10]:
# Scale X with a Standard Scaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = pd.DataFrame(scaler.transform(X_train), index=X_train.index, columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), index=X_test.index, columns=X_test.columns)

In [11]:
X_train

Unnamed: 0,cement,slag,flyash,water,superplasticizer,coarseaggregate,fineaggregate,age
666,-0.858785,2.415833,-0.847354,0.475472,-1.034530,-0.549031,-0.704369,-0.677267
237,-0.649654,0.257904,-0.463043,-0.008664,0.083955,1.203924,0.179163,0.174174
725,0.273206,-0.856855,-0.847354,0.475472,-1.034530,0.508920,0.745693,1.202330
802,0.551408,-0.856855,-0.847354,0.146448,-1.034530,1.126701,-0.221864,-0.275644
568,1.107810,-0.856855,-0.847354,0.179350,-1.034530,0.877014,-0.472665,-0.613008
...,...,...,...,...,...,...,...,...
47,0.944727,0.222677,-0.847354,2.167599,-1.034530,-0.520716,-2.258826,2.166226
638,0.896761,-0.856855,-0.847354,0.193451,-1.034530,0.843551,-0.170940,-0.275644
113,1.039699,1.290846,-0.847354,-1.691390,2.638108,-0.357261,-0.198948,-0.613008
96,1.376418,0.351085,-0.847354,-1.432871,2.070519,-0.469234,0.410867,-0.613008


In [12]:
X_train.mean()

cement             -4.878206e-16
slag                2.463740e-17
flyash              7.883969e-17
water               1.687662e-15
superplasticizer   -5.420229e-17
coarseaggregate     1.660561e-15
fineaggregate      -7.736145e-16
age                -3.572424e-17
dtype: float64

In [13]:
X_train.var()

cement              1.001389
slag                1.001389
flyash              1.001389
water               1.001389
superplasticizer    1.001389
coarseaggregate     1.001389
fineaggregate       1.001389
age                 1.001389
dtype: float64

In [14]:
y_train

666    12.79
237    47.13
725    38.70
802    31.65
568    25.45
       ...  
47     40.76
638    38.21
113    59.09
96     46.80
106    55.90
Name: csMPa, Length: 721, dtype: float64

#### Model Selection

In [15]:
models = {
    "                     Linear Regression": LinearRegression(),
    "                 L2 (Ridge) Regression": Ridge(),
    "Support Vector Machine (Linear Kernel)": LinearSVR(),
    "   Support Vector Machine (RBF Kernel)": SVR(),
    "                         Decision Tree": DecisionTreeRegressor(),
    "                        Neural Network": MLPRegressor(),
    "                         Random Forest": RandomForestRegressor(),
    "             GradientBoostingRegressor": GradientBoostingRegressor(),
    "                              AdaBoost": AdaBoostRegressor()
}

for name, model in models.items():
    model.fit(X_train, y_train)
    print(name + " trained.")

                     Linear Regression trained.
                 L2 (Ridge) Regression trained.
Support Vector Machine (Linear Kernel) trained.
   Support Vector Machine (RBF Kernel) trained.
                         Decision Tree trained.




                        Neural Network trained.
                         Random Forest trained.
             GradientBoostingRegressor trained.
                              AdaBoost trained.


In [16]:
for name, model in models.items():
    print(name + " R^2: {:.5f}".format(model.score(X_test, y_test)))

                     Linear Regression R^2: 0.60111
                 L2 (Ridge) Regression R^2: 0.60097
Support Vector Machine (Linear Kernel) R^2: 0.57163
   Support Vector Machine (RBF Kernel) R^2: 0.63459
                         Decision Tree R^2: 0.84438
                        Neural Network R^2: 0.47772
                         Random Forest R^2: 0.89933
             GradientBoostingRegressor R^2: 0.90701
                              AdaBoost R^2: 0.81034


#### Model Optimization

In [17]:
best_model = GradientBoostingRegressor()
best_model.fit(X_train, y_train)

print("Model R^2 (Before Optimization): {:.5f}".format(best_model.score(X_test, y_test)))

Model R^2 (Before Optimization): 0.90689


In [18]:
params = {
    "learning_rate": [0.01, 0.1, 1.0],
    "n_estimators": [100, 150, 200],
    "max_depth": [3, 4, 5]
}

clf = GridSearchCV(best_model, params)
clf.fit(X_train, y_train)

clf.best_params_

{'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200}

In [19]:
print("Model R^2 (After Optimization): {:.5f}".format(clf.score(X_test, y_test)))

Model R^2 (After Optimization): 0.92722
