# Concrete Feature Engineering

---

## Reference

> [What Is Feature Engineering](https://www.kaggle.com/ryanholbrook/what-is-feature-engineering)</br>
> [Data Source](https://www.kaggle.com/sinamhd9/concrete-comprehensive-strength)

---

## Dependencies

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

%load_ext autotime

time: 198 µs (started: 2021-08-15 12:01:03 -07:00)


---

## Import Dataset

In [2]:
df = pd.read_excel("data/Concrete_Data.xls")
columns = ['Cement', 'BlastFurnaceSlag', 'FlyAsh', 'Water', 'Superplasticizer', 
           'CoarseAggregate', 'FineAggregate', 'Age', 'CompressiveStrength']
df.columns = columns
df.head()

Unnamed: 0,Cement,BlastFurnaceSlag,FlyAsh,Water,Superplasticizer,CoarseAggregate,FineAggregate,Age,CompressiveStrength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.986111
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.887366
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.269535
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05278
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.296075


time: 89.5 ms (started: 2021-08-15 12:01:03 -07:00)


---

## Baseline

You can see here the various ingredients going into each variety of concrete. We'll see in a moment how adding some additional synthetic features derived from these can help a model to learn important relationships among them.

We'll first establish a baseline by training the model on the un-augmented dataset. This will help us determine whether our new features are actually useful.

Establishing baselines like this is good practice at the start of the feature engineering process. A baseline score can help you decide whether your new features are worth keeping, or whether you should discard them and possibly try something else.

In [3]:
X = df.copy()
y = X.pop("CompressiveStrength")

# Train and score baseline model
baseline = RandomForestRegressor(criterion="mae", random_state=0)
baseline_score = cross_val_score(
    baseline, X, y, cv=5, scoring="neg_mean_absolute_error"
)
baseline_score = -1 * baseline_score.mean()

print(f"MAE Baseline Score: {baseline_score:.4}")

MAE Baseline Score: 8.397
time: 9.35 s (started: 2021-08-15 12:01:03 -07:00)


If you ever cook at home, you might know that the ratio of ingredients in a recipe is usually a better predictor of how the recipe turns out than their absolute amounts. We might reason then that ratios of the features above would be a good predictor of `CompressiveStrength`.

The cell below adds three new ratio features to the dataset.

In [4]:
X = df.copy()
y = X.pop("CompressiveStrength")

# Create synthetic features
X["FCRatio"] = X["FineAggregate"] / X["CoarseAggregate"]
X["AggCmtRatio"] = (X["CoarseAggregate"] + X["FineAggregate"]) / X["Cement"]
X["WtrCmtRatio"] = X["Water"] / X["Cement"]

# Train and score model on dataset with additional ratio features
model = RandomForestRegressor(criterion="mae", random_state=0)
score = cross_val_score(
    model, X, y, cv=5, scoring="neg_mean_absolute_error"
)
score = -1 * score.mean()

print(f"MAE Score with Ratio Features: {score:.4}")

MAE Score with Ratio Features: 8.01
time: 12.4 s (started: 2021-08-15 12:01:13 -07:00)


And sure enough, performance improved! This is evidence that these new ratio features exposed important information to the model that it wasn't detecting before.

In [5]:
X.head()

Unnamed: 0,Cement,BlastFurnaceSlag,FlyAsh,Water,Superplasticizer,CoarseAggregate,FineAggregate,Age,FCRatio,AggCmtRatio,WtrCmtRatio
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,0.65,3.177778,0.3
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,0.640758,3.205556,0.3
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,0.637339,4.589474,0.685714
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,0.637339,4.589474,0.685714
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,0.843724,9.083082,0.966767


time: 11.7 ms (started: 2021-08-15 12:01:25 -07:00)
