##   🏡 Concrete Strength Prediction: From Exploration to Modeling 🏗️

<br>

---

<br>

##   1\. 🧐 Project Initialization: Library Imports

We begin by importing the necessary Python libraries. These tools will help us handle data, perform calculations, visualize information, build machine learning models, and evaluate their performance.

In [1]:
import pandas as pd
from numpy import log1p
# import matplotlib.pyplot as plt
# import seaborn as sns
from sklearn.model_selection import train_test_split, KFold
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from xgboost import XGBRegressor

scores = pd.DataFrame(columns=['Model', 'Train R²', 'Test R²', ' || Diff ||'])

---

<br>

##   2\. 📥 Data Loading and Inspection

In this step, we load the concrete strength dataset from a CSV file. We then identify the features (input variables) and the target variable ('strength', which is what we want to predict). Finally, we display the first few rows of the data to get an initial look at its contents.

<br>

---

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/Zaid-N-Ansari/PG-Mini-Project/refs/heads/main/Data/Concrete.csv')
features = df.columns[:-1].tolist()
target = 'strength'
df

Unnamed: 0,cement,slag,ash,water,superplastic,coarseagg,fineagg,age,strength
0,141.3,212.0,0.0,203.5,0.0,971.8,748.5,28,29.89
1,168.9,42.2,124.3,158.3,10.8,1080.8,796.2,14,23.51
2,250.0,0.0,95.7,187.4,5.5,956.9,861.2,28,29.22
3,266.0,114.0,0.0,228.0,0.0,932.0,670.0,28,45.85
4,154.8,183.4,0.0,193.3,9.1,1047.4,696.7,28,18.29
...,...,...,...,...,...,...,...,...,...
1025,135.0,0.0,166.0,180.0,10.0,961.0,805.0,28,13.29
1026,531.3,0.0,0.0,141.8,28.2,852.1,893.7,3,41.30
1027,276.4,116.0,90.3,179.6,8.9,870.1,768.3,28,44.28
1028,342.0,38.0,0.0,228.0,0.0,932.0,670.0,270,55.06


---

<br>

##   3\. 📊 Initial Data Exploration: Box Plots

The following code was designed to create box plots. Box plots are a useful way to visualize the distribution of data and identify potential outliers (extreme values).

<br>

---

In [3]:
# plt.figure(figsize=(25, 20))
# plt.subplot(3, 3, 1)
# plt.title('Box plot of features')
# plt.xlabel('Features')
# plt.ylabel('Value')
# plt.xticks(rotation=45)
# plt.grid(axis='y')
# plt.boxplot([df[feature] for feature in features], labels=features, showfliers=True)
# plt.show()

# for feature in features:
# 	plt.figure(figsize=(10, 5))
# 	plt.boxplot(df[feature], showfliers=True, vert=False)
# 	plt.title(f'Box plot of {feature}')
# 	plt.show()

In [4]:
# eise hi uncomment karna outlier treatment ke liye, sirf eise hi aur konsa bhi outliers treatment wala nhi
    #    ^
    #    |
    #    |
    #    |
    #    |
df = df[df['water'] < 240]
df = df[df['superplastic'] < 30]
df = df[df['fineagg'] < 960]
df = df[df['age'] < 300]
df


Unnamed: 0,cement,slag,ash,water,superplastic,coarseagg,fineagg,age,strength
0,141.3,212.0,0.0,203.5,0.0,971.8,748.5,28,29.89
1,168.9,42.2,124.3,158.3,10.8,1080.8,796.2,14,23.51
2,250.0,0.0,95.7,187.4,5.5,956.9,861.2,28,29.22
3,266.0,114.0,0.0,228.0,0.0,932.0,670.0,28,45.85
4,154.8,183.4,0.0,193.3,9.1,1047.4,696.7,28,18.29
...,...,...,...,...,...,...,...,...,...
1025,135.0,0.0,166.0,180.0,10.0,961.0,805.0,28,13.29
1026,531.3,0.0,0.0,141.8,28.2,852.1,893.7,3,41.30
1027,276.4,116.0,90.3,179.6,8.9,870.1,768.3,28,44.28
1028,342.0,38.0,0.0,228.0,0.0,932.0,670.0,270,55.06


---

<br>

##   4\. 📊 Post-Outlier Removal Visualization

This code would generate box plots *after* the outlier removal in the previous step. Comparing these plots to those in section 3 would show the impact of removing outliers.

<br>

---

In [5]:
# plt.figure(figsize=(25, 20))
# plt.subplot(3, 3, 1)
# plt.title('Box plot of features')
# plt.xlabel('Features')
# plt.ylabel('Value')
# plt.xticks(rotation=45)
# plt.grid(axis='y')
# plt.boxplot([df[feature] for feature in features], labels=features, showfliers=True)
# plt.show()

# for feature in features:
# 	plt.figure(figsize=(10, 5))
# 	plt.boxplot(df[feature], showfliers=True, vert=False)
# 	plt.title(f'Box plot of {feature}')
# 	plt.show()

---

<br>

##   5\. 🛠️ Feature Transformation and Data Splitting

This is a critical data preprocessing stage. First, we address skewness (asymmetry) in the data by applying a log transformation to features that exhibit high skewness.  Then, we split the data into training and testing sets. The training set is used to train the models, and the testing set is used to evaluate their performance on unseen data. Finally, we scale the numerical features using `StandardScaler`. Scaling ensures that all features contribute equally to the models.

<br>

---


In [6]:
highly_skewed_features = df[features].skew()[abs(df[features].skew()) > 0.5].index

for feature in highly_skewed_features:
    df[feature] = log1p(df[feature])

x_train, x_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.3, random_state=42)

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

---

<br>

##   6\. 📊 Post-Transformation Visualization (Commented Out)

This commented-out section would visualize the feature distributions *after* the log transformation, allowing us to see how the transformation affected the skewness.

<br>

---

In [7]:
# plt.figure(figsize=(25, 20))
# plt.subplot(3, 3, 1)
# plt.title('Box plot of features')
# plt.xlabel('Features')
# plt.ylabel('Value')
# plt.xticks(rotation=45)
# plt.grid(axis='y')
# plt.boxplot([df[feature] for feature in features], labels=features, showfliers=True)
# plt.show()

# for feature in features:
# 	plt.figure(figsize=(10, 5))
# 	plt.boxplot(df[feature], showfliers=True, vert=False)
# 	plt.title(f'Box plot of {feature}')
# 	plt.show()

---

<br>

##   8\. 🤖 Linear Regression Model

Here, we train a Linear Regression model. We then evaluate its performance using the R-squared metric, which measures how well the model fits the data. We calculate the R-squared on both the training and testing sets to check for overfitting (when the model performs much better on the training data than on the testing data).


<br>

---

In [8]:
LR = LinearRegression()

LR.fit(x_train, y_train)

y_pred = LR.predict(x_test)
y_train_pred = LR.predict(x_train)

training_r2 = LR.score(x_train, y_train) * 100
testing_r2 = LR.score(x_test, y_test) * 100

diff = abs(training_r2 - testing_r2)

# pd.DataFrame({'Model': 'Linear Regression', 'Traning R2': training_r2, 'Testing R2': testing_r2, 'Traning and Testing R2 Diff': diff}, index=[1])
scores.loc[1] = ['Linear Regression', training_r2, testing_r2, diff]
scores

Unnamed: 0,Model,Train R²,Test R²,|| Diff ||
1,Linear Regression,81.327857,81.82655,0.498694


---

<br>

##   9\. 🌳 Decision Tree Regressor

This section trains and evaluates a Decision Tree Regressor. <br>
Hyperparameters (parameters that control the model's structure) such as `ccp_alpha`, `max_depth`, `max_leaf_nodes`, and `min_samples_split` are set to influence the tree's complexity.

<br>

---

In [9]:
DTR = DecisionTreeRegressor(ccp_alpha=0.689, max_depth=80, max_leaf_nodes=87, min_samples_split=26, max_features=4)

DTR.fit(x_train, y_train)

y_pred = DTR.predict(x_test)
y_train_pred = DTR.predict(x_train)

training_r2 = DTR.score(x_train, y_train) * 100
testing_r2 = DTR.score(x_test, y_test) * 100

diff = abs(training_r2 - testing_r2)

scores.loc[2] = ['Decision Tree Regression', training_r2, testing_r2, diff]
scores

Unnamed: 0,Model,Train R²,Test R²,|| Diff ||
1,Linear Regression,81.327857,81.82655,0.498694
2,Decision Tree Regression,84.03822,75.545258,8.492962


---

<br>

##   10\. 🌲 Random Forest and XGBoost Regressors

This part explores more advanced models: Random Forest and XGBoost. <br>
These are ensemble methods, which combine multiple simpler models to make more accurate predictions. Hyperparameters are again used to tune the models.

<br>

---

In [10]:
RFR = RandomForestRegressor(n_estimators=825, min_samples_split=8, min_samples_leaf=8, max_features=0.49, bootstrap=True, oob_score=True)

RFR.fit(x_train, y_train)

y_pred = RFR.predict(x_test)
y_train_pred = RFR.predict(x_train)

training_r2 = RFR.score(x_train, y_train) * 100
testing_r2 = RFR.score(x_test, y_test) * 100

diff = abs(training_r2 - testing_r2)

scores.loc[3] = ['Random Forest Regression', training_r2, testing_r2, diff]
scores

# XGB = XGBRegressor(n_estimators=354, learning_rate=0.009, max_depth=6, gamma=0.15, subsample=0.045, colsample_bytree=0.8)
# XGB.fit(x_train, y_train)

# y_pred = XGB.predict(x_test)
# y_train_pred = XGB.predict(x_train)
# training_r2 = XGB.score(x_train, y_train) * 100
# testing_r2 = XGB.score(x_test, y_test) * 100
# diff = abs(training_r2 - testing_r2)
# print(training_r2, testing_r2, diff, 1-diff)

Unnamed: 0,Model,Train R²,Test R²,|| Diff ||
1,Linear Regression,81.327857,81.82655,0.498694
2,Decision Tree Regression,84.03822,75.545258,8.492962
3,Random Forest Regression,88.078543,85.19993,2.878613


---

<br>

##   12\. ⚙️ Support Vector Regression (SVR)

This part demonstrates training and evaluating a Support Vector Regression (SVR) model. SVR is another powerful regression technique. <br>
Hyperparameters like `C`, `gamma`, and `epsilon` are tuned to control the model's behavior.

<br>

---

In [11]:
SVM = SVR(C=410, gamma=0.01, epsilon=0.06)

SVM.fit(x_train, y_train)

y_pred = SVM.predict(x_test)
y_train_pred = SVM.predict(x_train)

training_r2 = SVM.score(x_train, y_train) * 100
testing_r2 = SVM.score(x_test, y_test) * 100

diff = abs(training_r2 - testing_r2)

scores.loc[4] = ['Support Vector Regression', training_r2, testing_r2, diff]
scores

Unnamed: 0,Model,Train R²,Test R²,|| Diff ||
1,Linear Regression,81.327857,81.82655,0.498694
2,Decision Tree Regression,84.03822,75.545258,8.492962
3,Random Forest Regression,88.078543,85.19993,2.878613
4,Support Vector Regression,88.213014,87.681997,0.531017


---

<br>

##   13\. 👯‍♀️ K-Nearest Neighbors (KNN)

<li>This final section repeats the training and evaluation of KNN.
<li> Comparing the results from different runs is essential for model selection.

<br>
<br>
<br>

---

In [12]:
KNN = KNeighborsRegressor(algorithm='auto', metric='minkowski', p=2, leaf_size=65, n_neighbors=15)

KNN.fit(x_train, y_train)

y_pred = KNN.predict(x_test)
y_train_pred = KNN.predict(x_train)

training_r2 = KNN.score(x_train, y_train) * 100
testing_r2 = KNN.score(x_test, y_test) * 100

diff = abs(training_r2 - testing_r2)

scores.loc[5] = ['K-Nearest Neighbors Regression', training_r2, testing_r2, diff]
scores

# XGB = XGBRegressor(n_estimators=354, learning_rate=0.0099, max_depth=6, gamma=0.12, subsample=0.045, colsample_bytree=0.8)
# XGB.fit(x_train, y_train)

# y_pred = XGB.predict(x_test)
# y_train_pred = XGB.predict(x_train)
# training_r2 = XGB.score(x_train, y_train) * 100
# testing_r2 = XGB.score(x_test, y_test) * 100
# diff = abs(training_r2 - testing_r2)
# print(training_r2, testing_r2, diff, 1-diff)

Unnamed: 0,Model,Train R²,Test R²,|| Diff ||
1,Linear Regression,81.327857,81.82655,0.498694
2,Decision Tree Regression,84.03822,75.545258,8.492962
3,Random Forest Regression,88.078543,85.19993,2.878613
4,Support Vector Regression,88.213014,87.681997,0.531017
5,K-Nearest Neighbors Regression,81.423613,80.158281,1.265333


### 📊 **Model Evaluation Summary**

| Model | Train R² | Test R² | Absolute Difference |
|----------------------|----------|---------|------------|
| Linear Regression | 81.328% |	81.826% | 0.498 |
| Decision Tree Regression | 83.991 | 76.276 | 7.715 |
| Random Forest Regression | 87.964 | 85.152 | 2.812 |
| Support Vector Regression | 88.213 | 87.682 | 0.531 |
| K-Nearest Neighbors Regression | 81.424 | 80.158 | 1.266 |

---

### 🧠 **Inference**

- 🔹 **Linear Regression** shows good generalization with a small gap between training and test R².
- 🌳 **Decision Tree** performs well on training data but has a larger drop on test data, indicating **overfitting**.
- 🌲 **Random Forest** balances high performance with improved generalization over a single tree — a **robust** model.
- 📈 **Support Vector Regressor (SVR)** maintains consistent R² scores, indicating **stable generalization**.
- 👥 **K-Nearest Neighbors** performs decently but might be sensitive to data scaling and choice of `k`.

> 📌 **Conclusion**:  
> ✅ **Random Forest** emerges as the most balanced model — high training performance and strong generalization.  
> ⚠️ **Decision Tree** may need pruning or parameter tuning to avoid overfitting.  
> 💡 Consider **SVR or Linear Regression** if interpretability and consistency are key.



## 🔍 Feature Characteristics & Insights

### 📈 Strong correlation expected:
- **Cement and Age** usually have positive correlation with strength.
- **Too much Water** typically shows negative correlation (weakens concrete).
- **Superplasticizer** often helps improve strength by reducing water demand.

### 🧮 Mostly numerical:
- All features are continuous, which makes them ideal for regression models.

### 🔄 Some nonlinear relationships:
- e.g., **Age vs Strength** curve tends to flatten over time — this is why models like **SVR** and **Random Forest** do better than pure **Linear Regression**.

### 📉 Multicollinearity may exist:
- **Cement**, **slag**, and **fly ash** might substitute each other — linear models might suffer slightly unless regularization is used.

## 🤖 Why Models Performed as They Did
- **SVR** and **Random Forest** performed best because they capture non-linear and complex interactions.
- **Decision Tree** overfitted likely due to lack of regularization.
- **Linear Regression** did okay because the dataset still holds some linear structure.
- **KNN** did fairly well but may suffer when feature distributions vary widely (e.g., Cement vs Superplasticizer values).
