##   🏡 Concrete Strength Prediction: From Exploration to Modeling 🏗️

<br>

---

<br>

##   1\. 🧐 Project Initialization: Library Imports

We begin by importing the necessary Python libraries. These tools will help us handle data, perform calculations, visualize information, build machine learning models, and evaluate their performance.

In [1]:
import pandas as pd
from numpy import log1p
# import matplotlib.pyplot as plt
# import seaborn as sns
from sklearn.model_selection import train_test_split, KFold
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from xgboost import XGBRegressor

---

<br>

##   2\. 📥 Data Loading and Inspection

In this step, we load the concrete strength dataset from a CSV file. We then identify the features (input variables) and the target variable ('strength', which is what we want to predict). Finally, we display the first few rows of the data to get an initial look at its contents.

<br>

---

In [None]:
# df = pd.read_csv('https://raw.githubusercontent.com/rahulinchal/SPPU/refs/heads/main/Data/concrete_Data.csv')
df = pd.read_csv('./Data/Concrete.csv')
features = df.columns[:-1].tolist()
target = 'strength'
df

Unnamed: 0,cement,slag,ash,water,superplastic,coarseagg,fineagg,age,strength
0,141.3,212.0,0.0,203.5,0.0,971.8,748.5,28,29.89
1,168.9,42.2,124.3,158.3,10.8,1080.8,796.2,14,23.51
2,250.0,0.0,95.7,187.4,5.5,956.9,861.2,28,29.22
3,266.0,114.0,0.0,228.0,0.0,932.0,670.0,28,45.85
4,154.8,183.4,0.0,193.3,9.1,1047.4,696.7,28,18.29
...,...,...,...,...,...,...,...,...,...
1025,135.0,0.0,166.0,180.0,10.0,961.0,805.0,28,13.29
1026,531.3,0.0,0.0,141.8,28.2,852.1,893.7,3,41.30
1027,276.4,116.0,90.3,179.6,8.9,870.1,768.3,28,44.28
1028,342.0,38.0,0.0,228.0,0.0,932.0,670.0,270,55.06


---

<br>

##   3\. 📊 Initial Data Exploration: Box Plots

The following code was designed to create box plots. Box plots are a useful way to visualize the distribution of data and identify potential outliers (extreme values).

<br>

---

In [3]:
# plt.figure(figsize=(25, 20))
# plt.subplot(3, 3, 1)
# plt.title('Box plot of features')
# plt.xlabel('Features')
# plt.ylabel('Value')
# plt.xticks(rotation=45)
# plt.grid(axis='y')
# plt.boxplot([df[feature] for feature in features], labels=features, showfliers=True)
# plt.show()

# for feature in features:
# 	plt.figure(figsize=(10, 5))
# 	plt.boxplot(df[feature], showfliers=True, vert=False)
# 	plt.title(f'Box plot of {feature}')
# 	plt.show()

In [4]:
# eise hi uncomment karna outlier treatment ke liye, sirf eise hi aur konsa bhi outliers treatment wala nhi
    #    ^
    #    |
    #    |
    #    |
    #    |
df = df[df['water'] < 240]
df = df[df['superplastic'] < 30]
df = df[df['fineagg'] < 960]
df = df[df['age'] < 300]
df


Unnamed: 0,cement,slag,ash,water,superplastic,coarseagg,fineagg,age,strength
0,141.3,212.0,0.0,203.5,0.0,971.8,748.5,28,29.89
1,168.9,42.2,124.3,158.3,10.8,1080.8,796.2,14,23.51
2,250.0,0.0,95.7,187.4,5.5,956.9,861.2,28,29.22
3,266.0,114.0,0.0,228.0,0.0,932.0,670.0,28,45.85
4,154.8,183.4,0.0,193.3,9.1,1047.4,696.7,28,18.29
...,...,...,...,...,...,...,...,...,...
1025,135.0,0.0,166.0,180.0,10.0,961.0,805.0,28,13.29
1026,531.3,0.0,0.0,141.8,28.2,852.1,893.7,3,41.30
1027,276.4,116.0,90.3,179.6,8.9,870.1,768.3,28,44.28
1028,342.0,38.0,0.0,228.0,0.0,932.0,670.0,270,55.06


---

<br>

##   4\. 📊 Post-Outlier Removal Visualization

This code would generate box plots *after* the outlier removal in the previous step. Comparing these plots to those in section 3 would show the impact of removing outliers.

<br>

---

In [5]:
# plt.figure(figsize=(25, 20))
# plt.subplot(3, 3, 1)
# plt.title('Box plot of features')
# plt.xlabel('Features')
# plt.ylabel('Value')
# plt.xticks(rotation=45)
# plt.grid(axis='y')
# plt.boxplot([df[feature] for feature in features], labels=features, showfliers=True)
# plt.show()

# for feature in features:
# 	plt.figure(figsize=(10, 5))
# 	plt.boxplot(df[feature], showfliers=True, vert=False)
# 	plt.title(f'Box plot of {feature}')
# 	plt.show()

---

<br>

##   5\. 🛠️ Feature Transformation and Data Splitting

This is a critical data preprocessing stage. First, we address skewness (asymmetry) in the data by applying a log transformation to features that exhibit high skewness.  Then, we split the data into training and testing sets. The training set is used to train the models, and the testing set is used to evaluate their performance on unseen data. Finally, we scale the numerical features using `StandardScaler`. Scaling ensures that all features contribute equally to the models.

<br>

---


In [6]:
highly_skewed_features = df[features].skew()[abs(df[features].skew()) > 0.5].index

for feature in highly_skewed_features:
    df[feature] = log1p(df[feature])

x_train, x_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.3, random_state=42)

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

---

<br>

##   6\. 📊 Post-Transformation Visualization (Commented Out)

This commented-out section would visualize the feature distributions *after* the log transformation, allowing us to see how the transformation affected the skewness.

<br>

---

In [7]:
# plt.figure(figsize=(25, 20))
# plt.subplot(3, 3, 1)
# plt.title('Box plot of features')
# plt.xlabel('Features')
# plt.ylabel('Value')
# plt.xticks(rotation=45)
# plt.grid(axis='y')
# plt.boxplot([df[feature] for feature in features], labels=features, showfliers=True)
# plt.show()

# for feature in features:
# 	plt.figure(figsize=(10, 5))
# 	plt.boxplot(df[feature], showfliers=True, vert=False)
# 	plt.title(f'Box plot of {feature}')
# 	plt.show()

---

<br>

##   8\. 🤖 Linear Regression Model

Here, we train a Linear Regression model. We then evaluate its performance using the R-squared metric, which measures how well the model fits the data. We calculate the R-squared on both the training and testing sets to check for overfitting (when the model performs much better on the training data than on the testing data).


<br>

---

In [8]:
LR = LinearRegression()
LR.fit(x_train, y_train)
y_pred = LR.predict(x_test)
y_train_pred = LR.predict(x_train)
training_r2 = LR.score(x_train, y_train) * 100
testing_r2 = LR.score(x_test, y_test) * 100
diff = abs(training_r2 - testing_r2)
print(training_r2, testing_r2, diff, 1-diff)

81.32785659503517 81.82655042523399 0.4986938301988175 0.5013061698011825


---

<br>

##   9\. 🌳 Decision Tree Regressor

This section trains and evaluates a Decision Tree Regressor. <br>
Hyperparameters (parameters that control the model's structure) such as `ccp_alpha`, `max_depth`, `max_leaf_nodes`, and `min_samples_split` are set to influence the tree's complexity.

<br>

---

In [9]:
DTR = DecisionTreeRegressor(ccp_alpha=0.689, max_depth=80, max_leaf_nodes=87, min_samples_split=26, max_features=4)
DTR.fit(x_train, y_train)
y_pred = DTR.predict(x_test)
y_train_pred = DTR.predict(x_train)

training_r2 = DTR.score(x_train, y_train) * 100
testing_r2 = DTR.score(x_test, y_test) * 100
diff = abs(training_r2 - testing_r2)
print(f'{training_r2} {testing_r2} {diff} {1-diff} {_}')

82.42477563661568 73.78008052318323 8.644695113432448 -7.644695113432448         cement      slag    ash  water  superplastic  coarseagg  fineagg  \
0     4.957938  5.361292    0.0  203.5      0.000000      971.8    748.5   
1     5.135210  3.765840  124.3  158.3      2.468100     1080.8    796.2   
2     5.525453  0.000000   95.7  187.4      1.871802      956.9    861.2   
3     5.587249  4.744932    0.0  228.0      0.000000      932.0    670.0   
4     5.048573  5.217107    0.0  193.3      2.312535     1047.4    696.7   
...        ...       ...    ...    ...           ...        ...      ...   
1025  4.912655  0.000000  166.0  180.0      2.397895      961.0    805.0   
1026  6.277207  0.000000    0.0  141.8      3.374169      852.1    893.7   
1027  5.625461  4.762174   90.3  179.6      2.292535      870.1    768.3   
1028  5.837730  3.663562    0.0  228.0      0.000000      932.0    670.0   
1029  6.293419  0.000000    0.0  173.0      0.000000     1125.0    613.0   

           age

---

<br>

##   10\. 🌲 Random Forest and XGBoost Regressors

This part explores more advanced models: Random Forest and XGBoost. <br>
These are ensemble methods, which combine multiple simpler models to make more accurate predictions. Hyperparameters are again used to tune the models.

<br>

---

In [None]:
RFR = RandomForestRegressor(n_estimators=825, min_samples_split=8, min_samples_leaf=8, max_features=0.49, bootstrap=True, oob_score=True)
RFR.fit(x_train, y_train)
y_pred = RFR.predict(x_test)
y_train_pred = RFR.predict(x_train)

training_r2 = RFR.score(x_train, y_train) * 100
testing_r2 = RFR.score(x_test, y_test) * 100
diff = abs(training_r2 - testing_r2)
print(training_r2, testing_r2, diff, 1-diff)

# XGB = XGBRegressor(n_estimators=354, learning_rate=0.009, max_depth=6, gamma=0.15, subsample=0.045, colsample_bytree=0.8)
# XGB.fit(x_train, y_train)

# y_pred = XGB.predict(x_test)
# y_train_pred = XGB.predict(x_train)
# training_r2 = XGB.score(x_train, y_train) * 100
# testing_r2 = XGB.score(x_test, y_test) * 100
# diff = abs(training_r2 - testing_r2)
# print(training_r2, testing_r2, diff, 1-diff)

88.00463691522623 85.20503966832244 2.7995972469037866 -1.7995972469037866


```python
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
param_grid = {
    'n_estimators': [100, 300, 500, 800, 1200],# Number of trees
    'max_depth': [None, 5, 10, 15],# Maximum depth of the trees
    'min_samples_split': [2, 5, 10, 20],# Minimum samples required to split an internal node
    'min_samples_leaf': [1, 5, 10, 20], # Minimum number of samples required at a leaf node
    'min_impurity_decrease': [0.0, 0.05, 0.1], # Minimum impurity decrease for a split
    'max_features': ['sqrt', 'log2', 0.6, 0.8], # Number of features to consider at each split
    'bootstrap': [True, False]# Whether to use bootstrap samples
}

RFR = RandomForestRegressor(oob_score=True, random_state=42) # Added random_state for reproducibility
grid_search = GridSearchCV(
    estimator=RFR,
    param_grid=param_grid,
    cv=3,# Number of cross-validation folds
    scoring='r2',# Metric to optimize (R-squared)
    n_jobs=-1,# Use all available cores
    verbose=2
)# Display progress

grid_search.fit(x_train, y_train)
print("Best hyperparameters:", grid_search.best_params_)
print("Best R-squared score:", grid_search.best_score_)
best_rfr = grid_search.best_estimator_
y_pred_tuned = best_rfr.predict(x_test)
y_train_pred_tuned = best_rfr.predict(x_train)
training_r2_tuned = best_rfr.score(x_train, y_train) * 100
testing_r2_tuned = best_rfr.score(x_test, y_test) * 100
diff_tuned = abs(training_r2_tuned - testing_r2_tuned)
print("Tuned Training R2:", training_r2_tuned)
print("Tuned Testing R2:", testing_r2_tuned)
print("Tuned Difference:", diff_tuned)
print("Tuned (1 - Difference):", 1 - diff_tuned)
```
### OUTPUT:
Fitting 3 folds for each of 7680 candidates, totalling 23040 fits

e:\Python312\Lib\site-packages\sklearn\model_selection\_validation.py:425: FitFailedWarning: 
11520 fits failed out of a total of 23040.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
11520 fits failed with the following error:
Traceback (most recent call last):
  File "e:\Python312\Lib\site-packages\sklearn\model_selection\_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "e:\Python312\Lib\site-packages\sklearn\base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "e:\Python312\Lib\site-packages\sklearn\ensemble\_forest.py", line 417, in fit
    raise ValueError("Out of bag estimation only available if bootstrap=True")
ValueError: Out of bag estimation only available if bootstrap=True

  warnings.warn(some_fits_failed_message, FitFailedWarning)
e:\Python312\Lib\site-packages\sklearn\model_selection\_search.py:979: UserWarning: One or more of the test scores are non-finite: [0.84728255 0.85030631 0.85021671 ...        nan        nan        nan]

Best hyperparameters: {'bootstrap': True, 'max_depth': None, 'max_features': 0.6, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 800}
Best R-squared score: 0.86545690159861
Tuned Training R2: 98.22626431289588
Tuned Testing R2: 93.0819510242481
Tuned Difference: 5.14431328864778
Tuned (1 - Difference): -4.14431328864778

---

<br>

##   12\. ⚙️ Support Vector Regression (SVR)

This part demonstrates training and evaluating a Support Vector Regression (SVR) model. SVR is another powerful regression technique. <br>
Hyperparameters like `C`, `gamma`, and `epsilon` are tuned to control the model's behavior.

<br>

---

In [11]:
SVM = SVR(C=410, gamma=0.01, epsilon=0.06)
SVM.fit(x_train, y_train)
y_pred = SVM.predict(x_test)
y_train_pred = SVM.predict(x_train)

training_r2 = SVM.score(x_train, y_train) * 100
testing_r2 = SVM.score(x_test, y_test) * 100
diff = abs(training_r2 - testing_r2)
print(training_r2, testing_r2, diff, 1-diff)

88.21301362638955 87.68199685877651 0.5310167676130391 0.46898323238696094


---

<br>

##   13\. 👯‍♀️ K-Nearest Neighbors (KNN) and XGBoost (Repeated)

<li>This final section repeats the training and evaluation of KNN and XGBoost.
<li> It's possible that different hyperparameters are being explored here compared to earlier sections.
<li> Comparing the results from different runs is essential for model selection.

<br>
<br>
<br>

---

In [12]:
KNN = KNeighborsRegressor(algorithm='auto', metric='minkowski', p=2, leaf_size=65, n_neighbors=15)

KNN.fit(x_train, y_train)

y_pred = KNN.predict(x_test)
y_train_pred = KNN.predict(x_train)

training_r2 = KNN.score(x_train, y_train) * 100
testing_r2 = KNN.score(x_test, y_test) * 100

diff = abs(training_r2 - testing_r2)

print(training_r2, testing_r2, diff, 1-diff)

XGB = XGBRegressor(n_estimators=354, learning_rate=0.0099, max_depth=6, gamma=0.12, subsample=0.045, colsample_bytree=0.8)
XGB.fit(x_train, y_train)

y_pred = XGB.predict(x_test)
y_train_pred = XGB.predict(x_train)
training_r2 = XGB.score(x_train, y_train) * 100
testing_r2 = XGB.score(x_test, y_test) * 100
diff = abs(training_r2 - testing_r2)
print(training_r2, testing_r2, diff, 1-diff)

81.42361341220936 80.15828061920818 1.2653327930011784 -0.2653327930011784
82.987611701891 82.71176954659256 0.2758421552984345 0.7241578447015655
