# # Concrete Compressive Strength Regression Analysis
# 
# In this notebook, we explore the differences between linear and polynomial regression on the Concrete Compressive Strength Dataset. We will:
# 
# - Load and preprocess the data
# - Train and evaluate a linear regression model
# - Train and evaluate polynomial regression models (degrees 2, 3, and 4)
# - Visualize and compare the results
# - Discuss the bias-variance tradeoff observed in these models

# ## (a) Data Loading & Preprocessing

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/Concrete_Data.xls"
df = pd.read_excel(url)

print("Dataset Head:")
print(df.head())

print("\nColumn Names:")
print(df.columns.tolist())

# Check for missing values
print("\nMissing Values per Column:")
print(df.isnull().sum())

X = df.iloc[:, :-1]
y = df.iloc[:, -1]

# Split the dataset into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("\nTraining set size:", X_train.shape)
print("Testing set size:", X_test.shape)

# ## (b) Implementing Linear Regression

In [None]:
# Create and train the linear regression model
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

# Make predictions on the test set
y_pred_lin = lin_reg.predict(X_test)

# Evaluate performance
mse_lin = mean_squared_error(y_test, y_pred_lin)
r2_lin = r2_score(y_test, y_pred_lin)

print("Linear Regression Performance:")
print("Mean Squared Error (MSE):", mse_lin)
print("Coefficient of Determination (R²):", r2_lin)

# Plot predicted vs. actual values for linear regression
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred_lin, alpha=0.6, label="Predicted vs. Actual")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', label="Perfect Prediction")
plt.xlabel("Actual Compressive Strength")
plt.ylabel("Predicted Compressive Strength")
plt.title("Linear Regression: Predicted vs. Actual")
plt.legend()
plt.show()

# ## (c) Implementing Polynomial Regression (Degrees 2, 3, 4)


In [None]:
# Define the degrees we want to test
degrees = [2, 3, 4]

# Dictionaries to store performance metrics and predictions for each polynomial degree
mse_poly = {}
r2_poly = {}
y_pred_poly = {}

plt.figure(figsize=(15, 4))

for i, degree in enumerate(degrees):
    # Transform the features for the current polynomial degree
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly = poly.transform(X_test)
    
    # Train the model on the polynomial features
    poly_reg = LinearRegression()
    poly_reg.fit(X_train_poly, y_train)
    
    # Predict on the test set
    y_pred = poly_reg.predict(X_test_poly)
    y_pred_poly[degree] = y_pred
    
    # Evaluate performance
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    mse_poly[degree] = mse
    r2_poly[degree] = r2
    
    print(f"Polynomial Degree {degree} Performance:")
    print(f"Mean Squared Error (MSE): {mse}")
    print(f"Coefficient of Determination (R²): {r2}\n")
    
    # Plot predicted vs. actual values for current polynomial degree
    plt.subplot(1, len(degrees), i+1)
    plt.scatter(y_test, y_pred, alpha=0.6, label="Predicted vs. Actual")
    plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', label="Perfect Prediction")
    plt.xlabel("Actual")
    plt.ylabel("Predicted")
    plt.title(f"Poly Degree {degree}")
    plt.legend()

plt.tight_layout()
plt.show()

# ## (d) Visualizing & Comparing Results

In [None]:
# Create a combined plot for comparison
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_test, color="black", alpha=0.5, label="Actual Data")  # This shows the perfect prediction line

# Plot linear regression predictions
plt.scatter(y_test, y_pred_lin, alpha=0.6, label="Linear Regression")

# Plot polynomial regression predictions for each degree
colors = {2: "blue", 3: "green", 4: "orange"}
for degree in degrees:
    plt.scatter(y_test, y_pred_poly[degree], alpha=0.6, label=f"Poly Degree {degree}", color=colors[degree])

plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', label="Perfect Prediction")
plt.xlabel("Actual Compressive Strength")
plt.ylabel("Predicted Compressive Strength")
plt.title("Model Comparison: Predicted vs. Actual")
plt.legend()
plt.show()

# Summarize the performance metrics
print("Performance Summary:")
print("Linear Regression -> MSE: {:.4f}, R²: {:.4f}".format(mse_lin, r2_lin))
for degree in degrees:
    print(f"Polynomial Degree {degree} -> MSE: {mse_poly[degree]:.4f}, R²: {r2_poly[degree]:.4f}")

# ## (e) Bias-Variance Tradeoff Analysis
# 
# **Discussion:**
# 
# - **High Bias / Low Variance:**  
#   The **linear regression** model is simple and might not capture the complex nonlinear relationships in the data. This simplicity typically results in high bias (underfitting) but low variance.
# 
# - **Low Bias / High Variance:**  
#   The **polynomial regression model with degree 4** is very flexible and can closely follow the training data, capturing noise in addition to the underlying trend. This results in low bias but high variance (overfitting), which may hurt performance on unseen data.
# 
# - **Balanced Bias and Variance:**  
#   A **polynomial model with degree 2 or 3** may strike a better balance between bias and variance. These models are flexible enough to capture the non-linear patterns without overfitting as much as a higher degree polynomial.
# 
# **Why Higher-Degree Polynomials Tend to Overfit:**
# 
# Higher-degree polynomials have many parameters, which allow them to fit the training data very closely—even the noise. Although this leads to a very low error on the training set (low bias), it often results in large fluctuations when predicting unseen data (high variance). This phenomenon is known as overfitting, where the model performs well on training data but poorly generalizes to new data.