### Using the same data set of Civil_Engineering_Regression_Dataset.csv

Part 3: Multiple Linear Regression
Fit a multiple linear regression model using Building Height, Material Quality, Labor Cost, Concrete Strength, and Foundation Depth as independent variables.
What is the equation of the multiple regression model?
Which independent variable has the highest impact on Construction Cost based on the regression coefficients?

In [8]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset
df = pd.read_csv("Civil_Engineering_Regression_Dataset.csv")

# Trim whitespace from column names
df.columns = df.columns.str.strip()

# Display first few rows
print("First few rows of the dataset:")
print(df.head())

# Identify independent and dependent variables
dependent_variable = "Construction_Cost"
independent_variables = ["Building_Height", "Material_Quality_Index", "Labor_Cost", "Concrete_Strength", "Foundation_Depth"]

# Check if required columns exist
missing_columns = [col for col in independent_variables + [dependent_variable] if col not in df.columns]
if missing_columns:
    print("Error: The following required columns are missing:", missing_columns)
else:
    # Prepare data for regression
    X = df[independent_variables]  # Independent variables
    y = df[dependent_variable]  # Dependent variable
    
    # Split data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Fit multiple linear regression model
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Get regression coefficients
    intercept = model.intercept_
    coefficients = dict(zip(independent_variables, model.coef_))
    print("\nEquation of multiple regression model:")
    print(f"Construction Cost = {intercept:.2f} " + " ".join([f"+ ({coeff:.2f} * {var})" for var, coeff in coefficients.items()]))
    
    # Identify the most impactful variable
    most_impactful_variable = max(coefficients, key=coefficients.get, default=None)
    print(f"\nVariable with highest impact: {most_impactful_variable} ({coefficients[most_impactful_variable]:.2f} impact on Construction Cost)")
    
    # Evaluate model performance
    y_pred = model.predict(X_test)
    r_squared = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    print("\nModel Performance:")
    print(f"R-squared: {r_squared:.4f}")
    print(f"Mean Squared Error: {mse:.4f}")

First few rows of the dataset:
   Project_ID  Building_Height  Material_Quality_Index  Labor_Cost  \
0           1        21.854305                       9   70.213332   
1           2        47.782144                       9  142.413614   
2           3        37.939727                       3  110.539985   
3           4        31.939632                       6  250.784939   
4           5        12.020839                       7  167.575159   

   Concrete_Strength  Foundation_Depth  Weather_Index  Construction_Cost  
0          45.326394          8.804790              4        2400.287931  
1          47.900505          6.727632              6        3705.461312  
2          22.112484          8.208544              8        2653.631004  
3          26.267562          7.094515              4        2534.099466  
4          40.134306          6.160303              6        1741.179333  

Equation of multiple regression model:
Construction Cost = -9.64 + (49.81 * Building_Height) + (1