4)	Implement a multiple linear regression model for the house.csv dataset

Regression is a statistical method for determining the relationship between features and an
outcome variable or result. Multiple linear regression attempts to model the relationship
between two or more features and a response by fitting a linear equation to the observed
data. Clearly, it is nothing but an extension of simple linear regression.
Simple Linear Regression: This is the simplest form of linear regression, and it involves
only one independent variable and one dependent variable. The equation for simple linear
regression is: Y= β0 + β1.X,
Where, Y is the dependent variable, X is the independent variable, β0 is the intercept, β1 is
the slope
Multiple Linear Regressions: This involves more than one independent variable and one
dependent variable. The equation for multiple linear regression is:
Y=β0+β1.X1+β2.X2…+βn.Xn
Where, Y is the dependent variable, X1, X2, …, Xp are the independent variables, β0 is the
intercept, β1, β2, …, βn are the slopes
Discussion-
 Univariate linear regression involves a single independent variable to predict a
dependent variable, fitting a straight line to the data.
 Multivariate linear regression uses multiple independent variables to predict a
dependent variable, modeling more complex relationships between the variables.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
import matplotlib.pyplot as plt

# Load the data
data = pd.read_csv('house.csv')

# If the first row is indeed the header, pandas should have recognized it
# If not, we can set the column names manually
if 'Square_Foot' in data.columns:
    data.columns = ['size', 'bedrooms', 'bathrooms', 'year', 'lot_size', 'garage_size', 'neighborhood_quality', 'price']
else:
    # If pandas didn't recognize the header, we set it manually and skip the first row
    column_names = ['size', 'bedrooms', 'bathrooms', 'year', 'lot_size', 'garage_size', 'neighborhood_quality', 'price']
    data = pd.read_csv('house.csv', header=None, names=column_names, skiprows=1)

# Split features and target
X = data.drop('price', axis=1)
y = data['price']

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f"Mean Absolute Error: {mae}")
print(f"Mean Squared Error: {mse}")
print(f"Root Mean Squared Error: {rmse}")

# Visualizations
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title("Actual vs Predicted House Prices")
plt.tight_layout()
plt.show()

residuals = y_test - y_pred
plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals, alpha=0.5)
plt.plot([y_pred.min(), y_pred.max()], [0, 0], 'r--', lw=2)
plt.xlabel("Predicted Price")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.tight_layout()
plt.show()

feature_importance = pd.DataFrame({'feature': X.columns, 'importance': abs(model.coef_)})
feature_importance = feature_importance.sort_values('importance', ascending=False)
plt.figure(figsize=(10, 6))
plt.bar(feature_importance['feature'], feature_importance['importance'])
plt.xlabel("Features")
plt.ylabel("Absolute Coefficient Value")
plt.title("Feature Importance")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()