# Multiple Linear Regression

**Multiple linear regression** is an extension of simple linear regression that allows us to model the relationship between a dependent variable and **multiple** independent variables. It is used when we want to predict a continuous outcome based on several predictors. The model assumes that the relationship between the dependent variable and the independent variables can be expressed as a linear equation: $$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \epsilon$$ where:

-   $y$ is the dependent variable (the outcome we want to predict).
-   $\beta_0$ is the y-intercept (the value of $y$ when all $x_i$ are 0).
-   $\beta_1, \beta_2, ..., \beta_n$ are the coefficients for each independent variable $x_i$ (the change in $y$ for a one-unit change in $x_i$).
-   $x_1, x_2, ..., x_n$ are the independent variables (the predictors).
-   $\epsilon$ is the error term (the difference between the observed and predicted values).

## Assumptions of Multiple Linear Regression

The assumptions of multiple linear regression are similar to those of simple linear regression, but with some additional considerations due to the presence of multiple predictors:

1.  **Linearity**: The relationship between the dependent variable and each independent variable is linear.
2.  **Independence**: The observations are independent of each other.
3.  **Homoscedasticity**: The variance of the errors is constant across all levels of the independent variables.
4.  **Normality of errors**: The errors (residuals) are normally distributed.
5.  **No multicollinearity**: The independent variables are not too highly correlated with each other.

## Practical Demonstration

We will use a feature of `scikit-learn` to generate a synthetic dataset for linear regression. This will allow us to demonstrate the concepts of linear regression without focusing on data preprocessing or feature engineering.

`scikit-learn` provides a simple way to create synthetic datasets for regression tasks, which can be useful for testing and learning purposes. We will use the `make_regression` function to generate a dataset with a specified number of samples, features, and noise level.

-   Generate a synthetic dataset using the `make_regression` function from `scikit-learn`

In [None]:
import numpy as np
from sklearn.datasets import make_regression

# Generate synthetic dataset
X, y, coef = make_regression(n_samples=100,
                             n_features=2,
                             bias=1, noise=10,
                             coef=True, random_state=42)
# Print the coefficients
print("True coefficients:", coef)

# Create a DataFrame for better visualization
import pandas as pd
df = pd.DataFrame(X, columns=['Feature 1', 'Feature 2'])
df['Target'] = y
print(df.head())

-   Explore the dataset and create a correlation matrix to visualize the relationships between the features and the target variable.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Features and Target')
plt.show()

-   Perform the train-test split to prepare the data for model training and evaluation.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

-   Fit a linear regression model to the training data and make predictions on the test data.

In [None]:
# Fit a linear regression model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
# Print the coefficients and intercept
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

-   Visualize the predictions vs actual values, the residuals, and the residuals vs the predicted values.

In [None]:
# Visualize the predictions vs actual values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values')
plt.show()

In [None]:
# Visualize the residuals
residuals = y_test - y_pred
plt.figure(figsize=(10, 6))
sns.histplot(residuals, kde=True)
plt.title('Residuals Distribution')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Visualize the residuals vs predicted values
plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(0, color='k', linestyle='--', lw=2)
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Values')
plt.show()

-   Evaluate the model using metrics such as Mean Squared Error (MSE) and R-squared ($R^2$) score.

In [None]:
# Evaluate the model
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print("Mean Absolute Error:", mae)
print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("R^2 Score:", r2)

## Hands-on Exercises

-   Generate a synthetic dataset using the `make_regression` function from `scikit-learn`; use the following parameters:
    -   `n_samples` = 200
    -   `n_features` = 3
    -   `noise` = 10
    -   `random_state` = 42
    -   `bias` = 2
-   Explore the dataset and create a correlation matrix to visualize the relationships between the features and the target variable.
-   Train a linear regression model on the synthetic dataset; choose two features from the dataset as independent variables and the target variable as the dependent variable.
-   Split the dataset into training and testing sets (80% train, 20% test).
-   Fit a linear regression model to the training data.
-   Print the coefficients and intercept of the model.
-   Make predictions on the `test` dataset.
-   Calculate and print the Mean Squared Error (MSE) and R-squared ($R^2$) score of the model on the test set.
-   Visualize the predictions vs actual values, the residuals, and the residuals vs the predicted values.
-   Modify the synthetic dataset by introducing a non-linear relationship (e.g., quadratic or exponential) and observe how the performance metrics change. Fit a simple linear regression model to this new dataset and evaluate its performance.