# Module 2: Linear Regression 2 Practice

## Introduction
In this notebook, you'll learn how to implement a multivariate Linear Regression model using scikit-learn.

## Initial Knowledge Check
1. Why might multivariate regression be more powerful than univariate?
2. What is multicollinearity and why can it be problematic?
3. How do you interpret regression coefficients in multivariate context?

In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv('./data/linear_regression_2.csv')
X = df[['feature1', 'feature2', 'feature3']]
y = df['target']
df.head()

## 2. Exploratory Data Analysis
Visualize pairwise relationships and correlation matrix.


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(df, vars=['feature1','feature2','feature3'], diag_kind='kde')
plt.suptitle('Pairplot of Multivariate Regression Data', y=1.02)
plt.show()

corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

## 3. Train a Multivariate Linear Regression Model
We'll train a `LinearRegression` model and evaluate using MSE and R².


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Instantiate and train
lr = LinearRegression()
lr.fit(X, y)

# Predict and evaluate
y_pred = lr.predict(X)
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)
print("Intercept:", lr.intercept_)
print("Coefficients:", lr.coef_)
print(f"MSE: {mse:.2f}")
print(f"R²: {r2:.2f}")

## 4. Exercise for the Student
**Task:**  
1. Split the data into train/test (75/25).  
2. Evaluate model performance on both sets.  
3. Use statsmodels to fit the same model and display a detailed summary (coefficients, p-values, etc.).  
4. **Bonus:** Detect multicollinearity using Variance Inflation Factor (VIF).


## 5. Solution
Below is one possible solution, including statsmodels and VIF calculation.


In [None]:
from sklearn.model_selection import train_test_split
from statsmodels.api import OLS, add_constant
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=24
)

# Train scikit-learn
lr2 = LinearRegression()
lr2.fit(X_train, y_train)
print('Sklearn Performance:')
print('Train R²:', r2_score(y_train, lr2.predict(X_train)))
print('Test R²:', r2_score(y_test, lr2.predict(X_test)))

# Statsmodels summary
X_sm = add_constant(X)
model_sm = OLS(y, X_sm).fit()
print(model_sm.summary())

# Calculate VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame()
vif_data['feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)

---
### Next Steps
- Explore regularization methods (Ridge, Lasso) to address multicollinearity.
- Review diagnostic plots: residuals, QQ plot, leverage.
- Prepare for Logistic Regression by considering binary targets and classification metrics.
