
# 🏡 Linear Regression and Multicollinearity Analysis on Housing Data

This section continues from where we split the dataset into training and testing sets. We will now apply linear regression with different feature combinations and analyze the multicollinearity in the dataset.
CONTINUE BDMT_S1
---

## 1. Linear Regression with One Feature (Area)

We will start by building a simple linear regression model using just the `area` feature to predict the `price`.

```python
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Use 'area' as the only feature
X_train_one = X_train[['area']]
X_test_one = X_test[['area']]

# Initialize and train the linear regression model
lr_one = LinearRegression()
lr_one.fit(X_train_one, y_train)

# Make predictions
y_pred_one = lr_one.predict(X_test_one)

# Evaluate the model
print("Mean Squared Error (MSE):", mean_squared_error(y_test, y_pred_one))
print("R^2 Score:", r2_score(y_test, y_pred_one))
```

---

## 2. Linear Regression with Three Features

Next, we will extend the model to use three features: `area`, `bedrooms`, and `bathrooms`.

```python
# Use 'area', 'bedrooms', and 'bathrooms' as features
X_train_three = X_train[['area', 'bedrooms', 'bathrooms']]
X_test_three = X_test[['area', 'bedrooms', 'bathrooms']]

# Initialize and train the linear regression model
lr_three = LinearRegression()
lr_three.fit(X_train_three, y_train)

# Make predictions
y_pred_three = lr_three.predict(X_test_three)

# Evaluate the model
print("Mean Squared Error (MSE):", mean_squared_error(y_test, y_pred_three))
print("R^2 Score:", r2_score(y_test, y_pred_three))
```

---

## 3. Linear Regression with All Features

Finally, we will build a model using all available features in the dataset.

```python
# Use all features available
lr_all = LinearRegression()
lr_all.fit(X_train, y_train)

# Make predictions
y_pred_all = lr_all.predict(X_test)

# Evaluate the model
print("Mean Squared Error (MSE):", mean_squared_error(y_test, y_pred_all))
print("R^2 Score:", r2_score(y_test, y_pred_all))
```

---

## 4. Detecting Multicollinearity

Multicollinearity occurs when two or more predictor variables in a dataset are highly correlated. We can detect multicollinearity using the Variance Inflation Factor (VIF).

```python
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd

# Calculate VIF for each feature
vif_data = pd.DataFrame()
vif_data["feature"] = X_train.columns
vif_data["VIF"] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]

# Display the VIF values
print(vif_data)
```

- A VIF value greater than 10 indicates a high correlation between a feature and other features, signaling multicollinearity.
- We may consider removing or transforming features with high VIF values to improve the model.

---

## Conclusion

We have built and evaluated linear regression models using different combinations of features, starting with one feature (`area`), then adding more features (`bedrooms` and `bathrooms`), and finally using all available features. We also analyzed multicollinearity using the Variance Inflation Factor (VIF) to identify potential issues with feature correlations.