# Multiple Linear Regression using Scikit-learn

This notebook demonstrates how to implement Multiple Linear Regression using scikit-learn. Unlike simple linear regression which uses one feature, multiple linear regression uses multiple features to predict the target variable.

## What we'll cover:
1. Data preparation with multiple features
2. Train-test splitting
3. Model training and prediction
4. Model evaluation using multiple metrics
5. Results visualization

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
%matplotlib inline

# Set random seed for reproducibility
np.random.seed(0)

## Generate Sample Data with Multiple Features

We create synthetic data with a known relationship:
- True relationship: y = 2 + 3x₁ + 1.5x₂ - 2x₃ + noise
- Three features (X₁, X₂, X₃) are randomly generated
- Gaussian noise is added for realism

In [None]:
# Generate random data points with 3 features
n_samples = 100
X = np.random.rand(n_samples, 3)  # 100 samples, 3 features

# True relationship: y = 2 + 3x₁ + 1.5x₂ - 2x₃ + noise
y = 2 + 3 * X[:, 0] + 1.5 * X[:, 1] - 2 * X[:, 2] + np.random.randn(n_samples)
y = y.reshape(-1, 1)

# Create feature names for better visualization
feature_names = ['Feature 1', 'Feature 2', 'Feature 3']

# Plot relationships between each feature and target
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
fig.suptitle('Relationship between Features and Target Variable')

for i in range(3):
    axes[i].scatter(X[:, i], y, alpha=0.5)
    axes[i].set_xlabel(feature_names[i])
    axes[i].set_ylabel('Target Variable')
    axes[i].grid(True)

plt.tight_layout()
plt.show()

## Split Data into Training and Testing Sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Dataset Split:")
print(f"Training set size: {X_train.shape[0]} samples, {X_train.shape[1]} features")
print(f"Testing set size: {X_test.shape[0]} samples, {X_test.shape[1]} features")

## Train the Multiple Linear Regression Model

In [None]:
# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Print the learned parameters
print("\nLearned Parameters:")
print("Intercept (bias):", model.intercept_[0].round(4), "(True value: 2)")
for i, (coef, name) in enumerate(zip(model.coef_[0], feature_names)):
    true_coef = [3, 1.5, -2][i]
    print(f"{name} coefficient:", coef.round(4), f"(True value: {true_coef})")

## Model Evaluation

In [None]:
# Calculate performance metrics
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)

print("Model Performance:")
print(f"Training R² Score: {train_r2:.4f}")
print(f"Testing R² Score: {test_r2:.4f}")
print(f"Training MSE: {train_mse:.4f}")
print(f"Testing MSE: {test_mse:.4f}")

## Visualize Predictions vs Actual Values

In [None]:
plt.figure(figsize=(10, 6))

# Plot training data
plt.scatter(y_train, y_train_pred, color='blue', alpha=0.5, label='Training Data')
plt.scatter(y_test, y_test_pred, color='red', alpha=0.5, label='Testing Data')

# Plot perfect prediction line
min_val = min(y_train.min(), y_test.min())
max_val = max(y_train.max(), y_test.max())
plt.plot([min_val, max_val], [min_val, max_val], 'k--', label='Perfect Prediction')

plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Multiple Linear Regression: Predicted vs Actual Values')
plt.legend()
plt.grid(True)
plt.show()

## Key Differences from Simple Linear Regression

1. **Number of Features**:
   - Simple Linear Regression: One feature (X)
   - Multiple Linear Regression: Multiple features (X₁, X₂, X₃)

2. **Model Equation**:
   - Simple: y = b₀ + b₁x + ε
   - Multiple: y = b₀ + b₁x₁ + b₂x₂ + b₃x₃ + ε

3. **Visualization**:
   - Simple: Can plot in 2D (one feature vs target)
   - Multiple: Requires multiple plots or dimensionality reduction

4. **Interpretation**:
   - Simple: One coefficient represents the effect of the single feature
   - Multiple: Each coefficient represents the effect of its feature while holding others constant