# Week 4: Multiple Linear Regression and Fine Tuning

## Learning Objectives:
- Be able to conceptually understand how linear regression works as more features are added
- Understand how Lasso, Ridge, and ElasticNet change the Linear Regression model
- Be able to make decisions about how to choose and fine tune their model

## Topics Covered:
- Multiple Linear Regression
- Ridge Regression
- Lasso Regression
- ElasticNet

In [None]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_regression
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

print("Libraries imported successfully!")

## 1. Understanding Multiple Linear Regression

Multiple linear regression extends simple linear regression by using multiple features to predict a target variable. As we add more features, the model becomes more flexible but also more prone to overfitting.

### Mathematical Foundation:
```
y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε
```

Where:
- y = target variable
- β₀ = intercept
- β₁, β₂, ..., βₙ = coefficients for each feature
- x₁, x₂, ..., xₙ = features
- ε = error term

In [None]:
# Create a comprehensive dataset with multiple features
np.random.seed(42)
n_samples = 200

# Generate synthetic house price data with many features
house_size = np.random.normal(2000, 600, n_samples)
bedrooms = np.random.randint(1, 6, n_samples)
bathrooms = np.random.randint(1, 4, n_samples)
age = np.random.randint(0, 50, n_samples)
lot_size = np.random.normal(8000, 2000, n_samples)
garage = np.random.randint(0, 4, n_samples)
neighborhood_score = np.random.normal(7, 2, n_samples)  # 1-10 scale
distance_to_city = np.random.normal(15, 8, n_samples)  # miles

# Create realistic relationships for house prices
price = (house_size * 100 + 
         bedrooms * 10000 + 
         bathrooms * 15000 + 
         -age * 800 +
         lot_size * 5 +
         garage * 8000 +
         neighborhood_score * 12000 +
         -distance_to_city * 2000 +
         np.random.normal(0, 25000, n_samples))

# Create DataFrame
house_data = pd.DataFrame({
    'House_Size': house_size,
    'Bedrooms': bedrooms,
    'Bathrooms': bathrooms,
    'Age': age,
    'Lot_Size': lot_size,
    'Garage': garage,
    'Neighborhood_Score': neighborhood_score,
    'Distance_to_City': distance_to_city,
    'Price': price
})

print("House Price Dataset:")
print(house_data.head())
print(f"\nDataset shape: {house_data.shape}")
print(f"\nBasic statistics:")
print(house_data.describe())

In [None]:
# Build baseline multiple linear regression model
print("=== BASELINE MULTIPLE LINEAR REGRESSION ===")

# Prepare features and target
feature_columns = ['House_Size', 'Bedrooms', 'Bathrooms', 'Age', 'Lot_Size', 
                  'Garage', 'Neighborhood_Score', 'Distance_to_City']
X = house_data[feature_columns]
y = house_data['Price']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the model
lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_train = lr_model.predict(X_train_scaled)
y_pred_test = lr_model.predict(X_test_scaled)

# Evaluate performance
train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
train_r2 = r2_score(y_train, y_pred_train)
test_r2 = r2_score(y_test, y_pred_test)

print(f"Training RMSE: ${train_rmse:,.2f}")
print(f"Test RMSE: ${test_rmse:,.2f}")
print(f"Training R²: {train_r2:.4f}")
print(f"Test R²: {test_r2:.4f}")

# Display coefficients
print(f"\nModel coefficients:")
print(f"Intercept: ${lr_model.intercept_:,.2f}")
for feature, coef in zip(feature_columns, lr_model.coef_):
    print(f"{feature:<20}: {coef:10.2f}")

## 2. Ridge Regression (L2 Regularization)

Ridge regression adds a penalty term to the linear regression cost function to prevent overfitting. It shrinks the coefficients towards zero but doesn't eliminate them entirely.

### Mathematical Foundation:
```
Cost = MSE + α * Σ(βᵢ²)
```

Where α (alpha) is the regularization parameter that controls the strength of the penalty.

In [None]:
# Ridge Regression Implementation
print("=== RIDGE REGRESSION (L2 REGULARIZATION) ===")

# Try different alpha values
alphas = [0.01, 0.1, 1, 10, 100, 1000]
ridge_results = []

for alpha in alphas:
    # Create Ridge model
    ridge_model = Ridge(alpha=alpha)
    ridge_model.fit(X_train_scaled, y_train)
    
    # Predictions
    y_pred_train_ridge = ridge_model.predict(X_train_scaled)
    y_pred_test_ridge = ridge_model.predict(X_test_scaled)
    
    # Evaluate
    train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train_ridge))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test_ridge))
    train_r2 = r2_score(y_train, y_pred_train_ridge)
    test_r2 = r2_score(y_test, y_pred_test_ridge)
    
    ridge_results.append({
        'alpha': alpha,
        'train_rmse': train_rmse,
        'test_rmse': test_rmse,
        'train_r2': train_r2,
        'test_r2': test_r2,
        'coefficients': ridge_model.coef_.copy()
    })
    
    print(f"Alpha {alpha:6.2f}: Train RMSE: ${train_rmse:8,.0f}, Test RMSE: ${test_rmse:8,.0f}, "
          f"Train R²: {train_r2:.4f}, Test R²: {test_r2:.4f}")

# Convert to DataFrame
ridge_df = pd.DataFrame(ridge_results)

# Find best alpha
best_alpha_idx = ridge_df['test_r2'].idxmax()
best_alpha = ridge_df.loc[best_alpha_idx, 'alpha']
print(f"\nBest alpha based on test R²: {best_alpha}")

## 3. Lasso Regression (L1 Regularization)

Lasso regression also adds a penalty term, but it uses the absolute value of coefficients. This can drive some coefficients to exactly zero, effectively performing feature selection.

### Mathematical Foundation:
```
Cost = MSE + α * Σ|βᵢ|
```

Where α (alpha) is the regularization parameter.

In [None]:
# Lasso Regression Implementation
print("=== LASSO REGRESSION (L1 REGULARIZATION) ===")

# Try different alpha values
lasso_results = []

for alpha in alphas:
    # Create Lasso model
    lasso_model = Lasso(alpha=alpha, max_iter=10000)
    lasso_model.fit(X_train_scaled, y_train)
    
    # Predictions
    y_pred_train_lasso = lasso_model.predict(X_train_scaled)
    y_pred_test_lasso = lasso_model.predict(X_test_scaled)
    
    # Evaluate
    train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train_lasso))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test_lasso))
    train_r2 = r2_score(y_train, y_pred_train_lasso)
    test_r2 = r2_score(y_test, y_pred_test_lasso)
    
    # Count non-zero coefficients
    n_nonzero = np.sum(lasso_model.coef_ != 0)
    
    lasso_results.append({
        'alpha': alpha,
        'train_rmse': train_rmse,
        'test_rmse': test_rmse,
        'train_r2': train_r2,
        'test_r2': test_r2,
        'n_nonzero': n_nonzero,
        'coefficients': lasso_model.coef_.copy()
    })
    
    print(f"Alpha {alpha:6.2f}: Train RMSE: ${train_rmse:8,.0f}, Test RMSE: ${test_rmse:8,.0f}, "
          f"Train R²: {train_r2:.4f}, Test R²: {test_r2:.4f}, Non-zero coefs: {n_nonzero}")

# Convert to DataFrame
lasso_df = pd.DataFrame(lasso_results)

# Find best alpha
best_lasso_idx = lasso_df['test_r2'].idxmax()
best_lasso_alpha = lasso_df.loc[best_lasso_idx, 'alpha']
print(f"\nBest alpha based on test R²: {best_lasso_alpha}")

## 4. ElasticNet Regression

ElasticNet combines both L1 (Lasso) and L2 (Ridge) regularization. It provides a balance between feature selection and coefficient shrinkage.

### Mathematical Foundation:
```
Cost = MSE + α * [l1_ratio * Σ|βᵢ| + (1 - l1_ratio) * Σ(βᵢ²)]
```

Where:
- α = regularization strength
- l1_ratio = mixing parameter (0 = Ridge, 1 = Lasso)

In [None]:
# ElasticNet Regression Implementation
print("=== ELASTICNET REGRESSION ===")

# Try different l1_ratio values
l1_ratios = [0.1, 0.3, 0.5, 0.7, 0.9]
elastic_results = []

for l1_ratio in l1_ratios:
    # Use alpha=1.0 for comparison
    elastic_model = ElasticNet(alpha=1.0, l1_ratio=l1_ratio, max_iter=10000)
    elastic_model.fit(X_train_scaled, y_train)
    
    # Predictions
    y_pred_train_elastic = elastic_model.predict(X_train_scaled)
    y_pred_test_elastic = elastic_model.predict(X_test_scaled)
    
    # Evaluate
    train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train_elastic))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test_elastic))
    train_r2 = r2_score(y_train, y_pred_train_elastic)
    test_r2 = r2_score(y_test, y_pred_test_elastic)
    
    # Count non-zero coefficients
    n_nonzero = np.sum(elastic_model.coef_ != 0)
    
    elastic_results.append({
        'l1_ratio': l1_ratio,
        'train_rmse': train_rmse,
        'test_rmse': test_rmse,
        'train_r2': train_r2,
        'test_r2': test_r2,
        'n_nonzero': n_nonzero,
        'coefficients': elastic_model.coef_.copy()
    })
    
    print(f"L1_ratio {l1_ratio:.1f}: Train RMSE: ${train_rmse:8,.0f}, Test RMSE: ${test_rmse:8,.0f}, "
          f"Train R²: {train_r2:.4f}, Test R²: {test_r2:.4f}, Non-zero coefs: {n_nonzero}")

# Convert to DataFrame
elastic_df = pd.DataFrame(elastic_results)

# Find best l1_ratio
best_elastic_idx = elastic_df['test_r2'].idxmax()
best_l1_ratio = elastic_df.loc[best_elastic_idx, 'l1_ratio']
print(f"\nBest l1_ratio based on test R²: {best_l1_ratio}")

## 5. Model Comparison and Selection

Let's compare all the models and use cross-validation to select the best one.

In [None]:
# Comprehensive model comparison
print("=== COMPREHENSIVE MODEL COMPARISON ===")

# Define models with their best parameters
models = {
    'Linear Regression': LinearRegression(),
    'Ridge': Ridge(alpha=best_alpha),
    'Lasso': Lasso(alpha=best_lasso_alpha, max_iter=10000),
    'ElasticNet': ElasticNet(alpha=1.0, l1_ratio=best_l1_ratio, max_iter=10000)
}

# Cross-validation comparison
cv_results = []

for name, model in models.items():
    # Perform 5-fold cross-validation
    cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='r2')
    cv_rmse_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='neg_mean_squared_error')
    cv_rmse_scores = np.sqrt(-cv_rmse_scores)
    
    # Train on full training set and test
    model.fit(X_train_scaled, y_train)
    y_pred_test = model.predict(X_test_scaled)
    test_r2 = r2_score(y_test, y_pred_test)
    test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
    
    cv_results.append({
        'Model': name,
        'CV_R2_Mean': cv_scores.mean(),
        'CV_R2_Std': cv_scores.std(),
        'CV_RMSE_Mean': cv_rmse_scores.mean(),
        'CV_RMSE_Std': cv_rmse_scores.std(),
        'Test_R2': test_r2,
        'Test_RMSE': test_rmse
    })

# Convert to DataFrame and display
cv_results_df = pd.DataFrame(cv_results)
print("\nModel Comparison Results:")
print(cv_results_df.round(4))

# Find best model
best_model_idx = cv_results_df['Test_R2'].idxmax()
best_model_name = cv_results_df.loc[best_model_idx, 'Model']
print(f"\nBest model based on test R²: {best_model_name}")

## 6. Hyperparameter Tuning with GridSearchCV

Let's use GridSearchCV to find the optimal hyperparameters for each regularized model.

In [None]:
# Grid search for best hyperparameters
print("=== GRID SEARCH FOR OPTIMAL HYPERPARAMETERS ===")

# Define parameter grids
param_grids = {
    'Ridge': {
        'alpha': [0.01, 0.1, 1, 10, 100, 1000]
    },
    'Lasso': {
        'alpha': [0.01, 0.1, 1, 10, 100, 1000]
    },
    'ElasticNet': {
        'alpha': [0.01, 0.1, 1, 10, 100],
        'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]
    }
}

# Perform grid search
grid_search_results = {}

for model_name, param_grid in param_grids.items():
    print(f"\nGrid searching {model_name}...")
    
    if model_name == 'Ridge':
        model = Ridge()
    elif model_name == 'Lasso':
        model = Lasso(max_iter=10000)
    elif model_name == 'ElasticNet':
        model = ElasticNet(max_iter=10000)
    
    # Perform grid search
    grid_search = GridSearchCV(model, param_grid, cv=5, scoring='r2', n_jobs=-1)
    grid_search.fit(X_train_scaled, y_train)
    
    # Test the best model
    best_model = grid_search.best_estimator_
    y_pred_test = best_model.predict(X_test_scaled)
    test_r2 = r2_score(y_test, y_pred_test)
    test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
    
    grid_search_results[model_name] = {
        'best_params': grid_search.best_params_,
        'best_cv_score': grid_search.best_score_,
        'test_r2': test_r2,
        'test_rmse': test_rmse,
        'best_model': best_model
    }
    
    print(f"Best parameters: {grid_search.best_params_}")
    print(f"Best CV R²: {grid_search.best_score_:.4f}")
    print(f"Test R²: {test_r2:.4f}")
    print(f"Test RMSE: ${test_rmse:,.2f}")

# Find overall best model
best_overall = max(grid_search_results.items(), key=lambda x: x[1]['test_r2'])
print(f"\nOverall best model: {best_overall[0]}")
print(f"Parameters: {best_overall[1]['best_params']}")
print(f"Test R²: {best_overall[1]['test_r2']:.4f}")

## 7. Summary

Congratulations! You've completed your deep dive into multiple linear regression and regularization techniques. Here's what you learned:

### Key Concepts Mastered:
1. **Multiple Linear Regression**: Using multiple features to predict outcomes
2. **Ridge Regression (L2)**: Shrinks coefficients to prevent overfitting
3. **Lasso Regression (L1)**: Performs feature selection by driving coefficients to zero
4. **ElasticNet**: Combines Ridge and Lasso for balanced regularization
5. **Hyperparameter Tuning**: Using GridSearchCV to find optimal parameters

### Key Skills Acquired:
- Understanding when and why to use regularization
- Implementing Ridge, Lasso, and ElasticNet regression
- Comparing model performance across different regularization techniques
- Using cross-validation for model selection
- Making informed decisions about model complexity

### When to Use Each Technique:
- **Linear Regression**: When you have few features and no overfitting
- **Ridge Regression**: When you want to keep all features but prevent overfitting
- **Lasso Regression**: When you want automatic feature selection
- **ElasticNet**: When you want both regularization and some feature selection

### Best Practices to Remember:
- Always scale your features before applying regularization
- Use cross-validation to select hyperparameters
- Start with simple models before adding complexity
- Consider the interpretability vs accuracy trade-off
- Regularization helps prevent overfitting, especially with limited data

### Next Steps:
In the next week, we'll explore classification algorithms including logistic regression, decision trees, and random forests. You'll learn how to apply these concepts to classification problems and understand different evaluation metrics for classification models.