# Lab 4: Regression Analysis on Diabetes Dataset

**Name:** [Your Name]  
**Course:** MSCS 634  
**Lab Assignment:** Regression Analysis with Linear, Multiple, Polynomial, Ridge, and Lasso Regression


## Step 1: Data Preparation


In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
try:
    plt.style.use('seaborn-v0_8')
except:
    try:
        plt.style.use('seaborn')
    except:
        plt.style.use('default')
sns.set_palette("husl")


In [None]:
# Load the Diabetes dataset
diabetes = load_diabetes()
print("Dataset loaded successfully!")
print(f"Number of samples: {diabetes.data.shape[0]}")
print(f"Number of features: {diabetes.data.shape[1]}")
print(f"Feature names: {diabetes.feature_names}")


In [None]:
# Create a DataFrame for easier exploration
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df['target'] = diabetes.target

# Display basic information about the dataset
print("Dataset Info:")
print(df.info())
print("\nFirst few rows:")
print(df.head())
print("\nDataset Statistics:")
print(df.describe())


In [None]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")
print("\nNo missing values found - dataset is clean!")


In [None]:
# Visualize data distribution
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Target distribution
axes[0, 0].hist(df['target'], bins=30, edgecolor='black')
axes[0, 0].set_title('Distribution of Target Variable (Disease Progression)')
axes[0, 0].set_xlabel('Target Value')
axes[0, 0].set_ylabel('Frequency')

# Feature distributions
axes[0, 1].boxplot([df[col] for col in diabetes.feature_names[:5]], labels=diabetes.feature_names[:5])
axes[0, 1].set_title('Boxplot of First 5 Features')
axes[0, 1].tick_params(axis='x', rotation=45)

# Correlation heatmap
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', center=0, ax=axes[1, 0])
axes[1, 0].set_title('Correlation Heatmap')

# Target vs feature correlation
target_corr = df.corr()['target'].sort_values(ascending=False)[:-1]
axes[1, 1].barh(range(len(target_corr)), target_corr.values)
axes[1, 1].set_yticks(range(len(target_corr)))
axes[1, 1].set_yticklabels(target_corr.index)
axes[1, 1].set_title('Feature Correlation with Target')
axes[1, 1].set_xlabel('Correlation Coefficient')

plt.tight_layout()
plt.show()


## Step 2: Linear Regression


In [None]:
# Prepare data for Simple Linear Regression
# Using 'bmi' (Body Mass Index) as it typically has good correlation with health outcomes
X_simple = diabetes.data[:, 2].reshape(-1, 1)  # bmi is the 3rd feature (index 2)
y = diabetes.target

# Split the data into training and testing sets
X_train_simple, X_test_simple, y_train, y_test = train_test_split(
    X_simple, y, test_size=0.2, random_state=42
)

print(f"Training set size: {X_train_simple.shape[0]}")
print(f"Test set size: {X_test_simple.shape[0]}")


In [None]:
# Train Simple Linear Regression model
simple_lr = LinearRegression()
simple_lr.fit(X_train_simple, y_train)

# Make predictions
y_train_pred_simple = simple_lr.predict(X_train_simple)
y_test_pred_simple = simple_lr.predict(X_test_simple)

# Calculate evaluation metrics
train_mae_simple = mean_absolute_error(y_train, y_train_pred_simple)
train_mse_simple = mean_squared_error(y_train, y_train_pred_simple)
train_rmse_simple = np.sqrt(train_mse_simple)
train_r2_simple = r2_score(y_train, y_train_pred_simple)

test_mae_simple = mean_absolute_error(y_test, y_test_pred_simple)
test_mse_simple = mean_squared_error(y_test, y_test_pred_simple)
test_rmse_simple = np.sqrt(test_mse_simple)
test_r2_simple = r2_score(y_test, y_test_pred_simple)

print("Simple Linear Regression Results:")
print("=" * 50)
print("Training Set Metrics:")
print(f"  MAE:  {train_mae_simple:.4f}")
print(f"  MSE:  {train_mse_simple:.4f}")
print(f"  RMSE: {train_rmse_simple:.4f}")
print(f"  R²:   {train_r2_simple:.4f}")
print("\nTest Set Metrics:")
print(f"  MAE:  {test_mae_simple:.4f}")
print(f"  MSE:  {test_mse_simple:.4f}")
print(f"  RMSE: {test_rmse_simple:.4f}")
print(f"  R²:   {test_r2_simple:.4f}")
print(f"\nModel Equation: y = {simple_lr.coef_[0]:.4f} * x + {simple_lr.intercept_:.4f}")


In [None]:
# Visualize Simple Linear Regression
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Training set visualization
axes[0].scatter(X_train_simple, y_train, alpha=0.5, label='Actual', color='blue')
axes[0].plot(X_train_simple, y_train_pred_simple, color='red', linewidth=2, label='Predicted')
axes[0].set_xlabel('BMI (normalized)')
axes[0].set_ylabel('Disease Progression')
axes[0].set_title(f'Simple Linear Regression - Training Set\nR² = {train_r2_simple:.4f}')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Test set visualization
axes[1].scatter(X_test_simple, y_test, alpha=0.5, label='Actual', color='blue')
axes[1].plot(X_test_simple, y_test_pred_simple, color='red', linewidth=2, label='Predicted')
axes[1].set_xlabel('BMI (normalized)')
axes[1].set_ylabel('Disease Progression')
axes[1].set_title(f'Simple Linear Regression - Test Set\nR² = {test_r2_simple:.4f}')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


## Step 3: Multiple Regression


In [None]:
# Prepare data for Multiple Regression using all features
X_multiple = diabetes.data
y = diabetes.target

# Split the data into training and testing sets
X_train_multiple, X_test_multiple, y_train_multiple, y_test_multiple = train_test_split(
    X_multiple, y, test_size=0.2, random_state=42
)

print(f"Training set size: {X_train_multiple.shape[0]}")
print(f"Test set size: {X_test_multiple.shape[0]}")
print(f"Number of features: {X_train_multiple.shape[1]}")


In [None]:
# Train Multiple Regression model
multiple_lr = LinearRegression()
multiple_lr.fit(X_train_multiple, y_train_multiple)

# Make predictions
y_train_pred_multiple = multiple_lr.predict(X_train_multiple)
y_test_pred_multiple = multiple_lr.predict(X_test_multiple)

# Calculate evaluation metrics
train_mae_multiple = mean_absolute_error(y_train_multiple, y_train_pred_multiple)
train_mse_multiple = mean_squared_error(y_train_multiple, y_train_pred_multiple)
train_rmse_multiple = np.sqrt(train_mse_multiple)
train_r2_multiple = r2_score(y_train_multiple, y_train_pred_multiple)

test_mae_multiple = mean_absolute_error(y_test_multiple, y_test_pred_multiple)
test_mse_multiple = mean_squared_error(y_test_multiple, y_test_pred_multiple)
test_rmse_multiple = np.sqrt(test_mse_multiple)
test_r2_multiple = r2_score(y_test_multiple, y_test_pred_multiple)

print("Multiple Regression Results:")
print("=" * 50)
print("Training Set Metrics:")
print(f"  MAE:  {train_mae_multiple:.4f}")
print(f"  MSE:  {train_mse_multiple:.4f}")
print(f"  RMSE: {train_rmse_multiple:.4f}")
print(f"  R²:   {train_r2_multiple:.4f}")
print("\nTest Set Metrics:")
print(f"  MAE:  {test_mae_multiple:.4f}")
print(f"  MSE:  {test_mse_multiple:.4f}")
print(f"  RMSE: {test_rmse_multiple:.4f}")
print(f"  R²:   {test_r2_multiple:.4f}")

# Display feature coefficients
print("\nFeature Coefficients:")
for i, feature_name in enumerate(diabetes.feature_names):
    print(f"  {feature_name}: {multiple_lr.coef_[i]:.4f}")
print(f"\nIntercept: {multiple_lr.intercept_:.4f}")


In [None]:
# Visualize Multiple Regression predictions
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Training set: Predicted vs Actual
axes[0].scatter(y_train_multiple, y_train_pred_multiple, alpha=0.5, color='blue')
min_val = min(y_train_multiple.min(), y_train_pred_multiple.min())
max_val = max(y_train_multiple.max(), y_train_pred_multiple.max())
axes[0].plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2, label='Perfect Prediction')
axes[0].set_xlabel('Actual Values')
axes[0].set_ylabel('Predicted Values')
axes[0].set_title(f'Multiple Regression - Training Set\nR² = {train_r2_multiple:.4f}')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Test set: Predicted vs Actual
axes[1].scatter(y_test_multiple, y_test_pred_multiple, alpha=0.5, color='blue')
min_val = min(y_test_multiple.min(), y_test_pred_multiple.min())
max_val = max(y_test_multiple.max(), y_test_pred_multiple.max())
axes[1].plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2, label='Perfect Prediction')
axes[1].set_xlabel('Actual Values')
axes[1].set_ylabel('Predicted Values')
axes[1].set_title(f'Multiple Regression - Test Set\nR² = {test_r2_multiple:.4f}')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


## Step 4: Polynomial Regression


In [None]:
# Prepare data for Polynomial Regression
# Using 'bmi' feature for polynomial regression
X_poly = diabetes.data[:, 2].reshape(-1, 1)  # bmi
y = diabetes.target

# Split the data
X_train_poly, X_test_poly, y_train_poly, y_test_poly = train_test_split(
    X_poly, y, test_size=0.2, random_state=42
)

# Test different polynomial degrees
degrees = [1, 2, 3, 4, 5]
poly_results = {}

for degree in degrees:
    # Create polynomial features
    poly_features = PolynomialFeatures(degree=degree)
    X_train_poly_transformed = poly_features.fit_transform(X_train_poly)
    X_test_poly_transformed = poly_features.transform(X_test_poly)
    
    # Train model
    poly_model = LinearRegression()
    poly_model.fit(X_train_poly_transformed, y_train_poly)
    
    # Make predictions
    y_train_pred_poly = poly_model.predict(X_train_poly_transformed)
    y_test_pred_poly = poly_model.predict(X_test_poly_transformed)
    
    # Calculate metrics
    train_r2 = r2_score(y_train_poly, y_train_pred_poly)
    test_r2 = r2_score(y_test_poly, y_test_pred_poly)
    train_rmse = np.sqrt(mean_squared_error(y_train_poly, y_train_pred_poly))
    test_rmse = np.sqrt(mean_squared_error(y_test_poly, y_test_pred_poly))
    
    poly_results[degree] = {
        'train_r2': train_r2,
        'test_r2': test_r2,
        'train_rmse': train_rmse,
        'test_rmse': test_rmse,
        'model': poly_model,
        'poly_features': poly_features
    }
    
    print(f"Degree {degree}:")
    print(f"  Train R²: {train_r2:.4f}, Test R²: {test_r2:.4f}")
    print(f"  Train RMSE: {train_rmse:.4f}, Test RMSE: {test_rmse:.4f}")
    print()


In [None]:
# Select degree 2 for detailed analysis (good balance)
degree_selected = 2
poly_features_selected = poly_results[degree_selected]['poly_features']
poly_model_selected = poly_results[degree_selected]['model']

X_train_poly_transformed = poly_features_selected.fit_transform(X_train_poly)
X_test_poly_transformed = poly_features_selected.transform(X_test_poly)

y_train_pred_poly = poly_model_selected.predict(X_train_poly_transformed)
y_test_pred_poly = poly_model_selected.predict(X_test_poly_transformed)

# Calculate detailed metrics
train_mae_poly = mean_absolute_error(y_train_poly, y_train_pred_poly)
train_mse_poly = mean_squared_error(y_train_poly, y_train_pred_poly)
train_rmse_poly = np.sqrt(train_mse_poly)
train_r2_poly = r2_score(y_train_poly, y_train_pred_poly)

test_mae_poly = mean_absolute_error(y_test_poly, y_test_pred_poly)
test_mse_poly = mean_squared_error(y_test_poly, y_test_pred_poly)
test_rmse_poly = np.sqrt(test_mse_poly)
test_r2_poly = r2_score(y_test_poly, y_test_pred_poly)

print(f"Polynomial Regression (Degree {degree_selected}) Results:")
print("=" * 50)
print("Training Set Metrics:")
print(f"  MAE:  {train_mae_poly:.4f}")
print(f"  MSE:  {train_mse_poly:.4f}")
print(f"  RMSE: {train_rmse_poly:.4f}")
print(f"  R²:   {train_r2_poly:.4f}")
print("\nTest Set Metrics:")
print(f"  MAE:  {test_mae_poly:.4f}")
print(f"  MSE:  {test_mse_poly:.4f}")
print(f"  RMSE: {test_rmse_poly:.4f}")
print(f"  R²:   {test_r2_poly:.4f}")


In [None]:
# Visualize Polynomial Regression with different degrees
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# Plot for each degree
for idx, degree in enumerate([1, 2, 3, 4, 5]):
    row = idx // 3
    col = idx % 3
    
    poly_features = poly_results[degree]['poly_features']
    model = poly_results[degree]['model']
    
    # Create smooth line for visualization
    X_plot = np.linspace(X_train_poly.min(), X_train_poly.max(), 300).reshape(-1, 1)
    X_plot_poly = poly_features.transform(X_plot)
    y_plot = model.predict(X_plot_poly)
    
    # Plot
    axes[row, col].scatter(X_train_poly, y_train_poly, alpha=0.3, color='blue', label='Training Data')
    axes[row, col].scatter(X_test_poly, y_test_poly, alpha=0.3, color='green', label='Test Data')
    axes[row, col].plot(X_plot, y_plot, color='red', linewidth=2, label='Polynomial Fit')
    axes[row, col].set_xlabel('BMI (normalized)')
    axes[row, col].set_ylabel('Disease Progression')
    axes[row, col].set_title(f'Degree {degree}\nTrain R²={poly_results[degree]["train_r2"]:.3f}, Test R²={poly_results[degree]["test_r2"]:.3f}')
    axes[row, col].legend()
    axes[row, col].grid(True, alpha=0.3)

# Remove the last subplot
axes[1, 2].axis('off')

plt.tight_layout()
plt.show()


In [None]:
# Compare overfitting/underfitting across degrees
degrees_list = list(poly_results.keys())
train_r2_list = [poly_results[d]['train_r2'] for d in degrees_list]
test_r2_list = [poly_results[d]['test_r2'] for d in degrees_list]
train_rmse_list = [poly_results[d]['train_rmse'] for d in degrees_list]
test_rmse_list = [poly_results[d]['test_rmse'] for d in degrees_list]

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# R² comparison
axes[0].plot(degrees_list, train_r2_list, 'o-', label='Train R²', linewidth=2, markersize=8)
axes[0].plot(degrees_list, test_r2_list, 's-', label='Test R²', linewidth=2, markersize=8)
axes[0].set_xlabel('Polynomial Degree')
axes[0].set_ylabel('R² Score')
axes[0].set_title('R² Score vs Polynomial Degree')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# RMSE comparison
axes[1].plot(degrees_list, train_rmse_list, 'o-', label='Train RMSE', linewidth=2, markersize=8)
axes[1].plot(degrees_list, test_rmse_list, 's-', label='Test RMSE', linewidth=2, markersize=8)
axes[1].set_xlabel('Polynomial Degree')
axes[1].set_ylabel('RMSE')
axes[1].set_title('RMSE vs Polynomial Degree')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Observations:")
print("- Lower degrees (1-2): Better generalization, less overfitting")
print("- Higher degrees (4-5): Better training fit but may overfit (gap between train and test)")
print("- Degree 2-3 appears to be a good balance")


## Step 5: Regularization with Ridge and Lasso Regression


In [None]:
# Prepare data for Ridge and Lasso Regression (using all features)
X_reg = diabetes.data
y = diabetes.target

# Split the data
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y, test_size=0.2, random_state=42
)

# Test different alpha values
alphas = [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
ridge_results = {}
lasso_results = {}

print("Testing different alpha values for Ridge and Lasso Regression:")
print("=" * 70)

for alpha in alphas:
    # Ridge Regression
    ridge_model = Ridge(alpha=alpha)
    ridge_model.fit(X_train_reg, y_train_reg)
    y_train_pred_ridge = ridge_model.predict(X_train_reg)
    y_test_pred_ridge = ridge_model.predict(X_test_reg)
    
    ridge_results[alpha] = {
        'train_r2': r2_score(y_train_reg, y_train_pred_ridge),
        'test_r2': r2_score(y_test_reg, y_test_pred_ridge),
        'train_rmse': np.sqrt(mean_squared_error(y_train_reg, y_train_pred_ridge)),
        'test_rmse': np.sqrt(mean_squared_error(y_test_reg, y_test_pred_ridge)),
        'model': ridge_model
    }
    
    # Lasso Regression
    lasso_model = Lasso(alpha=alpha)
    lasso_model.fit(X_train_reg, y_train_reg)
    y_train_pred_lasso = lasso_model.predict(X_train_reg)
    y_test_pred_lasso = lasso_model.predict(X_test_reg)
    
    lasso_results[alpha] = {
        'train_r2': r2_score(y_train_reg, y_train_pred_lasso),
        'test_r2': r2_score(y_test_reg, y_test_pred_lasso),
        'train_rmse': np.sqrt(mean_squared_error(y_train_reg, y_train_pred_lasso)),
        'test_rmse': np.sqrt(mean_squared_error(y_test_reg, y_test_pred_lasso)),
        'model': lasso_model
    }
    
    print(f"Alpha = {alpha:6.2f} | Ridge Test R²: {ridge_results[alpha]['test_r2']:.4f} | "
          f"Lasso Test R²: {lasso_results[alpha]['test_r2']:.4f}")

# Select best alpha (alpha=1.0 is often a good default)
alpha_selected = 1.0


In [None]:
# Train Ridge and Lasso with selected alpha
ridge_model = Ridge(alpha=alpha_selected)
lasso_model = Lasso(alpha=alpha_selected)

ridge_model.fit(X_train_reg, y_train_reg)
lasso_model.fit(X_train_reg, y_train_reg)

# Make predictions
y_train_pred_ridge = ridge_model.predict(X_train_reg)
y_test_pred_ridge = ridge_model.predict(X_test_reg)
y_train_pred_lasso = lasso_model.predict(X_train_reg)
y_test_pred_lasso = lasso_model.predict(X_test_reg)

# Calculate metrics for Ridge
train_mae_ridge = mean_absolute_error(y_train_reg, y_train_pred_ridge)
train_mse_ridge = mean_squared_error(y_train_reg, y_train_pred_ridge)
train_rmse_ridge = np.sqrt(train_mse_ridge)
train_r2_ridge = r2_score(y_train_reg, y_train_pred_ridge)

test_mae_ridge = mean_absolute_error(y_test_reg, y_test_pred_ridge)
test_mse_ridge = mean_squared_error(y_test_reg, y_test_pred_ridge)
test_rmse_ridge = np.sqrt(test_mse_ridge)
test_r2_ridge = r2_score(y_test_reg, y_test_pred_ridge)

# Calculate metrics for Lasso
train_mae_lasso = mean_absolute_error(y_train_reg, y_train_pred_lasso)
train_mse_lasso = mean_squared_error(y_train_reg, y_train_pred_lasso)
train_rmse_lasso = np.sqrt(train_mse_lasso)
train_r2_lasso = r2_score(y_train_reg, y_train_pred_lasso)

test_mae_lasso = mean_absolute_error(y_test_reg, y_test_pred_lasso)
test_mse_lasso = mean_squared_error(y_test_reg, y_test_pred_lasso)
test_rmse_lasso = np.sqrt(test_mse_lasso)
test_r2_lasso = r2_score(y_test_reg, y_test_pred_lasso)

print("Ridge Regression Results (alpha = 1.0):")
print("=" * 50)
print("Training Set Metrics:")
print(f"  MAE:  {train_mae_ridge:.4f}")
print(f"  MSE:  {train_mse_ridge:.4f}")
print(f"  RMSE: {train_rmse_ridge:.4f}")
print(f"  R²:   {train_r2_ridge:.4f}")
print("\nTest Set Metrics:")
print(f"  MAE:  {test_mae_ridge:.4f}")
print(f"  MSE:  {test_mse_ridge:.4f}")
print(f"  RMSE: {test_rmse_ridge:.4f}")
print(f"  R²:   {test_r2_ridge:.4f}")

print("\n" + "=" * 50)
print("Lasso Regression Results (alpha = 1.0):")
print("=" * 50)
print("Training Set Metrics:")
print(f"  MAE:  {train_mae_lasso:.4f}")
print(f"  MSE:  {train_mse_lasso:.4f}")
print(f"  RMSE: {train_rmse_lasso:.4f}")
print(f"  R²:   {train_r2_lasso:.4f}")
print("\nTest Set Metrics:")
print(f"  MAE:  {test_mae_lasso:.4f}")
print(f"  MSE:  {test_mse_lasso:.4f}")
print(f"  RMSE: {test_rmse_lasso:.4f}")
print(f"  R²:   {test_r2_lasso:.4f}")


In [None]:
# Compare feature coefficients: Multiple Regression vs Ridge vs Lasso
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Multiple Regression coefficients
axes[0].barh(range(len(diabetes.feature_names)), multiple_lr.coef_)
axes[0].set_yticks(range(len(diabetes.feature_names)))
axes[0].set_yticklabels(diabetes.feature_names)
axes[0].set_xlabel('Coefficient Value')
axes[0].set_title('Multiple Regression\nCoefficients')
axes[0].grid(True, alpha=0.3, axis='x')

# Ridge coefficients
axes[1].barh(range(len(diabetes.feature_names)), ridge_model.coef_)
axes[1].set_yticks(range(len(diabetes.feature_names)))
axes[1].set_yticklabels(diabetes.feature_names)
axes[1].set_xlabel('Coefficient Value')
axes[1].set_title(f'Ridge Regression (α={alpha_selected})\nCoefficients (Shrunk)')
axes[1].grid(True, alpha=0.3, axis='x')

# Lasso coefficients
axes[2].barh(range(len(diabetes.feature_names)), lasso_model.coef_)
axes[2].set_yticks(range(len(diabetes.feature_names)))
axes[2].set_yticklabels(diabetes.feature_names)
axes[2].set_xlabel('Coefficient Value')
axes[2].set_title(f'Lasso Regression (α={alpha_selected})\nCoefficients (Some Zeroed)')
axes[2].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

# Count non-zero coefficients in Lasso
non_zero_lasso = np.sum(lasso_model.coef_ != 0)
print(f"\nLasso Regression: {non_zero_lasso} out of {len(diabetes.feature_names)} features have non-zero coefficients")
print("Features with zero coefficients (removed by Lasso):")
for i, name in enumerate(diabetes.feature_names):
    if lasso_model.coef_[i] == 0:
        print(f"  - {name}")


In [None]:
# Visualize how alpha affects Ridge and Lasso
alphas_list = list(ridge_results.keys())
ridge_test_r2 = [ridge_results[a]['test_r2'] for a in alphas_list]
lasso_test_r2 = [lasso_results[a]['test_r2'] for a in alphas_list]
ridge_test_rmse = [ridge_results[a]['test_rmse'] for a in alphas_list]
lasso_test_rmse = [lasso_results[a]['test_rmse'] for a in alphas_list]

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# R² comparison
axes[0].semilogx(alphas_list, ridge_test_r2, 'o-', label='Ridge', linewidth=2, markersize=8)
axes[0].semilogx(alphas_list, lasso_test_r2, 's-', label='Lasso', linewidth=2, markersize=8)
axes[0].set_xlabel('Alpha (Regularization Parameter)')
axes[0].set_ylabel('Test R² Score')
axes[0].set_title('Test R² vs Alpha (Regularization Parameter)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# RMSE comparison
axes[1].semilogx(alphas_list, ridge_test_rmse, 'o-', label='Ridge', linewidth=2, markersize=8)
axes[1].semilogx(alphas_list, lasso_test_rmse, 's-', label='Lasso', linewidth=2, markersize=8)
axes[1].set_xlabel('Alpha (Regularization Parameter)')
axes[1].set_ylabel('Test RMSE')
axes[1].set_title('Test RMSE vs Alpha (Regularization Parameter)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Observations about Regularization:")
print("- Low alpha (0.01-0.1): Less regularization, models behave more like Multiple Regression")
print("- Medium alpha (1.0-10.0): Balanced regularization, prevents overfitting")
print("- High alpha (100-1000): Too much regularization, underfitting occurs")
print("- Lasso can zero out features (feature selection), while Ridge shrinks all features")


In [None]:
# Visualize predictions: Ridge vs Lasso
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Ridge - Training
axes[0, 0].scatter(y_train_reg, y_train_pred_ridge, alpha=0.5, color='blue')
min_val = min(y_train_reg.min(), y_train_pred_ridge.min())
max_val = max(y_train_reg.max(), y_train_pred_ridge.max())
axes[0, 0].plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2)
axes[0, 0].set_xlabel('Actual Values')
axes[0, 0].set_ylabel('Predicted Values')
axes[0, 0].set_title(f'Ridge Regression - Training Set\nR² = {train_r2_ridge:.4f}')
axes[0, 0].grid(True, alpha=0.3)

# Ridge - Test
axes[0, 1].scatter(y_test_reg, y_test_pred_ridge, alpha=0.5, color='blue')
min_val = min(y_test_reg.min(), y_test_pred_ridge.min())
max_val = max(y_test_reg.max(), y_test_pred_ridge.max())
axes[0, 1].plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2)
axes[0, 1].set_xlabel('Actual Values')
axes[0, 1].set_ylabel('Predicted Values')
axes[0, 1].set_title(f'Ridge Regression - Test Set\nR² = {test_r2_ridge:.4f}')
axes[0, 1].grid(True, alpha=0.3)

# Lasso - Training
axes[1, 0].scatter(y_train_reg, y_train_pred_lasso, alpha=0.5, color='green')
min_val = min(y_train_reg.min(), y_train_pred_lasso.min())
max_val = max(y_train_reg.max(), y_train_pred_lasso.max())
axes[1, 0].plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2)
axes[1, 0].set_xlabel('Actual Values')
axes[1, 0].set_ylabel('Predicted Values')
axes[1, 0].set_title(f'Lasso Regression - Training Set\nR² = {train_r2_lasso:.4f}')
axes[1, 0].grid(True, alpha=0.3)

# Lasso - Test
axes[1, 1].scatter(y_test_reg, y_test_pred_lasso, alpha=0.5, color='green')
min_val = min(y_test_reg.min(), y_test_pred_lasso.min())
max_val = max(y_test_reg.max(), y_test_pred_lasso.max())
axes[1, 1].plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2)
axes[1, 1].set_xlabel('Actual Values')
axes[1, 1].set_ylabel('Predicted Values')
axes[1, 1].set_title(f'Lasso Regression - Test Set\nR² = {test_r2_lasso:.4f}')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


## Step 6: Model Comparison and Analysis


In [None]:
# Create a comprehensive comparison table
comparison_data = {
    'Model': [
        'Simple Linear Regression',
        'Multiple Regression',
        'Polynomial Regression (deg=2)',
        'Ridge Regression (α=1.0)',
        'Lasso Regression (α=1.0)'
    ],
    'Train MAE': [
        train_mae_simple,
        train_mae_multiple,
        train_mae_poly,
        train_mae_ridge,
        train_mae_lasso
    ],
    'Test MAE': [
        test_mae_simple,
        test_mae_multiple,
        test_mae_poly,
        test_mae_ridge,
        test_mae_lasso
    ],
    'Train RMSE': [
        train_rmse_simple,
        train_rmse_multiple,
        train_rmse_poly,
        train_rmse_ridge,
        train_rmse_lasso
    ],
    'Test RMSE': [
        test_rmse_simple,
        test_rmse_multiple,
        test_rmse_poly,
        test_rmse_ridge,
        test_rmse_lasso
    ],
    'Train R²': [
        train_r2_simple,
        train_r2_multiple,
        train_r2_poly,
        train_r2_ridge,
        train_r2_lasso
    ],
    'Test R²': [
        test_r2_simple,
        test_r2_multiple,
        test_r2_poly,
        test_r2_ridge,
        test_r2_lasso
    ]
}

comparison_df = pd.DataFrame(comparison_data)
print("Model Performance Comparison:")
print("=" * 100)
print(comparison_df.to_string(index=False))


In [None]:
# Visualize model comparison
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

models = comparison_df['Model'].values
x_pos = np.arange(len(models))

# Test R² comparison
axes[0, 0].bar(x_pos, comparison_df['Test R²'], color=['blue', 'green', 'orange', 'red', 'purple'], alpha=0.7)
axes[0, 0].set_xticks(x_pos)
axes[0, 0].set_xticklabels(models, rotation=45, ha='right')
axes[0, 0].set_ylabel('R² Score')
axes[0, 0].set_title('Test R² Score Comparison')
axes[0, 0].grid(True, alpha=0.3, axis='y')
axes[0, 0].set_ylim([0, max(comparison_df['Test R²']) * 1.1])

# Test RMSE comparison
axes[0, 1].bar(x_pos, comparison_df['Test RMSE'], color=['blue', 'green', 'orange', 'red', 'purple'], alpha=0.7)
axes[0, 1].set_xticks(x_pos)
axes[0, 1].set_xticklabels(models, rotation=45, ha='right')
axes[0, 1].set_ylabel('RMSE')
axes[0, 1].set_title('Test RMSE Comparison (Lower is Better)')
axes[0, 1].grid(True, alpha=0.3, axis='y')

# Train vs Test R²
axes[1, 0].bar(x_pos - 0.2, comparison_df['Train R²'], width=0.4, label='Train R²', alpha=0.7)
axes[1, 0].bar(x_pos + 0.2, comparison_df['Test R²'], width=0.4, label='Test R²', alpha=0.7)
axes[1, 0].set_xticks(x_pos)
axes[1, 0].set_xticklabels(models, rotation=45, ha='right')
axes[1, 0].set_ylabel('R² Score')
axes[1, 0].set_title('Train vs Test R² Comparison')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3, axis='y')

# Train vs Test RMSE
axes[1, 1].bar(x_pos - 0.2, comparison_df['Train RMSE'], width=0.4, label='Train RMSE', alpha=0.7)
axes[1, 1].bar(x_pos + 0.2, comparison_df['Test RMSE'], width=0.4, label='Test RMSE', alpha=0.7)
axes[1, 1].set_xticks(x_pos)
axes[1, 1].set_xticklabels(models, rotation=45, ha='right')
axes[1, 1].set_ylabel('RMSE')
axes[1, 1].set_title('Train vs Test RMSE Comparison')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()


### Key Observations and Analysis

#### 1. Model Performance Summary

**Simple Linear Regression:**
- Uses only one feature (BMI), making it the simplest model
- Lower R² score compared to models using multiple features
- Good baseline for comparison
- Limited predictive power due to using only one feature

**Multiple Regression:**
- Uses all available features, capturing more information
- Significantly better performance than Simple Linear Regression
- Higher R² score indicates better fit
- May be prone to overfitting if not regularized

**Polynomial Regression:**
- Captures non-linear relationships in the data
- Degree 2 provides a good balance between complexity and generalization
- Higher degrees can lead to overfitting (large gap between train and test performance)
- Useful when relationships are non-linear

**Ridge Regression:**
- Adds L2 regularization to prevent overfitting
- Shrinks all coefficients but doesn't eliminate any features
- Helps with multicollinearity issues
- Performance similar to Multiple Regression but more stable

**Lasso Regression:**
- Adds L1 regularization with feature selection capability
- Can zero out less important features
- Useful for feature selection and model interpretability
- May perform slightly worse if all features are important

#### 2. Overfitting and Underfitting Analysis

- **Simple Linear Regression**: May underfit due to limited complexity
- **Multiple Regression**: Risk of overfitting, especially with many features
- **Polynomial Regression**: Higher degrees show clear overfitting (large train-test gap)
- **Ridge Regression**: Regularization helps prevent overfitting
- **Lasso Regression**: Regularization + feature selection helps with overfitting

#### 3. Insights about the Diabetes Dataset

- Multiple features contribute to disease progression prediction
- Non-linear relationships exist (Polynomial Regression improves over Simple Linear)
- Regularization helps stabilize predictions
- Feature selection (Lasso) can identify most important predictors
- The dataset benefits from using all features rather than just one

#### 4. Recommendations

1. **For best performance**: Use Multiple Regression or Ridge Regression
2. **For interpretability**: Use Lasso Regression to identify key features
3. **For non-linear relationships**: Use Polynomial Regression with degree 2-3
4. **For simplicity**: Simple Linear Regression provides a baseline but limited accuracy

#### 5. Regularization Parameter (Alpha) Impact

- **Low alpha (0.01-0.1)**: Models behave like Multiple Regression, minimal regularization
- **Medium alpha (1.0-10.0)**: Balanced regularization, prevents overfitting effectively
- **High alpha (100-1000)**: Too much regularization leads to underfitting
- **Lasso vs Ridge**: Lasso performs feature selection (zeros out features), Ridge shrinks all features
