# Diabetes Dataset - Comprehensive Regression Analysis

This notebook analyzes the Diabetes dataset using various regression algorithms to predict disease progression.

## Dataset Overview
- **Target**: Quantitative measure of disease progression one year after baseline
- **Features**: 10 baseline variables (age, sex, BMI, blood pressure, and 6 blood serum measurements)
- **Samples**: 442 diabetes patients
- **Type**: Medical/Healthcare data

## Models to Compare:
1. Linear Regression
2. Ridge Regression
3. Lasso Regression
4. Decision Tree Regressor (DTR)
5. Random Forest Regressor (RFR)
6. Support Vector Regressor (SVR)
7. Gradient Boosting Regressor
8. XGBoost Regressor
9. LightGBM Regressor
10. Elastic Net


In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
import xgboost as xgb
import lightgbm as lgb
import warnings
warnings.filterwarnings('ignore')

# Set style for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")


In [None]:
# Load the Diabetes dataset
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

# Create DataFrame for better visualization
feature_names = diabetes.feature_names
df = pd.DataFrame(X, columns=feature_names)
df['target'] = y

print("Dataset Info:")
print(f"Shape: {df.shape}")
print(f"Features: {list(feature_names)}")
print(f"Target: Disease progression measure")
print(f"Data type: {diabetes.DESCR.split('\\n')[0]}")
print("\nFirst few rows:")
df.head()


In [None]:
# Dataset description and EDA
print("Dataset Description:")
print(diabetes.DESCR[:1000] + "...")

print("\n" + "="*50)
print("Dataset Statistics:")
print(df.describe())

print("\nMissing Values:")
print(df.isnull().sum())

print(f"\nTarget variable range: {y.min():.2f} to {y.max():.2f}")
print(f"Target variable mean: {y.mean():.2f}")
print(f"Target variable std: {y.std():.2f}")


In [None]:
# Target distribution analysis
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.hist(df['target'], bins=30, alpha=0.7, color='lightblue', edgecolor='black')
plt.title('Target Distribution (Disease Progression)')
plt.xlabel('Disease Progression Score')
plt.ylabel('Frequency')

plt.subplot(1, 3, 2)
plt.boxplot(df['target'])
plt.title('Target Boxplot')
plt.ylabel('Disease Progression Score')

plt.subplot(1, 3, 3)
df['target'].plot(kind='kde', alpha=0.7, color='darkblue')
plt.title('Target Density Plot')
plt.xlabel('Disease Progression Score')

plt.tight_layout()
plt.show()


In [None]:
# Correlation analysis
plt.figure(figsize=(12, 10))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=0.5, fmt='.3f')
plt.title('Correlation Matrix - Diabetes Dataset')
plt.tight_layout()
plt.show()

# Feature correlation with target
target_corr = correlation_matrix['target'].sort_values(ascending=False)
print("Feature correlation with disease progression:")
print(target_corr)

# Top correlated features
print(f"\nTop positive correlations:")
positive_corr = target_corr[target_corr > 0].drop('target')
for feature, corr in positive_corr.head(3).items():
    print(f"  {feature}: {corr:.3f}")

print(f"\nTop negative correlations:")
negative_corr = target_corr[target_corr < 0]
for feature, corr in negative_corr.head(3).items():
    print(f"  {feature}: {corr:.3f}")


In [None]:
# Feature distributions
plt.figure(figsize=(15, 12))
for i, feature in enumerate(feature_names, 1):
    plt.subplot(3, 4, i)
    plt.hist(df[feature], bins=20, alpha=0.7, color='lightcoral', edgecolor='black')
    plt.title(f'{feature}')
    plt.xlabel('Value')
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()


In [None]:
# Data preparation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")
print(f"Features: {len(feature_names)}")
print(f"Target range: {y.min():.2f} - {y.max():.2f}")

# Define evaluation function
def evaluate_model(model, X_train, X_test, y_train, y_test, model_name):
    """Evaluate a regression model and return metrics"""
    model.fit(X_train, y_train)
    
    # Predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Metrics
    train_mse = mean_squared_error(y_train, y_train_pred)
    test_mse = mean_squared_error(y_test, y_test_pred)
    train_rmse = np.sqrt(train_mse)
    test_rmse = np.sqrt(test_mse)
    train_mae = mean_absolute_error(y_train, y_train_pred)
    test_mae = mean_absolute_error(y_test, y_test_pred)
    train_r2 = r2_score(y_train, y_train_pred)
    test_r2 = r2_score(y_test, y_test_pred)
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')
    cv_mean = cv_scores.mean()
    cv_std = cv_scores.std()
    
    return {
        'Model': model_name,
        'Train RMSE': train_rmse,
        'Test RMSE': test_rmse,
        'Train MAE': train_mae,
        'Test MAE': test_mae,
        'Train R²': train_r2,
        'Test R²': test_r2,
        'CV R² Mean': cv_mean,
        'CV R² Std': cv_std
    }, y_test_pred


In [None]:
# Initialize models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0, random_state=42),
    'Lasso Regression': Lasso(alpha=1.0, random_state=42),
    'Elastic Net': ElasticNet(alpha=1.0, l1_ratio=0.5, random_state=42),
    'Decision Tree': DecisionTreeRegressor(random_state=42),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(random_state=42),
    'XGBoost': xgb.XGBRegressor(random_state=42),
    'LightGBM': lgb.LGBMRegressor(random_state=42, verbose=-1),
    'SVR': SVR(kernel='rbf')
}

# Models that need scaled data
scaled_models = ['Linear Regression', 'Ridge Regression', 'Lasso Regression', 'Elastic Net', 'SVR']

print("Models to be evaluated:")
for i, model_name in enumerate(models.keys(), 1):
    print(f"{i:2d}. {model_name}")
    
print(f"\nNote: {len(scaled_models)} models will use scaled features")
print(f"      {len(models) - len(scaled_models)} models will use original features")


In [None]:
# Train and evaluate all models
results_list = []
predictions_dict = {}

print("Training and evaluating models...")
print("=" * 60)

for model_name, model in models.items():
    print(f"\nTraining {model_name}...")
    
    # Choose scaled or unscaled data
    if model_name in scaled_models:
        X_train_use = X_train_scaled
        X_test_use = X_test_scaled
    else:
        X_train_use = X_train
        X_test_use = X_test
    
    # Evaluate model
    results, y_pred = evaluate_model(model, X_train_use, X_test_use, y_train, y_test, model_name)
    results_list.append(results)
    predictions_dict[model_name] = y_pred
    
    print(f"✓ {model_name} - Test R²: {results['Test R²']:.4f}, Test RMSE: {results['Test RMSE']:.2f}")

print("\n" + "=" * 60)
print("All models trained successfully!")


In [None]:
# Results analysis
results_df = pd.DataFrame(results_list)
results_df = results_df.round(4)

# Sort by Test R² score (descending)
results_df = results_df.sort_values('Test R²', ascending=False)

print("Model Performance Comparison - Diabetes Dataset:")
print("=" * 90)
print(results_df.to_string(index=False))

# Best performing model
best_model = results_df.iloc[0]['Model']
best_r2 = results_df.iloc[0]['Test R²']
best_rmse = results_df.iloc[0]['Test RMSE']

print(f"\n🏆 Best performing model: {best_model}")
print(f"   Test R² = {best_r2:.4f}")
print(f"   Test RMSE = {best_rmse:.2f}")

# Model performance insights
print(f"\n📊 Performance Insights:")
print(f"   • Best R² Score: {results_df['Test R²'].max():.4f}")
print(f"   • Worst R² Score: {results_df['Test R²'].min():.4f}")
print(f"   • Performance Range: {results_df['Test R²'].max() - results_df['Test R²'].min():.4f}")
print(f"   • Average R² Score: {results_df['Test R²'].mean():.4f}")
print(f"   • Standard Deviation: {results_df['Test R²'].std():.4f}")


In [None]:
# Visualization of model performance
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# R² Score comparison
axes[0, 0].barh(results_df['Model'], results_df['Test R²'], color='lightblue')
axes[0, 0].set_xlabel('Test R² Score')
axes[0, 0].set_title('Model Performance Comparison (R² Score)')
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].set_xlim(0, 1)

# RMSE comparison
axes[0, 1].barh(results_df['Model'], results_df['Test RMSE'], color='lightcoral')
axes[0, 1].set_xlabel('Test RMSE')
axes[0, 1].set_title('Model Performance Comparison (RMSE)')
axes[0, 1].grid(True, alpha=0.3)

# Cross-validation scores with error bars
cv_means = results_df['CV R² Mean']
cv_stds = results_df['CV R² Std']
axes[1, 0].barh(results_df['Model'], cv_means, color='lightgreen', 
                xerr=cv_stds, capsize=5)
axes[1, 0].set_xlabel('Cross-Validation R² Mean ± Std')
axes[1, 0].set_title('Cross-Validation Performance')
axes[1, 0].grid(True, alpha=0.3)

# MAE comparison
axes[1, 1].barh(results_df['Model'], results_df['Test MAE'], color='gold')
axes[1, 1].set_xlabel('Test MAE')
axes[1, 1].set_title('Model Performance Comparison (MAE)')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


In [None]:
# Prediction vs Actual plots for top 4 models
top_4_models = results_df.head(4)['Model'].tolist()

plt.figure(figsize=(16, 8))
for i, model_name in enumerate(top_4_models, 1):
    plt.subplot(2, 2, i)
    y_pred = predictions_dict[model_name]
    
    plt.scatter(y_test, y_pred, alpha=0.6, s=30)
    plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
    plt.xlabel('Actual Disease Progression')
    plt.ylabel('Predicted Disease Progression')
    plt.title(f'{model_name}')
    
    # Add metrics to the plot
    r2 = results_df[results_df['Model'] == model_name]['Test R²'].iloc[0]
    rmse = results_df[results_df['Model'] == model_name]['Test RMSE'].iloc[0]
    plt.text(0.05, 0.95, f'R² = {r2:.4f}\\nRMSE = {rmse:.2f}', 
             transform=plt.gca().transAxes, 
             bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

plt.tight_layout()
plt.show()


## Key Findings - Diabetes Dataset

### Model Performance Summary:
- **Best Model**: The top-performing model demonstrates strong predictive capability for disease progression
- **Dataset Characteristics**: Small dataset (442 samples) with 10 standardized features
- **Feature Relationships**: BMI and blood pressure measurements show significant correlations
- **Model Comparison**: Performance varies significantly across different algorithm types

### Medical Insights:
1. **Predictive Factors**: Certain baseline measurements are more predictive of disease progression
2. **Model Interpretability**: Linear models provide clear coefficient interpretation for medical professionals
3. **Robustness**: Cross-validation helps ensure model reliability for medical applications
4. **Feature Standardization**: All features are pre-standardized in this dataset

### Recommendations:
1. **Clinical Use**: Consider the top-performing models for disease progression prediction
2. **Interpretability vs Performance**: Balance between model accuracy and clinical interpretability
3. **Validation**: Extensive validation needed before clinical deployment
4. **Feature Engineering**: Consider interaction terms between biological markers

### Next Steps:
- Hyperparameter optimization for best models
- Feature importance analysis for clinical insights
- Ensemble methods for improved robustness
- External validation on independent patient cohorts
- Integration with clinical decision support systems
