## Predicting Diabetes Disease Progression from BMI using Linear Regression

## Phase 1: Decide on the Question

**Problem Statement:** Can we predict a patient's diabetes disease progression from their Body Mass Index (BMI)?

**Why This Matters:** Identifying the relationship between BMI and disease progression could help clinicians identify high-risk patients early and intervene proactively. This is a supervised learning problem where we seek to establish a linear relationship between one feature (BMI) and one continuous target (disease progression).

## Phase 2: Collect and Prepare Data

### 2.1 Load the Dataset
Using the scikit-learn diabetes dataset containing 442 patient records with 10 physiological measurements.

In [1]:
# Import essential libraries for data manipulation, numerical computation, visualization, and machine learning
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio
from sklearn import datasets, linear_model, model_selection, metrics

# Force a reliable renderer
pio.renderers.default = "notebook_connected"

In [2]:
# Load the diabetes dataset from sklearn
X, y = datasets.load_diabetes(return_X_y=True)

In [3]:
# Explore the dataset structure
print(f"Dataset Shape: {X.shape}")
print(f"Number of patients: {X.shape[0]}")
print(f"Number of features: {X.shape[1]}")
print(f"Target shape: {y.shape}")
print(f"\nFirst 5 records:")
print(X[:5])

Dataset Shape: (442, 10)
Number of patients: 442
Number of features: 10
Target shape: (442,)

First 5 records:
[[ 0.03807591  0.05068012  0.06169621  0.02187239 -0.0442235  -0.03482076
  -0.04340085 -0.00259226  0.01990749 -0.01764613]
 [-0.00188202 -0.04464164 -0.05147406 -0.02632753 -0.00844872 -0.01916334
   0.07441156 -0.03949338 -0.06833155 -0.09220405]
 [ 0.08529891  0.05068012  0.04445121 -0.00567042 -0.04559945 -0.03419447
  -0.03235593 -0.00259226  0.00286131 -0.02593034]
 [-0.08906294 -0.04464164 -0.01159501 -0.03665608  0.01219057  0.02499059
  -0.03603757  0.03430886  0.02268774 -0.00936191]
 [ 0.00538306 -0.04464164 -0.03638469  0.02187239  0.00393485  0.01559614
   0.00814208 -0.00259226 -0.03198764 -0.04664087]]


In [4]:
# Visualize the distribution of disease progression target variable
# Prepare data
df = pd.DataFrame({'Disease Progression': y})

# Create interactive histogram
fig = px.histogram(
    df,
    x='Disease Progression',
    nbins=30,
    title='Distribution of Diabetes Disease Progression',
    opacity=0.7
)

fig.update_layout(
    xaxis_title='Disease Progression',
    yaxis_title='Frequency',
    bargap=0.03
)

fig.show()


### 2.2 Feature Engineering and Visualization

Extracting BMI (Body Mass Index) as our predictor variable - the 3rd feature in the dataset.

In [5]:
# Extract the BMI column (index 2 - the third feature)
X = X[:, 2]
print(f"BMI data shape after extraction: {X.shape}")

# Reshape X to be a 2D array with one column (required by sklearn)
X = X.reshape(-1, 1)
print(f"BMI data shape after reshape: {X.shape}")

BMI data shape after extraction: (442,)
BMI data shape after reshape: (442, 1)


In [6]:
# Visualize the relationship between BMI and disease progression
# Convert data to DataFrame
df = pd.DataFrame({
    'Scaled BMI': X.flatten(),
    'Disease Progression': y
})

# Interactive scatter plot
fig = px.scatter(
    df,
    x='Scaled BMI',
    y='Disease Progression',
    opacity=0.6,
    title='Relationship Between BMI and Diabetes Disease Progression',
)

fig.update_traces(marker=dict(line=dict(width=0.5, color='black')))
fig.update_layout(
    xaxis_title='Scaled BMI',
    yaxis_title='Disease Progression'
)

fig.show()

# Compute and print correlation
corr = np.corrcoef(df['Scaled BMI'], df['Disease Progression'])[0, 1]
print(f"Correlation between BMI and Disease Progression: {corr:.4f}")

Correlation between BMI and Disease Progression: 0.5865


### 2.3 Train-Test Split

Dividing the data into training (67%) and testing (33%) sets to properly evaluate model generalization.

In [7]:
# Split the data into training and testing sets (67% train, 33% test)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.33, random_state=42)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")

Training set size: 296 samples
Testing set size: 146 samples


## Phase 3: Choose a Training Method

**Linear Regression** is appropriate for this task because:
- We're predicting a continuous numeric value (disease progression)
- We want to model a linear relationship between two variables
- The algorithm is interpretable, showing exactly how BMI influences disease progression
- It serves as a strong baseline for regression problems

## Phase 4: Train the Model

In [8]:
# Create and train a linear regression model
model = linear_model.LinearRegression()
model.fit(X_train, y_train)

## Phase 5: Evaluate the Model

In [9]:
# Generate predictions on the test set
y_pred = model.predict(X_test)

In [10]:
# Calculate evaluation metrics
r2_score = metrics.r2_score(y_test, y_pred)
mae = metrics.mean_absolute_error(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print("Model Evaluation Metrics:")
print(f"- R² Score: {r2_score:.4f}")
print(f"  → The model explains {r2_score*100:.2f}% of the variance")
print(f"- Mean Absolute Error (MAE): {mae:.4f}")
print(f"  → On average, predictions are off by {mae:.2f} units")
print(f"- Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"  → Penalizes larger errors more heavily")

Model Evaluation Metrics:
- R² Score: 0.3164
  → The model explains 31.64% of the variance
- Mean Absolute Error (MAE): 50.9684
  → On average, predictions are off by 50.97 units
- Root Mean Squared Error (RMSE): 62.7221
  → Penalizes larger errors more heavily


In [11]:
# Display model coefficients
print("\nModel Parameters:")
print(f"- Coefficient (Slope): {model.coef_[0]:.4f}")
print(f"  → For every 1 unit increase in scaled BMI,")
print(f"    disease progression increases by {model.coef_[0]:.2f} units")
print(f"- Intercept: {model.intercept_:.4f}")
print(f"  → Baseline disease progression when BMI = 0")


Model Parameters:
- Coefficient (Slope): 972.8763
  → For every 1 unit increase in scaled BMI,
    disease progression increases by 972.88 units
- Intercept: 150.2627
  → Baseline disease progression when BMI = 0


## Phase 6: Predict and Visualize Results

In [12]:
# Sort test data for smooth regression line visualization
sort_indices = X_test.flatten().argsort()
X_test_sorted = X_test.flatten()[sort_indices]
y_pred_sorted = y_pred[sort_indices]

# Create dataframe for plotting
df_plot = pd.DataFrame({
    'Scaled BMI': X_test.flatten(),
    'Actual Progression': y_test,
    'Predicted Progression': y_pred
})

# Create scatter plot of actual data
fig = px.scatter(
    df_plot,
    x='Scaled BMI',
    y='Actual Progression',
    title='Linear Regression: Predicting Diabetes Disease Progression from BMI'
)

# Add regression line
fig.add_scatter(
    x=X_test_sorted,
    y=y_pred_sorted,
    mode='lines',
    name='Linear Regression Fit',
    line=dict(color='red', width=3)
)

fig.update_layout(
    xaxis_title='Scaled BMI',
    yaxis_title='Disease Progression'
)

fig.show()

## Summary


In [13]:
print("\nSummary:")
print(f"- The model successfully captures the positive linear relationship between BMI and disease progression")
print(f"- With an R² score of {r2_score:.4f}, the model explains {r2_score*100:.2f}% of the variance in disease progression")
print(f"- The Mean Absolute Error of {mae:.4f} units indicates the typical prediction error magnitude")
print(f"- This model can serve as a baseline for more complex approaches or help clinicians identify risk patterns")
print(f"\nNext Steps for Improvement:")
print(f"- Parameter Tuning: Cross-validation and hyperparameter optimization")
print(f"- Feature Engineering: Include additional physiological measurements beyond BMI")
print(f"- Model Comparison: Test polynomial regression or other algorithms")
print(f"- Uncertainty Estimation: Calculate prediction intervals for clinical decision-making")


Summary:
- The model successfully captures the positive linear relationship between BMI and disease progression
- With an R² score of 0.3164, the model explains 31.64% of the variance in disease progression
- The Mean Absolute Error of 50.9684 units indicates the typical prediction error magnitude
- This model can serve as a baseline for more complex approaches or help clinicians identify risk patterns

Next Steps for Improvement:
- Parameter Tuning: Cross-validation and hyperparameter optimization
- Feature Engineering: Include additional physiological measurements beyond BMI
- Model Comparison: Test polynomial regression or other algorithms
- Uncertainty Estimation: Calculate prediction intervals for clinical decision-making
