# Diabetes Prediction Model Development 
This notebook documents the development of a machine learning pipeline for predicting diabetes. The workflow includes:

1. **Data Loading and Preparation**: Loading the engineered dataset and splitting it into features and target variables.
2. **Model Definition**: Defining a variety of machine learning models, including:
    - Decision Tree
    - Random Forest
    - Gradient Boosting
    - Support Vector Machine (SVM)
    - K-Nearest Neighbors (KNN)
    - Naive Bayes
    - Logistic Regression
    - XGBoost
3. **Model Evaluation**: Using cross-validation to evaluate models based on metrics such as accuracy, precision, recall, and F1 score.
4. **Detailed Analysis**: Performing a detailed evaluation of the top 3 models, including confusion matrices, ROC curves, and precision-recall curves.
5. **Feature Importance**: Analyzing feature importance for the best-performing model.
6. **Summary and Next Steps**: Summarizing findings and outlining next steps for further development.

The goal is to identify the best-performing model for predicting diabetes and to provide insights into the most important features contributing to the predictions.


In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc, precision_recall_curve
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
import joblib
import os

# Set styling for plots
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('viridis')

In [3]:
# 1. Load the Engineered Dataset
# -----------------------------------------------------

print("Loading engineered diabetes dataset...")
df = pd.read_csv('C:/Users/hp/Desktop/diabetes-analysis-project/outputs/features_engineered_scaled.csv')

Loading engineered diabetes dataset...


In [4]:
# 2. Prepare Data for Modeling

print("\n Preparing Data for Modeling:")

# Split features and target
X = df.drop('Outcome', axis=1)
y = df['Outcome']

print(f"Features: {X.columns.tolist()}")
print(f"Target: Outcome (Diabetes = 1, No Diabetes = 0)")

# Split data into training and testing sets (stratified by outcome)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Testing set: {X_test.shape[0]} samples")

# Check class distribution
print("\nClass distribution:")
print(f"Training set: {pd.Series(y_train).value_counts(normalize=True).round(3) * 100}%")
print(f"Testing set: {pd.Series(y_test).value_counts(normalize=True).round(3) * 100}%")



 Preparing Data for Modeling:
Features: ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Diabetes_Risk_Index', 'Insulin_Sensitivity', 'Age_BMI_Factor']
Target: Outcome (Diabetes = 1, No Diabetes = 0)
Training set: 741 samples
Testing set: 248 samples

Class distribution:
Training set: Outcome
1    50.1
0    49.9
Name: proportion, dtype: float64%
Testing set: Outcome
0    50.0
1    50.0
Name: proportion, dtype: float64%


In [6]:
# 3. Define Base Models

print("\nDefining Base Models:")

# Define a list of models to evaluate
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'SVM': SVC(probability=True, random_state=42),
    'KNN': KNeighborsClassifier(n_neighbors=5),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Naive Bayes': GaussianNB(),
    'XGBoost': XGBClassifier(random_state=42)
}

print(f"Evaluating {len(models)} base models:")
for name in models.keys():
    print(f"- {name}")



Defining Base Models:
Evaluating 8 base models:
- Logistic Regression
- Random Forest
- Gradient Boosting
- SVM
- KNN
- Decision Tree
- Naive Bayes
- XGBoost


In [7]:
# 4. Evaluate Base Models with Cross-Validation

print("\nEvaluating Base Models with Cross-Validation:")

# Define cross-validation strategy
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Metrics to evaluate
metrics = {
    'Accuracy': [],
    'Precision': [],
    'Recall': [],
    'F1': []
}

# Results dataframe
results = pd.DataFrame(columns=['Model', 'Accuracy', 'Precision', 'Recall', 'F1', 'Std_Accuracy'])

# Evaluate each model with cross-validation
for name, model in models.items():
    print(f"\nEvaluating {name}...")
    
    # Calculate cross-validation scores
    accuracy = cross_val_score(model, X_train, y_train, cv=cv, scoring='accuracy')
    precision = cross_val_score(model, X_train, y_train, cv=cv, scoring='precision')
    recall = cross_val_score(model, X_train, y_train, cv=cv, scoring='recall')
    f1 = cross_val_score(model, X_train, y_train, cv=cv, scoring='f1')
    
    # Print results
    print(f"Accuracy: {accuracy.mean():.4f} ± {accuracy.std():.4f}")
    print(f"Precision: {precision.mean():.4f} ± {precision.std():.4f}")
    print(f"Recall: {recall.mean():.4f} ± {recall.std():.4f}")
    print(f"F1 Score: {f1.mean():.4f} ± {f1.std():.4f}")
    
    # Add to results dataframe
    results = pd.concat([results, pd.DataFrame({
        'Model': [name],
        'Accuracy': [accuracy.mean()],
        'Precision': [precision.mean()],
        'Recall': [recall.mean()],
        'F1': [f1.mean()],
        'Std_Accuracy': [accuracy.std()]
    })], ignore_index=True)
    
    # Store results for plotting
    metrics['Accuracy'].append(accuracy.mean())
    metrics['Precision'].append(precision.mean())
    metrics['Recall'].append(recall.mean())
    metrics['F1'].append(f1.mean())

# Sort results by F1 score
results = results.sort_values('F1', ascending=False).reset_index(drop=True)
print("\nModel Rankings by F1 Score:")
print(results)

# Plot model comparison
plt.figure(figsize=(14, 10))

# Accuracy comparison
plt.subplot(2, 2, 1)
sns.barplot(x=list(models.keys()), y=metrics['Accuracy'], palette='viridis')
plt.title('Model Accuracy Comparison', fontsize=14)
plt.ylabel('Accuracy', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.ylim(0.65, 0.85)

# Precision comparison
plt.subplot(2, 2, 2)
sns.barplot(x=list(models.keys()), y=metrics['Precision'], palette='viridis')
plt.title('Model Precision Comparison', fontsize=14)
plt.ylabel('Precision', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.ylim(0.65, 0.85)

# Recall comparison
plt.subplot(2, 2, 3)
sns.barplot(x=list(models.keys()), y=metrics['Recall'], palette='viridis')
plt.title('Model Recall Comparison', fontsize=14)
plt.ylabel('Recall', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.ylim(0.65, 0.85)

# F1 Score comparison
plt.subplot(2, 2, 4)
sns.barplot(x=list(models.keys()), y=metrics['F1'], palette='viridis')
plt.title('Model F1 Score Comparison', fontsize=14)
plt.ylabel('F1 Score', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.ylim(0.65, 0.85)

plt.tight_layout()
plt.savefig('C:/Users/hp/Desktop/diabetes-analysis-project/visuals/static/model_comparison.png')
plt.close()



Evaluating Base Models with Cross-Validation:

Evaluating Logistic Regression...
Accuracy: 0.8233 ± 0.0274
Precision: 0.8202 ± 0.0426
Recall: 0.8329 ± 0.0376
F1 Score: 0.8253 ± 0.0251

Evaluating Random Forest...


  results = pd.concat([results, pd.DataFrame({


Accuracy: 0.8921 ± 0.0229
Precision: 0.8844 ± 0.0230
Recall: 0.9030 ± 0.0384
F1 Score: 0.8931 ± 0.0236

Evaluating Gradient Boosting...
Accuracy: 0.8853 ± 0.0191
Precision: 0.8847 ± 0.0204
Recall: 0.8868 ± 0.0290
F1 Score: 0.8855 ± 0.0195

Evaluating SVM...
Accuracy: 0.8637 ± 0.0172
Precision: 0.8502 ± 0.0213
Recall: 0.8841 ± 0.0248
F1 Score: 0.8666 ± 0.0170

Evaluating KNN...
Accuracy: 0.8502 ± 0.0237
Precision: 0.8282 ± 0.0144
Recall: 0.8843 ± 0.0451
F1 Score: 0.8549 ± 0.0255

Evaluating Decision Tree...
Accuracy: 0.8435 ± 0.0222
Precision: 0.8348 ± 0.0193
Recall: 0.8572 ± 0.0324
F1 Score: 0.8456 ± 0.0225

Evaluating Naive Bayes...
Accuracy: 0.7693 ± 0.0386
Precision: 0.7940 ± 0.0555
Recall: 0.7331 ± 0.0503
F1 Score: 0.7608 ± 0.0398

Evaluating XGBoost...
Accuracy: 0.8934 ± 0.0206
Precision: 0.8862 ± 0.0158
Recall: 0.9030 ± 0.0312
F1 Score: 0.8944 ± 0.0210

Model Rankings by F1 Score:
                 Model  Accuracy  Precision    Recall        F1  Std_Accuracy
0              XGBoost


Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=list(models.keys()), y=metrics['Accuracy'], palette='viridis')

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=list(models.keys()), y=metrics['Precision'], palette='viridis')

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=list(models.keys()), y=metrics['Recall'], palette='viridis')

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=list(models.keys()), y=metrics['F1'], palette='viridis')


In [8]:
# 5. Detailed Evaluation of Top 3 Models

print("\n🔍 Detailed Evaluation of Top 3 Models:")

# Select top 3 models based on F1 score
top_models = results.head(3)['Model'].tolist()
print(f"Top 3 models for detailed evaluation: {top_models}")

# Train and evaluate each top model
for model_name in top_models:
    print(f"\nDetailed evaluation of {model_name}:")
    model = models[model_name]
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Predict on test set
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    # Calculate metrics
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    print(f"Test Accuracy: {acc:.4f}")
    print(f"Test Precision: {prec:.4f}")
    print(f"Test Recall: {rec:.4f}")
    print(f"Test F1 Score: {f1:.4f}")
    
    # Classification Report
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    
    # Plot confusion matrix
    plt.figure(figsize=(8, 6))
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
    plt.title(f'Confusion Matrix - {model_name}', fontsize=14)
    plt.ylabel('True Label', fontsize=12)
    plt.xlabel('Predicted Label', fontsize=12)
    plt.savefig(f'C:/Users/hp/Desktop/diabetes-analysis-project/visuals/static/confusion_matrix_{model_name.replace(" ", "_").lower()}.png')
    plt.close()
    
    # Plot ROC curve
    plt.figure(figsize=(8, 6))
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
    roc_auc = auc(fpr, tpr)
    
    plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate', fontsize=12)
    plt.ylabel('True Positive Rate', fontsize=12)
    plt.title(f'ROC Curve - {model_name}', fontsize=14)
    plt.legend(loc="lower right")
    plt.savefig(f'C:/Users/hp/Desktop/diabetes-analysis-project/visuals/static/roc_curve_{model_name.replace(" ", "_").lower()}.png')
    plt.close()
    
    # Plot Precision-Recall curve
    plt.figure(figsize=(8, 6))
    precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
    pr_auc = auc(recall, precision)
    
    plt.plot(recall, precision, color='darkorange', lw=2, label=f'PR curve (area = {pr_auc:.2f})')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('Recall', fontsize=12)
    plt.ylabel('Precision', fontsize=12)
    plt.title(f'Precision-Recall Curve - {model_name}', fontsize=14)
    plt.legend(loc="lower left")
    plt.savefig(f'C:/Users/hp/Desktop/diabetes-analysis-project/visuals/static/pr_curve_{model_name.replace(" ", "_").lower()}.png')
    plt.close()
    
    # Save the model
    joblib.dump(model, f'C:/Users/hp/Desktop/diabetes-analysis-project/outputs/models/{model_name.replace(" ", "_").lower()}.pkl')



🔍 Detailed Evaluation of Top 3 Models:
Top 3 models for detailed evaluation: ['XGBoost', 'Random Forest', 'Gradient Boosting']

Detailed evaluation of XGBoost:
Test Accuracy: 0.9032
Test Precision: 0.8571
Test Recall: 0.9677
Test F1 Score: 0.9091

Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.84      0.90       124
           1       0.86      0.97      0.91       124

    accuracy                           0.90       248
   macro avg       0.91      0.90      0.90       248
weighted avg       0.91      0.90      0.90       248


Detailed evaluation of Random Forest:
Test Accuracy: 0.9073
Test Precision: 0.8741
Test Recall: 0.9516
Test F1 Score: 0.9112

Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.86      0.90       124
           1       0.87      0.95      0.91       124

    accuracy                           0.91       248
   macro avg       0.91      0.91

In [9]:
# 6. Feature Importance for Top Model

print("\n🔍 Feature Importance Analysis:")
best_model_name = results.iloc[0]['Model']
best_model = models[best_model_name]

# Get feature importance if available
if hasattr(best_model, 'feature_importances_'):
    feature_importance = pd.DataFrame({
        'Feature': X.columns,
        'Importance': best_model.feature_importances_
    }).sort_values('Importance', ascending=False)
    
    print(f"\nFeature Importance for {best_model_name}:")
    print(feature_importance)
    
    # Plot feature importance
    plt.figure(figsize=(12, 8))
    sns.barplot(x='Importance', y='Feature', data=feature_importance, palette='viridis')
    plt.title(f'Feature Importance - {best_model_name}', fontsize=16)
    plt.xlabel('Importance', fontsize=14)
    plt.ylabel('Feature', fontsize=14)
    plt.tight_layout()
    plt.savefig('C:/Users/hp/Desktop/diabetes-analysis-project/visuals/static/feature_importance.png')
    plt.close()

elif best_model_name == 'Logistic Regression':
    # For logistic regression, use coefficients as importance
    feature_importance = pd.DataFrame({
        'Feature': X.columns,
        'Coefficient': best_model.coef_[0]
    }).sort_values('Coefficient', ascending=False)
    
    print(f"\nFeature Coefficients for {best_model_name}:")
    print(feature_importance)
    
    # Plot feature coefficients
    plt.figure(figsize=(12, 8))
    sns.barplot(x='Coefficient', y='Feature', data=feature_importance, palette='viridis')
    plt.title(f'Feature Coefficients - {best_model_name}', fontsize=16)
    plt.xlabel('Coefficient', fontsize=14)
    plt.ylabel('Feature', fontsize=14)
    plt.axvline(x=0, color='red', linestyle='--')
    plt.tight_layout()
    plt.savefig('C:/Users/hp/Desktop/diabetes-analysis-project/visuals/static/feature_coefficients.png')
    plt.close()



🔍 Feature Importance Analysis:

Feature Importance for XGBoost:
                     Feature  Importance
4                    Insulin    0.370307
10            Age_BMI_Factor    0.109198
8        Diabetes_Risk_Index    0.109029
3              SkinThickness    0.070951
7                        Age    0.065667
6   DiabetesPedigreeFunction    0.061496
5                        BMI    0.055311
9        Insulin_Sensitivity    0.049320
1                    Glucose    0.044270
0                Pregnancies    0.033671
2              BloodPressure    0.030780



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x='Importance', y='Feature', data=feature_importance, palette='viridis')


In [10]:
# 7. Model Development Summary
# -----------------------------------------------------

print("\n📝 Model Development Summary:")
best_model_metrics = results.iloc[0]
print(f"""
Model Development Summary:

1. Best Performing Model:
   - Name: {best_model_metrics['Model']}
   - Accuracy: {best_model_metrics['Accuracy']:.4f}
   - Precision: {best_model_metrics['Precision']:.4f}
   - Recall: {best_model_metrics['Recall']:.4f}
   - F1 Score: {best_model_metrics['F1']:.4f}

2. Top 3 Models:
   {results[['Model', 'F1']].head(3).to_string(index=False)}

3. All models have been saved to the '../../outputs/models/' directory

4. Key Findings:
   - {best_model_metrics['Model']} performs best among all models
   - Most important features: {feature_importance['Feature'].iloc[:3].tolist() if 'feature_importance' in locals() else 'N/A'}
   - Model evaluation metrics and visualizations created

Next Steps:
- Fine-tune the best model with hyperparameter optimization
- Evaluate model performance on different subsets of the data
- Create an ensemble of top-performing models
- Develop a prediction system for diabetes risk assessment
""")

print("\nModel development complete! Models saved to '../../outputs/models/' directory")


📝 Model Development Summary:

Model Development Summary:

1. Best Performing Model:
   - Name: XGBoost
   - Accuracy: 0.8934
   - Precision: 0.8862
   - Recall: 0.9030
   - F1 Score: 0.8944

2. Top 3 Models:
               Model       F1
          XGBoost 0.894367
    Random Forest 0.893136
Gradient Boosting 0.885493

3. All models have been saved to the '../../outputs/models/' directory

4. Key Findings:
   - XGBoost performs best among all models
   - Most important features: ['Insulin', 'Age_BMI_Factor', 'Diabetes_Risk_Index']
   - Model evaluation metrics and visualizations created

Next Steps:
- Fine-tune the best model with hyperparameter optimization
- Evaluate model performance on different subsets of the data
- Create an ensemble of top-performing models
- Develop a prediction system for diabetes risk assessment


Model development complete! Models saved to '../../outputs/models/' directory


# Model Development Summary:

### Model Development Summary:

## 1. Best Performing Model:
   - Name: XGBoost
   - Accuracy: 0.8934
   - Precision: 0.8862
   - Recall: 0.9030


## 2. Top 3 Models:
               Model  F1
          XGBoost 0.894367
    Random Forest 0.893136
Gradient Boosting 0.885493

## 3. All models have been saved to the '../../outputs/models/' directory

## 4. Key Findings:
   - XGBoost performs best among all models
   - Most important features: ['Insulin', 'Age_BMI_Factor', 'Diabetes_Risk_Index']
   - Model evaluation metrics and visualizations created
...
- Develop a prediction system for diabetes risk assessment
