# Diabetes Dataset Analysis & ML Model

This notebook **loads the provided `diabetes_dataset.csv`**, performs EDA, preprocessing, trains a couple of ML models (Logistic Regression & RandomForest), and includes **interactive plots** (Plotly) for exploration. Designed for quick copy-run in your environment. 

**File path used:** `/mnt/data/diabetes_dataset.csv`

**Generated:** 2025-08-11 14:08:57 UTC


In [None]:
# Standard imports - assume these are already installed as you said
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# scikit-learn for modeling
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, roc_curve, classification_report

# Plotly for interactive plots
import plotly.express as px
import plotly.graph_objects as go

# For nicer warnings
import warnings
warnings.filterwarnings('ignore')

print('Imports done')

In [None]:
# Load dataset
path = '/mnt/data/diabetes_dataset.csv'
df = pd.read_csv(path)
print('Dataset shape:', df.shape)
df.head()

In [None]:
# Quick info
display(df.info())
display(df.describe().T)

In [None]:
# Check missing values and zeros that may represent missing
missing = df.isnull().sum()
zeros = (df == 0).sum()
print('Missing values:\n', missing)
print('\nZero counts (may be missing for some columns):\n', zeros)

**Note:** In many diabetes datasets, columns like `Glucose`, `BloodPressure`, `SkinThickness`, `Insulin`, `BMI` contain zeros which actually mean missing. We'll replace zeros with NaN for these columns and impute with median.

In [None]:
# Replace zeros with NaN for selected columns then impute with median
cols_zero_missing = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
for c in cols_zero_missing:
    if c in df.columns:
        df[c] = df[c].replace(0, np.nan)

# show missing after replacement
display(df[cols_zero_missing].isnull().sum())

# Impute with median
for c in cols_zero_missing:
    if c in df.columns:
        df[c].fillna(df[c].median(), inplace=True)

print('Imputation done. Any nulls left?\n', df.isnull().sum().sum())

## Correlation matrix (interactive heatmap)

In [None]:
# Correlation matrix interactive heatmap using Plotly
corr = df.corr()
fig = go.Figure(data=go.Heatmap(
    z=corr.values,
    x=corr.columns,
    y=corr.columns,
    colorbar=dict(title='corr')
))
fig.update_layout(title='Feature Correlation Heatmap', width=800, height=700)
fig

## Feature distributions (interactive)

In [None]:
# Interactive histograms for numerical columns (one by one)
num_cols = df.select_dtypes(include=np.number).columns.tolist()
for col in num_cols:
    fig = px.histogram(df, x=col, nbins=40, title=f'Distribution of {col}', marginal='box')
    fig.update_layout(width=800, height=450)
    display(fig)

## Pairwise scatterplots: top features vs Outcome

In [None]:
# Scatter plots of a few useful features against Outcome
target = 'Outcome' if 'Outcome' in df.columns else df.columns[-1]
plot_cols = ['Glucose','BMI','Age','Insulin']  # common useful features
plot_cols = [c for c in plot_cols if c in df.columns]
for c in plot_cols:
    fig = px.scatter(df, x=c, y=target, title=f'{c} vs {target}', marginal_y='violin', width=800, height=450)
    display(fig)

## Modeling: prepare dataset (features, target), split and scale

In [None]:
# Prepare X, y
target_col = 'Outcome' if 'Outcome' in df.columns else df.columns[-1]
X = df.drop(columns=[target_col])
y = df[target_col]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print('X_train shape:', X_train.shape, 'X_test shape:', X_test.shape)

## Train models: Logistic Regression & Random Forest

In [None]:
# Logistic Regression
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train_scaled, y_train)
y_pred_lr = lr.predict(X_test_scaled)
y_proba_lr = lr.predict_proba(X_test_scaled)[:,1]

# Random Forest
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
y_proba_rf = rf.predict_proba(X_test)[:,1]

print('Models trained')

## Model evaluation & comparison

In [None]:
# Evaluation function
def evaluate_model(name, y_test, y_pred, y_proba):
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc = roc_auc_score(y_test, y_proba)
    print(f'-- {name} --')
    print('Accuracy:', round(acc,4))
    print('Precision:', round(prec,4))
    print('Recall:', round(rec,4))
    print('F1-score:', round(f1,4))
    print('ROC AUC:', round(roc,4))
    print('\nClassification report:\n', classification_report(y_test, y_pred))
    cm = confusion_matrix(y_test, y_pred)
    return dict(acc=acc,prec=prec,rec=rec,f1=f1,roc=roc,cm=cm)

res_lr = evaluate_model('Logistic Regression', y_test, y_pred_lr, y_proba_lr)
res_rf = evaluate_model('Random Forest', y_test, y_pred_rf, y_proba_rf)

## Confusion matrix (interactive)

In [None]:
# Choose best model by F1-score
best_name = 'Random Forest' if res_rf['f1'] >= res_lr['f1'] else 'Logistic Regression'
best_pred = y_pred_rf if best_name=='Random Forest' else y_pred_lr
best_proba = y_proba_rf if best_name=='Random Forest' else y_proba_lr
best_cm = res_rf['cm'] if best_name=='Random Forest' else res_lr['cm']

labels = ['Negative','Positive']
z = best_cm.tolist()

fig = go.Figure(data=go.Heatmap(z=z, x=labels, y=labels, hoverongaps=False, showscale=False, text=z, texttemplate="%{text}"))
fig.update_layout(title=f'Confusion Matrix - {best_name}', width=600, height=500)
fig

## ROC Curve (interactive)

In [None]:
# ROC curves for both models
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_proba_lr)
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_proba_rf)

fig = go.Figure()
fig.add_trace(go.Scatter(x=fpr_lr, y=tpr_lr, mode='lines', name=f'Logistic Regression (AUC={res_lr["roc"]:.3f})'))
fig.add_trace(go.Scatter(x=fpr_rf, y=tpr_rf, mode='lines', name=f'Random Forest (AUC={res_rf["roc"]:.3f})'))
fig.add_trace(go.Scatter(x=[0,1], y=[0,1], mode='lines', name='Random', line=dict(dash='dash')))
fig.update_layout(title='ROC Curves', xaxis_title='False Positive Rate', yaxis_title='True Positive Rate', width=800, height=600)
fig

## Feature importance (Random Forest)

In [None]:
# Feature importance from Random Forest (if available)
if hasattr(rf, 'feature_importances_'):
    fi = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
    fig = px.bar(fi.reset_index().rename(columns={'index':'feature',0:'importance'}),
                 x='importance', y='feature', orientation='h', title='Feature Importance (Random Forest)')
    fig.update_layout(width=800, height=500)
    fig
else:
    print('RandomForest feature importances not available')

## Save model (optional) & Next steps

- You can export the trained scaler and model using `joblib` or `pickle` for later inference.
- Next improvements: cross-validation, hyperparameter tuning (GridSearchCV), SHAP explanations, and deployment via FastAPI.

----

**End of notebook.**