### 03_Exploratory_Stats: Correlations & Logistic Regression

This notebook computes exploratory statistical tests on each of the three trial datasets (dfA, dfB_filled, dfC_filled):

- **Point‐Biserial** correlations between each feature and the binary Outcome.

- **Univariate logistic regression** → Odds Ratios and p‐values for each predictor.

- **Multivariate logistic regression** → adjusted coefficients, ORs, and p‐values.

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
from scipy.stats import pointbiserialr
import statsmodels.api as sm

In [2]:
# Define Columns
missing_cols = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']

feature_cols = [
    'Pregnancies','Glucose','BloodPressure',
    'SkinThickness','Insulin','BMI',
    'DiabetesPedigreeFunction','Age'
]

target_col = 'Outcome'

In [3]:
# Load raw data and rebuild trial DataFrames (same code as in 02_Preprocessing.ipynb)
df = pd.read_csv('diabetes.csv')

In [6]:
# Trial A: drop zeros in missing_cols
dfA_mask = ~(df[missing_cols] == 0).any(axis=1)
dfA = df.loc[dfA_mask, feature_cols + [target_col]].reset_index(drop=True)

# Trial B: median impute zeros->NaN on feature_cols
dfB_filled = df.copy()
dfB_filled[missing_cols] = dfB_filled[missing_cols].replace(0, np.nan)
med_imp = SimpleImputer(strategy='median') 
dfB_filled[feature_cols] = med_imp.fit_transform(dfB_filled[feature_cols])

# Trial C: KNN impute zeros->NaN on feature_cols
dfC_filled = dfB_filled.copy()
knn_imp = KNNImputer(n_neighbors=5)
dfC_filled[feature_cols] = knn_imp.fit_transform(dfC_filled[feature_cols])

print('Shapes: dfA=', dfA.shape, 'dfB_filled=', dfB_filled.shape, 'dfC_filled=', dfC_filled.shape)

Shapes: dfA= (392, 9) dfB_filled= (768, 9) dfC_filled= (768, 9)


In [15]:
# Define explore() to run all tests on a given DataFrame
def explore(df_name, df_explore):
    print(f"\n***************** Exploratory stats for {df_name} *****************\n")
    # Point-Biserial
    for col in feature_cols:
        x = df_explore[col].dropna()
        y = df_explore.loc[df_explore[col].notnull(), 'Outcome']
        r, p = pointbiserialr(x, y)
        print(f"{col}: r_pb={r:.3f}, p={p:.3g}")
    print("----------------------------------------")

    # Univariate Logistic
    X_uni = sm.add_constant(df_explore[feature_cols])
    for col in feature_cols:
        df2 = df_explore[[col,'Outcome']].dropna()
        X2 = sm.add_constant(df2[col])
        y2 = df2['Outcome']
        m = sm.Logit(y2, X2).fit(disp=False)
        print(f"{col} - OR={np.exp(m.params[col]):.3f}, p={m.pvalues[col]:.3f}")
    print("----------------------------------------")

    # Multivariate Logistic
    df_mv = df_explore.dropna(subset=feature_cols)
    X_all = sm.add_constant(df_mv[feature_cols])
    y_all = df_mv['Outcome']
    m2 = sm.Logit(y_all, X_all).fit(disp=False)
    print(m2.summary2().tables[1])
    print("----------------------------------------")

explore("Trial A (drop zeros)", dfA)
explore("Trial B (median imputation)", dfB_filled)
explore("Trial C (KNN imputation)", dfC_filled)


***************** Exploratory stats for Trial A (drop zeros) *****************

Pregnancies: r_pb=0.257, p=2.61e-07
Glucose: r_pb=0.516, p=5.1e-28
BloodPressure: r_pb=0.193, p=0.000124
SkinThickness: r_pb=0.256, p=2.79e-07
Insulin: r_pb=0.301, p=1.12e-09
BMI: r_pb=0.270, p=5.56e-08
DiabetesPedigreeFunction: r_pb=0.209, p=2.95e-05
Age: r_pb=0.351, p=8.56e-13
----------------------------------------
Pregnancies - OR=1.181, p=0.000
Glucose - OR=1.043, p=0.000
BloodPressure - OR=1.035, p=0.000
SkinThickness - OR=1.056, p=0.000
Insulin - OR=1.006, p=0.000
BMI - OR=1.090, p=0.000
DiabetesPedigreeFunction - OR=3.600, p=0.000
Age - OR=1.077, p=0.000
----------------------------------------
                              Coef.  Std.Err.         z         P>|z|  \
const                    -10.040739  1.217675 -8.245826  1.640233e-16   
Pregnancies                0.082159  0.055426  1.482338  1.382504e-01   
Glucose                    0.038270  0.005768  6.635131  3.242151e-11   
BloodPressure   

In [16]:
# Creating a table based on stats results for better visualization and comparison
def compute_stats(df):
    rows = []
    # point-biserial & univariate logistic
    for col in feature_cols:
        # 1) point-biserial
        x = df[col].dropna()
        y = df.loc[df[col].notnull(), 'Outcome']
        r_pb, p_pb = pointbiserialr(x, y)
        # 2) univariate logistic
        df2 = df[[col,'Outcome']].dropna()
        X2 = sm.add_constant(df2[col])
        m1 = sm.Logit(df2['Outcome'], X2).fit(disp=False)
        or_uni = np.exp(m1.params[col])
        p_uni = m1.pvalues[col]
        rows.append((col, r_pb, p_pb, or_uni, p_uni))

    # multivariate Logistic
    df3 = df.dropna(subset=feature_cols)
    X3 = sm.add_constant(df3[feature_cols])
    m2 = sm.Logit(df3['Outcome'], X3).fit(disp=False)
    #attach multivariate results
    stats = []
    for col, r_pb, p_pb, or_uni, p_uni in rows:
        or_mv = np.exp(m2.params[col])
        p_mv = m2.pvalues[col]
        stats.append((col, r_pb, p_pb, or_uni, p_uni, or_mv, p_mv))
    return pd.DataFrame(
        stats,
        columns=['feature','r_pb','p_pb','OR_uni','p_uni','OR_mv','p_mv']
    ).set_index('feature')

statsA = compute_stats(dfA)             # complete-case
statsB = compute_stats(dfB_filled)      # median-imputed
statsC = compute_stats(dfC_filled)      # KNN-imputed

combined = pd.concat(
    [statsA, statsB, statsC],
    axis=1,
    keys=['Trial A', 'Trial B', 'Trial C']
)

# swap so metrics are outer, trials inner
combined = combined.swaplevel(0,1, axis=1).sort_index(axis=1, level=0)

# round for readability
combined = combined.round(3)

display(combined)

Unnamed: 0_level_0,OR_mv,OR_mv,OR_mv,OR_uni,OR_uni,OR_uni,p_mv,p_mv,p_mv,p_pb,p_pb,p_pb,p_uni,p_uni,p_uni,r_pb,r_pb,r_pb
Unnamed: 0_level_1,Trial A,Trial B,Trial C,Trial A,Trial B,Trial C,Trial A,Trial B,Trial C,Trial A,Trial B,Trial C,Trial A,Trial B,Trial C,Trial A,Trial B,Trial C
feature,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
Pregnancies,1.086,1.133,1.133,1.181,1.147,1.147,0.138,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.257,0.222,0.222
Glucose,1.039,1.039,1.039,1.043,1.041,1.041,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.516,0.493,0.493
BloodPressure,0.999,0.991,0.991,1.035,1.03,1.03,0.904,0.275,0.275,0.0,0.0,0.0,0.0,0.0,0.0,0.193,0.166,0.166
SkinThickness,1.011,1.003,1.003,1.056,1.056,1.056,0.511,0.793,0.793,0.0,0.0,0.0,0.0,0.0,0.0,0.256,0.215,0.215
Insulin,0.999,0.999,0.999,1.006,1.005,1.005,0.528,0.301,0.301,0.0,0.0,0.0,0.0,0.0,0.0,0.301,0.204,0.204
BMI,1.073,1.099,1.099,1.09,1.108,1.108,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.27,0.312,0.312
DiabetesPedigreeFunction,3.13,2.401,2.401,3.6,2.953,2.953,0.008,0.003,0.003,0.0,0.0,0.0,0.0,0.0,0.0,0.209,0.174,0.174
Age,1.035,1.013,1.013,1.077,1.043,1.043,0.065,0.171,0.171,0.0,0.0,0.0,0.0,0.0,0.0,0.351,0.238,0.238


### Key Takeaways from Exploratory Statistics

- **Glucose** is the single strongest predictor of diabetes in every trial.  
  - Univariate OR ≈ 1.04–1.05 (p < 0.001), point‐biserial r ≈ 0.50.  
  - Multivariate OR ≈ 1.04 (p < 0.001) even after adjusting for all other features.

- **BMI** and **Diabetes Pedigree Function** are the next most influential:  
  - **BMI:** uni‐ and multivariate OR ≈ 1.09–1.11 per unit (p < 0.01), r ≈ 0.27–0.31  
  - **Pedigree:** uni‐ and multivariate OR ≈ 2.4–3.6 (p < 0.01), r ≈ 0.17–0.21

- **Pregnancies** shows a small but consistent effect (uni OR ≈ 1.15, multivariate OR ≈ 1.08–1.13), with r ≈ 0.22–0.26 and p < 0.001 in most trials.

- **Age** is significant in univariate tests (OR ≈ 1.04–1.08, p < 0.05, r ≈ 0.24–0.35) but becomes borderline after full adjustment (p ≈ 0.06–0.17).

- **BloodPressure, SkinThickness, and Insulin** do not reach significance in the multivariate models (p > 0.05), despite weak univariate correlations (r ≈ 0.16–0.30).

- **Imputation method robustness:**  
  The ranking, OR magnitudes, and significance levels are highly consistent across all three trials (drop-zeros, median, KNN imputation), demonstrating that our core findings do not hinge on the missing-value strategy.

> **Conclusion:** Blood glucose, BMI, and genetic risk (pedigree) are the most reliable independent predictors of type-2 diabetes in the Pima dataset, with pregnancies and age playing secondary roles. Other measurements (blood pressure, skin fold, insulin) add little predictive value once these primary factors are accounted for.  
