# EDA ‚Äì Correlations (Diabetes Capstone)

**Goal:** understand pairwise relationships among features and their association with `Outcome`.

## üìò Overview
This notebook performs exploratory analysis on the **UCI Pima Indians Diabetes dataset**, combining initial data cleaning with feature-relationship exploration.
It computes and compares **Pearson**, **Spearman**, and **Kendall** correlations to identify which variables are most associated with diabetes 'Outcome' and visualizes those relationships in a JMP-style scatterplot maxtrix.

Key steps:
1) Loads & cleans data (Pima "zero" values are treated as missing).
2) Summarizes missingness and descriptive statistics.
3) Computes correlations across all numerical features.
4) Generate and save correlation tables and scatterplot figures.
5) Capture data-driven insights and next-step considerations for future modeling.

Artifacts are saved in:
* /reports/tables/diabetes_corr_spearman_Outcome.csv (same for kendall & pearson)
* /reports/figures/diabetes_scatter_maxtrix_spearman_Outcome.png (same for kendall & pearson)


# Setup

In [32]:
# ---- Setup for ALL Notebooks ---
from proj.setup import setup_notebook
setup_notebook() # sets up paths, autoreload, etc.

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
‚úÖ Notebook setup complete.
Project root: /Users/b-rad.j.neiman/CODE/diabetes-capstone


In [33]:
# Import reusable plotting package (installed with `pip install -e .`)
from ds_viz import scatter_matrix_with_corr, quick_corr_plot

## Load data

In [None]:
assert DATA_PATH.exists(), f"Couldn't find {DATA_PATH}. Put your CSV at that path or update DATA_PATH."
df = pd.read_csv(DATA_PATH)
df.head()


## Quick structure & summary

In [None]:
df.shape, df.dtypes.to_frame("dtype")


In [None]:
df.describe(include="all").T


## Pima Indians ‚Äú0 = missing‚Äù clean-up
In the classic dataset, zeros in these columns are invalid and should be treated as missing.

In [None]:
zero_as_na = ["Glucose","BloodPressure","SkinThickness","Insulin","BMI"]
df = df.copy()
df[zero_as_na] = df[zero_as_na].replace(0, np.nan)

missing = df.isna().mean().sort_values(ascending=False).to_frame("missing_frac")
missing


## (Optional) Simple median imputation for visualization
For EDA visuals, you can work with NaNs (the plotting function handles them).  
If you prefer a ‚Äúfilled‚Äù copy for some charts, use median impute below.

In [None]:
df_eda = df.copy()
for c in df_eda.select_dtypes(include=np.number):
    df_eda[c] = df_eda[c].fillna(df_eda[c].median())

df_eda.head()


## Rank features by correlation with `Outcome`

In [None]:
for m in ["pearson","spearman","kendall"]:
    s = df.corr(method=m)[TARGET].abs().sort_values(ascending=False)
    print(f"\nTop correlations with {TARGET} by {m}:")
    display(s.to_frame(f"abs_{m}"))


### üìä Interpretation ‚Äî Feature Correlations with Outcome
Across all three correlation methods (Pearson, Spearman, Kendall), **Glucose** consistently shows the **strongest relationship** with diabetes outcome (~0.48-0.49), confirming it as the dominant predictor.

Beyond Glucose, the **ranking of other features varies slightly** by method:
* **BMI**, **Insulin**, and **Age** form a *second tier* of moderately correlated features (p ‚âà 0.25-0.37).
* **SkinThickness** and **Pregnancies** show weaker, but still positive associations.
* **BloodPressure** and **DiabetesPedigreeFunction** have the *lowest correlations* across all methods (<0.18).

The small shifts in ranking between Pearson, Spearman, and Kendall suggest **some non-linear or monotonic effects**, especially for **Insulin** and **Age** -- which rank higher under rank-based correlations.

Overall, the pattern aligns with clinical understanding: **Glucose metabolism and body composition (BMI, Insulin)** remain the most predictive domains for diabetes risk.

### üß†Modeling Implications
* **Prioritize Glucose as a primary predictive feature** -- its consistent top correlation supports its central role in early modeling baselines.
* **Include BMI, Insulin, and Age as key secondary variables**, capturing body composition and metabolic response; consider interaction terms among these.
* **Expect non-linear effects** (pre rank-based correlations) -- tree-based or regularized models (e.g., Random Forest, XGBoost, ElasticNet) may capture these relationships better than linear models.
* **Low-correlation variables** like BloodPressure and DiabetesPedigreeFunction may still contribute **in a multivariate context**; avoid dropping them before testing feature importance.
* Use this correlation matrix as a **feature selection reference**, not a filter -- combine with domain context and model-based importance metrics in later notebooks.

## JMP-style Scatterplot Matrix (Spearman, target-based)
Saves a CSV correlation table and a PNG figure into `reports/`.

In [None]:
corr, fig = scatter_matrix_with_corr(
    df,                                # you can use df_eda as well; both are fine for visuals
    method="spearman",
    select_strategy="target",
    target=TARGET,
    max_vars=8,                        # adjust as desired
    standardize=True,                  # comparable axes (z-scores)
    diag_hist_sharey=True,
    diag_hist_density=True,
    save_table=str(OUT_DIR_TAB / "diabetes_corr_spearman_Outcome.csv"),
    save_fig=str(OUT_DIR_FIG / "diabetes_scatter_matrix_spearman_Outcome.png")
)
print("Saved:", OUT_DIR_TAB / "diabetes_corr_spearman_Outcome.csv")
print("Saved:", OUT_DIR_FIG / "diabetes_scatter_matrix_spearman_Outcome.png")


### (Optional) One-liner wrapper instead of the cell above
Uncomment to run the compact version. Produces method/target-coded filenames automatically.

In [None]:
# quick_corr_plot(str(DATA_PATH), target=TARGET, max_vars=8, method="spearman", standardize=True)


## (Optional) Also generate Pearson & Kendall artifacts (no display)

In [None]:
_ = scatter_matrix_with_corr(
    df, method="pearson",
    select_strategy="target", target=TARGET, max_vars=8,
    standardize=True, show=False,
    save_table=str(OUT_DIR_TAB / "diabetes_corr_pearson_Outcome.csv"),
    save_fig=str(OUT_DIR_FIG / "diabetes_scatter_matrix_pearson_Outcome.png")
)

_ = scatter_matrix_with_corr(
    df, method="kendall",
    select_strategy="target", target=TARGET, max_vars=8,
    standardize=True, show=False,
    save_table=str(OUT_DIR_TAB / "diabetes_corr_kendall_Outcome.csv"),
    save_fig=str(OUT_DIR_FIG / "diabetes_scatter_matrix_kendall_Outcome.png")
)

print("Also saved Pearson & Kendall artifacts.")


## Peek at the saved Spearman correlation table

In [None]:
corr_path = OUT_DIR_TAB / "diabetes_corr_spearman_Outcome.csv"
if corr_path.exists():
    corr_df = pd.read_csv(corr_path, index_col=0)
    corr_df
else:
    print("Spearman corr table not found (skip if you used the one-liner instead).")


## üß©Notes & takeaways

* **Top correlations**: *Glucose* shows the strongest association with *diabetes outcome* across all three methods (|r| ‚âà 0.48-0.49), followed by *Insulin, BMI, and Age*.
* **Direction & strength**: These features exhibit moderate positive correlations, consistent with clinical intuition -- higher glucose, insulin resistance, and body mass increase diabetes likelihood.
* **Data caveats**: Pima dataset "zero" entries were treated as missing. Correlations indicate association only and should not be interpreted as causal.
* **Next steps**
* **Phase 1 -- Current focus:**
  1. Capture findings in the portfolio summary or README as evidence of EDA and interpretation skills.
  2. Highlight how exploratory analysis leads to insights.
  3. Preserve outputs (reports/tables/..., reports/figures/...) for future modeling work.
* **Phase 2 -- Future analysis**:
  1. Build baseline models (e.g., Logistic Regression, Random Forest) using the saved data artifacts to quantify predictive power.
  2. Assess multicollinearity and feature redundancy.
  3. Explore potential interaction effects among *Glucose*, *BMI*, *Insulin*, and *Age* to validate relationships suggested by the correlations.

## Final Summary:
This analysis establishes a clear, data-driven foundation for understanding which physiological factors most strongly associate with diabetes 'Outcomes', setting the stage for future predictive modeling and feature importance exploation.


## Environment info (for reproducibility)

In [None]:
import sys, matplotlib as mpl
print(f"Python:    {sys.version.split()[0]}")
print(f"pandas:    {pd.__version__}")
print(f"numpy:     {np.__version__}")
print(f"matplotlib:{mpl.__version__}")
