# EDA – Correlations (Diabetes Capstone)

**Goal:** understand pairwise relationships among features and their association with `Outcome`.

This notebook:
1) Loads & cleans data (Pima adjustments),
2) Summarizes missingness and basic stats,
3) Computes Pearson/Spearman/Kendall correlations,
4) Produces a JMP-style scatterplot matrix and saves artifacts,
5) Notes takeaways for modeling.


In [None]:
# Setup
from pathlib import Path
import numpy as np
import pandas as pd

# auto-reload (handy while you edit your ds_viz package)
%load_ext autoreload
%autoreload 2

DATA_PATH = Path("data/diabetes.csv")  # adjust if needed
TARGET    = "Outcome"

OUT_DIR_FIG = Path("reports/figures"); OUT_DIR_FIG.mkdir(parents=True, exist_ok=True)
OUT_DIR_TAB = Path("reports/tables");  OUT_DIR_TAB.mkdir(parents=True, exist_ok=True)

pd.set_option("display.max_columns", 100)


In [None]:
# Import reusable plotting package (installed with `pip install -e .`)
# Fallback: if not installed, try adding ../src to path (repo layout)
try:
    from ds_viz import scatter_matrix_with_corr, quick_corr_plot
except Exception as e:
    import sys, pathlib
    sys.path.append(str(pathlib.Path.cwd().parent / "src"))
    try:
        from ds_viz import scatter_matrix_with_corr, quick_corr_plot
    except Exception as e2:
        raise ImportError(
            "Couldn't import ds_viz. Make sure you've run `pip install -e .` "
            "from the repo root, and that src/ds_viz exists."
        ) from e2


## Load data

In [None]:
assert DATA_PATH.exists(), f"Couldn't find {DATA_PATH}. Put your CSV at that path or update DATA_PATH."
df = pd.read_csv(DATA_PATH)
df.head()


## Quick structure & summary

In [None]:
df.shape, df.dtypes.to_frame("dtype")


In [None]:
df.describe(include="all").T


## Pima Indians “0 = missing” clean-up
In the classic dataset, zeros in these columns are invalid and should be treated as missing.

In [None]:
zero_as_na = ["Glucose","BloodPressure","SkinThickness","Insulin","BMI"]
df = df.copy()
df[zero_as_na] = df[zero_as_na].replace(0, np.nan)

missing = df.isna().mean().sort_values(ascending=False).to_frame("missing_frac")
missing


## (Optional) Simple median imputation for visualization
For EDA visuals, you can work with NaNs (the plotting function handles them).  
If you prefer a “filled” copy for some charts, use median impute below.

In [None]:
df_eda = df.copy()
for c in df_eda.select_dtypes(include=np.number):
    df_eda[c] = df_eda[c].fillna(df_eda[c].median())

df_eda.head()


## Rank features by correlation with `Outcome`

In [None]:
for m in ["pearson","spearman","kendall"]:
    s = df.corr(method=m)[TARGET].abs().sort_values(ascending=False)
    print(f"\nTop correlations with {TARGET} by {m}:")
    display(s.to_frame(f"abs_{m}"))


## JMP-style Scatterplot Matrix (Spearman, target-based)
Saves a CSV correlation table and a PNG figure into `reports/`.

In [None]:
corr, fig = scatter_matrix_with_corr(
    df,                                # you can use df_eda as well; both are fine for visuals
    method="spearman",
    select_strategy="target",
    target=TARGET,
    max_vars=8,                        # adjust as desired
    standardize=True,                  # comparable axes (z-scores)
    diag_hist_sharey=True,
    diag_hist_density=True,
    save_table=str(OUT_DIR_TAB / "diabetes_corr_spearman_Outcome.csv"),
    save_fig=str(OUT_DIR_FIG / "diabetes_scatter_matrix_spearman_Outcome.png")
)
print("Saved:", OUT_DIR_TAB / "diabetes_corr_spearman_Outcome.csv")
print("Saved:", OUT_DIR_FIG / "diabetes_scatter_matrix_spearman_Outcome.png")


### (Optional) One-liner wrapper instead of the cell above
Uncomment to run the compact version. Produces method/target-coded filenames automatically.

In [None]:
# quick_corr_plot(str(DATA_PATH), target=TARGET, max_vars=8, method="spearman", standardize=True)


## (Optional) Also generate Pearson & Kendall artifacts (no display)

In [None]:
_ = scatter_matrix_with_corr(
    df, method="pearson",
    select_strategy="target", target=TARGET, max_vars=8,
    standardize=True, show=False,
    save_table=str(OUT_DIR_TAB / "diabetes_corr_pearson_Outcome.csv"),
    save_fig=str(OUT_DIR_FIG / "diabetes_scatter_matrix_pearson_Outcome.png")
)

_ = scatter_matrix_with_corr(
    df, method="kendall",
    select_strategy="target", target=TARGET, max_vars=8,
    standardize=True, show=False,
    save_table=str(OUT_DIR_TAB / "diabetes_corr_kendall_Outcome.csv"),
    save_fig=str(OUT_DIR_FIG / "diabetes_scatter_matrix_kendall_Outcome.png")
)

print("Also saved Pearson & Kendall artifacts.")


## Peek at the saved Spearman correlation table

In [None]:
corr_path = OUT_DIR_TAB / "diabetes_corr_spearman_Outcome.csv"
if corr_path.exists():
    corr_df = pd.read_csv(corr_path, index_col=0)
    corr_df
else:
    print("Spearman corr table not found (skip if you used the one-liner instead).")


## Notes & takeaways (fill me in)

- Highest |Spearman r| with `Outcome`: (inspect the table above)
- Direction & strength: interpret top 3–5 features.
- Data caveats: Pima zeros treated as missing; correlations are associations, not causation.
- Next steps: try logistic regression / tree-based models; check multicollinearity; consider interactions.


## Environment info (for reproducibility)

In [None]:
import sys, matplotlib as mpl
print(f"Python:    {sys.version.split()[0]}")
print(f"pandas:    {pd.__version__}")
print(f"numpy:     {np.__version__}")
print(f"matplotlib:{mpl.__version__}")
