# 📈 4.3 Correlation Analysis — From Basics to Better Practice

Correlation measures the **strength and direction** of association between variables. In nutrition research, it can help you see whether higher vitamin D levels are associated with outcomes, time-on-study, body weight, etc.

---
## 🎯 Objectives
- Compute **Pearson** (linear), **Spearman** (rank/monotonic), and (optionally) **Kendall** correlations.
- Compare correlations **with and without log transforms** for skewed data.
- Visualise correlation structures with **scatter/regression**, **pair plots**, and **heatmaps**.
- Estimate **confidence intervals** and **p-values**; run a simple **permutation test**.
- Understand **caveats**: outliers, nonlinearity, confounding, and multiple comparisons.

<details><summary>What correlation is (and isn’t)</summary>
Correlation quantifies association, not causation. A strong correlation might be driven by a confounder or a few outliers. Always pair correlation with plots and domain knowledge.
</details>

In [None]:
# Setup for Google Colab: Fetch datasets automatically or manuallyimport osfrom google.colab import filesMODULE = '04_data_analysis'DATASET = 'vitamin_trial.csv'BASE_PATH = '/content/data-analysis-projects'MODULE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)DATASET_PATH = os.path.join('data', DATASET)try:    print('Attempting to clone repository...')    if not os.path.exists(BASE_PATH):        !git clone https://github.com/ggkuhnle/data-analysis-projects.git    print('Setting working directory...')    os.chdir(MODULE_PATH)    if os.path.exists(DATASET_PATH):        print(f'Dataset found: {DATASET_PATH} ✅')    else:        raise FileNotFoundError('Dataset missing after clone.')except Exception as e:    print(f'Cloning failed: {e}')    print('Falling back to manual upload...')    os.makedirs('data', exist_ok=True)    uploaded = files.upload()    if DATASET in uploaded:        with open(DATASET_PATH, 'wb') as f:            f.write(uploaded[DATASET])        print(f'Successfully uploaded {DATASET} ✅')    else:        raise FileNotFoundError(f'Upload failed. Please upload {DATASET}.')

In [None]:
%pip install -q pandas numpy seaborn matplotlib scipy statsmodelsimport pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltfrom scipy.stats import pearsonr, spearmanr, kendalltau, pointbiserialrimport statsmodels.api as smpd.set_option('display.max_columns', 60)sns.set_theme()print('Correlation environment ready.')

## 1) Load & Inspect
We’ll use `vitamin_trial.csv`. Typical columns might include `ID`, `Group` (Control/Treatment), `Vitamin_D` (µg), `Time` (weeks), `Outcome` (categorical).

In [None]:
df = pd.read_csv('data/vitamin_trial.csv')print('Shape:', df.shape)print('\nDtypes:')print(df.dtypes)display(df.head())# Identify numeric columns we can correlate directlynum_cols = df.select_dtypes(include=[np.number]).columns.tolist()print('\nNumeric columns:', num_cols)

## 2) Pearson vs Spearman (and Kendall)
- **Pearson r**: measures **linear** association; sensitive to outliers.
- **Spearman ρ**: measures **monotonic** (rank-based) association; robust to outliers and nonlinearity.
- **Kendall τ**: another rank-based measure; often more conservative.

<details><summary>Which to use?</summary>
- Use Pearson for approximately linear relationships with near-normal noise.
- Use Spearman when the relationship is monotonic but not linear, or when outliers / skew exist.
- Use Kendall for small samples or as a robustness check.
</details>

In [None]:
# Example pair: Vitamin_D vs Time (if present)if {'Vitamin_D','Time'}.issubset(df.columns):    x = df['Vitamin_D'].astype(float)    y = df['Time'].astype(float)    mask = x.notna() & y.notna()    x, y = x[mask], y[mask]    pr, ppr = pearsonr(x, y)    sr, psr = spearmanr(x, y)    kr, pkr = kendalltau(x, y)    print(f'Pearson r = {pr:.3f} (p={ppr:.2e})')    print(f'Spearman ρ = {sr:.3f} (p={psr:.2e})')    print(f'Kendall τ = {kr:.3f} (p={pkr:.2e})')else:    print('No numeric pair (Vitamin_D, Time) found for demonstration.')

## 3) Correlation with and without Log Transform
Right-skewed variables (like intakes) often benefit from log transformation. Compare **raw** vs **log** correlations to see whether scale influences conclusions.

<details><summary>Note</summary>
If zeros are possible, use `np.log1p(x)` or add a tiny epsilon: `np.log(x + 1e-6)`.
</details>

In [None]:
if {'Vitamin_D','Time'}.issubset(df.columns):    eps = 1e-6    x_raw = df['Vitamin_D'].astype(float)    x_log = np.log(df['Vitamin_D'].astype(float) + eps)    y = df['Time'].astype(float)    mask = x_raw.notna() & x_log.notna() & y.notna()    pr_raw, p_raw = pearsonr(x_raw[mask], y[mask])    pr_log, p_log = pearsonr(x_log[mask], y[mask])    sr_raw, ps_raw = spearmanr(x_raw[mask], y[mask])    sr_log, ps_log = spearmanr(x_log[mask], y[mask])    print('Pearson raw vs log:')    print(f'  raw: r={pr_raw:.3f}, p={p_raw:.2e}     log: r={pr_log:.3f}, p={p_log:.2e}')    print('Spearman raw vs log:')    print(f'  raw: ρ={sr_raw:.3f}, p={ps_raw:.2e}     log: ρ={sr_log:.3f}, p={ps_log:.2e}')    # Visual comparison    fig, axes = plt.subplots(1, 2, figsize=(12, 4))    sns.regplot(x=y[mask], y=x_raw[mask], ax=axes[0])    axes[0].set_title('Time vs Vitamin_D (raw)')    axes[0].set_xlabel('Time'); axes[0].set_ylabel('Vitamin_D (µg)')    sns.regplot(x=y[mask], y=x_log[mask], ax=axes[1])    axes[1].set_title('Time vs log(Vitamin_D)')    axes[1].set_xlabel('Time'); axes[1].set_ylabel('log Vitamin_D')    plt.tight_layout(); plt.show()else:    print('No numeric pair (Vitamin_D, Time) found for demonstration.')

## 4) Correlation Matrices & Heatmaps
For many variables, examine the overall structure with heatmaps. Compare **Pearson** and **Spearman**.

<details><summary>Tip</summary>
If you have many variables, consider **clustering** the correlation matrix (advanced) or focusing on a subset (e.g., biomarkers only).
</details>

In [None]:
num = df.select_dtypes(include=[np.number]).copy()if num.shape[1] >= 2:    corr_pear = num.corr(method='pearson')    corr_spea = num.corr(method='spearman')    fig, axes = plt.subplots(1, 2, figsize=(12, 4))    sns.heatmap(corr_pear, annot=True, fmt='.2f', vmin=-1, vmax=1, cmap='coolwarm', ax=axes[0])    axes[0].set_title('Pearson correlation')    sns.heatmap(corr_spea, annot=True, fmt='.2f', vmin=-1, vmax=1, cmap='coolwarm', ax=axes[1])    axes[1].set_title('Spearman correlation')    plt.tight_layout(); plt.show()else:    print('Not enough numeric variables for heatmaps.')

## 5) Correlation with Binary Outcomes (Point-Biserial)
If you have a **binary** outcome (e.g., `Outcome` with two levels like *Normal* vs *Low*), use the **point-biserial** correlation between a numeric predictor and a binary-coded outcome.

<details><summary>How to code</summary>
Map categories to 0/1 (e.g., `{'Normal': 0, 'Low': 1}`) and apply `scipy.stats.pointbiserialr`.
</details>

In [None]:
if 'Outcome' in df.columns and 'Vitamin_D' in df.columns:    # Check if exactly two unique non-null labels exist    cats = df['Outcome'].dropna().unique()    if len(cats) == 2:        mapping = {cats[0]: 0, cats[1]: 1}        y_bin = df['Outcome'].map(mapping)        x = df['Vitamin_D']        mask = x.notna() & y_bin.notna()        r_pb, p_pb = pointbiserialr(y_bin[mask].astype(int), x[mask].astype(float))        print(f'Point-biserial r = {r_pb:.3f} (p={p_pb:.2e})   [coding: {mapping}]')    else:        print('Outcome is not binary (or has missing labels); skipping point-biserial.')else:    print('No Outcome/Vitamin_D found for point-biserial demo.')

## 6) Uncertainty: Confidence Intervals & Permutation Test
p-values tell you about *extremeness under the null*, but confidence intervals (CIs) convey **estimate precision**. Permutation tests give a non-parametric check.

<details><summary>Permutation idea</summary>
Shuffle `y` many times, recompute correlation, and see where the observed value sits in this null distribution.
</details>

In [None]:
rng = np.random.default_rng(1)def bootstrap_ci_corr(x, y, method='pearson', B=2000):    x = np.asarray(x, float); y = np.asarray(y, float)    mask = np.isfinite(x) & np.isfinite(y)    x, y = x[mask], y[mask]    if len(x) < 3:        return np.nan, (np.nan, np.nan)    corr_fn = {'pearson': pearsonr, 'spearman': spearmanr}[method]    r_obs = corr_fn(x, y)[0]    rs = np.empty(B)    n = len(x)    for b in range(B):        idx = rng.integers(0, n, n)        rs[b] = corr_fn(x[idx], y[idx])[0]    lo, hi = np.percentile(rs, [2.5, 97.5])    return r_obs, (lo, hi)def permutation_pvalue(x, y, method='pearson', B=5000):    x = np.asarray(x, float); y = np.asarray(y, float)    mask = np.isfinite(x) & np.isfinite(y)    x, y = x[mask], y[mask]    corr_fn = {'pearson': pearsonr, 'spearman': spearmanr}[method]    r_obs = corr_fn(x, y)[0]    cnt = 0    for _ in range(B):        y_perm = rng.permutation(y)        r_perm = corr_fn(x, y_perm)[0]        if abs(r_perm) >= abs(r_obs):            cnt += 1    return r_obs, cnt / Bif {'Vitamin_D','Time'}.issubset(df.columns):    x = df['Vitamin_D']; y = df['Time']    r_p, (lo_p, hi_p) = bootstrap_ci_corr(x, y, method='pearson', B=1000)    r_s, (lo_s, hi_s) = bootstrap_ci_corr(x, y, method='spearman', B=1000)    print(f'Bootstrap 95% CI (Pearson): r={r_p:.3f}, CI=({lo_p:.3f}, {hi_p:.3f})')    print(f'Bootstrap 95% CI (Spearman): r={r_s:.3f}, CI=({lo_s:.3f}, {hi_s:.3f})')    r_obs, p_perm = permutation_pvalue(x, y, method='pearson', B=2000)    print(f'Permutation test (Pearson): r_obs={r_obs:.3f}, p≈{p_perm:.4f}')else:    print('Skipping CI/permutation — variables not found.')

## 7) Pair Plots for a Quick Multivariate Glance
Pair plots show bivariate scatter plots and univariate histograms/KDEs for a set of numeric variables. Use a sample if the dataset is large.

In [None]:
num = df.select_dtypes(include=[np.number]).copy()if 'Vitamin_D' in num.columns:    num['Vitamin_D_log'] = np.log(num['Vitamin_D'] + 1e-6)if num.shape[1] >= 2:    sns.pairplot(num.sample(min(400, len(num)), random_state=2))    plt.suptitle('Pair plot (sampled)', y=1.02); plt.show()else:    print('Not enough numeric variables for a pair plot.')

## 8) Caveats & Good Practice
- **Outliers** can dominate Pearson r. Always **plot** first; consider Spearman.
- **Nonlinearity** (e.g., curved relationships) can make Pearson r ~ 0 despite strong association—inspect scatter plots.
- **Confounding**: a third variable can induce correlation. Consider **partial correlation** (advanced) or stratified analysis.
- **Multiple testing**: testing many pairs inflates false positives—control FDR or pre-register hypotheses.
- **Missing data**: dropping rows changes sample size; check patterns of missingness.

## 🧪 Exercises
1) **Raw vs Log**  
   - Compute Pearson and Spearman correlations of `Vitamin_D` with `Time` **before and after** log transform.  
   - Which looks more linear in scatter plots?

2) **Heatmaps**  
   - Build Pearson and Spearman heatmaps for all numeric variables.  
   - Identify the top 3 strongest associations and hypothesise why.

3) **Binary Outcome**  
   - If `Outcome` is binary, compute the point-biserial correlation with `Vitamin_D` and interpret the sign.

4) **Permutation**  
   - Run the permutation test for `Vitamin_D` vs `Time` and compare the p-value to `pearsonr()`.

5) **Partial correlation (optional)**  
   - Regress `Vitamin_D` and `Time` each on `Group` (as dummies), then correlate residuals (a rough partial correlation controlling for `Group`).

## ✅ Conclusion
You computed multiple correlation measures, compared raw vs log scales, visualised with scatter/heatmaps, and assessed uncertainty. These steps make your claims about relationships **more trustworthy**.

👉 Next: **4.4 Statistical Testing** — formal comparisons of groups and effects.

<details><summary>Further reading</summary>
- SciPy stats (correlation): <https://docs.scipy.org/doc/scipy/reference/stats.html>
- Seaborn heatmaps: <https://seaborn.pydata.org/generated/seaborn.heatmap.html>
- Statsmodels graphics (Q–Q): <https://www.statsmodels.org/stable/graphics.html>
</details>