# 📊 4.2 Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is where we *look* before we *leap*. We’ll use `vitamin_trial.csv` to summarise, visualise, and question our data so later modelling steps are grounded in reality.

---
## 🎯 Objectives
- Summarise data with descriptive statistics and tidy tables.
- Visualise distributions (histograms, KDE/density, ECDF, box/violin).
- Explore relationships (scatter/regression, pair plots, correlations).
- Compare **raw vs log** views when data are skewed.
- Apply EDA to `vitamin_trial.csv` and write brief, evidence-based observations.

<details><summary>Why EDA matters</summary>
EDA reveals data quality issues (missingness, outliers, coding quirks), structure (groups, clusters), and plausible transformations. It informs the rest of your analysis pipeline.
</details>

In [None]:
# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files

MODULE = '04_data_analysis'
DATASET = 'vitamin_trial.csv'
BASE_PATH = '/content/data-analysis-projects'
MODULE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
DATASET_PATH = os.path.join('data', DATASET)

try:
    print('Attempting to clone repository...')
    if not os.path.exists(BASE_PATH):
        !git clone https://github.com/ggkuhnle/data-analysis-projects.git
    print('Setting working directory...')
    os.chdir(MODULE_PATH)
    if os.path.exists(DATASET_PATH):
        print(f'Dataset found: {DATASET_PATH} ✅')
    else:
        raise FileNotFoundError('Dataset missing after clone.')
except Exception as e:
    print(f'Cloning failed: {e}')
    print('Falling back to manual upload...')
    os.makedirs('data', exist_ok=True)
    uploaded = files.upload()
    if DATASET in uploaded:
        with open(DATASET_PATH, 'wb') as f:
            f.write(uploaded[DATASET])
        print(f'Successfully uploaded {DATASET} ✅')
    else:
        raise FileNotFoundError(f'Upload failed. Please upload {DATASET}.')

In [None]:
%pip install -q pandas numpy matplotlib seaborn statsmodels scipy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from scipy.stats import shapiro, pearsonr, spearmanr

pd.set_option('display.max_columns', 60)
sns.set_theme()
print('EDA environment ready.')

## 1) Load & Quick Inspect
We start with a **structural** and **statistical** glance at the dataset.

In [None]:
df = pd.read_csv('data/vitamin_trial.csv')
print('Shape:', df.shape)
print('\nColumn dtypes:')
print(df.dtypes)

print('\nPeek at the first rows:')
display(df.head())

print('\nSummary (numeric):')
display(df.describe())

print('\nSummary (including categorical):')
display(df.describe(include='all'))

## 2) Missingness & Basic Data Quality
Before plotting, ensure we know where the gaps are. If missingness clusters by group/time, it may bias results.

In [None]:
print('Missing values per column:')
print(df.isna().sum().sort_values(ascending=False))

# Missingness by group (if columns exist)
if {'Group','Vitamin_D'}.issubset(df.columns):
    miss_rate = df.groupby('Group')['Vitamin_D'].apply(lambda s: s.isna().mean())
    print('\nProportion missing in Vitamin_D by Group:')
    display(miss_rate)

## 3) Distributions
We’ll examine **Vitamin_D** as a worked example. Look for skewness and fat tails—these often suggest a **log transform**.

<details><summary>Distribution tools</summary>
- Histogram: frequency bars.
- KDE (density): smooth estimate of the distribution.
- ECDF: empirical cumulative distribution ← great for comparing the whole distribution.
</details>

In [None]:
x = df['Vitamin_D'].dropna()

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Histogram
axes[0].hist(x, bins=20, edgecolor='black')
axes[0].set_title('Histogram: Vitamin_D')
axes[0].set_xlabel('Vitamin_D (µg)')

# KDE
sns.kdeplot(x=x, ax=axes[1])
axes[1].set_title('KDE / Density: Vitamin_D')
axes[1].set_xlabel('Vitamin_D (µg)')

# ECDF
xs = np.sort(x.values)
ys = np.arange(1, len(xs)+1) / len(xs)
axes[2].plot(xs, ys, marker='.', linestyle='none')
axes[2].set_title('ECDF: Vitamin_D')
axes[2].set_xlabel('Vitamin_D (µg)')
axes[2].set_ylabel('Proportion ≤ x')

plt.tight_layout(); plt.show()

### 3.1 Compare across groups
Use **box/violin** to compare distributions by `Group` (e.g., Control vs Treatment) and optionally `Outcome`.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
sns.boxplot(data=df, x='Group', y='Vitamin_D', ax=axes[0])
axes[0].set_title('Boxplot by Group')
axes[0].set_ylabel('Vitamin_D (µg)')

sns.violinplot(data=df, x='Group', y='Vitamin_D', inner='quartile', ax=axes[1])
axes[1].set_title('Violin by Group (quartiles)')
axes[1].set_ylabel('Vitamin_D (µg)')
plt.tight_layout(); plt.show()

# Optional: stratify by Outcome if present
if 'Outcome' in df.columns:
    plt.figure(figsize=(8,4))
    sns.boxplot(data=df, x='Group', y='Vitamin_D', hue='Outcome')
    plt.title('Vitamin_D by Group × Outcome')
    plt.ylabel('Vitamin_D (µg)')
    plt.tight_layout(); plt.show()

## 4) Normality Checks (Raw vs Log)
Some models assume normal-ish residuals. We **inspect** with Q–Q plots and **test** (Shapiro–Wilk). For right-skew, a log transform can help.

<details><summary>Notes</summary>
- Q–Q: points on line → close to normal.
- Shapiro–Wilk: p < 0.05 → reject normality (sensitive with large n).
- If zeros exist, use `log1p` or add a small ε.
</details>

In [None]:
y = df['Vitamin_D'].dropna()
eps = 1e-6
y_log = np.log(y + eps)

fig, axes = plt.subplots(1, 2, figsize=(10, 5))
sm.qqplot(y, line='s', ax=axes[0])
axes[0].set_title('Q–Q: raw Vitamin_D')
sm.qqplot(y_log, line='s', ax=axes[1])
axes[1].set_title('Q–Q: log Vitamin_D')
plt.tight_layout(); plt.show()

stat_raw, p_raw = shapiro(y)
stat_log, p_log = shapiro(y_log)
print(f'Shapiro–Wilk raw: stat={stat_raw:.3f}, p={p_raw:.3g}')
print(f'Shapiro–Wilk log:  stat={stat_log:.3f}, p={p_log:.3g}')

## 5) Relationships
Start with scatter plots and trend lines, then move to matrices and heatmaps for many variables.

In [None]:
if 'Time' in df.columns and np.issubdtype(df['Time'].dtype, np.number):
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    sns.regplot(data=df, x='Time', y='Vitamin_D', ax=axes[0])
    axes[0].set_title('Time vs Vitamin_D (raw)')
    axes[0].set_ylabel('Vitamin_D (µg)')

    # log(Vitamin_D) for linearising right-skew
    df_tmp = df.assign(Vitamin_D_log=np.log(df['Vitamin_D'] + 1e-6))
    sns.regplot(data=df_tmp, x='Time', y='Vitamin_D_log', ax=axes[1])
    axes[1].set_title('Time vs log(Vitamin_D)')
    axes[1].set_ylabel('log Vitamin_D')

    plt.tight_layout(); plt.show()
else:
    print('No numeric Time column found for scatter examples.')

### 5.1 Pair Plots & Correlation Heatmaps
For multiple numeric columns, pair plots and heatmaps give fast overviews.

<details><summary>Tip</summary>
Use **Spearman** for monotonic (rank-based) relationships and **Pearson** for linear relationships.
</details>

In [None]:
num = df.select_dtypes(include=[np.number]).copy()
if 'Vitamin_D' in num.columns:
    num['Vitamin_D_log'] = np.log(num['Vitamin_D'] + 1e-6)

if num.shape[1] >= 2:
    # Pair plot (can be slow with many rows; sample if needed)
    sns.pairplot(num.sample(min(300, len(num)), random_state=1))
    plt.suptitle('Pair Plot (sampled)', y=1.02)
    plt.show()

    # Correlation heatmaps
    corr_pear = num.corr(numeric_only=True, method='pearson')
    corr_spea = num.corr(numeric_only=True, method='spearman')

    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    sns.heatmap(corr_pear, annot=True, fmt='.2f', ax=axes[0])
    axes[0].set_title('Pearson correlation')
    sns.heatmap(corr_spea, annot=True, fmt='.2f', ax=axes[1])
    axes[1].set_title('Spearman correlation')
    plt.tight_layout(); plt.show()
else:
    print('Not enough numeric variables for pair plot / heatmap.')

## 6) Grouped Descriptives
Summarise by key factors (e.g., `Group`, `Outcome`, `Time`). This clarifies patterns before modelling.

<details><summary>Common recipes</summary>
- Mean/median/count per group.
- Within-group standard deviation (variability).
- Confidence intervals for group means (basic normal approximation).
</details>

In [None]:
def mean_se_ci(s):
    s = pd.Series(s).dropna()
    n = s.size
    if n == 0:
        return pd.Series({'mean': np.nan, 'se': np.nan, 'low': np.nan, 'high': np.nan, 'n': 0})
    m = s.mean()
    sd = s.std(ddof=1) if n > 1 else 0.0
    se = sd / np.sqrt(n) if n > 0 else np.nan
    low = m - 1.96 * se
    high = m + 1.96 * se
    return pd.Series({'mean': m, 'se': se, 'low': low, 'high': high, 'n': n})

if {'Group','Vitamin_D'}.issubset(df.columns):
    grp = (
        df.groupby('Group', observed=True)['Vitamin_D']
          .apply(mean_se_ci)
          .reset_index()
    )
    display(grp.head(10))

    # Pivot a compact summary of means by group (if helpful)
    grp_means = df.groupby('Group', observed=True)['Vitamin_D'].mean().reset_index(name='mean_VitD')
    display(grp_means)

## 7) Outliers: Identify & Reflect
Boxplots flag outliers as points outside 1.5×IQR. Outliers can be data errors, rare but valid observations, or signals of subgroups. **Investigate** before removing.

<details><summary>Quick IQR rule</summary>
Q1 = 25th percentile, Q3 = 75th. IQR = Q3−Q1. Outliers ≈ values < Q1−1.5×IQR or > Q3+1.5×IQR.
</details>

In [None]:
q1, q3 = np.percentile(df['Vitamin_D'].dropna(), [25, 75])
iqr = q3 - q1
low, high = q1 - 1.5*iqr, q3 + 1.5*iqr
mask_out = (df['Vitamin_D'] < low) | (df['Vitamin_D'] > high)
print(f'Potential outliers (IQR rule): {int(mask_out.sum())}')
display(df.loc[mask_out, ['ID','Group','Vitamin_D']].head(10))

## 🧪 Exercises
1) **Distribution by Group**  
   - Make overlaid histograms (`hue='Group'`) for `Vitamin_D`.  
   - Comment on differences in spread/centres.

2) **Normality**  
   - Produce Q–Q plots of `Vitamin_D` by group (facet by `Group`).  
   - Where does log transform help most?

3) **Correlations**  
   - If `Time` exists, compute Pearson & Spearman correlations with `Vitamin_D` (raw and log).  
   - Interpret: linear vs monotonic patterns.

4) **Grouped Summary**  
   - Create a table of mean, median, n of `Vitamin_D` by `Group × Outcome`.  
   - Which subgroup has the highest average?

5) **Outliers**  
   - Identify outliers by IQR rule within each `Group`.  
   - Suggest a plan (verify source, winsorise, robust stats) and justify briefly.

## ✅ Conclusion
You’ve conducted a principled EDA: profiling, distributions, group comparisons, normality checks, relationships, correlations, and outlier scanning. These findings should inform your next steps (transformations, model choices, and validation rules).

👉 Next: **4.3 Correlation & Association**—formalise the relationships you saw here.

<details><summary>Further reading</summary>
- Seaborn: <https://seaborn.pydata.org/>
- Matplotlib: <https://matplotlib.org/>
- Statsmodels graphics (Q–Q): <https://www.statsmodels.org/stable/graphics.html>
- Scipy stats: <https://docs.scipy.org/doc/scipy/reference/stats.html>
</details>