# 📊 4.1 Data Distributions and Visualisation

In this notebook we’ll explore the **shape** of our data and how to visualise it clearly. You’ll learn to create histograms, density plots, box/violin plots, Q–Q plots, and correlation visuals—and see when a **log transform** helps.

We’ll use `vitamin_trial.csv` (a simulated trial dataset with vitamin D measurements).

---
## 🎯 Objectives
- Visualise distributions (histogram, KDE/density, ECDF), and compare groups.
- Use box/violin plots to summarise variation and outliers.
- Check normality with **Q–Q plots** and **Shapiro–Wilk**.
- Compare correlation **with and without log transforms**.
- Interpret plots in the context of nutrition data.

<details><summary>Useful references</summary>
- [Seaborn: Distributions](https://seaborn.pydata.org/tutorial/distributions.html)
- [Matplotlib: Hist](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html)
- [statsmodels Q–Q plot](https://www.statsmodels.org/stable/generated/statsmodels.graphics.gofplots.qqplot.html)
- [scipy.stats.shapiro](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html)
</details>

In [None]:
# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files

MODULE = '04_data_analysis'
DATASET = 'vitamin_trial.csv'
BASE_PATH = '/content/data-analysis-projects'
MODULE_PATH = os.path.join(BASE_PATH, 'notebooks', MODULE)
DATASET_PATH = os.path.join('data', DATASET)

try:
    print('Attempting to clone repository...')
    if not os.path.exists(BASE_PATH):
        !git clone https://github.com/ggkuhnle/data-analysis-projects.git
    print('Setting working directory...')
    os.chdir(MODULE_PATH)
    if os.path.exists(DATASET_PATH):
        print(f'Dataset found: {DATASET_PATH} ✅')
    else:
        raise FileNotFoundError('Dataset missing after clone.')
except Exception as e:
    print(f'Cloning failed: {e}')
    print('Falling back to manual upload...')
    os.makedirs('data', exist_ok=True)
    uploaded = files.upload()
    if DATASET in uploaded:
        with open(DATASET_PATH, 'wb') as f:
            f.write(uploaded[DATASET])
        print(f'Successfully uploaded {DATASET} ✅')
    else:
        raise FileNotFoundError(f'Upload failed. Please upload {DATASET}.')

In [None]:
%pip install -q pandas numpy matplotlib seaborn statsmodels scipy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from scipy.stats import shapiro, pearsonr, spearmanr

pd.set_option('display.max_columns', 50)
sns.set_theme()
print('Visualisation environment ready.')

## 🔧 Load the Data and Inspect
We’ll load `vitamin_trial.csv` and take a quick look at structure and the first few rows.

In [None]:
df = pd.read_csv('data/vitamin_trial.csv')
print('Shape:', df.shape)
display(df.head())
display(df.describe(include='all'))

## 📦 Distribution Plots
We start with a single numeric variable, e.g. **Vitamin_D**.

### Why distributions?
- They reveal skewness, heavy tails, and outliers.
- They inform transformations (e.g., **log** for right-skew).
- They guide the choice of statistical model.

In [None]:
x = df['Vitamin_D'].dropna()

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# 1) Histogram
axes[0].hist(x, bins=20, edgecolor='black')
axes[0].set_title('Histogram: Vitamin_D')
axes[0].set_xlabel('Vitamin_D (µg)')

# 2) KDE / density
sns.kdeplot(x=x, ax=axes[1])
axes[1].set_title('KDE / Density: Vitamin_D')
axes[1].set_xlabel('Vitamin_D (µg)')

# 3) ECDF (empirical CDF)
xs = np.sort(x.values)
ys = np.arange(1, len(xs)+1) / len(xs)
axes[2].plot(xs, ys, marker='.', linestyle='none')
axes[2].set_title('ECDF: Vitamin_D')
axes[2].set_xlabel('Vitamin_D (µg)')
axes[2].set_ylabel('Proportion ≤ x')

plt.tight_layout(); plt.show()

## 🧪 Compare Groups (Box/Violin)
Box and violin plots summarise spread and potential outliers; violin also shows the density shape. Let’s compare by **Group** (e.g. Control vs Treatment).

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
sns.boxplot(data=df, x='Group', y='Vitamin_D', ax=axes[0])
axes[0].set_title('Boxplot by Group')
axes[0].set_ylabel('Vitamin_D (µg)')

sns.violinplot(data=df, x='Group', y='Vitamin_D', inner='quartile', ax=axes[1])
axes[1].set_title('Violin by Group (with quartiles)')
axes[1].set_ylabel('Vitamin_D (µg)')

plt.tight_layout(); plt.show()

## 🌀 Normality Checks
Some methods (e.g., classic t-tests, OLS regression) assume approximate normality of residuals or variables. Always **look at the data first**:
- **Q–Q plots**: compare quantiles of your data to a normal distribution.
- **Shapiro–Wilk test**: formal test (sensitive with large n).

<details><summary>Learn more (click)</summary>
- [statsmodels Q–Q plot](https://www.statsmodels.org/stable/generated/statsmodels.graphics.gofplots.qqplot.html)
- [scipy.stats.shapiro](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html)
</details>

In [None]:
y = df['Vitamin_D'].dropna()

# Q–Q plot (raw)
plt.figure(figsize=(5,5))
sm.qqplot(y, line='s')
plt.title('Q–Q Plot (raw Vitamin_D)')
plt.show()

# Shapiro–Wilk
stat, p = shapiro(y)
print(f'Shapiro–Wilk (raw): statistic={stat:.3f}, p={p:.3g}')
print('Interpretation: p<0.05 → reject normality; p≥0.05 → normality plausible.')

### Log Transform (and Q–Q side-by-side)
Right-skewed data often benefits from a log transform. If zeros are possible, add a small constant `ε` or shift the data. Here we’ll add a small `ε` safely.

In [None]:
eps = 1e-6
y_log = np.log(y + eps)

fig, axes = plt.subplots(1, 2, figsize=(10, 5))
sm.qqplot(y, line='s', ax=axes[0])
axes[0].set_title('Q–Q: raw Vitamin_D')
sm.qqplot(y_log, line='s', ax=axes[1])
axes[1].set_title('Q–Q: log(Vitamin_D)')
plt.tight_layout(); plt.show()

stat_l, p_l = shapiro(y_log)
print(f'Shapiro–Wilk (log): statistic={stat_l:.3f}, p={p_l:.3g}')

## 🔗 Correlations (with and without log transform)
Correlations describe **linear** association (Pearson) or **monotonic** association (Spearman). Log transforms can straighten curved relationships and stabilise variance.

We’ll build a small numeric frame and compare raw vs log correlations. (If your dataset has other numeric variables like `Time`, we’ll include them automatically.)

In [None]:
# Select numeric columns
num = df.select_dtypes(include=[np.number]).copy()
if 'Vitamin_D' in num.columns:
    num['Vitamin_D_log'] = np.log(num['Vitamin_D'] + 1e-6)

print('Numeric columns:', num.columns.tolist())

# Pearson correlations
corr_raw = num.corr(numeric_only=True, method='pearson')
corr_spr = num.corr(numeric_only=True, method='spearman')

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
sns.heatmap(corr_raw, annot=True, fmt='.2f', ax=axes[0])
axes[0].set_title('Pearson correlation (raw/log)')
sns.heatmap(corr_spr, annot=True, fmt='.2f', ax=axes[1])
axes[1].set_title('Spearman correlation (raw/log)')
plt.tight_layout(); plt.show()

### Scatter: Raw vs Log
If you have another numeric variable (e.g., `Time`), compare the raw scatter with the log-transformed vitamin D to see whether a relationship becomes clearer.

In [None]:
if 'Time' in df.columns and np.issubdtype(df['Time'].dtype, np.number):
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    sns.regplot(data=df, x='Time', y='Vitamin_D', ax=axes[0])
    axes[0].set_title('Scatter (raw) with regression')
    axes[0].set_ylabel('Vitamin_D (µg)')

    sns.regplot(data=df.assign(Vitamin_D_log=np.log(df['Vitamin_D'] + 1e-6)),
                x='Time', y='Vitamin_D_log', ax=axes[1])
    axes[1].set_title('Scatter (log Vitamin_D) with regression')
    axes[1].set_ylabel('log Vitamin_D')
    plt.tight_layout(); plt.show()
else:
    print('No numeric Time variable found for scatter comparison.')

## 📝 Interpretation Notes
- **Hist/KDE/ECDF**: inspect skewness and tails. Heavy right tails often justify log transforms.
- **Box/Violin**: compare groups; note outliers and differences in spread.
- **Q–Q + Shapiro**: use plots first; tests can be overly sensitive with large samples.
- **Correlations**: Pearson (linear) vs Spearman (rank/monotonic). Log transforms can reveal hidden linearity.

## 🧪 Exercises
1. **Distribution by group**: Plot histograms (or KDEs) of Vitamin_D **by Group** (use `hue='Group'`), and describe differences.
2. **Normality**: Produce Q–Q plots for each Group separately. Where is log helpful?
3. **Correlation**: If `Time` exists, compute Pearson and Spearman correlations with Vitamin_D (raw vs log). Interpret differences.
4. **Outliers**: Identify potential outliers (e.g., values beyond 1.5×IQR in boxplots). How would you handle them (justify, don’t just drop)?

## ✅ Conclusion
You’ve learned how to characterise distributions and assess normality using visual and formal tools, and how log transforms can clarify structure. You also compared correlations with and without log scaling.

👉 Next: **4.2 Exploratory Data Analysis**—bring these tools together to form and test hypotheses.

<details><summary>More reading</summary>
- [Seaborn: Statistical estimation & plots](https://seaborn.pydata.org/)
- [Matplotlib: Gallery](https://matplotlib.org/stable/gallery/index.html)
- [statsmodels: Graphics GOF](https://www.statsmodels.org/stable/graphics.html)
- [Scipy: Statistical functions](https://docs.scipy.org/doc/scipy/reference/stats.html)
</details>