# 📐 Deeper into Stats – Distributions, Transformations & Regression
This session builds on previous work and introduces a more realistic dataset inspired by the NDNS.

**Topics:**
- Working with more complex data
- Checking distributions
- Z-score and log transformations
- Visualising regression fit and uncertainty

## 📥 Load Data and Tools

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

# Load simulated NDNS data
df = pd.read_csv('https://raw.githubusercontent.com/ggkuhnle/data-analysis/main/data/ndns_simulated.csv')
df.head()

## 📊 Check Distributions

In [None]:
sns.histplot(df['Cholesterol_mmol_L'], kde=True, bins=20)
plt.title('Distribution of Cholesterol (mmol/L)')
plt.show()

## 🧮 Z-Transformation

In [None]:
df['Cholesterol_z'] = (df['Cholesterol_mmol_L'] - df['Cholesterol_mmol_L'].mean()) / df['Cholesterol_mmol_L'].std()
sns.histplot(df['Cholesterol_z'], kde=True)
plt.title('Z-score Normalised Cholesterol')
plt.show()

## 🔢 Log Transformation

In [None]:
df['Cholesterol_log'] = np.log(df['Cholesterol_mmol_L'])
sns.histplot(df['Cholesterol_log'], kde=True)
plt.title('Log-Transformed Cholesterol')
plt.show()

## 📈 Regression – Saturated Fat and Cholesterol

In [None]:
model = smf.ols('Cholesterol_mmol_L ~ SaturatedFat_g + Fibre_g + Age + C(Sex)', data=df).fit()
print(model.summary())

In [None]:
df['pred'] = model.fittedvalues
df['ci_low'], df['ci_high'] = model.get_prediction().conf_int().T

plt.figure(figsize=(8,5))
sns.scatterplot(x='pred', y='Cholesterol_mmol_L', data=df)
plt.plot([df['pred'].min(), df['pred'].max()], [df['pred'].min(), df['pred'].max()], ls='--', color='red')
plt.title('Observed vs Predicted Cholesterol')
plt.xlabel('Predicted')
plt.ylabel('Observed')
plt.show()

## ✅ Summary
- Used a more realistic NDNS-like dataset
- Explored variable distributions and transformations
- Built a regression model with multiple predictors
- Interpreted results visually and statistically

You're now well equipped to dive into complex data and interpret it like a pro. 🧠🦛