# 02 – Exploring and Transforming Data

In this notebook, we will:
- Visualise variable distributions
- Apply data transformations (z-score, quantiles, log)
- Examine relationships between variables using correlation
- Compare groups (Table 1-style summary)
- Run a simple linear regression

Make sure you've already loaded the dataset using the previous notebook.


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf

sns.set(style="whitegrid")

df = pd.read_csv('https://raw.githubusercontent.com/ggkuhnle/FB2NEP_datascience/main/data/fb2nep_data.csv')

## 📈 Distribution of Numeric Variables

In [None]:
# Histogram of age
sns.histplot(df['age'], kde=True)
plt.title("Age Distribution")
plt.show()

## 🔁 Transformations

In [None]:
# Z-score
df['age_z'] = (df['age'] - df['age'].mean()) / df['age'].std()

# Quantiles
df['age_q'] = pd.qcut(df['age'], q=4)

# Log transform example
df['log_energy'] = np.log(df['energy_kcal'] + 1)  # Add 1 to avoid log(0)

sns.histplot(df['log_energy'], kde=True)
plt.title("Log-transformed Energy Intake")
plt.show()

## 🔗 Correlation Matrix

In [None]:
corr_vars = ['age', 'bmi', 'energy_kcal', 'sbp', 'dbp']
corr = df[corr_vars].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()

## 📋 Table 1: Compare Groups

In [None]:
# Compare age and energy intake between males and females
df.groupby('sex')[['age', 'energy_kcal', 'bmi']].agg(['mean', 'std', 'count'])

In [None]:
# T-test for energy intake
stats.ttest_ind(df[df['sex'] == 'female']['energy_kcal'],
                df[df['sex'] == 'male']['energy_kcal'],
                nan_policy='omit')

## 📉 Simple Linear Regression

In [None]:
# Predict systolic blood pressure (sbp) from age
model = smf.ols('sbp ~ age', data=df).fit()
print(model.summary())

## 🧠 Exercise

Choose a numeric outcome (e.g. `dbp`, `bmi`, `energy_kcal`) and a plausible predictor.

1. Create a plot to visualise the relationship.
2. Fit a simple regression model.
3. Comment on the strength and limitations of your model.

✍️ Add your comments in a text cell below.

## 🧪 Playground – try your own code below

In [None]:
# Write your own code here!