# Epidemiological Analysis: Cross-Sectional and Prospective Studies in Nutrition 🥗📊

Welcome to this Jupyter notebook on epidemiological analysis in nutrition science! We’ll explore a large study (n=25,000, age range 45-80) with cross-sectional and prospective designs, focusing on continuous (BMI) and survival (CVD incidence) endpoints. The dataset includes baseline and follow-up data (2, 4, 6 years) on smoking, sex, physical activity, social class (UK ABC12DE), BMI, blood pressure, sugar intake, SFA intake, and CVD incidence, with random missing data.

In this notebook, we’ll:
- **Summarise baseline characteristics** with Table 1 🧩
- **Analyse missing data** to understand patterns
- **Perform cross-sectional analysis** using Frequentist and Bayesian regression (baseline BMI)
- **Conduct survival analysis** for CVD incidence (Frequentist and Bayesian)
- **Analyse prospective changes** in BMI and CVD incidence (Frequentist and Bayesian regression) 📈

Let’s dive in and explore this epidemiological dataset!

## Step 1: Load the Dataset and Libraries 📦

First, let’s load the necessary libraries and the simulated dataset.

In [None]:
# Import libraries for analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
from lifelines import CoxPHFitter
import pymc as pm
import arviz as az
from statsmodels.formula.api import mixedlm

# Set seaborn style for clean visuals
sns.set_style("whitegrid")

# Load the dataset
data = pd.read_csv('data/epidemiological_study.csv')

# Display the first few rows
data.head()

## Step 2: Table 1 - Baseline Characteristics 📊

Let’s create Table 1 to summarise the baseline characteristics of the study population, including means (SD) for continuous variables and counts (%) for categorical variables.

In [None]:
# Continuous variables: Mean (SD)
continuous_vars = ['Age', 'BMI_Baseline', 'BP_Baseline', 'Sugar_Intake', 'SFA_Intake']
continuous_summary = data[continuous_vars].agg(['mean', 'std']).round(2).T
continuous_summary.columns = ['Mean', 'SD']
continuous_summary['Mean (SD)'] = continuous_summary['Mean'].astype(str) + ' (' + continuous_summary['SD'].astype(str) + ')'

# Categorical variables: Counts (%)
categorical_vars = ['Sex', 'Smoking', 'Physical_Activity', 'Social_Class']
categorical_summary = {}
for var in categorical_vars:
    counts = data[var].value_counts(dropna=False)
    percents = (counts / counts.sum() * 100).round(2)
    categorical_summary[var] = pd.DataFrame({
        'Count (%)': [f"{count} ({percent}%)" for count, percent in zip(counts, percents)]
    }, index=counts.index)

# Display Table 1
print("Table 1: Baseline Characteristics")
print("\nContinuous Variables:")
print(continuous_summary[['Mean (SD)']])
print("\nCategorical Variables:")
for var in categorical_vars:
    print(f"\n{var}:")
    print(categorical_summary[var])

## Step 3: Analysis of Missing Data 🔎

Let’s assess the extent and pattern of missing data in the dataset to understand potential biases.

In [None]:
# Calculate percentage of missing data for each variable
missing_data = data.isna().mean() * 100
missing_summary = pd.DataFrame({
    'Missing (%)': missing_data.round(2)
})

# Display missing data summary
print("Missing Data Analysis:")
print(missing_summary[missing_summary['Missing (%)'] > 0])

# Visualize missing data patterns
plt.figure(figsize=(10, 6))
sns.heatmap(data.isna(), cbar=False, cmap='viridis')
plt.title('Missing Data Patterns (Yellow = Missing) 📉')
plt.tight_layout()
plt.show()

## Step 4: Cross-Sectional Analysis - Baseline BMI 🧮

Let’s perform a cross-sectional analysis of baseline BMI, using Frequentist (linear regression) and Bayesian regression, with predictors: age, sex, smoking, physical activity, social class, sugar intake, and SFA intake.

### Data Preparation
First, we’ll preprocess the data, encoding categorical variables and handling missing data (simple imputation for this example).

In [None]:
# Prepare data for cross-sectional analysis
cross_sectional_data = data[['BMI_Baseline', 'Age', 'Sex', 'Smoking', 'Physical_Activity', 'Social_Class', 'Sugar_Intake', 'SFA_Intake']].copy()

# Encode categorical variables
le = LabelEncoder()
cross_sectional_data['Sex'] = le.fit_transform(cross_sectional_data['Sex'].astype(str))
cross_sectional_data['Smoking'] = le.fit_transform(cross_sectional_data['Smoking'].astype(str))
cross_sectional_data['Physical_Activity'] = cross_sectional_data['Physical_Activity'].map({'Low': 0, 'Medium': 1, 'High': 2, np.nan: 0})
cross_sectional_data['Social_Class'] = cross_sectional_data['Social_Class'].map({'A': 1, 'B': 2, 'C1': 3, 'C2': 4, 'D': 5, 'E': 6, np.nan: 3})

# Impute missing data with mean for simplicity
cross_sectional_data.fillna(cross_sectional_data.mean(), inplace=True)

# Define predictors and outcome
X_cross = cross_sectional_data.drop('BMI_Baseline', axis=1)
y_cross = cross_sectional_data['BMI_Baseline']

### Frequentist Linear Regression
We’ll use scikit-learn’s `LinearRegression` to model baseline BMI.

In [None]:
# Frequentist linear regression
freq_model = LinearRegression()
freq_model.fit(X_cross, y_cross)

# Coefficients and intercept
freq_coefs = pd.DataFrame({
    'Predictor': X_cross.columns,
    'Coefficient': freq_model.coef_
})
print("Frequentist Linear Regression Results:")
print(f"Intercept: {freq_model.intercept_:.2f}")
print(freq_coefs)

### Bayesian Linear Regression
We’ll use PyMC to model the same relationship, with weakly informative priors.

In [None]:
# Bayesian linear regression
with pm.Model() as bayes_model:
    # Priors
    intercept = pm.Normal('Intercept', mu=0, sigma=10)
    beta_age = pm.Normal('Age', mu=0, sigma=1)
    beta_sex = pm.Normal('Sex', mu=0, sigma=1)
    beta_smoking = pm.Normal('Smoking', mu=0, sigma=1)
    beta_activity = pm.Normal('Physical_Activity', mu=0, sigma=1)
    beta_class = pm.Normal('Social_Class', mu=0, sigma=1)
    beta_sugar = pm.Normal('Sugar_Intake', mu=0, sigma=1)
    beta_sfa = pm.Normal('SFA_Intake', mu=0, sigma=1)
    sigma = pm.HalfNormal('sigma', sigma=1)

    # Linear model
    mu = (intercept + beta_age * X_cross['Age'] + beta_sex * X_cross['Sex'] +
          beta_smoking * X_cross['Smoking'] + beta_activity * X_cross['Physical_Activity'] +
          beta_class * X_cross['Social_Class'] + beta_sugar * X_cross['Sugar_Intake'] +
          beta_sfa * X_cross['SFA_Intake'])

    # Likelihood
    bmi_obs = pm.Normal('bmi_obs', mu=mu, sigma=sigma, observed=y_cross)

    # Sample from posterior
    trace_cross = pm.sample(1000, tune=1000, return_inferencedata=True)

# Summary of posterior
print("Bayesian Linear Regression Results:")
print(az.summary(trace_cross, var_names=['Intercept', 'Age', 'Sex', 'Smoking', 'Physical_Activity', 'Social_Class', 'Sugar_Intake', 'SFA_Intake']))

## Step 5: Survival Analysis - CVD Incidence 🕰️

Let’s perform survival analysis for CVD incidence, using Frequentist (Cox proportional hazards) and Bayesian survival regression, with the same predictors.

### Data Preparation
We’ll use the same predictors, ensuring proper encoding and imputation.

In [None]:
# Prepare data for survival analysis
survival_data = data[['Time_to_CVD', 'CVD_Incidence', 'Age', 'Sex', 'Smoking', 'Physical_Activity', 'Social_Class', 'Sugar_Intake', 'SFA_Intake']].copy()
survival_data['Sex'] = le.fit_transform(survival_data['Sex'].astype(str))
survival_data['Smoking'] = le.fit_transform(survival_data['Smoking'].astype(str))
survival_data['Physical_Activity'] = survival_data['Physical_Activity'].map({'Low': 0, 'Medium': 1, 'High': 2, np.nan: 0})
survival_data['Social_Class'] = survival_data['Social_Class'].map({'A': 1, 'B': 2, 'C1': 3, 'C2': 4, 'D': 5, 'E': 6, np.nan: 3})
survival_data.fillna(survival_data.mean(), inplace=True)

### Frequentist Cox Proportional Hazards
We’ll use `lifelines` to fit a Cox model.

In [None]:
# Frequentist Cox model
cox_model = CoxPHFitter()
cox_model.fit(survival_data, duration_col='Time_to_CVD', event_col='CVD_Incidence')

# Display results
print("Frequentist Cox Proportional Hazards Results:")
cox_model.print_summary()

### Bayesian Survival Regression
We’ll use PyMC to fit a Weibull survival model, a common choice for Bayesian survival analysis.

In [None]:
# Bayesian survival regression (Weibull model)
with pm.Model() as bayes_survival:
    # Priors for coefficients
    beta_age = pm.Normal('Age', mu=0, sigma=1)
    beta_sex = pm.Normal('Sex', mu=0, sigma=1)
    beta_smoking = pm.Normal('Smoking', mu=0, sigma=1)
    beta_activity = pm.Normal('Physical_Activity', mu=0, sigma=1)
    beta_class = pm.Normal('Social_Class', mu=0, sigma=1)
    beta_sugar = pm.Normal('Sugar_Intake', mu=0, sigma=1)
    beta_sfa = pm.Normal('SFA_Intake', mu=0, sigma=1)
    alpha = pm.Normal('alpha', mu=1, sigma=1)  # Shape parameter for Weibull

    # Linear predictor for scale (lambda)
    log_lambda = (beta_age * survival_data['Age'] + beta_sex * survival_data['Sex'] +
                  beta_smoking * survival_data['Smoking'] + beta_activity * survival_data['Physical_Activity'] +
                  beta_class * survival_data['Social_Class'] + beta_sugar * survival_data['Sugar_Intake'] +
                  beta_sfa * survival_data['SFA_Intake'])
    lambda_ = pm.math.exp(log_lambda)

    # Likelihood (Weibull distribution)
    time_obs = pm.Weibull('time_obs', alpha=alpha, beta=lambda_, observed=survival_data['Time_to_CVD'])

    # Sample from posterior
    trace_survival = pm.sample(1000, tune=1000, return_inferencedata=True)

# Summary of posterior
print("Bayesian Survival Regression Results:")
print(az.summary(trace_survival, var_names=['Age', 'Sex', 'Smoking', 'Physical_Activity', 'Social_Class', 'Sugar_Intake', 'SFA_Intake']))

## Step 6: Prospective Analysis - BMI Change and CVD Incidence 🔄

Let’s analyze prospective changes in BMI over time and their association with CVD incidence, using Frequentist (mixed-effects model) and Bayesian regression.

### Data Preparation
We’ll reshape the data into long format for prospective analysis.

In [None]:
# Reshape data into long format for prospective analysis
long_data = pd.melt(data, 
                    id_vars=['ID', 'Age', 'Sex', 'Smoking', 'Physical_Activity', 'Social_Class', 'Sugar_Intake', 'SFA_Intake', 'CVD_Incidence'],
                    value_vars=['BMI_Baseline', 'BMI_Year2', 'BMI_Year4', 'BMI_Year6'],
                    var_name='Time', value_name='BMI')
long_data['Time'] = long_data['Time'].map({'BMI_Baseline': 0, 'BMI_Year2': 2, 'BMI_Year4': 4, 'BMI_Year6': 6})

# Encode categorical variables
long_data['Sex'] = le.fit_transform(long_data['Sex'].astype(str))
long_data['Smoking'] = le.fit_transform(long_data['Smoking'].astype(str))
long_data['Physical_Activity'] = long_data['Physical_Activity'].map({'Low': 0, 'Medium': 1, 'High': 2, np.nan: 0})
long_data['Social_Class'] = long_data['Social_Class'].map({'A': 1, 'B': 2, 'C1': 3, 'C2': 4, 'D': 5, 'E': 6, np.nan: 3})

# Impute missing data
numeric_cols = long_data.select_dtypes(include=[np.number]).columns
long_data[numeric_cols] = long_data[numeric_cols].fillna(long_data[numeric_cols].mean())

### Frequentist Mixed-Effects Model for BMI Change
We’ll use `statsmodels` to fit a mixed-effects model for BMI over time.

In [None]:
# Frequentist mixed-effects model for BMI change
freq_mixed_model = mixedlm("BMI ~ Time + Age + Sex + Smoking + Physical_Activity + Social_Class + Sugar_Intake + SFA_Intake", 
                           long_data, groups=long_data['ID'])
freq_mixed_result = freq_mixed_model.fit()

# Display results
print("Frequentist Mixed-Effects Model for BMI Change:")
print(freq_mixed_result.summary())

### Bayesian Mixed-Effects Model for BMI Change
We’ll use PyMC to fit a Bayesian mixed-effects model.

In [None]:
# Bayesian mixed-effects model for BMI change
with pm.Model() as bayes_mixed:
    # Random intercepts for each participant
    intercept = pm.Normal('Intercept', mu=0, sigma=10)
    slope = pm.Normal('Slope', mu=0, sigma=1)
    beta_age = pm.Normal('Age', mu=0, sigma=1)
    beta_sex = pm.Normal('Sex', mu=0, sigma=1)
    beta_smoking = pm.Normal('Smoking', mu=0, sigma=1)
    beta_activity = pm.Normal('Physical_Activity', mu=0, sigma=1)
    beta_class = pm.Normal('Social_Class', mu=0, sigma=1)
    beta_sugar = pm.Normal('Sugar_Intake', mu=0, sigma=1)
    beta_sfa = pm.Normal('SFA_Intake', mu=0, sigma=1)
    sigma = pm.HalfNormal('sigma', sigma=1)

    # Linear model
    mu = (intercept + slope * long_data['Time'] + beta_age * long_data['Age'] +
          beta_sex * long_data['Sex'] + beta_smoking * long_data['Smoking'] +
          beta_activity * long_data['Physical_Activity'] + beta_class * long_data['Social_Class'] +
          beta_sugar * long_data['Sugar_Intake'] + beta_sfa * long_data['SFA_Intake'])

    # Likelihood
    bmi_obs = pm.Normal('bmi_obs', mu=mu, sigma=sigma, observed=long_data['BMI'])

    # Sample from posterior
    trace_mixed = pm.sample(1000, tune=1000, return_inferencedata=True)

# Summary of posterior
print("Bayesian Mixed-Effects Model for BMI Change:")
print(az.summary(trace_mixed, var_names=['Intercept', 'Slope', 'Age', 'Sex', 'Smoking', 'Physical_Activity', 'Social_Class', 'Sugar_Intake', 'SFA_Intake']))

### Frequentist Logistic Regression for CVD Incidence (Prospective)
We’ll use logistic regression to assess prospective predictors of CVD incidence.

In [None]:
# Frequentist logistic regression for CVD incidence (prospective)
from sklearn.linear_model import LogisticRegression

# Prepare data (use baseline predictors and average BMI over time)
prospective_data = long_data.groupby('ID').agg({
    'Age': 'first', 'Sex': 'first', 'Smoking': 'first', 'Physical_Activity': 'first',
    'Social_Class': 'first', 'Sugar_Intake': 'first', 'SFA_Intake': 'first', 'CVD_Incidence': 'first',
    'BMI': 'mean'
}).reset_index()

# Fit logistic regression
X_prosp = prospective_data[['Age', 'Sex', 'Smoking', 'Physical_Activity', 'Social_Class', 'Sugar_Intake', 'SFA_Intake', 'BMI']]
y_prosp = prospective_data['CVD_Incidence']
freq_logistic = LogisticRegression(max_iter=1000)
freq_logistic.fit(X_prosp, y_prosp)

# Coefficients
freq_logistic_coefs = pd.DataFrame({
    'Predictor': X_prosp.columns,
    'Coefficient': freq_logistic.coef_[0]
})
print("Frequentist Logistic Regression for CVD Incidence:")
print(freq_logistic_coefs)

### Bayesian Logistic Regression for CVD Incidence (Prospective)
We’ll use PyMC to fit a Bayesian logistic regression model.

In [None]:
# Bayesian logistic regression for CVD incidence
with pm.Model() as bayes_logistic:
    # Priors
    beta_age = pm.Normal('Age', mu=0, sigma=1)
    beta_sex = pm.Normal('Sex', mu=0, sigma=1)
    beta_smoking = pm.Normal('Smoking', mu=0, sigma=1)
    beta_activity = pm.Normal('Physical_Activity', mu=0, sigma=1)
    beta_class = pm.Normal('Social_Class', mu=0, sigma=1)
    beta_sugar = pm.Normal('Sugar_Intake', mu=0, sigma=1)
    beta_sfa = pm.Normal('SFA_Intake', mu=0, sigma=1)
    beta_bmi = pm.Normal('BMI', mu=0, sigma=1)

    # Linear predictor
    logits = (beta_age * X_prosp['Age'] + beta_sex * X_prosp['Sex'] +
              beta_smoking * X_prosp['Smoking'] + beta_activity * X_prosp['Physical_Activity'] +
              beta_class * X_prosp['Social_Class'] + beta_sugar * X_prosp['Sugar_Intake'] +
              beta_sfa * X_prosp['SFA_Intake'] + beta_bmi * X_prosp['BMI'])

    # Likelihood
    cvd_obs = pm.Bernoulli('cvd_obs', logit_p=logits, observed=y_prosp)

    # Sample from posterior
    trace_logistic = pm.sample(1000, tune=1000, return_inferencedata=True)

# Summary of posterior
print("Bayesian Logistic Regression for CVD Incidence:")
print(az.summary(trace_logistic, var_names=['Age', 'Sex', 'Smoking', 'Physical_Activity', 'Social_Class', 'Sugar_Intake', 'SFA_Intake', 'BMI']))

## Step 7: Learning Points and Next Steps 🎓

### Learning Points
- **Table 1**: Summarised baseline characteristics, providing a clear overview of the study population.
- **Missing Data**: Identified patterns of missingness (~8% per variable), which should be considered in analysis (e.g., imputation strategies).
- **Cross-Sectional Analysis**: Frequentist and Bayesian regression showed similar predictors of baseline BMI, with Bayesian providing uncertainty quantification.
- **Survival Analysis**: Cox and Bayesian survival models highlighted SFA intake as a key predictor of CVD incidence, consistent with the simulated association.
- **Prospective Analysis**: Mixed-effects models confirmed sugar intake’s association with BMI increase, and logistic regression identified predictors of CVD incidence, with Bayesian models offering probabilistic insights.

### Next Steps
- **Advanced Imputation**: Use multiple imputation for missing data to reduce bias.
- **Interaction Terms**: Explore interactions (e.g., age × SFA intake) in survival models.
- **Sensitivity Analysis**: Test the impact of different priors in Bayesian models.
- **Further Outcomes**: Analyze other outcomes, like blood pressure changes over time.

*Keep exploring epidemiological methods to uncover insights in nutrition science! 🥕📉*

---

### Setup Requirements
1. **Install Libraries**:
   ```bash
   source ~/Documents/data-analysis-projects/venv/bin/activate
   pip install numpy pandas matplotlib seaborn scipy pymc arviz scikit-learn lifelines statsmodels
   ```
2. **Environment**: Python 3.9, compatible with Apple Silicon (MPS).
3. **Dataset**: Ensure `data/epidemiological_study.csv` is available (generated by the simulation script).

### Expected Output
- **Table 1**: Descriptive statistics for baseline characteristics.
- **Missing Data Plot**: Heatmap showing missing data patterns.
- **Cross-Sectional Results**: Coefficients from Frequentist and Bayesian regression for baseline BMI.
- **Survival Results**: Hazard ratios (Frequentist) and posterior summaries (Bayesian) for CVD incidence.
- **Prospective Results**: Coefficients for BMI change and CVD incidence from Frequentist and Bayesian models.