# Handling Missing Data in Nutrition Science 

🧩📊Welcome to this Jupyter notebook on handling missing data in nutrition science! Missing data is a common challenge in epidemiological studies, especially in nutrition research where variables like dietary intake or biomarkers may be incomplete. In this mini-project, we’ll use the epidemiological dataset from our previous work (n=25,000, age range 45-80) to explore and address missing data.We’ll:

- **Explore missing data patterns** to understand their extent and impact 🕵️
- **Apply common techniques**: listwise deletion, mean/mode imputation, multiple imputation, and a Bayesian approach 🌐
- **Compare their impact** on a simple analysis (linear regression for baseline BMI) 📈Let’s dive in and learn how to handle missing data effectively in nutrition science!

## Step 1: Load the Dataset and Libraries 📦

Let’s load the epidemiological dataset and the libraries we’ll need for this analysis. The dataset includes variables like age, sex, smoking, physical activity, social class, BMI, blood pressure, sugar intake, SFA intake, and CVD incidence, with ~8% missing data per variable.

In [None]:
# Import libraries for analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import pymc as pm
import arviz as az

# Set seaborn style for clean visuals
sns.set_style("whitegrid")

# Load the dataset
data = pd.read_csv('data/epidemiological_study.csv')

# Display the first few rows
data.head()

## Step 2: Explore Missing Data Patterns 🔎

Let’s start by examining the extent and patterns of missing data in our dataset. This will help us understand which variables are most affected and whether the missingness appears random or systematic.

In [None]:
# Calculate percentage of missing data for each variable
missing_data = data.isna().mean() * 100
missing_summary = pd.DataFrame({
    'Missing (%)': missing_data.round(2)
})

# Display missing data summary
print("Missing Data Summary:")
print(missing_summary[missing_summary['Missing (%)'] > 0])

# Visualize missing data patterns
plt.figure(figsize=(12, 8))
sns.heatmap(data.isna(), cbar=False, cmap='viridis')
plt.title('Missing Data Patterns (Yellow = Missing) 📉')
plt.xlabel('Variables')
plt.ylabel('Participants')
plt.tight_layout()
plt.show()

## Step 3: Prepare the Data for Analysis 🛠️

We’ll prepare the data for a simple analysis: predicting baseline BMI using age, sex, smoking, physical activity, social class, sugar intake, and SFA intake. First, we need to encode categorical variables and select the relevant columns.

In [None]:
# Select relevant columns for analysis
analysis_data = data[['BMI_Baseline', 'Age', 'Sex', 'Smoking', 'Physical_Activity', 'Social_Class', 'Sugar_Intake', 'SFA_Intake']].copy()

# Encode categorical variables
le = LabelEncoder()
analysis_data['Sex'] = le.fit_transform(analysis_data['Sex'].astype(str))
analysis_data['Smoking'] = le.fit_transform(analysis_data['Smoking'].astype(str))
analysis_data['Physical_Activity'] = analysis_data['Physical_Activity'].map({'Low': 0, 'Medium': 1, 'High': 2, np.nan: 0})
analysis_data['Social_Class'] = analysis_data['Social_Class'].map({'A': 1, 'B': 2, 'C1': 3, 'C2': 4, 'D': 5, 'E': 6, np.nan: 3})

# Define predictors and outcome
X = analysis_data.drop('BMI_Baseline', axis=1)
y = analysis_data['BMI_Baseline']

# Display the first few rows of the prepared data
X.head()

## Step 4: Technique 1 - Listwise Deletion 🚮

The simplest approach to handling missing data is **listwise deletion**, where we remove any row with at least one missing value. This method is easy but can lead to loss of data and potential bias if the missingness is not completely random.

In [None]:
# Apply listwise deletion
X_listwise = X.dropna()
y_listwise = y[X_listwise.index]

# Check the number of rows remaining
print(f"Original dataset size: {X.shape[0]} rows")
print(f"After listwise deletion: {X_listwise.shape[0]} rows")

# Fit a linear regression model
model_listwise = LinearRegression()
model_listwise.fit(X_listwise, y_listwise)

# Display coefficients
coeffs_listwise = pd.DataFrame({
    'Predictor': X_listwise.columns,
    'Coefficient': model_listwise.coef_
})
print("\nLinear Regression Results (Listwise Deletion):")
print(coeffs_listwise)

## Step 5: Technique 2 - Mean/Mode Imputation 📏

Another common method is **mean/mode imputation**, where we replace missing values with the mean (for numerical variables) or mode (for categorical variables). This preserves the sample size but can underestimate variability.

In [None]:
# Create a copy of the data for mean/mode imputation
X_mean_mode = X.copy()

# Impute numerical variables with mean
numerical_vars = ['Age', 'Sugar_Intake', 'SFA_Intake']
X_mean_mode[numerical_vars] = X_mean_mode[numerical_vars].fillna(X_mean_mode[numerical_vars].mean())

# Impute categorical variables with mode
categorical_vars = ['Sex', 'Smoking', 'Physical_Activity', 'Social_Class']
X_mean_mode[categorical_vars] = X_mean_mode[categorical_vars].fillna(X_mean_mode[categorical_vars].mode().iloc[0])

# Impute the outcome variable (BMI_Baseline)
y_mean_mode = y.fillna(y.mean())

# Fit a linear regression model
model_mean_mode = LinearRegression()
model_mean_mode.fit(X_mean_mode, y_mean_mode)

# Display coefficients
coeffs_mean_mode = pd.DataFrame({
    'Predictor': X_mean_mode.columns,
    'Coefficient': model_mean_mode.coef_
})
print("Linear Regression Results (Mean/Mode Imputation):")
print(coeffs_mean_mode)

## Step 6: Technique 3 - Multiple Imputation 🔄

A more sophisticated approach is **multiple imputation**, which creates multiple plausible datasets by imputing missing values, then combines the results. We’ll use `IterativeImputer` from scikit-learn, which models each variable with missing values as a function of the others.

In [None]:
# Combine predictors and outcome for imputation
combined_data = X.copy()
combined_data['BMI_Baseline'] = y

# Apply multiple imputation
imputer = IterativeImputer(max_iter=10, random_state=11088)
imputed_data = pd.DataFrame(imputer.fit_transform(combined_data), columns=combined_data.columns)

# Separate predictors and outcome after imputation
X_multiple = imputed_data.drop('BMI_Baseline', axis=1)
y_multiple = imputed_data['BMI_Baseline']

# Fit a linear regression model
model_multiple = LinearRegression()
model_multiple.fit(X_multiple, y_multiple)

# Display coefficients
coeffs_multiple = pd.DataFrame({
    'Predictor': X_multiple.columns,
    'Coefficient': model_multiple.coef_
})
print("Linear Regression Results (Multiple Imputation):")
print(coeffs_multiple)

## Step 7: Technique 4 - Bayesian Imputation 🌐

Finally, let’s use a **Bayesian approach** to impute missing data. We’ll model the data with PyMC, treating missing values as parameters to be estimated, and then use the imputed dataset for regression.

In [None]:
# Create a copy of the data for Bayesian imputation
X_bayesian = X.copy()
y_bayesian = y.copy()

# Identify missing values
missing_mask_X = X_bayesian.isna()
missing_mask_y = y_bayesian.isna()

# Bayesian imputation model
with pm.Model() as bayesian_imputation:
    # Priors for observed data means
    mu_age = pm.Normal('mu_age', mu=60, sigma=10)
    mu_sugar = pm.Normal('mu_sugar', mu=50, sigma=10)
    mu_sfa = pm.Normal('mu_sfa', mu=30, sigma=10)
    mu_bmi = pm.Normal('mu_bmi', mu=27, sigma=5)

    # Priors for standard deviations
    sigma_age = pm.HalfNormal('sigma_age', sigma=5)
    sigma_sugar = pm.HalfNormal('sigma_sugar', sigma=5)
    sigma_sfa = pm.HalfNormal('sigma_sfa', sigma=5)
    sigma_bmi = pm.HalfNormal('sigma_bmi', sigma=2)

    # Impute missing numerical variables
    age_imputed = pm.Normal('age_imputed', mu=mu_age, sigma=sigma_age, shape=X_bayesian.shape[0], observed=X_bayesian['Age'])
    sugar_imputed = pm.Normal('sugar_imputed', mu=mu_sugar, sigma=sigma_sugar, shape=X_bayesian.shape[0], observed=X_bayesian['Sugar_Intake'])
    sfa_imputed = pm.Normal('sfa_imputed', mu=mu_sfa, sigma=sigma_sfa, shape=X_bayesian.shape[0], observed=X_bayesian['SFA_Intake'])
    bmi_imputed = pm.Normal('bmi_imputed', mu=mu_bmi, sigma=sigma_bmi, shape=y_bayesian.shape[0], observed=y_bayesian)

    # Sample from posterior
    trace = pm.sample(1000, tune=1000, return_inferencedata=True)

# Extract imputed values (use the mean of the posterior samples)
X_bayesian['Age'] = trace.posterior['age_imputed'].mean(dim=['chain', 'draw']).values
X_bayesian['Sugar_Intake'] = trace.posterior['sugar_imputed'].mean(dim=['chain', 'draw']).values
X_bayesian['SFA_Intake'] = trace.posterior['sfa_imputed'].mean(dim=['chain', 'draw']).values
y_bayesian = trace.posterior['bmi_imputed'].mean(dim=['chain', 'draw']).values

# For categorical variables, use mode imputation (already handled during preparation)

# Fit a linear regression model
model_bayesian = LinearRegression()
model_bayesian.fit(X_bayesian, y_bayesian)

# Display coefficients
coeffs_bayesian = pd.DataFrame({
    'Predictor': X_bayesian.columns,
    'Coefficient': model_bayesian.coef_
})
print("Linear Regression Results (Bayesian Imputation):")
print(coeffs_bayesian)

## Step 8: Compare the Results 🔍

Let’s compare the regression coefficients from each method to see how the choice of imputation technique affects the results.

In [None]:
# Combine coefficients from all methods
comparison = pd.DataFrame({
    'Predictor': X.columns,
    'Listwise Deletion': coeffs_listwise['Coefficient'],
    'Mean/Mode Imputation': coeffs_mean_mode['Coefficient'],
    'Multiple Imputation': coeffs_multiple['Coefficient'],
    'Bayesian Imputation': coeffs_bayesian['Coefficient']
})

# Display the comparison
print("Comparison of Regression Coefficients Across Methods:")
print(comparison)

# Visualize the comparison
comparison.set_index('Predictor').plot(kind='bar', figsize=(12, 6))
plt.title('Comparison of Regression Coefficients by Imputation Method 📊')
plt.ylabel('Coefficient Value')
plt.xlabel('Predictor')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## Step 9: Learning Points and Next Steps 🎓

### Learning Points

- **Missing Data Patterns**: Visualizing and quantifying missing data helps us understand its extent and potential bias. In our dataset, ~8% of values were missing per variable, distributed randomly
- **Listwise Deletion**: Simple but reduces sample size (e.g., from 25,000 to fewer rows), potentially introducing bias if missingness is not completely random.
- **Mean/Mode Imputation**: Preserves sample size but underestimates variability, which can lead to overly confident estimates.
- **Multiple Imputation**: A more robust method that accounts for uncertainty by creating multiple datasets, often yielding more reliable results.
- **Bayesian Imputation**: Treats missing values as parameters, providing a probabilistic approach that can capture uncertainty and relationships between variables.
- **Impact on Analysis**: Different methods led to slight variations in regression coefficients, highlighting the importance of choosing an appropriate technique.
 
### Next Steps
 
- **Explore Missingness Mechanisms**: Test if the missing data is Missing Completely at Random (MCAR), Missing at Random (MAR), or Missing Not at Random (MNAR).
- **Advanced Bayesian Models**: Use more complex Bayesian models to impute missing data, incorporating relationships between variables (e.g., hierarchical models).
- **Sensitivity Analysis**: Compare results with different imputation methods to assess robustness.
- **Apply to Other Analyses**: Use these techniques in other mini-projects (e.g., epidemiology case study) to improve data quality.
- 
- *Keep exploring data handling techniques to ensure robust analyses in nutrition science! 🥕📉*
 
---

### Setup Requirements

1. **Install Libraries**:

   ```bash
   source ~/Documents/data-analysis-toolkit-FNS/venv/bin/activate
   pip install numpy==1.26.4 pandas==2.2.3 matplotlib==3.9.2 seaborn==0.13.2 scipy==1.12.0 pymc==5.16.2 arviz==0.19.0 scikit-learn==1.5.2
   ```
   
2. **Environment**: 
3. Python 3.9, compatible with Apple Silicon (MPS).
4. **Dataset**: Ensure `data/epidemiological_study.csv` is available (generated by `create_epi_data.py`).
 
### Expected Output

- **Missing Data Summary**: Table and heatmap showing the extent of missing data.
- **Regression Results**: Coefficients from linear regression using each imputation method.
- **Comparison Plot**: Bar chart comparing coefficients across methods.