# Complete Analysis Workflow: Impute, Fit, and Pool

This notebook demonstrates the complete workflow for statistical analysis with multiple imputation.

## Overview

Multiple imputation is not just about filling in missing values - it's about properly accounting for uncertainty in statistical inference. This notebook covers:

1. Imputing missing data
2. Fitting statistical models to each imputed dataset
3. Pooling results using Rubin's rules
4. Interpreting pooled estimates 


## 1. Import Libraries and Load Data


In [182]:
import pandas as pd
import numpy as np
import sys
sys.path.append('..')

from imputation.MICE import MICE
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Load NHANES dataset
df = pd.read_csv('data/nhanes.csv')
print("Dataset shape:", df.shape)
print("\nFirst few rows:")
print(df.head())
print("\nMissing values:")
print(df.isnull().sum())


Dataset shape: (25, 4)

First few rows:
   age   bmi  hyp    chl
0    1   NaN  NaN    NaN
1    2  22.7  1.0  187.0
2    1   NaN  1.0  187.0
3    3   NaN  NaN    NaN
4    1  20.4  1.0  113.0

Missing values:
age     0
bmi     9
hyp     8
chl    10
dtype: int64


## 2. Step 1: Impute Missing Data

First, we perform multiple imputation to create complete datasets.


In [183]:
# Initialize MICE and perform imputation
mice = MICE(df)
mice.impute(
    n_imputations=10,  # Use 10 imputations for stable inference
    maxit=10,
    method='pmm'       # PMM is good for continuous variables
)

print(f"Created {len(mice.imputed_datasets)} imputed datasets")
print(f"Each dataset has shape: {mice.imputed_datasets[0].shape}")


Created 10 imputed datasets
Each dataset has shape: (25, 4)


## 3. Step 2: Fit Statistical Model

Now we fit a regression model to each imputed dataset. We'll model BMI as a function of age, hypertension status, and cholesterol.

The formula syntax follows statsmodels/R conventions: `outcome ~ predictor1 + predictor2 + ...`


In [184]:
# Fit regression model: BMI ~ age + hyp + chl
formula = 'bmi ~ age + hyp + chl'
mice.fit(formula)

print(f"Fitted model to {len(mice.model_results)} datasets")
print(f"\nExample: Results from first imputed dataset:")
print(mice.model_results[0].summary())


Fitted model to 10 datasets

Example: Results from first imputed dataset:
                            OLS Regression Results                            
Dep. Variable:                    bmi   R-squared:                       0.414
Model:                            OLS   Adj. R-squared:                  0.330
Method:                 Least Squares   F-statistic:                     4.937
Date:                Sun, 19 Oct 2025   Prob (F-statistic):            0.00949
Time:                        17:30:37   Log-Likelihood:                -64.625
No. Observations:                  25   AIC:                             137.3
Df Residuals:                      21   BIC:                             142.1
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------

## 4. Step 3: Pool Results Using Rubin's Rules

Pooling combines estimates from all imputed datasets while properly accounting for uncertainty.


In [185]:
# Pool results
pooled_results = mice.pool()

print("Pooled Results:")
print("="*60)
print(f"\nParameter estimates:")
for i, param_name in enumerate(pooled_results['parameter_names']):
    estimate = pooled_results['pooled_params'][i]
    se = np.sqrt(pooled_results['pooled_covariance'][i, i])
    fmi = pooled_results['fraction_missing_info'][i]
    print(f"{param_name:12s}: estimate={estimate:7.4f}, SE={se:7.4f}, FMI={fmi:5.3f}")


Pooled Results:

Parameter estimates:
Intercept   : estimate=22.5586, SE= 5.0247, FMI=0.369
age         : estimate=-2.5717, SE= 1.3975, FMI=0.334
hyp         : estimate= 1.8681, SE= 2.7983, FMI=0.398
chl         : estimate= 0.0322, SE= 0.0269, FMI=0.497


In [186]:
# Get summary table
summary = mice.pool(summ=True)
print("\nPooled Summary Table:")
print(summary)



Pooled Summary Table:
                         Results: MICE
Dependent variable:            bmi          Scale          15.30
Sample size:                   25           M              10   
----------------------------------------------------------------
           Coef.  Std.Err.    t    P>|t|   [0.025  0.975]  FMI  
----------------------------------------------------------------
Intercept 22.5586   5.0247  4.4896 0.0000 12.7105 32.4067 0.3692
age       -2.5717   1.3975 -1.8402 0.0657 -5.3108  0.1674 0.3342
hyp        1.8681   2.7983  0.6676 0.5044 -3.6164  7.3527 0.3978
chl        0.0322   0.0269  1.1984 0.2308 -0.0205  0.0848 0.4972



## Summary

In this notebook, we completed the full analysis workflow:

1. **Imputed** missing data using MICE
2. **Fitted** regression models using `.fit(formula)`
3. **Pooled** results using `.pool()` with Rubin's rules
4. **Interpreted** pooled estimates and FMI from the summary table

**Key takeaways**:
- Multiple imputation properly accounts for uncertainty from missing data
- The `.fit()` and `.pool()` methods provide a complete workflow for statistical inference
- Pooled results include parameter estimates, standard errors, confidence intervals, and FMI

**Next steps**:
- See `05_diagnostic_plots.ipynb` for visual assessment of imputation quality
- See `03_advanced_parameters.ipynb` for tuning imputation parameters
- See `02_imputation_methods.ipynb` for comparing different imputation methods
