## Stroke Work Pipeline - Main Notebook
Author: Daniel Maina Nderitu<br>
Project: MADIVA<br>
Purpose: Stroke Work Analysis<br>

### Run Pipeline

In [10]:
import os

# Set current working directory to where the notebook and run_pipeline.py are
os.chdir("D:/APHRC/GoogleDrive_ii/stata_do_files/madiva/stroke_work")

# Verify
print("Current directory:", os.getcwd())

# Then run the pipeline
%run run_pipeline.py


Current directory: D:\APHRC\GoogleDrive_ii\stata_do_files\madiva\stroke_work
Starting Stroke Analysis Pipeline...

Running: 01_setup_and_imports.ipynb


Executing:   0%|          | 0/9 [00:00<?, ?cell/s]


Running: 02_data_cleaning_and_harmonization.ipynb


Executing:   0%|          | 0/8 [00:00<?, ?cell/s]


Running: 03_preprocessing_and_covariates.ipynb


Executing:   0%|          | 0/25 [00:00<?, ?cell/s]


Running: 04_person_time_and_events.ipynb


Executing:   0%|          | 0/14 [00:00<?, ?cell/s]


Running: 05_exploratory_and_diagnostics.ipynb


Executing:   0%|          | 0/13 [00:00<?, ?cell/s]


Running: 06_models_logistic.ipynb


Executing:   0%|          | 0/15 [00:00<?, ?cell/s]


Running: 07_models_poisson_nb.ipynb


Executing:   0%|          | 0/26 [00:00<?, ?cell/s]


Pipeline completed successfully!
Start time: 2026-01-28 14:06:39.568997
End time  : 2026-01-28 14:07:48.926707
Duration  : 0:01:09.357710
Executed notebooks saved to: D:\APHRC\GoogleDrive_ii\stata_do_files\madiva\stroke_work\notebooks_executed


### Notes for the project

<b>Que. 1.</b> Why are we using pickle?
    Pickle is Python-only, preserves dtypes, fastest. The other options are csv (interoperability) and Parquet (for large numeric datasets)
<br><b>Que. 2.</b> Are we saving any datasets and models? Yes we are to make it easier for: <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a) model comparison, <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;b) reproducibility, <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;c) sensitivity analyses and <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;d) paper revisions 
<br><b>Que. 3.</b> 
<br><b>Que. 4.</b> 
<br><b>Que. 5.</b> 
<br><b>Que. 6.</b> 

In [11]:
from src.utils.helpers import load_paths

paths = load_paths()
print(paths.keys())

dict_keys(['BASE_DIR', 'DATA_DIR', 'OUT_DIR', 'FIG_DIR', 'NOTEBOOKS_DIR', 'NOTEBOOKS_EXECUTED_DIR'])


In [12]:
from src.utils.helpers import load_paths

paths = load_paths()
for k, v in paths.items():
    print(k, ":", v)

BASE_DIR : D:\APHRC\GoogleDrive_ii\stata_do_files\madiva\stroke_work
DATA_DIR : D:\APHRC\GoogleDrive_ii\stata_do_files\madiva\stroke_work\data
OUT_DIR : D:\APHRC\GoogleDrive_ii\stata_do_files\madiva\stroke_work\model_output
FIG_DIR : D:\APHRC\GoogleDrive_ii\stata_do_files\madiva\stroke_work\visualization
NOTEBOOKS_DIR : D:\APHRC\GoogleDrive_ii\stata_do_files\madiva\stroke_work\notebooks
NOTEBOOKS_EXECUTED_DIR : D:\APHRC\GoogleDrive_ii\stata_do_files\madiva\stroke_work\notebooks_executed


### Check Outputs

In [13]:
import pandas as pd
import os

# Example: list saved files
print(os.listdir("notebooks_executed"))
print(os.listdir("model_output"))
print(os.listdir("visualization"))

['01_setup_and_imports.ipynb', '02_data_cleaning_and_harmonization.ipynb', '03_preprocessing_and_covariates.ipynb', '04_person_time_and_events.ipynb', '05_exploratory_and_diagnostics.ipynb', '06_models_logistic.ipynb', '07_models_poisson_nb.ipynb']
['.ipynb_checkpoints', 'df_step02_processed.pkl', 'df_step03_processed.pkl', 'df_step04_processed.pkl', 'df_step05_processed.pkl', 'df_step06_processed.pkl', 'df_step07_processed.pkl', 'IRR_results_se.xlsx', 'poisson_model_results.csv', 'poisson_model_results_main.csv', 'poisson_model_results_se.csv', 'statsmodels', 'stroke_model_results_comparison.xlsx', 'stroke_model_results_comparison_main.xlsx', 'stroke_model_results_comparison_se.xlsx', 'X_step06_model_matrix.pkl', 'X_step07_model_matrix.pkl', 'y_step06_event.pkl', 'y_step07_event.pkl']
['condition_combinations_bar_chart.png', 'condition_combinations_bar_chart_with_gender.png', 'missingness_barchart_full_dataset.png', 'missingness_dendrogram_full_dataset.png', 'missingness_dendrogram_fu