# Notebook 6 · Feature Mart & Label Engineering

This workbook consolidates Apple Health and Libre glucose features into reusable daily and event-level datasets, setting the stage for wellness, cardio, strength, and stress recommendations.

**Objectives**
- Reuse existing SQLite tables and processed CSV exports without re-parsing source XML/CSV files
- Load daily Apple Health aggregates and glucose metrics into a unified "feature mart"
- Prepare scaffolding for event-centric features (meals, cardio, strength, stress/readiness)
- Persist interim datasets for downstream modeling notebooks

**Notebook roadmap**
1. Configure environment and helper utilities
2. Inspect SQLite inventory and load core tables
3. Load processed CSV feature exports
4. Build the first pass of the daily feature mart and persist to Parquet
5. Scaffold event-centric feature pipelines and TODOs

In [48]:
# Imports and display defaults
from __future__ import annotations

from pathlib import Path
import sqlite3

import numpy as np
import pandas as pd
from IPython.display import Markdown, display

pd.set_option("display.max_columns", None)
pd.set_option("display.precision", 3)
print("✅ Libraries imported")

✅ Libraries imported


In [49]:
# Project paths and database connection helpers
PROJECT_ROOT = Path("..").resolve()
DATA_DIR = PROJECT_ROOT / "data"
PROCESSED_DIR = DATA_DIR / "processed"
DB_PATH = DATA_DIR / "database" / "health_glucose.db"

assert DB_PATH.exists(), f"SQLite database missing at {DB_PATH}"


def get_connection(db_path: Path) -> sqlite3.Connection:
    """Create a read-only SQLite connection when possible."""
    uri = f"file:{db_path}?mode=ro"
    try:
        return sqlite3.connect(uri, uri=True)
    except sqlite3.OperationalError:
        return sqlite3.connect(db_path)


conn = get_connection(DB_PATH)
print(f"📀 Connected to SQLite at {DB_PATH}")

📀 Connected to SQLite at /Users/george/Library/Mobile Documents/com~apple~CloudDocs/Programming Projects/Apple-Health-DS/data/database/health_glucose.db


In [50]:
# Inspect available tables and basic row counts
def list_tables(connection: sqlite3.Connection) -> pd.DataFrame:
    query = "SELECT name FROM sqlite_master WHERE type='table' ORDER BY name;"
    return pd.read_sql(query, connection)


def table_counts(
    connection: sqlite3.Connection, table_names: list[str]
) -> pd.DataFrame:
    counts = []
    for name in table_names:
        try:
            row_count = pd.read_sql(
                f"SELECT COUNT(*) AS row_count FROM {name};", connection
            )["row_count"].iat[0]
        except Exception as exc:
            row_count = np.nan
            print(f"⚠️ Could not count table {name}: {exc}")
        counts.append({"table": name, "row_count": row_count})
    return pd.DataFrame(counts)


table_inventory = list_tables(conn)
display(Markdown("**SQLite tables available:**"))
display(table_inventory)

core_tables = ["glucose_readings", "apple_health_records", "windowed_glucose_features"]
existing_tables = [t for t in core_tables if t in table_inventory["name"].tolist()]
display(Markdown("**Row counts for core tables:**"))
display(table_counts(conn, existing_tables))

**SQLite tables available:**

Unnamed: 0,name
0,apple_health_records
1,data_quality_log
2,glucose_readings
3,glucose_statistics
4,merged_health_data
5,sqlite_sequence
6,workout_records


**Row counts for core tables:**

Unnamed: 0,table,row_count
0,glucose_readings,4151
1,apple_health_records,114252


In [51]:
# Load core SQLite tables into DataFrames (subset to necessary columns for efficiency)
GLUCOSE_COLUMNS = [
    "timestamp",
    "glucose_value",
    "glucose_range",
    "glucose_rate_change",
    "glucose_trend",
]
WINDOW_COLUMNS = [
    "window_start",
    "window_end",
    "mean_glucose",
    "std_glucose",
    "min_glucose",
    "max_glucose",
]

glucose_query = f"SELECT {', '.join(GLUCOSE_COLUMNS)} FROM glucose_readings;"
glucose_df = pd.read_sql(glucose_query, conn, parse_dates=["timestamp"])
print(f"✅ Loaded glucose_readings: {len(glucose_df):,} rows")

apple_records_df = pd.read_sql("SELECT * FROM apple_health_records;", conn)
date_like_cols = [
    col
    for col in apple_records_df.columns
    if col.lower().endswith("date") or col.lower().endswith("time")
]
for col in date_like_cols:
    apple_records_df[col] = pd.to_datetime(apple_records_df[col], errors="coerce")
print(f"✅ Loaded apple_health_records: {len(apple_records_df):,} rows")

if "windowed_glucose_features" in existing_tables:
    window_query = f"SELECT {', '.join(WINDOW_COLUMNS)} FROM windowed_glucose_features;"
    windowed_glucose_df = pd.read_sql(
        window_query, conn, parse_dates=["window_start", "window_end"]
    )
    print(f"✅ Loaded windowed_glucose_features: {len(windowed_glucose_df):,} rows")
else:
    windowed_glucose_df = pd.DataFrame()
    print("ℹ️ Table windowed_glucose_features not found; continuing without it.")

✅ Loaded glucose_readings: 4,151 rows
✅ Loaded apple_health_records: 114,252 rows
ℹ️ Table windowed_glucose_features not found; continuing without it.
✅ Loaded apple_health_records: 114,252 rows
ℹ️ Table windowed_glucose_features not found; continuing without it.


In [52]:
# Load processed CSV feature exports for Apple Health daily metrics
APPLE_DAILY_PATH = PROCESSED_DIR / "apple_health_daily_features.csv"
APPLE_SUMMARY_PATH = PROCESSED_DIR / "apple_health_type_summary.csv"
MERGED_HEALTH_GLUCOSE_PATH = PROCESSED_DIR / "merged_health_glucose.csv"


def safe_read_csv(
    path: Path,
    parse_dates: list[str] | None = None,
    infer_datetime: bool = True,
    **kwargs,
) -> pd.DataFrame:
    if not path.exists():
        print(f"⚠️ Missing CSV: {path.name}")
        return pd.DataFrame()
    df = pd.read_csv(path, **kwargs)
    candidate_cols: set[str] = set(parse_dates or [])
    if infer_datetime:
        inferred = {
            col
            for col in df.columns
            if "timestamp" in col.lower() or col.lower().endswith("date")
        }
        candidate_cols.update(inferred)
    for col in sorted(candidate_cols):
        if col in df.columns:
            df[col] = pd.to_datetime(df[col], errors="coerce")
        else:
            print(
                f"⚠️ Column '{col}' not found in {path.name}; skipping datetime parsing.",
            )
    print(f"📄 Loaded {path.name}: {len(df):,} rows")
    return df


apple_daily_df = safe_read_csv(
    APPLE_DAILY_PATH, parse_dates=["event_date"], infer_datetime=True
)
apple_summary_df = safe_read_csv(APPLE_SUMMARY_PATH, infer_datetime=False)
merged_health_glucose_df = safe_read_csv(
    MERGED_HEALTH_GLUCOSE_PATH,
    parse_dates=["glucose_timestamp", "health_timestamp", "creationDate", "endDate"],
    infer_datetime=True,
)

📄 Loaded apple_health_daily_features.csv: 15 rows
📄 Loaded apple_health_type_summary.csv: 40 rows
📄 Loaded merged_health_glucose.csv: 114,252 rows
📄 Loaded merged_health_glucose.csv: 114,252 rows


In [53]:
# Inspect Apple daily and merged health schemas to guide feature engineering
display(Markdown("**Apple daily features schema preview**"))
display(apple_daily_df.head())
print("Apple daily dtypes:")
print(apple_daily_df.dtypes.head())

display(Markdown("**Merged health + glucose schema preview**"))
display(merged_health_glucose_df.head())
print("Merged health + glucose dtypes (subset):")
print(
    merged_health_glucose_df[
        ["glucose_timestamp", "health_timestamp", "creationDate", "endDate"]
    ].dtypes
)

**Apple daily features schema preview**

Unnamed: 0,event_date,mean_ActiveEnergyBurned,mean_AppleExerciseTime,mean_AppleStandTime,mean_AppleWalkingSteadiness,mean_BasalEnergyBurned,mean_BloodGlucose,mean_BodyFatPercentage,mean_BodyMass,mean_BodyMassIndex,mean_DietaryCarbohydrates,mean_DietaryCholesterol,mean_DietaryEnergyConsumed,mean_DietaryFatSaturated,mean_DietaryFatTotal,mean_DietaryFiber,mean_DietaryPotassium,mean_DietaryProtein,mean_DietarySodium,mean_DietarySugar,mean_DietaryWater,mean_DistanceCycling,mean_DistanceWalkingRunning,mean_EnvironmentalAudioExposure,mean_EnvironmentalSoundReduction,mean_FlightsClimbed,mean_HeadphoneAudioExposure,mean_HeartRate,mean_HeartRateVariabilitySDNN,mean_LeanBodyMass,mean_PhysicalEffort,mean_RespiratoryRate,mean_RestingHeartRate,mean_SixMinuteWalkTestDistance,mean_StepCount,mean_VO2Max,mean_WalkingAsymmetryPercentage,mean_WalkingDoubleSupportPercentage,mean_WalkingHeartRateAverage,mean_WalkingSpeed,mean_WalkingStepLength,sum_ActiveEnergyBurned,sum_AppleExerciseTime,sum_AppleStandTime,sum_AppleWalkingSteadiness,sum_BasalEnergyBurned,sum_BloodGlucose,sum_BodyFatPercentage,sum_BodyMass,sum_BodyMassIndex,sum_DietaryCarbohydrates,sum_DietaryCholesterol,sum_DietaryEnergyConsumed,sum_DietaryFatSaturated,sum_DietaryFatTotal,sum_DietaryFiber,sum_DietaryPotassium,sum_DietaryProtein,sum_DietarySodium,sum_DietarySugar,sum_DietaryWater,sum_DistanceCycling,sum_DistanceWalkingRunning,sum_EnvironmentalAudioExposure,sum_EnvironmentalSoundReduction,sum_FlightsClimbed,sum_HeadphoneAudioExposure,sum_HeartRate,sum_HeartRateVariabilitySDNN,sum_LeanBodyMass,sum_PhysicalEffort,sum_RespiratoryRate,sum_RestingHeartRate,sum_SixMinuteWalkTestDistance,sum_StepCount,sum_VO2Max,sum_WalkingAsymmetryPercentage,sum_WalkingDoubleSupportPercentage,sum_WalkingHeartRateAverage,sum_WalkingSpeed,sum_WalkingStepLength,count_ActiveEnergyBurned,count_AppleExerciseTime,count_AppleStandTime,count_AppleWalkingSteadiness,count_BasalEnergyBurned,count_BloodGlucose,count_BodyFatPercentage,count_BodyMass,count_BodyMassIndex,count_DietaryCarbohydrates,count_DietaryCholesterol,count_DietaryEnergyConsumed,count_DietaryFatSaturated,count_DietaryFatTotal,count_DietaryFiber,count_DietaryPotassium,count_DietaryProtein,count_DietarySodium,count_DietarySugar,count_DietaryWater,count_DistanceCycling,count_DistanceWalkingRunning,count_EnvironmentalAudioExposure,count_EnvironmentalSoundReduction,count_FlightsClimbed,count_HeadphoneAudioExposure,count_HeartRate,count_HeartRateVariabilitySDNN,count_LeanBodyMass,count_PhysicalEffort,count_RespiratoryRate,count_RestingHeartRate,count_SixMinuteWalkTestDistance,count_StepCount,count_VO2Max,count_WalkingAsymmetryPercentage,count_WalkingDoubleSupportPercentage,count_WalkingHeartRateAverage,count_WalkingSpeed,count_WalkingStepLength
0,2025-08-16,0.486,1.0,2.132,,0.576,103.898,,,,6.5,79.101,121.143,1.981,6.4,2.695,723.265,19.0,73.824,3.587,,,0.061,60.936,16.526,2.556,71.563,97.021,31.129,,3.753,,,,86.107,,0.015,0.271,,2.877,29.497,917.858,153.0,145.0,,857.592,13299.0,,,,26.0,237.304,848.0,5.943,32.0,8.085,3616.327,95.0,442.946,14.349,,,5.871,974.969,66.102,23.0,1216.569,136216.911,124.516,,1460.1,,,,12055.0,,0.16,5.965,,74.803,766.93,1887.0,153.0,68.0,,1489.0,128.0,,,,4.0,3.0,7.0,3.0,5.0,3.0,5.0,5.0,6.0,4.0,,,97.0,16.0,4.0,9.0,17.0,1404.0,4.0,,389.0,,,,140.0,,11.0,22.0,,26.0,26.0
1,2025-08-17,0.655,1.0,2.348,,2.134,91.795,0.143,153.102,22.654,17.444,122.688,118.9,2.087,8.667,6.771,321.571,8.667,21.079,2.644,0.0,,0.047,50.569,16.856,2.667,76.45,82.268,36.536,131.087,3.85,14.327,58.0,500.0,124.781,,0.001,0.275,79.5,2.78,29.209,869.902,59.0,155.0,,1813.948,13402.0,0.143,153.102,45.309,157.0,245.375,1189.0,14.611,52.0,33.857,1286.285,52.0,147.551,23.793,0.0,,6.973,2022.749,50.569,16.0,229.351,184362.608,328.822,131.087,1594.1,7865.5,58.0,500.0,14225.0,,0.01,5.216,79.5,58.384,613.386,1329.0,59.0,66.0,,850.0,146.0,1.0,1.0,2.0,9.0,2.0,10.0,7.0,6.0,5.0,4.0,6.0,7.0,9.0,1.0,,147.0,40.0,3.0,6.0,3.0,2241.0,9.0,1.0,414.0,549.0,1.0,1.0,114.0,,8.0,19.0,1.0,21.0,21.0
2,2025-08-18,1.038,1.0,2.4,,12.152,,0.145,154.002,22.771,6.0,31.837,91.0,2.31,5.429,3.374,220.722,13.5,303.104,3.349,0.0,,0.067,50.597,13.902,2.222,66.385,70.017,40.048,131.638,3.695,14.126,58.0,,177.919,,0.003,0.278,88.0,2.587,27.804,724.797,33.0,144.0,,1774.17,,0.145,154.002,45.542,48.0,127.35,819.0,13.861,38.0,20.243,1324.335,81.0,2121.73,16.743,0.0,,8.361,1973.279,97.312,20.0,2456.251,120639.921,360.432,131.638,1562.8,10411.0,58.0,,17614.0,,0.06,11.133,88.0,137.102,1473.623,698.0,33.0,60.0,,146.0,,1.0,1.0,2.0,8.0,4.0,9.0,6.0,7.0,6.0,6.0,6.0,7.0,5.0,1.0,,125.0,39.0,7.0,9.0,37.0,1723.0,9.0,1.0,423.0,737.0,1.0,,99.0,,18.0,40.0,1.0,53.0,53.0
3,2025-08-19,0.305,1.0,2.088,,0.549,87.476,0.14,152.201,22.488,7.6,17.668,112.154,2.138,8.444,2.373,275.902,14.0,165.292,4.173,0.0,0.003,0.042,51.627,16.488,3.0,67.553,84.843,37.852,130.778,3.511,13.54,61.0,,85.54,,0.008,0.275,90.0,2.792,29.215,1132.204,260.0,190.0,,1813.203,3674.0,0.14,152.201,44.976,76.0,70.67,1458.0,17.108,76.0,16.612,1655.415,126.0,1652.915,25.038,0.0,7.225,8.424,1961.832,115.417,15.0,6282.395,259790.195,302.82,130.778,1790.8,12443.0,61.0,,16937.0,,0.12,7.141,90.0,80.955,847.245,3711.0,260.0,91.0,,3305.0,42.0,1.0,1.0,2.0,10.0,4.0,13.0,8.0,9.0,7.0,6.0,9.0,10.0,6.0,1.0,2218.0,199.0,38.0,7.0,5.0,93.0,3062.0,8.0,1.0,510.0,919.0,1.0,,198.0,,15.0,26.0,1.0,29.0,29.0
4,2025-08-20,0.837,1.0,2.19,,7.024,88.131,0.142,152.401,22.553,17.75,26.177,133.35,1.992,5.181,3.158,111.281,7.389,73.605,4.326,0.0,,0.041,51.834,16.82,3.25,68.212,70.13,48.037,130.712,3.824,13.77,57.5,,152.924,,0.031,0.271,77.0,2.858,29.528,765.855,35.0,138.0,,1833.229,20799.0,0.142,152.401,45.106,355.0,78.53,2667.0,15.936,82.9,31.575,667.686,140.4,662.444,43.258,0.0,,7.607,2177.008,33.64,39.0,1705.294,142925.415,432.329,130.712,1690.4,7504.5,115.0,,16057.0,,0.4,5.421,77.0,65.744,679.134,915.0,35.0,63.0,,261.0,236.0,1.0,1.0,2.0,20.0,3.0,20.0,8.0,16.0,10.0,6.0,19.0,9.0,10.0,1.0,,185.0,42.0,2.0,12.0,25.0,2038.0,9.0,1.0,442.0,545.0,2.0,,105.0,,13.0,20.0,1.0,23.0,23.0


Apple daily dtypes:
event_date                     datetime64[ns]
mean_ActiveEnergyBurned               float64
mean_AppleExerciseTime                float64
mean_AppleStandTime                   float64
mean_AppleWalkingSteadiness           float64
dtype: object


**Merged health + glucose schema preview**

Unnamed: 0,type,sourceName,value,unit,creationDate,health_timestamp,endDate,device,serial_number,glucose_timestamp,record_type,glucose_mg_dl,scan_glucose_mg_dl,Non-numeric Rapid-Acting Insulin,rapid_acting_insulin_units,Non-numeric Food,carbohydrates_grams,Carbohydrates (servings),Non-numeric Long-Acting Insulin,long_acting_insulin_units,notes,strip_glucose_mg_dl,ketone_mmol_l,Meal Insulin (units),Correction Insulin (units),User Change Insulin (units),glucose_value,glucose_source,glucose_rate_change,glucose_trend,hour,day_of_week,is_weekend,glucose_range,time_diff_minutes,is_night,is_morning,is_afternoon,is_evening,likely_meal_time
0,HKQuantityTypeIdentifierHeartRate,Zepp,78.0,count/min,2025-08-17 00:48:15,2025-08-16 13:20:00,2025-08-16 13:20:59,FreeStyle Libre 3,03EE5610-75AC-4A63-860F-A3A70A27AD2F,2025-08-16 13:20:00,0,125.0,,,,,,,,,,,,,,,125.0,historic,,stable,13,5,True,normal,0.0,False,False,True,False,True
1,HKQuantityTypeIdentifierAppleStandTime,George’s Apple Watch,3.0,min,2025-08-16 13:25:35,2025-08-16 13:20:00,2025-08-16 13:25:00,FreeStyle Libre 3,03EE5610-75AC-4A63-860F-A3A70A27AD2F,2025-08-16 13:20:00,0,125.0,,,,,,,,,,,,,,,125.0,historic,,stable,13,5,True,normal,0.0,False,False,True,False,True
2,HKQuantityTypeIdentifierActiveEnergyBurned,Zepp,1.0,Cal,2025-09-19 22:50:35,2025-08-16 13:20:00,2025-08-16 13:29:59,FreeStyle Libre 3,03EE5610-75AC-4A63-860F-A3A70A27AD2F,2025-08-16 13:20:00,0,125.0,,,,,,,,,,,,,,,125.0,historic,,stable,13,5,True,normal,0.0,False,False,True,False,True
3,HKQuantityTypeIdentifierBloodGlucose,Bevel,125.0,mg/dL,2025-08-17 00:38:00,2025-08-16 13:20:01,2025-08-16 13:20:01,FreeStyle Libre 3,03EE5610-75AC-4A63-860F-A3A70A27AD2F,2025-08-16 13:20:00,0,125.0,,,,,,,,,,,,,,,125.0,historic,,stable,13,5,True,normal,0.017,False,False,True,False,True
4,HKQuantityTypeIdentifierActiveEnergyBurned,George’s Apple Watch,0.465,Cal,2025-08-16 13:21:13,2025-08-16 13:20:12,2025-08-16 13:20:53,FreeStyle Libre 3,03EE5610-75AC-4A63-860F-A3A70A27AD2F,2025-08-16 13:20:00,0,125.0,,,,,,,,,,,,,,,125.0,historic,,stable,13,5,True,normal,0.2,False,False,True,False,True


Merged health + glucose dtypes (subset):
glucose_timestamp    datetime64[ns]
health_timestamp     datetime64[ns]
creationDate         datetime64[ns]
endDate              datetime64[ns]
dtype: object


In [54]:
merged_health_glucose_df.head()

Unnamed: 0,type,sourceName,value,unit,creationDate,health_timestamp,endDate,device,serial_number,glucose_timestamp,record_type,glucose_mg_dl,scan_glucose_mg_dl,Non-numeric Rapid-Acting Insulin,rapid_acting_insulin_units,Non-numeric Food,carbohydrates_grams,Carbohydrates (servings),Non-numeric Long-Acting Insulin,long_acting_insulin_units,notes,strip_glucose_mg_dl,ketone_mmol_l,Meal Insulin (units),Correction Insulin (units),User Change Insulin (units),glucose_value,glucose_source,glucose_rate_change,glucose_trend,hour,day_of_week,is_weekend,glucose_range,time_diff_minutes,is_night,is_morning,is_afternoon,is_evening,likely_meal_time
0,HKQuantityTypeIdentifierHeartRate,Zepp,78.0,count/min,2025-08-17 00:48:15,2025-08-16 13:20:00,2025-08-16 13:20:59,FreeStyle Libre 3,03EE5610-75AC-4A63-860F-A3A70A27AD2F,2025-08-16 13:20:00,0,125.0,,,,,,,,,,,,,,,125.0,historic,,stable,13,5,True,normal,0.0,False,False,True,False,True
1,HKQuantityTypeIdentifierAppleStandTime,George’s Apple Watch,3.0,min,2025-08-16 13:25:35,2025-08-16 13:20:00,2025-08-16 13:25:00,FreeStyle Libre 3,03EE5610-75AC-4A63-860F-A3A70A27AD2F,2025-08-16 13:20:00,0,125.0,,,,,,,,,,,,,,,125.0,historic,,stable,13,5,True,normal,0.0,False,False,True,False,True
2,HKQuantityTypeIdentifierActiveEnergyBurned,Zepp,1.0,Cal,2025-09-19 22:50:35,2025-08-16 13:20:00,2025-08-16 13:29:59,FreeStyle Libre 3,03EE5610-75AC-4A63-860F-A3A70A27AD2F,2025-08-16 13:20:00,0,125.0,,,,,,,,,,,,,,,125.0,historic,,stable,13,5,True,normal,0.0,False,False,True,False,True
3,HKQuantityTypeIdentifierBloodGlucose,Bevel,125.0,mg/dL,2025-08-17 00:38:00,2025-08-16 13:20:01,2025-08-16 13:20:01,FreeStyle Libre 3,03EE5610-75AC-4A63-860F-A3A70A27AD2F,2025-08-16 13:20:00,0,125.0,,,,,,,,,,,,,,,125.0,historic,,stable,13,5,True,normal,0.017,False,False,True,False,True
4,HKQuantityTypeIdentifierActiveEnergyBurned,George’s Apple Watch,0.465,Cal,2025-08-16 13:21:13,2025-08-16 13:20:12,2025-08-16 13:20:53,FreeStyle Libre 3,03EE5610-75AC-4A63-860F-A3A70A27AD2F,2025-08-16 13:20:00,0,125.0,,,,,,,,,,,,,,,125.0,historic,,stable,13,5,True,normal,0.2,False,False,True,False,True


In [55]:
def compute_sleep_features(apple_records: pd.DataFrame) -> pd.DataFrame:
    """Aggregate Apple Sleep Analysis records into daily sleep metrics."""
    columns = [
        "event_date",
        "sleep_duration_hours",
        "sleep_asleep_hours",
        "sleep_inbed_hours",
        "sleep_sessions",
        "sleep_efficiency_pct",
        "sleep_onset_time",
        "sleep_wake_time",
        "sleep_midpoint_time",
        "sleep_latency_minutes",
        "sleep_restful_ratio_pct",
    ]
    if apple_records.empty or "type" not in apple_records.columns:
        return pd.DataFrame(columns=columns)
    sleep_df = apple_records[
        apple_records["type"].str.contains("sleep", case=False, na=False)
    ].copy()
    if sleep_df.empty:
        return pd.DataFrame(columns=columns)

    def _resolve_column(df: pd.DataFrame, candidates: list[str]) -> str | None:
        lower_map = {col.lower(): col for col in df.columns}
        for candidate in candidates:
            if candidate in df.columns:
                return candidate
            lowered = candidate.lower()
            if lowered in lower_map:
                return lower_map[lowered]
        for col in df.columns:
            if any(candidate.lower() in col.lower() for candidate in candidates):
                return col
        return None

    start_col = _resolve_column(
        sleep_df, ["start_date", "startDate", "start_time", "startTime"]
    )
    end_col = _resolve_column(sleep_df, ["end_date", "endDate", "end_time", "endTime"])
    if start_col is None or end_col is None:
        return pd.DataFrame(columns=columns)

    sleep_df["start_ts"] = pd.to_datetime(sleep_df[start_col], errors="coerce")
    sleep_df["end_ts"] = pd.to_datetime(sleep_df[end_col], errors="coerce")
    sleep_df = sleep_df.dropna(subset=["start_ts", "end_ts"]).sort_values("start_ts")
    if sleep_df.empty:
        return pd.DataFrame(columns=columns)

    value_col = _resolve_column(
        sleep_df, ["value", "sleep_state", "sleepStage", "sleep_stage"]
    )
    if value_col:
        sleep_df["value_str"] = sleep_df[value_col].astype(str).str.lower()
    else:
        sleep_df["value_str"] = ""

    sleep_df["duration_hours"] = (
        sleep_df["end_ts"] - sleep_df["start_ts"]
    ).dt.total_seconds() / 3600
    sleep_df["event_date"] = sleep_df["end_ts"].dt.floor("D")

    def _summarize(group: pd.DataFrame) -> pd.Series:
        event_date = pd.Timestamp(group.name).floor("D")
        asleep_mask = group["value_str"].str.contains(
            "asleep|core|deep|rem", regex=True, na=False
        )
        inbed_mask = group["value_str"].str.contains(
            "inbed|in bed|bed", regex=True, na=False
        )
        restful_mask = group["value_str"].str.contains("deep|rem", regex=True, na=False)
        total_hours = group["duration_hours"].sum()
        asleep_hours = group.loc[asleep_mask, "duration_hours"].sum()
        inbed_hours = group.loc[inbed_mask, "duration_hours"].sum()
        if asleep_hours == 0:
            asleep_hours = total_hours
        if inbed_hours == 0:
            inbed_hours = total_hours
        sleep_onset = group["start_ts"].min()
        wake_time = group["end_ts"].max()
        midpoint = (
            sleep_onset + (wake_time - sleep_onset) / 2
            if pd.notna(sleep_onset) and pd.notna(wake_time)
            else pd.NaT
        )
        sleep_latency = (
            (sleep_onset - event_date).total_seconds() / 60
            if pd.notna(sleep_onset)
            else np.nan
        )
        restful_hours = group.loc[restful_mask, "duration_hours"].sum()
        return pd.Series(
            {
                "event_date": event_date,
                "sleep_duration_hours": total_hours,
                "sleep_asleep_hours": asleep_hours,
                "sleep_inbed_hours": inbed_hours,
                "sleep_sessions": group.shape[0],
                "sleep_efficiency_pct": (
                    (asleep_hours / inbed_hours * 100) if inbed_hours else np.nan
                ),
                "sleep_onset_time": sleep_onset,
                "sleep_wake_time": wake_time,
                "sleep_midpoint_time": midpoint,
                "sleep_latency_minutes": sleep_latency,
                "sleep_restful_ratio_pct": (
                    (restful_hours / asleep_hours * 100) if asleep_hours else np.nan
                ),
            }
        )

    summary = (
        sleep_df.groupby("event_date", group_keys=False)
        .apply(_summarize, include_groups=False)
        .reset_index(drop=True)
    )
    summary["event_date"] = pd.to_datetime(summary["event_date"]).dt.tz_localize(None)
    return summary[columns]


def _add_baseline_columns(
    df: pd.DataFrame, column: str, baseline_col: str, delta_col: str, window: int
) -> None:
    if column in df.columns:
        df[baseline_col] = df[column].rolling(window=window, min_periods=3).mean()
        df[delta_col] = df[column] - df[baseline_col]


def add_trailing_baselines(daily_df: pd.DataFrame) -> pd.DataFrame:
    """Compute trailing baselines and deltas for key wellness metrics."""
    if daily_df.empty or "date" not in daily_df.columns:
        return daily_df
    df = daily_df.sort_values("date").copy()
    _add_baseline_columns(
        df,
        "mean_RestingHeartRate",
        "resting_hr_baseline_14d",
        "resting_hr_delta",
        window=14,
    )
    _add_baseline_columns(
        df, "mean_HeartRateVariabilitySDNN", "hrv_baseline_14d", "hrv_delta", window=14
    )
    _add_baseline_columns(
        df, "mean_VO2Max", "vo2max_baseline_28d", "vo2max_delta", window=28
    )
    _add_baseline_columns(
        df,
        "mean_PhysicalEffort",
        "physical_effort_baseline_14d",
        "physical_effort_delta",
        window=14,
    )
    _add_baseline_columns(
        df,
        "sleep_duration_hours",
        "sleep_duration_baseline_14d",
        "sleep_duration_delta",
        window=14,
    )
    return df


def _normalize_series(series: pd.Series) -> pd.Series:
    if series.empty or series.dropna().empty:
        return pd.Series(np.nan, index=series.index)
    std = series.std(ddof=0)
    if std == 0 or np.isnan(std):
        return pd.Series(0.0, index=series.index)
    return (series - series.mean()) / std


def compute_readiness_components(daily_df: pd.DataFrame) -> pd.DataFrame:
    """Derive composite readiness score and categorical risk level."""
    if daily_df.empty or "date" not in daily_df.columns:
        return daily_df
    df = daily_df.copy()
    component_weights: list[tuple[str, float]] = []
    df["readiness_component_hrv"] = np.nan
    df["readiness_component_rhr"] = np.nan
    df["readiness_component_sleep"] = np.nan
    if "hrv_delta" in df.columns:
        df["readiness_component_hrv"] = _normalize_series(df["hrv_delta"])
        component_weights.append(("readiness_component_hrv", 0.4))
    if "resting_hr_delta" in df.columns:
        df["readiness_component_rhr"] = -_normalize_series(df["resting_hr_delta"])
        component_weights.append(("readiness_component_rhr", 0.3))
    sleep_basis = None
    if (
        "sleep_efficiency_pct" in df.columns
        and df["sleep_efficiency_pct"].notna().any()
    ):
        sleep_basis = df["sleep_efficiency_pct"]
    elif "sleep_duration_delta" in df.columns:
        sleep_basis = df["sleep_duration_delta"]
    if sleep_basis is not None:
        df["readiness_component_sleep"] = _normalize_series(sleep_basis)
        component_weights.append(("readiness_component_sleep", 0.3))
    if component_weights:
        total_weight = sum(weight for _, weight in component_weights)
        weighted_sum = sum(df[col] * weight for col, weight in component_weights)
        df["readiness_risk_score"] = weighted_sum / total_weight
    else:
        df["readiness_risk_score"] = np.nan
    df["readiness_risk_level"] = pd.cut(
        df["readiness_risk_score"],
        bins=[-np.inf, -0.5, 0.5, np.inf],
        labels=["high", "moderate", "low"],
    )
    return df

In [61]:
# Enrich Apple daily aggregates with derived sleep features (if available)
sleep_features_df = compute_sleep_features(apple_records_df)
if not sleep_features_df.empty:
    apple_daily_df = apple_daily_df.merge(
        sleep_features_df, on="event_date", how="left"
    )
else:
    for col in [
        "sleep_duration_hours",
        "sleep_asleep_hours",
        "sleep_inbed_hours",
        "sleep_sessions",
        "sleep_efficiency_pct",
    ]:
        if col not in apple_daily_df.columns:
            apple_daily_df[col] = np.nan
apple_daily_df = apple_daily_df.sort_values("event_date").reset_index(drop=True)

In [57]:
# Derive glucose daily aggregates directly from the readings DataFrame
glucose_df["timestamp"] = pd.to_datetime(glucose_df["timestamp"]).dt.tz_localize(None)
glucose_df["glucose_rate_change"] = pd.to_numeric(
    glucose_df["glucose_rate_change"], errors="coerce"
)
glucose_df["glucose_range"] = glucose_df["glucose_range"].astype("string").str.lower()

glucose_daily_df = (
    glucose_df.assign(date=glucose_df["timestamp"].dt.floor("D"))
    .groupby("date")
    .agg(
        glucose_mean_mg_dl=("glucose_value", "mean"),
        glucose_median_mg_dl=("glucose_value", "median"),
        glucose_min_mg_dl=("glucose_value", "min"),
        glucose_max_mg_dl=("glucose_value", "max"),
        glucose_std_mg_dl=("glucose_value", "std"),
        glucose_rate_change_avg=("glucose_rate_change", "mean"),
        readings_count=("glucose_value", "count"),
    )
    .reset_index()
)

tir_counts = (
    glucose_df.assign(date=glucose_df["timestamp"].dt.floor("D"))
    .dropna(subset=["glucose_range"])
    .groupby(["date", "glucose_range"])
    .size()
    .unstack(fill_value=0)
)
for col in ["normal", "low", "high", "very_low", "very_high"]:
    if col not in tir_counts.columns:
        tir_counts[col] = 0
tir_pct = tir_counts.div(tir_counts.sum(axis=1), axis=0).mul(100)
tir_pct = tir_pct.rename(
    columns={
        "normal": "tir_normal_pct",
        "low": "tir_low_pct",
        "high": "tir_high_pct",
        "very_low": "tir_very_low_pct",
        "very_high": "tir_very_high_pct",
    }
)

glucose_daily_df = glucose_daily_df.merge(
    tir_pct[[col for col in tir_pct.columns if col.startswith("tir_")]],
    left_on="date",
    right_index=True,
    how="left",
)
glucose_daily_df["date"] = pd.to_datetime(glucose_daily_df["date"])
print(f"📈 Glucose daily aggregates generated: {len(glucose_daily_df):,} days")
display(glucose_daily_df.tail())

📈 Glucose daily aggregates generated: 15 days


Unnamed: 0,date,glucose_mean_mg_dl,glucose_median_mg_dl,glucose_min_mg_dl,glucose_max_mg_dl,glucose_std_mg_dl,glucose_rate_change_avg,readings_count,tir_low_pct,tir_normal_pct,tir_very_low_pct,tir_high_pct,tir_very_high_pct
10,2025-08-26,84.608,81.0,69.0,126.0,10.576,-0.064,291,0.344,99.656,0.0,0.0,0.0
11,2025-08-27,90.211,87.0,67.0,138.0,15.852,-0.006,289,0.346,99.654,0.0,0.0,0.0
12,2025-08-28,91.212,91.0,73.0,127.0,9.617,-0.007,288,0.0,100.0,0.0,0.0,0.0
13,2025-08-29,91.764,90.0,59.0,132.0,9.679,-0.032,292,1.712,98.288,0.0,0.0,0.0
14,2025-08-30,95.87,92.0,64.0,124.0,9.49,-0.031,215,0.465,99.535,0.0,0.0,0.0


In [58]:
# Build first-pass daily feature mart merging Apple Health and glucose metrics
def normalize_daily_columns(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    if df.empty:
        return df
    if "date" not in df.columns:
        for candidate in [
            "event_date",
            "day",
            "start_date",
            "end_date",
            "creationDate",
        ]:
            if candidate in df.columns:
                df["date"] = pd.to_datetime(df[candidate], errors="coerce")
                break
    if "date" not in df.columns:
        raise ValueError(
            "Could not infer a 'date' column for Apple Health daily features."
        )
    date_series = pd.to_datetime(df["date"], errors="coerce")
    if getattr(date_series.dt, "tz", None) is not None:
        date_series = date_series.dt.tz_localize(None)
    df["date"] = date_series
    df = df.drop_duplicates(subset=["date"]).sort_values("date").reset_index(drop=True)
    return df


apple_daily_clean = (
    normalize_daily_columns(apple_daily_df)
    if not apple_daily_df.empty
    else pd.DataFrame()
)
if not apple_daily_clean.empty:
    apple_daily_clean = add_trailing_baselines(apple_daily_clean)
    apple_daily_clean = compute_readiness_components(apple_daily_clean)

if apple_daily_clean.empty and glucose_daily_df.empty:
    daily_feature_mart = pd.DataFrame()
elif apple_daily_clean.empty:
    daily_feature_mart = glucose_daily_df.copy()
elif glucose_daily_df.empty:
    daily_feature_mart = apple_daily_clean.copy()
else:
    daily_feature_mart = (
        apple_daily_clean.merge(
            glucose_daily_df, on="date", how="outer", suffixes=("_apple", "_glucose")
        )
        .sort_values("date")
        .reset_index(drop=True)
    )

print(f"🧩 Daily feature mart shape: {daily_feature_mart.shape}")
display(daily_feature_mart.tail())

🧩 Daily feature mart shape: (15, 159)


Unnamed: 0,event_date,mean_ActiveEnergyBurned,mean_AppleExerciseTime,mean_AppleStandTime,mean_AppleWalkingSteadiness,mean_BasalEnergyBurned,mean_BloodGlucose,mean_BodyFatPercentage,mean_BodyMass,mean_BodyMassIndex,mean_DietaryCarbohydrates,mean_DietaryCholesterol,mean_DietaryEnergyConsumed,mean_DietaryFatSaturated,mean_DietaryFatTotal,mean_DietaryFiber,mean_DietaryPotassium,mean_DietaryProtein,mean_DietarySodium,mean_DietarySugar,mean_DietaryWater,mean_DistanceCycling,mean_DistanceWalkingRunning,mean_EnvironmentalAudioExposure,mean_EnvironmentalSoundReduction,mean_FlightsClimbed,mean_HeadphoneAudioExposure,mean_HeartRate,mean_HeartRateVariabilitySDNN,mean_LeanBodyMass,mean_PhysicalEffort,mean_RespiratoryRate,mean_RestingHeartRate,mean_SixMinuteWalkTestDistance,mean_StepCount,mean_VO2Max,mean_WalkingAsymmetryPercentage,mean_WalkingDoubleSupportPercentage,mean_WalkingHeartRateAverage,mean_WalkingSpeed,mean_WalkingStepLength,sum_ActiveEnergyBurned,sum_AppleExerciseTime,sum_AppleStandTime,sum_AppleWalkingSteadiness,sum_BasalEnergyBurned,sum_BloodGlucose,sum_BodyFatPercentage,sum_BodyMass,sum_BodyMassIndex,sum_DietaryCarbohydrates,sum_DietaryCholesterol,sum_DietaryEnergyConsumed,sum_DietaryFatSaturated,sum_DietaryFatTotal,sum_DietaryFiber,sum_DietaryPotassium,sum_DietaryProtein,sum_DietarySodium,sum_DietarySugar,sum_DietaryWater,sum_DistanceCycling,sum_DistanceWalkingRunning,sum_EnvironmentalAudioExposure,sum_EnvironmentalSoundReduction,sum_FlightsClimbed,sum_HeadphoneAudioExposure,sum_HeartRate,sum_HeartRateVariabilitySDNN,sum_LeanBodyMass,sum_PhysicalEffort,sum_RespiratoryRate,sum_RestingHeartRate,sum_SixMinuteWalkTestDistance,sum_StepCount,sum_VO2Max,sum_WalkingAsymmetryPercentage,sum_WalkingDoubleSupportPercentage,sum_WalkingHeartRateAverage,sum_WalkingSpeed,sum_WalkingStepLength,count_ActiveEnergyBurned,count_AppleExerciseTime,count_AppleStandTime,count_AppleWalkingSteadiness,count_BasalEnergyBurned,count_BloodGlucose,count_BodyFatPercentage,count_BodyMass,count_BodyMassIndex,count_DietaryCarbohydrates,count_DietaryCholesterol,count_DietaryEnergyConsumed,count_DietaryFatSaturated,count_DietaryFatTotal,count_DietaryFiber,count_DietaryPotassium,count_DietaryProtein,count_DietarySodium,count_DietarySugar,count_DietaryWater,count_DistanceCycling,count_DistanceWalkingRunning,count_EnvironmentalAudioExposure,count_EnvironmentalSoundReduction,count_FlightsClimbed,count_HeadphoneAudioExposure,count_HeartRate,count_HeartRateVariabilitySDNN,count_LeanBodyMass,count_PhysicalEffort,count_RespiratoryRate,count_RestingHeartRate,count_SixMinuteWalkTestDistance,count_StepCount,count_VO2Max,count_WalkingAsymmetryPercentage,count_WalkingDoubleSupportPercentage,count_WalkingHeartRateAverage,count_WalkingSpeed,count_WalkingStepLength,sleep_duration_hours,sleep_asleep_hours,sleep_inbed_hours,sleep_sessions,sleep_efficiency_pct,sleep_onset_time,sleep_wake_time,sleep_midpoint_time,sleep_latency_minutes,sleep_restful_ratio_pct,date,resting_hr_baseline_14d,resting_hr_delta,hrv_baseline_14d,hrv_delta,vo2max_baseline_28d,vo2max_delta,physical_effort_baseline_14d,physical_effort_delta,sleep_duration_baseline_14d,sleep_duration_delta,readiness_component_hrv,readiness_component_rhr,readiness_component_sleep,readiness_risk_score,readiness_risk_level,glucose_mean_mg_dl,glucose_median_mg_dl,glucose_min_mg_dl,glucose_max_mg_dl,glucose_std_mg_dl,glucose_rate_change_avg,readings_count,tir_low_pct,tir_normal_pct,tir_very_low_pct,tir_high_pct,tir_very_high_pct
10,2025-08-26,0.492,1.0,2.097,,1.053,,0.139,151.301,22.371,6.5,7.5,57.5,1.75,3.0,0.5,10.0,2.0,67.5,2.0,,,0.06,51.182,13.809,2.0,67.702,84.692,34.251,130.227,3.581,14.115,57.0,,134.872,,0.005,0.28,101.0,2.672,28.231,1097.437,152.0,151.0,,1812.585,,0.139,151.301,44.743,13.0,7.5,115.0,1.75,6.0,0.5,10.0,2.0,67.5,2.0,,,7.498,2098.471,82.857,10.0,3249.71,203769.18,342.511,130.227,1701.2,7862.0,114.0,,15780.0,,0.09,8.95,101.0,90.842,959.843,2232.0,152.0,72.0,,1721.0,,1.0,1.0,2.0,2.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,,,126.0,41.0,6.0,5.0,48.0,2406.0,10.0,1.0,475.0,557.0,2.0,,117.0,,18.0,32.0,1.0,34.0,34.0,49.043,49.043,49.043,141.0,100.0,2025-08-26 00:20:29,2025-08-26 23:49:39,2025-08-26 12:05:04.000,20.483,0.0,2025-08-26,57.75,-0.75,37.36,-3.109,,,3.741,-0.159,40.502,8.541,-0.834,-0.113,0.0,-0.368,moderate,84.608,81.0,69.0,126.0,10.576,-0.064,291,0.344,99.656,0.0,0.0,0.0
11,2025-08-27,0.399,1.0,2.392,,0.669,,0.139,151.502,22.386,11.7,13.996,100.2,0.935,5.545,2.04,151.052,8.625,61.422,5.197,,0.007,0.064,51.809,16.079,2.875,64.824,82.639,38.78,130.425,4.191,13.954,56.5,,219.843,,0.039,0.279,101.5,2.719,28.929,1162.636,237.0,232.0,,1795.397,,0.139,151.502,44.772,234.0,69.981,2004.0,9.351,61.0,26.526,2265.785,138.0,921.331,77.954,,8.419,10.657,1865.139,96.475,23.0,5121.097,255686.329,349.016,130.425,2699.2,13423.5,113.0,,22424.0,,0.85,11.709,101.5,130.503,1388.584,2911.0,237.0,97.0,,2685.0,,1.0,1.0,2.0,20.0,5.0,20.0,10.0,11.0,13.0,15.0,16.0,15.0,15.0,,1260.0,167.0,36.0,6.0,8.0,79.0,3094.0,9.0,1.0,644.0,962.0,2.0,,102.0,,22.0,42.0,1.0,48.0,48.0,48.845,48.845,48.845,144.0,100.0,2025-08-26 23:11:09,2025-08-27 23:59:38,2025-08-27 11:35:23.500,-48.85,0.0,2025-08-27,57.636,-1.136,37.479,1.301,,,3.778,0.413,41.261,7.584,0.016,0.178,0.0,0.06,moderate,90.211,87.0,67.0,138.0,15.852,-0.006,289,0.346,99.654,0.0,0.0,0.0
12,2025-08-28,0.768,1.0,2.161,0.956,12.969,,0.143,153.001,22.647,7.4,16.6,51.0,1.298,3.667,1.708,142.8,1.0,33.725,5.973,,,0.056,51.114,15.045,3.0,28.705,66.438,39.651,131.021,3.609,14.0,57.0,,122.495,,0.007,0.281,82.0,2.383,26.284,616.526,20.0,121.0,0.956,1789.754,,0.143,153.001,45.294,37.0,16.6,255.0,5.193,11.0,6.83,285.6,2.0,134.9,23.89,,,5.906,2197.887,75.224,27.0,28.705,109889.207,475.81,131.021,1519.3,8316.0,114.0,,12372.0,,0.05,5.064,82.0,50.04,551.969,803.0,20.0,56.0,1.0,138.0,,1.0,1.0,2.0,5.0,1.0,5.0,4.0,3.0,4.0,2.0,2.0,4.0,4.0,,,105.0,43.0,5.0,9.0,1.0,1654.0,12.0,1.0,421.0,594.0,2.0,,101.0,,7.0,18.0,1.0,21.0,21.0,51.271,51.271,51.271,164.0,100.0,2025-08-27 23:25:00,2025-08-28 23:57:53,2025-08-28 11:41:26.500,-35.0,0.0,2025-08-28,57.583,-0.583,37.646,2.005,,,3.765,-0.156,42.095,9.176,0.152,-0.239,0.0,-0.011,moderate,91.212,91.0,73.0,127.0,9.617,-0.007,288,0.0,100.0,0.0,0.0,0.0
13,2025-08-29,0.53,1.0,2.241,,2.032,,0.142,152.802,22.582,11.8,38.744,173.636,3.303,11.333,3.527,280.838,12.6,258.454,5.688,,,0.059,52.9,14.3,1.353,73.037,70.084,47.858,131.087,3.457,13.746,55.0,,125.472,,0.073,0.297,80.0,2.18,25.089,889.107,86.0,186.0,,1848.93,,0.142,152.802,45.165,118.0,193.722,1910.0,26.422,102.0,21.163,1685.025,126.0,2326.09,39.819,,,7.589,2274.72,14.3,23.0,219.11,141009.435,430.719,131.087,1680.3,6584.5,110.0,,15935.0,,1.47,9.498,80.0,95.92,1103.937,1677.0,86.0,83.0,,910.0,,1.0,1.0,2.0,10.0,5.0,11.0,8.0,9.0,6.0,6.0,10.0,9.0,7.0,,,128.0,43.0,1.0,17.0,3.0,2012.0,9.0,1.0,486.0,479.0,2.0,,127.0,,20.0,32.0,1.0,44.0,44.0,43.445,43.445,43.445,170.0,100.0,2025-08-28 22:36:23,2025-08-29 23:59:59,2025-08-29 11:18:11.000,-83.617,0.0,2025-08-29,57.385,-2.385,38.375,9.483,,,3.743,-0.286,42.199,1.246,1.594,1.12,0.0,0.974,low,91.764,90.0,59.0,132.0,9.679,-0.032,292,1.712,98.288,0.0,0.0,0.0
14,2025-08-30,1.055,1.0,2.667,,16.212,,,,,9.667,10.0,69.75,1.544,3.333,2.193,259.29,4.5,350.0,5.926,,,0.072,47.777,,9.0,,65.072,36.001,,3.528,13.649,56.5,,138.236,,0.192,0.302,85.0,2.05,24.38,628.554,26.0,144.0,,1361.78,,,,,29.0,10.0,279.0,4.632,10.0,2.193,259.29,18.0,350.0,17.778,,,7.27,1528.861,,27.0,,81470.577,252.005,,1488.7,8353.0,113.0,,15206.0,,4.99,12.379,85.0,108.648,1292.125,596.0,26.0,54.0,,84.0,,,,,3.0,1.0,4.0,3.0,3.0,1.0,1.0,4.0,1.0,3.0,,,101.0,32.0,,3.0,,1252.0,7.0,,422.0,612.0,2.0,,110.0,,26.0,41.0,1.0,53.0,53.0,54.779,54.779,54.779,180.0,100.0,2025-08-29 22:09:12,2025-08-30 09:25:59,2025-08-30 03:47:35.500,-110.8,0.0,2025-08-30,57.321,-0.821,38.723,-2.722,,,3.727,-0.199,43.097,11.682,-0.76,-0.06,0.0,-0.322,moderate,95.87,92.0,64.0,124.0,9.49,-0.031,215,0.465,99.535,0.0,0.0,0.0


In [59]:
# Persist the daily feature mart
DAILY_FEATURE_MART_PATH = PROCESSED_DIR / "feature_mart_daily.parquet"
if daily_feature_mart.empty:
    print("⚠️ Daily feature mart is empty; skipping Parquet export for now.")
else:
    daily_feature_mart.to_parquet(DAILY_FEATURE_MART_PATH, index=False)
    print(f"💾 Saved daily feature mart to {DAILY_FEATURE_MART_PATH}")

💾 Saved daily feature mart to /Users/george/Library/Mobile Documents/com~apple~CloudDocs/Programming Projects/Apple-Health-DS/data/processed/feature_mart_daily.parquet


In [60]:
# Event feature mart generation
EVENT_WINDOW_SPEC = {
    "meal": {
        "description": "Identify low-variability glucose windows suitable for meals and track 2h post-event response.",
        "targets": ["glucose_auc_2h", "glucose_peak_2h", "time_above_140_2h_minutes"],
    },
    "cardio": {
        "description": "Exercise blocks linked to next-day recovery (ΔHRV, ΔRHR) and glycemic stability.",
        "targets": [
            "delta_hrv_next_day",
            "delta_rhr_next_day",
            "sleep_efficiency_same_night",
            "next_day_glucose_std_mg_dl",
        ],
    },
    "strength": {
        "description": "Strength sessions balancing effort, next-day readiness, and sleep protection.",
        "targets": ["delta_hrv_next_day", "perceived_effort_balance"],
    },
    "stress": {
        "description": "Readiness risk classification leveraging HRV, RHR, and sleep duration/efficiency.",
        "targets": ["readiness_risk_score"],
    },
}


def build_event_features(
    glucose: pd.DataFrame,
    apple_records: pd.DataFrame,
    windowed_glucose: pd.DataFrame,
    merged_health_glucose: pd.DataFrame,
    daily_feature_mart: pd.DataFrame,
    glucose_daily: pd.DataFrame,
    event_spec: dict[str, dict[str, object]],
) -> pd.DataFrame:
    """Construct event-centric feature mart covering meal, cardio, strength, and stress contexts."""
    events: list[dict[str, object]] = []
    glucose_temp = glucose.copy()
    if not glucose_temp.empty and "timestamp" in glucose_temp.columns:
        glucose_temp["timestamp"] = pd.to_datetime(
            glucose_temp["timestamp"], errors="coerce"
        ).dt.tz_localize(None)
        glucose_temp = glucose_temp.dropna(subset=["timestamp"]).sort_values(
            "timestamp"
        )
    if glucose_temp.empty:
        glucose_series = pd.Series(dtype=float)
        rate_change_series = pd.Series(dtype=float)
    else:
        glucose_series = (
            glucose_temp.set_index("timestamp")["glucose_value"]
            .astype(float)
            .resample("5min")
            .mean()
            .interpolate(limit=12, limit_direction="both")
        )
        if "glucose_rate_change" in glucose_temp.columns:
            rate_change_series = (
                glucose_temp.set_index("timestamp")["glucose_rate_change"]
                .astype(float)
                .resample("5min")
                .mean()
                .interpolate(limit=12, limit_direction="both")
            )
        else:
            rate_change_series = pd.Series(dtype=float)

    def compute_glucose_window_metrics(
        start_ts: pd.Timestamp, hours: int = 2
    ) -> dict[str, float]:
        if pd.isna(start_ts):
            return {
                "pre": np.nan,
                "auc": np.nan,
                "peak": np.nan,
                "std": np.nan,
                "time_above_140": np.nan,
                "mean_rate_change": np.nan,
            }
        start_ts = pd.Timestamp(start_ts).tz_localize(None)
        window_end = start_ts + pd.Timedelta(hours=hours)
        if glucose_series.empty:
            pre_value = np.nan
            window = pd.Series(dtype=float)
        else:
            window = glucose_series.loc[start_ts:window_end]
            pre_value = (
                glucose_series.asof(start_ts) if not glucose_series.empty else np.nan
            )
        if window.empty:
            return {
                "pre": float(pre_value) if pd.notna(pre_value) else np.nan,
                "auc": np.nan,
                "peak": np.nan,
                "std": np.nan,
                "time_above_140": np.nan,
                "mean_rate_change": np.nan,
            }
        dx_hours = 5 / 60
        auc = float(np.trapezoid(window.values, dx=dx_hours))
        peak = float(window.max())
        std_val = float(window.std())
        time_above_140 = float((window > 140).sum() * 5)
        if rate_change_series.empty:
            mean_rate_change = np.nan
        else:
            rate_window = rate_change_series.loc[start_ts:window_end]
            mean_rate_change = (
                float(rate_window.mean()) if not rate_window.empty else np.nan
            )
        return {
            "pre": float(pre_value) if pd.notna(pre_value) else np.nan,
            "auc": auc,
            "peak": peak,
            "std": std_val,
            "time_above_140": time_above_140,
            "mean_rate_change": mean_rate_change,
        }

    def prepare_lookup(df: pd.DataFrame) -> pd.DataFrame:
        if df.empty or "date" not in df.columns:
            return pd.DataFrame()
        lookup = df.copy()
        lookup["date"] = pd.to_datetime(lookup["date"], errors="coerce").dt.floor("D")
        lookup = lookup.dropna(subset=["date"]).drop_duplicates(
            subset=["date"], keep="last"
        )
        return lookup.set_index("date").sort_index()

    daily_lookup = prepare_lookup(daily_feature_mart)
    glucose_daily_lookup = prepare_lookup(glucose_daily)

    def get_metric(lookup: pd.DataFrame, date: pd.Timestamp, column: str) -> float:
        if lookup.empty or column not in lookup.columns or pd.isna(date):
            return np.nan
        key = pd.Timestamp(date).floor("D")
        if key not in lookup.index:
            return np.nan
        value = lookup.at[key, column]
        return float(value) if pd.notna(value) else np.nan

    merged_copy = (
        merged_health_glucose.copy()
        if not merged_health_glucose.empty
        else pd.DataFrame()
    )
    if "meal" in event_spec and not merged_copy.empty:
        meal_mask = merged_copy.get("likely_meal_time")
        if meal_mask is not None:
            if meal_mask.dtype == bool:
                merged_copy = merged_copy[meal_mask]
            else:
                merged_copy = merged_copy[
                    meal_mask.astype(str).str.lower().isin(["true", "1", "yes"])
                ]
        else:
            merged_copy = merged_copy.iloc[0:0]
        if not merged_copy.empty and "glucose_timestamp" in merged_copy.columns:
            merged_copy["glucose_timestamp"] = pd.to_datetime(
                merged_copy["glucose_timestamp"], errors="coerce"
            ).dt.tz_localize(None)
            merged_copy = merged_copy.dropna(subset=["glucose_timestamp"])
            merged_copy = merged_copy.sort_values("glucose_timestamp")
            merged_copy["event_bin"] = merged_copy["glucose_timestamp"].dt.floor(
                "30min"
            )
            meal_candidates = merged_copy.drop_duplicates(subset=["event_bin"])
            for row in meal_candidates.itertuples():
                start_ts = getattr(row, "glucose_timestamp", None)
                metrics = compute_glucose_window_metrics(start_ts)
                events.append(
                    {
                        "event_type": "meal",
                        "event_start": start_ts,
                        "event_end": (
                            start_ts + pd.Timedelta(hours=2)
                            if start_ts is not None
                            else np.nan
                        ),
                        "source": "merged_health_glucose",
                        "pre_glucose_mg_dl": metrics["pre"],
                        "glucose_auc_2h": metrics["auc"],
                        "glucose_peak_2h": metrics["peak"],
                        "time_above_140_2h_minutes": metrics["time_above_140"],
                        "post_window_std_mg_dl": metrics["std"],
                        "post_window_mean_rate_change": metrics["mean_rate_change"],
                        "glucose_trend_label": getattr(row, "glucose_trend", np.nan),
                    }
                )

    workouts = apple_records.copy() if not apple_records.empty else pd.DataFrame()
    if not workouts.empty and {"start_date", "end_date", "type"} <= set(
        workouts.columns
    ):
        for col in ["start_date", "end_date"]:
            workouts[col] = pd.to_datetime(
                workouts[col], errors="coerce"
            ).dt.tz_localize(None)
        workouts = workouts.dropna(subset=["start_date", "end_date"])
        workouts["type_lower"] = workouts["type"].astype(str).str.lower()
        strength_pattern = r"strength|resistance|weight"
        cardio_pattern = r"run|cycle|cardio|aerobic|walk|hiking|swim|row"
        strength_mask = workouts["type_lower"].str.contains(
            strength_pattern, regex=True, na=False
        )
        cardio_mask = (
            workouts["type_lower"].str.contains(cardio_pattern, regex=True, na=False)
            & ~strength_mask
        )

        def create_workout_events(subset: pd.DataFrame, label: str) -> None:
            for row in subset.itertuples():
                start_ts = getattr(row, "start_date", None)
                end_ts = getattr(row, "end_date", None)
                if pd.isna(start_ts) or pd.isna(end_ts):
                    continue
                start_ts = pd.Timestamp(start_ts)
                end_ts = pd.Timestamp(end_ts)
                metrics = compute_glucose_window_metrics(end_ts)
                event_date = start_ts.floor("D")
                next_day = event_date + pd.Timedelta(days=1)
                event_record: dict[str, object] = {
                    "event_type": label,
                    "event_start": start_ts,
                    "event_end": end_ts,
                    "source": getattr(row, "type", np.nan),
                    "duration_minutes": (end_ts - start_ts).total_seconds() / 60,
                    "delta_hrv_next_day": get_metric(
                        daily_lookup, next_day, "hrv_delta"
                    ),
                    "delta_rhr_next_day": get_metric(
                        daily_lookup, next_day, "resting_hr_delta"
                    ),
                    "sleep_efficiency_same_night": get_metric(
                        daily_lookup, next_day, "sleep_efficiency_pct"
                    ),
                    "next_day_glucose_std_mg_dl": get_metric(
                        glucose_daily_lookup, next_day, "glucose_std_mg_dl"
                    ),
                    "glucose_auc_post_2h": metrics["auc"],
                    "glucose_peak_post_2h": metrics["peak"],
                    "glucose_std_post_2h": metrics["std"],
                    "glucose_mean_rate_change_post_2h": metrics["mean_rate_change"],
                    "perceived_effort_balance": np.nan,
                }
                if label == "strength":
                    physical_effort_today = get_metric(
                        daily_lookup, event_date, "mean_PhysicalEffort"
                    )
                    physical_effort_baseline = get_metric(
                        daily_lookup, event_date, "physical_effort_baseline_14d"
                    )
                    if pd.notna(physical_effort_today) and pd.notna(
                        physical_effort_baseline
                    ):
                        event_record["perceived_effort_balance"] = (
                            physical_effort_today - physical_effort_baseline
                        )
                events.append(event_record)

        if cardio_mask.any():
            create_workout_events(workouts[cardio_mask], "cardio")
        if strength_mask.any():
            create_workout_events(workouts[strength_mask], "strength")

    if "stress" in event_spec and not daily_feature_mart.empty:
        stress_df = daily_feature_mart.copy()
        columns_needed = [
            "date",
            "readiness_risk_score",
            "readiness_risk_level",
            "readiness_component_hrv",
            "readiness_component_rhr",
            "readiness_component_sleep",
            "sleep_efficiency_pct",
            "resting_hr_delta",
            "hrv_delta",
        ]
        available_cols = [col for col in columns_needed if col in stress_df.columns]
        stress_df = (
            stress_df[available_cols].dropna(subset=["date"])
            if available_cols
            else pd.DataFrame()
        )
        for row in stress_df.itertuples():
            start_ts = pd.Timestamp(getattr(row, "date")).tz_localize(None)
            events.append(
                {
                    "event_type": "stress",
                    "event_start": start_ts,
                    "event_end": start_ts + pd.Timedelta(days=1),
                    "source": "daily_readiness",
                    "readiness_risk_score": getattr(
                        row, "readiness_risk_score", np.nan
                    ),
                    "readiness_risk_level": str(
                        getattr(row, "readiness_risk_level", "")
                    ),
                    "hrv_delta": getattr(row, "hrv_delta", np.nan),
                    "resting_hr_delta": getattr(row, "resting_hr_delta", np.nan),
                    "sleep_efficiency_pct": getattr(
                        row, "sleep_efficiency_pct", np.nan
                    ),
                    "hrv_component": getattr(row, "readiness_component_hrv", np.nan),
                    "resting_hr_component": getattr(
                        row, "readiness_component_rhr", np.nan
                    ),
                    "sleep_component": getattr(
                        row, "readiness_component_sleep", np.nan
                    ),
                }
            )

    event_df = pd.DataFrame(events)
    if event_df.empty:
        return event_df
    event_df = event_df.sort_values("event_start").reset_index(drop=True)
    event_df["event_id"] = np.arange(1, len(event_df) + 1)
    preferred_order = [
        "event_id",
        "event_type",
        "event_start",
        "event_end",
        "source",
        "pre_glucose_mg_dl",
        "glucose_auc_2h",
        "glucose_peak_2h",
        "time_above_140_2h_minutes",
        "post_window_std_mg_dl",
        "post_window_mean_rate_change",
        "duration_minutes",
        "delta_hrv_next_day",
        "delta_rhr_next_day",
        "sleep_efficiency_same_night",
        "next_day_glucose_std_mg_dl",
        "glucose_auc_post_2h",
        "glucose_peak_post_2h",
        "glucose_std_post_2h",
        "glucose_mean_rate_change_post_2h",
        "perceived_effort_balance",
        "readiness_risk_score",
        "readiness_risk_level",
        "hrv_component",
        "resting_hr_component",
        "sleep_component",
    ]
    existing_cols = [col for col in preferred_order if col in event_df.columns]
    event_df = event_df[
        existing_cols + [col for col in event_df.columns if col not in existing_cols]
    ]
    return event_df


event_feature_mart = build_event_features(
    glucose=glucose_df,
    apple_records=apple_records_df,
    windowed_glucose=windowed_glucose_df,
    merged_health_glucose=merged_health_glucose_df,
    daily_feature_mart=daily_feature_mart,
    glucose_daily=glucose_daily_df,
    event_spec=EVENT_WINDOW_SPEC,
)
display(Markdown("**Event feature mart preview**"))
display(event_feature_mart.head())
EVENT_FEATURE_MART_PATH = PROCESSED_DIR / "feature_mart_events.parquet"
if event_feature_mart.empty:
    print("⚠️ Event feature mart is empty; skipping Parquet export for now.")
else:
    event_feature_mart.to_parquet(EVENT_FEATURE_MART_PATH, index=False)
    print(f"💾 Saved event feature mart to {EVENT_FEATURE_MART_PATH}")

**Event feature mart preview**

Unnamed: 0,event_id,event_type,event_start,event_end,source,pre_glucose_mg_dl,glucose_auc_2h,glucose_peak_2h,time_above_140_2h_minutes,post_window_std_mg_dl,post_window_mean_rate_change,readiness_risk_score,readiness_risk_level,hrv_component,resting_hr_component,sleep_component,glucose_trend_label,hrv_delta,resting_hr_delta,sleep_efficiency_pct
0,1,stress,2025-08-16 00:00:00,2025-08-17 00:00:00,daily_readiness,,,,,,,,,,,0.0,,,,
1,2,meal,2025-08-16 13:20:00,2025-08-16 15:20:00,merged_health_glucose,127.0,204.5,127.0,0.0,13.058,0.357,,,,,,stable,,,
2,3,meal,2025-08-16 13:30:00,2025-08-16 15:30:00,merged_health_glucose,127.0,202.75,127.0,0.0,11.926,0.247,,,,,,stable,,,
3,4,meal,2025-08-16 14:00:00,2025-08-16 16:00:00,merged_health_glucose,87.0,204.271,117.5,0.0,9.945,-0.02,,,,,,stable,,,
4,5,meal,2025-08-16 14:30:00,2025-08-16 16:30:00,merged_health_glucose,92.0,213.854,117.5,0.0,6.384,-0.028,,,,,,stable,,,


💾 Saved event feature mart to /Users/george/Library/Mobile Documents/com~apple~CloudDocs/Programming Projects/Apple-Health-DS/data/processed/feature_mart_events.parquet


### Next steps
- Expand Apple Health daily features with sleep duration/efficiency, VO₂Max trends, and moving baselines
- Implement `build_event_features` for meal, cardio, strength, and stress readiness event windows
- Export event feature mart to `feature_mart_events.parquet` and surface quality checks for downstream Notebook 7