# MD+ Datathon
### Neuroncdocs Team

# Predictive Modeling
### Inputs
* `user_id`: User ID. There are duplicates, but the unique values are 42283 users. No passes 

* `age`: User's age. Based on the statistics, there are some anomalies such as the minimum age is -196691 and the maximum age is 2018. More processing or filtering may be required. There are also quite a lot of missing values 309226. It might be worth replacing them with the mean or median value
  
* `sex`: User gender. There are 132135 missing values. Requires conversion from a categorical variable to a numeric variable. And it is important to check why there are 4 unique values. Gaps should be replaced with the value unknown
  
* `country`: User's country. There are 297985 missing values. Requires conversion from a categorical variable to a numeric variable. Gaps should be replaced with the value unknown
  
* `checkin_date`: Tracking date. It is important to convert to datetime format for ease of use. No passes
  
* `trackable_id`: ID of the event being tracked. Unique values, 264603 events. It's probably possible to delete the column, but not sure yet. No passes trackable_type: The type of event to track. 7 unique types. Explore all types. No passes ('trackable_type' --> 'trackable_name')
  - `Condition` --> condition_keyword_groups
  - `Symptom` --> symptom_keyword_groups
  - `Food` --> food_keyword_groups()
  - `Tag` --> tag_keyword_groups()
  - `Weather` --> `icon`, `temperature_min`, `temperature_max`, `precip_intensity`, `pressure`, `humidity`
  - `HBI`
    
* `trackable_name`: Name of the event being tracked, description of symptoms. 117214 unique values. There are 4 gaps
  
* `trackable_value`: The value of the trackable event. 15960 unique values - severity for some conditions (0-4)


| **Domain**    | **Description** |
|----------------|-----------------|
| **Tags** | Self-reported contextual or psychosocial states (e.g., stress, fatigue, sleep quality). |
| **Foods** | Dietary items categorized via hierarchical clusters (e.g., vegetables, processed foods, caffeine). |
| **Treatments** | Medications, supplements, and therapeutic interventions (e.g., NSAIDs, SSRIs, biologics). |
| **Weather** | Daily environmental data including maximum/minimum temperature, humidity, barometric pressure, and precipitation. |
| **Symptoms** | Intensity scores and qualitative reports of user-experienced symptoms across multiple domains. |
| **Conditions** | Known or self-reported disease diagnoses (e.g., Major Depressive Disorder, Rheumatoid Arthritis, POTS). |



### Outputs

* **Conditions:**  Predictive models estimating the probability of new or worsening disease diagnosis **within 30 days** of a given user check-in.
* **Symptoms:**  Short-term forecasting models estimating daily symptom fluctuations or flare probability.


| Condition | Prediction Horizon | Description |
|------------|--------------------|--------------|
| **Postural Orthostatic Tachycardia Syndrome (POTS)** | 30 days | Predicts initial or recurrent episodes of orthostatic intolerance (tachycardia, dizziness, lightheadedness) within the next 30 days. |
| **Epilepsy** | 30 days | Predicts the first onset or reporting of epilepsy symptoms or diagnosis within the next 30 days. |
| **Depression** | 30 days | Forecasts whether a user will report depressive symptoms or receive a depression-related condition within 30 days. |
| **Anxiety** | 30 days | Estimates the likelihood of anxiety-related symptom onset or diagnosis within the next 30 days. |

| Symptom | Prediction Horizon | Description |
|----------|--------------------|--------------|
| **Inflammatory / Rheumatoid Arthritis flares** | Next day | Predicts short-term symptom worsening (pain, swelling, stiffness) indicative of an inflammatory arthritis flare the following day. |

---

### Model

**Light Gradient Boosting Classifier (LightGBM)**  
A high-performance, tree-based gradient boosting algorithm optimized for structured and tabular health data.

**Key Features**
- Handles **missing values**, **categorical encoding**, and **imbalanced data** efficiently.  
- Integrates **rolling temporal features** (7-day and 30-day symptom, treatment, and environmental aggregates).  
- Provides **feature importance rankings** for interpretability and clinical insight.  
- Tuned for each prediction target (condition or symptom) using class weighting, learning rate adjustment, and ensemble depth optimization.

**Model configuration**
```python
lgb.LGBMClassifier(
    n_estimators=1000,
    learning_rate=0.03,
    num_leaves=64,
    subsample=0.8,
    colsample_bytree=0.8,
    class_weight="balanced",
    random_state=42
)

In [3]:
# Setup
import pandas as pd
from collections import defaultdict
import numpy as np
import re
from datetime import timedelta

# Modeling
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, average_precision_score, precision_recall_curve
import lightgbm as lgb
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.utils import resample
from sklearn.metrics import (
    roc_auc_score, average_precision_score,
    confusion_matrix, precision_recall_curve
)


### Model Evaluation function

In [5]:
def evaluate_and_store_results(model_name, y_true, y_prob, feature_cols, model, results_path="model_results.csv", threshold=0.5):
    """
    Evaluate model performance, extract metrics, feature importance,
    and append results to a centralized CSV for later comparison.
    """

    # --- Probabilistic metrics ---
    auc = roc_auc_score(y_true, y_prob)
    auprc = average_precision_score(y_true, y_prob)

    # --- Binary metrics ---
    y_pred = (y_prob >= threshold).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    sensitivity = tp / (tp + fn) if (tp + fn) > 0 else np.nan
    specificity = tn / (tn + fp) if (tn + fp) > 0 else np.nan

    # --- Store feature importance (if available) ---
    if hasattr(model, "feature_importances_"):
        feature_importances = pd.Series(model.feature_importances_, index=feature_cols).sort_values(ascending=False)
        top_features = ", ".join(feature_importances.head(10).index)
    else:
        top_features = "N/A"

    # --- Package results ---
    result = {
        "Y True": y_true, 
        "Y Prob": y_prob,
        "Model": model_name,
        "AUC": round(auc, 3),
        "AUPRC": round(auprc, 3),
        "Sensitivity": round(sensitivity, 3),
        "Specificity": round(specificity, 3),
        "Top_Features": top_features
    }

    # --- Append to master results CSV ---
    try:
        existing = pd.read_csv(results_path)
        updated = pd.concat([existing, pd.DataFrame([result])], ignore_index=True)
    except FileNotFoundError:
        updated = pd.DataFrame([result])

    updated.to_csv(results_path, index=False)
    print(f"✅ Results saved to {results_path}")
    print(f"\n{pd.DataFrame([result])}")

    return result


### Training + Testing Data Split function

In [7]:
def make_train_test_balanced(
    df,
    label_col,
    date_col="checkin_date",
    user_col="user_id",
    drop_cols=None,
    pos_to_neg_ratio=3,
    cutoff_quantile=0.8,
    random_state=42
):
    """
    Temporal split (train/test) + optional random undersampling of negatives.
    Works for any model version (Epilepsy, POTS, RA, etc.).
    """

    # Sort and temporal split
    df = df.sort_values([user_col, date_col]).copy()
    cutoff = df[date_col].quantile(cutoff_quantile)
    train = df[df[date_col] <= cutoff].copy()
    test  = df[df[date_col] >  cutoff].copy()

    # Drop unwanted or leaky columns
    drop_cols = drop_cols or []
    feature_cols = [c for c in df.columns if c not in drop_cols and c != label_col]

    # Undersample negatives in train set
    pos = train[train[label_col] == 1]
    neg = train[train[label_col] == 0]

    # Sample negatives at pos_to_neg_ratio × positives (but not more than available)
    n_neg = min(len(neg), pos_to_neg_ratio * max(len(pos), 1))
    neg_sampled = resample(neg, replace=False, n_samples=n_neg, random_state=random_state)
    train_bal = pd.concat([pos, neg_sampled]).sample(frac=1, random_state=random_state)

    # Final matrices
    X_train = train_bal[feature_cols]
    y_train = train_bal[label_col].astype(int)
    X_test  = test[feature_cols]
    y_test  = test[label_col].astype(int)

    print(
        f"✅ Train/Test split complete: "
        f"{len(train_bal)} train ({len(pos)} pos, {n_neg} neg), {len(test)} test. "
        f"Pos rate train={y_train.mean():.3f}, test={y_test.mean():.3f}"
    )

    return X_train, X_test, y_train, y_test, feature_cols


---
# Inputs
### Loading clusters or labeled data

In [9]:
# Load all CSVs
conditions_df = pd.read_csv("conditions_clusters.csv")
symptoms_df = pd.read_csv("symptoms_clusters.csv")
food_df = pd.read_csv("foods_clusters.csv")
tags_df = pd.read_csv("tags_clusters.csv")
treatments_df = pd.read_csv("treatments_clusters.csv")

In [10]:
def build_keyword_groups(df):
    groups = defaultdict(list)
    for _, row in df.iterrows():
        term = str(row["term_original"]).strip()
        domain = str(row["best_domain"]).strip()
        if term and domain and domain.lower() != "nan":
            groups[domain].append(term)
            # Optionally, include capitalized version for robust matching
            groups[domain].append(term.capitalize())
    return dict(groups)

In [11]:
condition_keyword_groups = build_keyword_groups(conditions_df)
symptom_keyword_groups = build_keyword_groups(symptoms_df)
food_keyword_groups = build_keyword_groups(food_df)
tag_keyword_groups = build_keyword_groups(tags_df)
treatment_keyword_groups = build_keyword_groups(treatments_df)

---
# Gradient Boosting Models

### Predicting Disease in Individuals
* POTS, Epilepsy, Depression, Anxiety, Inflammatory Arthritis
* ['pots_dysautonomia', 'epilepsy_seizure', 'depression', 'anxiety', 'inflammatory_arthritis'])

In [14]:
# ========= 1) LOAD & CLEAN =========
# Assumes your full long-format table is in df with the columns you listed
path = "/Users/cristybanuelos/Downloads/Chronic_Illness_Dataset.csv"
df = pd.read_csv(path)

df = df[df["trackable_type"] == "Condition"].copy()
df["condition_clean"] = (
    df["trackable_name"].str.lower()
    .str.strip()
    .str.replace(r"[^a-z0-9\s\-']", " ", regex=True)
)

# Parse date
df["checkin_date"] = pd.to_datetime(df["checkin_date"], errors="coerce")
df = df.dropna(subset=["checkin_date"])  # keep only rows with a date

# Basic user-level fields
# Age cleaning: clip to a sensible window; set out-of-range to NaN, then impute
df["age"] = pd.to_numeric(df["age"], errors="coerce")
df.loc[(df["age"] < 0) | (df["age"] > 110), "age"] = np.nan  # clip biologically plausible ages
df["age"] = df.groupby("user_id")["age"].transform(lambda s: s.fillna(s.median()))
df["age"] = df["age"].fillna(df["age"].median())

# Sex cleaning: normalize categories
def norm_sex(x):
    x = str(x).strip().lower()
    if x in {"male", "m"}: return "male"
    if x in {"female", "f"}: return "female"
    if x in {"nan", "none", "", "unknown"}: return "unknown"
    return "other"
df["sex"] = df["sex"].apply(norm_sex)

# Country cleaning
def norm_country(x):
    x = str(x).strip()
    return "unknown" if (x == "" or x.lower() == "nan") else x
df["country"] = df["country"].apply(norm_country)

# ========= 2) TEXT NORMALIZATION (for matching) =========
# Create a clean text column for matching on trackable_name
df["name_clean"] = (
    df["trackable_name"]
    .fillna("")
    .astype(str)
    .str.lower()
    .str.replace(r"[^a-z0-9\s\-']", " ", regex=True)
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
)

# ========= 3) HELPERS TO FLAG KEYWORDS =========
def add_keyword_flags(sub_df, groups_dict, prefix):
    """
    For a subset of df (e.g., only Symptoms, only Food...), add binary columns
    indicating whether trackable_name matches any keyword group for that row.
    We then aggregate to daily features later.
    """
    out = sub_df.copy()
    for group, kws in groups_dict.items():
        # Prebuild a regex OR pattern for speed; escape non-alnum safely
        pattern = r"(" + "|".join([re.escape(k.lower()) for k in kws]) + r")"
        col = f"{prefix}__{group}"
        out[col] = out["name_clean"].str.contains(pattern, regex=True).astype(int)
    return out

# ========= 4) PER-TYPE KEYWORD FLAGS =========
# NOTE: You already have: keyword_groups (conditions incl. "epilepsy_seizure"),
#       symptom_keyword_groups, food_keyword_groups, tag_keyword_groups, treatment_keyword_groups

# Conditions (Condition rows only)
cond_rows = df[df["trackable_type"] == "Condition"]
cond_rows = add_keyword_flags(cond_rows, condition_keyword_groups, "cond")

# Symptoms
sym_rows = df[df["trackable_type"] == "Symptom"]
sym_rows = add_keyword_flags(sym_rows, symptom_keyword_groups, "sym")

# Food
food_rows = df[df["trackable_type"] == "Food"]
food_rows = add_keyword_flags(food_rows, food_keyword_groups, "food")

# Tags (triggers)
tag_rows = df[df["trackable_type"] == "Tag"]
tag_rows = add_keyword_flags(tag_rows, tag_keyword_groups, "tag")

# Treatments
trt_rows = df[df["trackable_type"] == "Treatment"]
trt_rows = add_keyword_flags(trt_rows, treatment_keyword_groups, "trt")

# Weather (already numeric columns; keep as-is if present)
weather_rows = df[df["trackable_type"] == "Weather"].copy()
# Example expected numeric weather columns (adjust to your schema if needed)
for col in ["temperature_min", "temperature_max", "precip_intensity", "pressure", "humidity"]:
    if col in weather_rows.columns:
        weather_rows[col] = pd.to_numeric(weather_rows[col], errors="coerce")

# ========= 5) DAILY AGGREGATION (no leakage) =========
# We build daily features per user, then later roll 7/30d windows that use ONLY past data.

def daily_agg_flags(sub, prefix):
    # get only the flag columns
    flag_cols = [c for c in sub.columns if c.startswith(prefix + "__")]
    if not flag_cols:
        return pd.DataFrame(columns=["user_id", "checkin_date"])
    # include severity if available (0-4); we’ll take max per day
    if "trackable_value" in sub.columns:
        sub["severity_val"] = pd.to_numeric(sub["trackable_value"], errors="coerce")
    else:
        sub["severity_val"] = np.nan

    agg = (
        sub.groupby(["user_id", "checkin_date"])
           .agg({**{c: "max" for c in flag_cols}, "severity_val": "max"})
           .reset_index()
    )
    # rename severity
    if "severity_val" in agg.columns:
        agg = agg.rename(columns={"severity_val": f"{prefix}__max_severity"})
    return agg

daily_cond = daily_agg_flags(cond_rows,  "cond")
daily_sym  = daily_agg_flags(sym_rows,   "sym")
daily_food = daily_agg_flags(food_rows,  "food")
daily_tag  = daily_agg_flags(tag_rows,   "tag")
daily_trt  = daily_agg_flags(trt_rows,   "trt")

# Daily weather (mean if multiple in same day)
if not weather_rows.empty:
    wcols = ["temperature_min", "temperature_max", "precip_intensity", "pressure", "humidity"]
    wcols = [c for c in wcols if c in weather_rows.columns]
    daily_wx = (
        weather_rows.groupby(["user_id", "checkin_date"])[wcols].mean().reset_index()
    )
else:
    daily_wx = pd.DataFrame(columns=["user_id", "checkin_date"])


# ========= Combine into one daily table =========
# Start from all user-day keys present in any table
parts = [daily_cond, daily_sym, daily_food, daily_tag, daily_trt, daily_wx]
daily = None
for p in parts:
    if p is None or p.empty: 
        continue
    daily = p if daily is None else pd.merge(daily, p, on=["user_id", "checkin_date"], how="outer")

if daily is None:
    raise ValueError("No daily features built. Check your inputs.")

daily = daily.sort_values(["user_id", "checkin_date"]).reset_index(drop=True)
daily = daily.fillna(0)  # for flags; numeric weather stays 0 if missing (fine for tree models)


# ========= 6) ROLLING (PAST-ONLY) FEATURES =========
# For each user, compute 7d/30d rolling sums of flags + rolling means of severities & weather.
def add_rollups(g):
    g = g.set_index("checkin_date").sort_index()
    # rolling windows (closed='left' to use ONLY past)
    win_defs = {"7d":"7D", "30d":"30D"}
    for col in g.columns:
        if col.startswith(("cond__", "sym__", "food__", "tag__", "trt__")) and col.endswith("__max_severity") is False:
            for k, win in win_defs.items():
                g[f"{col}__sum_{k}"] = g[col].rolling(win, closed="left").sum()
        # severities & weather: rolling mean
        if col.endswith("__max_severity") or col in ["temperature_min","temperature_max","precip_intensity","pressure","humidity"]:
            for k, win in win_defs.items():
                g[f"{col}__mean_{k}"] = g[col].rolling(win, closed="left").mean()
    return g.reset_index()

daily = daily.groupby("user_id", group_keys=False).apply(add_rollups)
# Fill remaining NaNs from leading window edges
daily = daily.fillna(0)

# ========= 7) BUILD THE TARGET: WHO WILL EVER DEVELOP THE CONDITION =========
# (e.g., Epilepsy, POTS, Depression, etc.)

# Identify all condition-specific flag columns
target_condition = "epilepsy_seizure"  # <-- CHANGE THIS PER MODEL
target_cols = [c for c in daily.columns if c.startswith(f"cond__{target_condition}")]
if not target_cols:
    raise ValueError(f"No columns found for {target_condition}. Check your keyword mapping!")

# Determine which users ever reported the condition
target_users = daily.loc[daily[target_cols].max(axis=1) > 0, "user_id"].unique()

# Label every record from those users as 1 (ever developed condition), others as 0
label_col = f"label_future_{target_condition}"
daily[label_col] = daily["user_id"].isin(target_users).astype(int)

print(f"✅ {label_col}: {daily[label_col].mean():.3%} positive rate")

# ========= 8) ADD DEMOGRAPHICS (static) =========
# Build a static per-user table (age/sex/country) at any row; then merge with daily
demo = df.drop_duplicates("user_id")[["user_id","age","sex","country"]].copy()
daily = daily.merge(demo, on="user_id", how="left")

# One-hot encode sex & country (country can be many; consider top-K and bucket rest as 'other')
# Keep top 20 countries to control dimensionality
top_countries = df["country"].value_counts().head(20).index
daily["country_top"] = daily["country"].where(daily["country"].isin(top_countries), "other")

X_cat = pd.get_dummies(daily[["sex","country_top"]], drop_first=False, dtype=int)
daily = pd.concat([daily.drop(columns=["sex","country","country_top"]), X_cat], axis=1)

# ========= 9) TRAIN / TEST TEMPORAL SPLIT =========
# Choose a cutoff date (e.g., last year as test). Adjust to your range.
cutoff = daily["checkin_date"].quantile(0.8)  # 80% oldest for train, 20% most recent for test
train = daily[daily["checkin_date"] <= cutoff].copy()
test  = daily[daily["checkin_date"] >  cutoff].copy()

# Features: use all engineered columns except identifiers and leakage columns
label_col = f"label_future_{target_condition}"

leaky_cols = target_cols + [
    "cond__max_severity__mean_30d",
    "cond__max_severity__mean_7d", 
    "cond__max_severity"
]
drop_cols = {
    "user_id", "checkin_date",
    "has_epilepsy_today"
}.union(leaky_cols)

feature_cols = [c for c in daily.columns if c not in drop_cols]

X_train, X_test, y_train, y_test, feature_cols = make_train_test_balanced(
    df=daily,
    label_col=label_col,
    drop_cols=drop_cols,
    pos_to_neg_ratio=4,   # 4:1 neg:pos is usually good
)

# Optional: scale numeric continuous columns (LightGBM doesn’t require it)

# ========= 10) TRAIN LIGHTGBM =========
clf = lgb.LGBMClassifier(
    n_estimators=1000,
    learning_rate=0.03,
    num_leaves=64,
    subsample=0.8,
    colsample_bytree=0.8,
    class_weight=None,
    random_state=42
)
clf.fit(X_train, y_train)

# ========= 11) EVALUATE =========
p_test = clf.predict_proba(X_test)[:,1]
auc = roc_auc_score(y_test, p_test)
auprc = average_precision_score(y_test, p_test)

print(f"AUC  : {auc:.3f}")
print(f"AUPRC: {auprc:.3f}")

# Optional: show precision/recall at a few thresholds
prec, rec, thr = precision_recall_curve(y_test, p_test)
for t in [0.05, 0.10, 0.20]:
    # nearest threshold
    idx = (np.abs(thr - t)).argmin() if len(thr) else -1
    if idx >= 0 and idx < len(prec):
        print(f"Threshold~{t:0.2f}: Precision={prec[idx]:.3f}  Recall={rec[idx]:.3f}")

# Evaluate and store results
model_name = "Epilepsy_Future_LGBM"  # or "Epilepsy_30d", "RA_NextDay", etc.
evaluate_and_store_results(model_name, y_test, p_test, feature_cols, clf, results_path="model_results.csv")

# ========= 12) FEATURE IMPORTANCE =========
imp = pd.Series(clf.feature_importances_, index=feature_cols).sort_values(ascending=False)
print("\nTop 25 features:\n", imp.head(25))


✅ label_future_epilepsy_seizure: 1.400% positive rate
✅ Train/Test split complete: 16420 train (3284 pos, 13136 neg), 56263 test. Pos rate train=0.200, test=0.012
[LightGBM] [Info] Number of positive: 3284, number of negative: 13136
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.010852 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2435
[LightGBM] [Info] Number of data points in the train set: 16420, number of used features: 207
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.200000 -> initscore=-1.386294
[LightGBM] [Info] Start training from score -1.386294
AUC  : 0.912
AUPRC: 0.628
Threshold~0.05: Precision=0.089  Recall=0.746
Threshold~0.10: Precision=0.144  Recall=0.704
Threshold~0.20: Precision=0.224  Recall=0.679
✅ Results saved to model_results.csv

                                              Y True  \
0  0         0
1     

### Predicting Depression 

In [16]:
# ========= 1) LOAD & CLEAN =========
# Assumes your full long-format table is in df with the columns you listed
path = "/Users/cristybanuelos/Downloads/Chronic_Illness_Dataset.csv"
df = pd.read_csv(path)

df = df[df["trackable_type"] == "Condition"].copy()
df["condition_clean"] = (
    df["trackable_name"].str.lower()
    .str.strip()
    .str.replace(r"[^a-z0-9\s\-']", " ", regex=True)
)

# Parse date
df["checkin_date"] = pd.to_datetime(df["checkin_date"], errors="coerce")
df = df.dropna(subset=["checkin_date"])  # keep only rows with a date

# Basic user-level fields
# Age cleaning: clip to a sensible window; set out-of-range to NaN, then impute
df["age"] = pd.to_numeric(df["age"], errors="coerce")
df.loc[(df["age"] < 0) | (df["age"] > 110), "age"] = np.nan  # clip biologically plausible ages
df["age"] = df.groupby("user_id")["age"].transform(lambda s: s.fillna(s.median()))
df["age"] = df["age"].fillna(df["age"].median())

# Sex cleaning: normalize categories
def norm_sex(x):
    x = str(x).strip().lower()
    if x in {"male", "m"}: return "male"
    if x in {"female", "f"}: return "female"
    if x in {"nan", "none", "", "unknown"}: return "unknown"
    return "other"
df["sex"] = df["sex"].apply(norm_sex)

# Country cleaning
def norm_country(x):
    x = str(x).strip()
    return "unknown" if (x == "" or x.lower() == "nan") else x
df["country"] = df["country"].apply(norm_country)

# ========= 2) TEXT NORMALIZATION (for matching) =========
# Create a clean text column for matching on trackable_name
df["name_clean"] = (
    df["trackable_name"]
    .fillna("")
    .astype(str)
    .str.lower()
    .str.replace(r"[^a-z0-9\s\-']", " ", regex=True)
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
)

# ========= 3) HELPERS TO FLAG KEYWORDS =========
def add_keyword_flags(sub_df, groups_dict, prefix):
    """
    For a subset of df (e.g., only Symptoms, only Food...), add binary columns
    indicating whether trackable_name matches any keyword group for that row.
    We then aggregate to daily features later.
    """
    out = sub_df.copy()
    for group, kws in groups_dict.items():
        # Prebuild a regex OR pattern for speed; escape non-alnum safely
        pattern = r"(" + "|".join([re.escape(k.lower()) for k in kws]) + r")"
        col = f"{prefix}__{group}"
        out[col] = out["name_clean"].str.contains(pattern, regex=True).astype(int)
    return out

# ========= 4) PER-TYPE KEYWORD FLAGS =========
# NOTE: You already have: keyword_groups (conditions incl. "epilepsy_seizure"),
#       symptom_keyword_groups, food_keyword_groups, tag_keyword_groups, treatment_keyword_groups

# Conditions (Condition rows only)
cond_rows = df[df["trackable_type"] == "Condition"]
cond_rows = add_keyword_flags(cond_rows, condition_keyword_groups, "cond")

# Symptoms
sym_rows = df[df["trackable_type"] == "Symptom"]
sym_rows = add_keyword_flags(sym_rows, symptom_keyword_groups, "sym")

# Food
food_rows = df[df["trackable_type"] == "Food"]
food_rows = add_keyword_flags(food_rows, food_keyword_groups, "food")

# Tags (triggers)
tag_rows = df[df["trackable_type"] == "Tag"]
tag_rows = add_keyword_flags(tag_rows, tag_keyword_groups, "tag")

# Treatments
trt_rows = df[df["trackable_type"] == "Treatment"]
trt_rows = add_keyword_flags(trt_rows, treatment_keyword_groups, "trt")

# Weather (already numeric columns; keep as-is if present)
weather_rows = df[df["trackable_type"] == "Weather"].copy()
# Example expected numeric weather columns (adjust to your schema if needed)
for col in ["temperature_min", "temperature_max", "precip_intensity", "pressure", "humidity"]:
    if col in weather_rows.columns:
        weather_rows[col] = pd.to_numeric(weather_rows[col], errors="coerce")

# ========= 5) DAILY AGGREGATION (no leakage) =========
# We build daily features per user, then later roll 7/30d windows that use ONLY past data.

def daily_agg_flags(sub, prefix):
    # get only the flag columns
    flag_cols = [c for c in sub.columns if c.startswith(prefix + "__")]
    if not flag_cols:
        return pd.DataFrame(columns=["user_id", "checkin_date"])
    # include severity if available (0-4); we’ll take max per day
    if "trackable_value" in sub.columns:
        sub["severity_val"] = pd.to_numeric(sub["trackable_value"], errors="coerce")
    else:
        sub["severity_val"] = np.nan

    agg = (
        sub.groupby(["user_id", "checkin_date"])
           .agg({**{c: "max" for c in flag_cols}, "severity_val": "max"})
           .reset_index()
    )
    # rename severity
    if "severity_val" in agg.columns:
        agg = agg.rename(columns={"severity_val": f"{prefix}__max_severity"})
    return agg

daily_cond = daily_agg_flags(cond_rows,  "cond")
daily_sym  = daily_agg_flags(sym_rows,   "sym")
daily_food = daily_agg_flags(food_rows,  "food")
daily_tag  = daily_agg_flags(tag_rows,   "tag")
daily_trt  = daily_agg_flags(trt_rows,   "trt")

# Daily weather (mean if multiple in same day)
if not weather_rows.empty:
    wcols = ["temperature_min", "temperature_max", "precip_intensity", "pressure", "humidity"]
    wcols = [c for c in wcols if c in weather_rows.columns]
    daily_wx = (
        weather_rows.groupby(["user_id", "checkin_date"])[wcols].mean().reset_index()
    )
else:
    daily_wx = pd.DataFrame(columns=["user_id", "checkin_date"])

# ========= Combine into one daily table =========
# Start from all user-day keys present in any table
parts = [daily_cond, daily_sym, daily_food, daily_tag, daily_trt, daily_wx]
daily = None
for p in parts:
    if p is None or p.empty: 
        continue
    daily = p if daily is None else pd.merge(daily, p, on=["user_id", "checkin_date"], how="outer")

if daily is None:
    raise ValueError("No daily features built. Check your inputs.")

daily = daily.sort_values(["user_id", "checkin_date"]).reset_index(drop=True)
daily = daily.fillna(0)  # for flags; numeric weather stays 0 if missing (fine for tree models)

# ========= 6) ROLLING (PAST-ONLY) FEATURES =========
# For each user, compute 7d/30d rolling sums of flags + rolling means of severities & weather.
def add_rollups(g):
    g = g.set_index("checkin_date").sort_index()
    # rolling windows (closed='left' to use ONLY past)
    win_defs = {"7d":"7D", "30d":"30D"}
    for col in g.columns:
        if col.startswith(("cond__", "sym__", "food__", "tag__", "trt__")) and col.endswith("__max_severity") is False:
            for k, win in win_defs.items():
                g[f"{col}__sum_{k}"] = g[col].rolling(win, closed="left").sum()
        # severities & weather: rolling mean
        if col.endswith("__max_severity") or col in ["temperature_min","temperature_max","precip_intensity","pressure","humidity"]:
            for k, win in win_defs.items():
                g[f"{col}__mean_{k}"] = g[col].rolling(win, closed="left").mean()
    return g.reset_index()

daily = daily.groupby("user_id", group_keys=False).apply(add_rollups)
# Fill remaining NaNs from leading window edges
daily = daily.fillna(0)

# ========= 7) BUILD THE TARGET: WHO WILL EVER DEVELOP THE CONDITION =========
# Example here is Depression — change target_condition per script.

target_condition = "depression"  # <-- CHANGE THIS for each model
target_cols = [c for c in daily.columns if c.startswith(f"cond__{target_condition}")]
if not target_cols:
    raise ValueError(f"No columns found for {target_condition}. Check your keyword mapping!")

# Determine which users ever reported this condition
target_users = daily.loc[daily[target_cols].max(axis=1) > 0, "user_id"].unique()

# Label every record from those users as 1 (ever developed condition), others as 0
label_col = f"label_future_{target_condition}"
daily[label_col] = daily["user_id"].isin(target_users).astype(int)

print(f"✅ {label_col}: {daily[label_col].mean():.3%} positive rate")


# ========= 8) ADD DEMOGRAPHICS (static) =========
# Build a static per-user table (age/sex/country) at any row; then merge with daily
demo = df.drop_duplicates("user_id")[["user_id","age","sex","country"]].copy()
daily = daily.merge(demo, on="user_id", how="left")

# One-hot encode sex & country (country can be many; consider top-K and bucket rest as 'other')
# Keep top 20 countries to control dimensionality
top_countries = df["country"].value_counts().head(20).index
daily["country_top"] = daily["country"].where(daily["country"].isin(top_countries), "other")

X_cat = pd.get_dummies(daily[["sex","country_top"]], drop_first=False, dtype=int)
daily = pd.concat([daily.drop(columns=["sex","country","country_top"]), X_cat], axis=1)

# ========= 9) TRAIN / TEST TEMPORAL SPLIT =========
# Choose a cutoff date (e.g., last year as test). Adjust to your range.
cutoff = daily["checkin_date"].quantile(0.8)  # 80% oldest for train, 20% most recent for test
train = daily[daily["checkin_date"] <= cutoff].copy()
test  = daily[daily["checkin_date"] >  cutoff].copy()

# Features: use all engineered columns except identifiers and leakage columns
label_col = f"label_future_{target_condition}"

leaky_cols = target_cols + [
    "cond__max_severity__mean_30d",
    "cond__max_severity__mean_7d", 
    "cond__max_severity"
]

drop_cols = {
    "user_id", "checkin_date", label_col
}.union(leaky_cols)

feature_cols = [c for c in daily.columns if c not in drop_cols]

X_train, X_test, y_train, y_test, feature_cols = make_train_test_balanced(
    df=daily,
    label_col=label_col,
    drop_cols=drop_cols,
    pos_to_neg_ratio=4,   # 4:1 neg:pos is usually good
)

# Optional: scale numeric continuous columns (LightGBM doesn’t require it)

# ========= 10) TRAIN LIGHTGBM =========
clf = lgb.LGBMClassifier(
    n_estimators=1000,
    learning_rate=0.03,
    num_leaves=64,
    subsample=0.8,
    colsample_bytree=0.8,
    class_weight=None,
    random_state=42
)
clf.fit(X_train, y_train)

# ========= 11) EVALUATE =========
p_test = clf.predict_proba(X_test)[:,1]
auc = roc_auc_score(y_test, p_test)
auprc = average_precision_score(y_test, p_test)

print(f"AUC  : {auc:.3f}")
print(f"AUPRC: {auprc:.3f}")

# Optional: show precision/recall at a few thresholds
prec, rec, thr = precision_recall_curve(y_test, p_test)
for t in [0.05, 0.10, 0.20]:
    # nearest threshold
    idx = (np.abs(thr - t)).argmin() if len(thr) else -1
    if idx >= 0 and idx < len(prec):
        print(f"Threshold~{t:0.2f}: Precision={prec[idx]:.3f}  Recall={rec[idx]:.3f}")

# Evaluate and store results
model_name = "Depression_Future_LGBM"  # or "Epilepsy_30d", "RA_NextDay", etc.
evaluate_and_store_results(model_name, y_test, p_test, feature_cols, clf, results_path="model_results.csv")

# ========= 12) FEATURE IMPORTANCE =========
imp = pd.Series(clf.feature_importances_, index=feature_cols).sort_values(ascending=False)
print("\nTop 25 features:\n", imp.head(25))


✅ label_future_depression: 26.076% positive rate
✅ Train/Test split complete: 225219 train (59438 pos, 165781 neg), 56263 test. Pos rate train=0.264, test=0.248
[LightGBM] [Info] Number of positive: 59438, number of negative: 165781
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.029573 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2637
[LightGBM] [Info] Number of data points in the train set: 225219, number of used features: 210
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.263912 -> initscore=-1.025734
[LightGBM] [Info] Start training from score -1.025734
AUC  : 0.907
AUPRC: 0.819
Threshold~0.05: Precision=0.371  Recall=0.957
Threshold~0.10: Precision=0.514  Recall=0.882
Threshold~0.20: Precision=0.671  Recall=0.809
✅ Results saved to model_results.csv

                                              Y True  \
0  0         0
1    

### Predicting Anxiety

In [18]:
# ========= 1) LOAD & CLEAN =========
# Assumes your full long-format table is in df with the columns you listed
path = "/Users/cristybanuelos/Downloads/Chronic_Illness_Dataset.csv"
df = pd.read_csv(path)

df = df[df["trackable_type"] == "Condition"].copy()
df["condition_clean"] = (
    df["trackable_name"].str.lower()
    .str.strip()
    .str.replace(r"[^a-z0-9\s\-']", " ", regex=True)
)

# Parse date
df["checkin_date"] = pd.to_datetime(df["checkin_date"], errors="coerce")
df = df.dropna(subset=["checkin_date"])  # keep only rows with a date

# Basic user-level fields
# Age cleaning: clip to a sensible window; set out-of-range to NaN, then impute
df["age"] = pd.to_numeric(df["age"], errors="coerce")
df.loc[(df["age"] < 0) | (df["age"] > 110), "age"] = np.nan  # clip biologically plausible ages
df["age"] = df.groupby("user_id")["age"].transform(lambda s: s.fillna(s.median()))
df["age"] = df["age"].fillna(df["age"].median())

# Sex cleaning: normalize categories
def norm_sex(x):
    x = str(x).strip().lower()
    if x in {"male", "m"}: return "male"
    if x in {"female", "f"}: return "female"
    if x in {"nan", "none", "", "unknown"}: return "unknown"
    return "other"
df["sex"] = df["sex"].apply(norm_sex)

# Country cleaning
def norm_country(x):
    x = str(x).strip()
    return "unknown" if (x == "" or x.lower() == "nan") else x
df["country"] = df["country"].apply(norm_country)

# ========= 2) TEXT NORMALIZATION (for matching) =========
# Create a clean text column for matching on trackable_name
df["name_clean"] = (
    df["trackable_name"]
    .fillna("")
    .astype(str)
    .str.lower()
    .str.replace(r"[^a-z0-9\s\-']", " ", regex=True)
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
)

# ========= 3) HELPERS TO FLAG KEYWORDS =========
def add_keyword_flags(sub_df, groups_dict, prefix):
    """
    For a subset of df (e.g., only Symptoms, only Food...), add binary columns
    indicating whether trackable_name matches any keyword group for that row.
    We then aggregate to daily features later.
    """
    out = sub_df.copy()
    for group, kws in groups_dict.items():
        # Prebuild a regex OR pattern for speed; escape non-alnum safely
        pattern = r"(" + "|".join([re.escape(k.lower()) for k in kws]) + r")"
        col = f"{prefix}__{group}"
        out[col] = out["name_clean"].str.contains(pattern, regex=True).astype(int)
    return out

# ========= 4) PER-TYPE KEYWORD FLAGS =========
# NOTE: You already have: keyword_groups (conditions incl. "epilepsy_seizure"),
#       symptom_keyword_groups, food_keyword_groups, tag_keyword_groups, treatment_keyword_groups

# Conditions (Condition rows only)
cond_rows = df[df["trackable_type"] == "Condition"]
cond_rows = add_keyword_flags(cond_rows, condition_keyword_groups, "cond")

# Symptoms
sym_rows = df[df["trackable_type"] == "Symptom"]
sym_rows = add_keyword_flags(sym_rows, symptom_keyword_groups, "sym")

# Food
food_rows = df[df["trackable_type"] == "Food"]
food_rows = add_keyword_flags(food_rows, food_keyword_groups, "food")

# Tags (triggers)
tag_rows = df[df["trackable_type"] == "Tag"]
tag_rows = add_keyword_flags(tag_rows, tag_keyword_groups, "tag")

# Treatments
trt_rows = df[df["trackable_type"] == "Treatment"]
trt_rows = add_keyword_flags(trt_rows, treatment_keyword_groups, "trt")

# Weather (already numeric columns; keep as-is if present)
weather_rows = df[df["trackable_type"] == "Weather"].copy()
# Example expected numeric weather columns (adjust to your schema if needed)
for col in ["temperature_min", "temperature_max", "precip_intensity", "pressure", "humidity"]:
    if col in weather_rows.columns:
        weather_rows[col] = pd.to_numeric(weather_rows[col], errors="coerce")

# ========= 5) DAILY AGGREGATION (no leakage) =========
# We build daily features per user, then later roll 7/30d windows that use ONLY past data.

def daily_agg_flags(sub, prefix):
    # get only the flag columns
    flag_cols = [c for c in sub.columns if c.startswith(prefix + "__")]
    if not flag_cols:
        return pd.DataFrame(columns=["user_id", "checkin_date"])
    # include severity if available (0-4); we’ll take max per day
    if "trackable_value" in sub.columns:
        sub["severity_val"] = pd.to_numeric(sub["trackable_value"], errors="coerce")
    else:
        sub["severity_val"] = np.nan

    agg = (
        sub.groupby(["user_id", "checkin_date"])
           .agg({**{c: "max" for c in flag_cols}, "severity_val": "max"})
           .reset_index()
    )
    # rename severity
    if "severity_val" in agg.columns:
        agg = agg.rename(columns={"severity_val": f"{prefix}__max_severity"})
    return agg

daily_cond = daily_agg_flags(cond_rows,  "cond")
daily_sym  = daily_agg_flags(sym_rows,   "sym")
daily_food = daily_agg_flags(food_rows,  "food")
daily_tag  = daily_agg_flags(tag_rows,   "tag")
daily_trt  = daily_agg_flags(trt_rows,   "trt")

# Daily weather (mean if multiple in same day)
if not weather_rows.empty:
    wcols = ["temperature_min", "temperature_max", "precip_intensity", "pressure", "humidity"]
    wcols = [c for c in wcols if c in weather_rows.columns]
    daily_wx = (
        weather_rows.groupby(["user_id", "checkin_date"])[wcols].mean().reset_index()
    )
else:
    daily_wx = pd.DataFrame(columns=["user_id", "checkin_date"])

# ========= Combine into one daily table =========
# Start from all user-day keys present in any table
parts = [daily_cond, daily_sym, daily_food, daily_tag, daily_trt, daily_wx]
daily = None
for p in parts:
    if p is None or p.empty: 
        continue
    daily = p if daily is None else pd.merge(daily, p, on=["user_id", "checkin_date"], how="outer")

if daily is None:
    raise ValueError("No daily features built. Check your inputs.")

daily = daily.sort_values(["user_id", "checkin_date"]).reset_index(drop=True)
daily = daily.fillna(0)  # for flags; numeric weather stays 0 if missing (fine for tree models)

# ========= 6) ROLLING (PAST-ONLY) FEATURES =========
# For each user, compute 7d/30d rolling sums of flags + rolling means of severities & weather.
def add_rollups(g):
    g = g.set_index("checkin_date").sort_index()
    # rolling windows (closed='left' to use ONLY past)
    win_defs = {"7d":"7D", "30d":"30D"}
    for col in g.columns:
        if col.startswith(("cond__", "sym__", "food__", "tag__", "trt__")) and col.endswith("__max_severity") is False:
            for k, win in win_defs.items():
                g[f"{col}__sum_{k}"] = g[col].rolling(win, closed="left").sum()
        # severities & weather: rolling mean
        if col.endswith("__max_severity") or col in ["temperature_min","temperature_max","precip_intensity","pressure","humidity"]:
            for k, win in win_defs.items():
                g[f"{col}__mean_{k}"] = g[col].rolling(win, closed="left").mean()
    return g.reset_index()

daily = daily.groupby("user_id", group_keys=False).apply(add_rollups)
# Fill remaining NaNs from leading window edges
daily = daily.fillna(0)

# ========= 7) BUILD THE TARGET: WHO WILL EVER DEVELOP THE CONDITION =========
# Example here is Depression — change target_condition per script.

target_condition = "anxiety"  # <-- CHANGE THIS for each model
target_cols = [c for c in daily.columns if c.startswith(f"cond__{target_condition}")]
if not target_cols:
    raise ValueError(f"No columns found for {target_condition}. Check your keyword mapping!")

# Determine which users ever reported this condition
target_users = daily.loc[daily[target_cols].max(axis=1) > 0, "user_id"].unique()

# Label every record from those users as 1 (ever developed condition), others as 0
label_col = f"label_future_{target_condition}"
daily[label_col] = daily["user_id"].isin(target_users).astype(int)

print(f"✅ {label_col}: {daily[label_col].mean():.3%} positive rate")



# ========= 8) ADD DEMOGRAPHICS (static) =========
# Build a static per-user table (age/sex/country) at any row; then merge with daily
demo = df.drop_duplicates("user_id")[["user_id","age","sex","country"]].copy()
daily = daily.merge(demo, on="user_id", how="left")

# One-hot encode sex & country (country can be many; consider top-K and bucket rest as 'other')
# Keep top 20 countries to control dimensionality
top_countries = df["country"].value_counts().head(20).index
daily["country_top"] = daily["country"].where(daily["country"].isin(top_countries), "other")

X_cat = pd.get_dummies(daily[["sex","country_top"]], drop_first=False, dtype=int)
daily = pd.concat([daily.drop(columns=["sex","country","country_top"]), X_cat], axis=1)

# ========= 9) TRAIN / TEST TEMPORAL SPLIT =========
# Choose a cutoff date (e.g., last year as test). Adjust to your range.
cutoff = daily["checkin_date"].quantile(0.8)  # 80% oldest for train, 20% most recent for test
train = daily[daily["checkin_date"] <= cutoff].copy()
test  = daily[daily["checkin_date"] >  cutoff].copy()

# Features: use all engineered columns except identifiers and leakage columns
label_col = f"label_future_{target_condition}"

leaky_cols = target_cols + [
    "cond__max_severity__mean_30d",
    "cond__max_severity__mean_7d", 
    "cond__max_severity"
]
drop_cols = {
    "user_id", "checkin_date", label_col
}.union(leaky_cols)

feature_cols = [c for c in daily.columns if c not in drop_cols]

X_train, X_test, y_train, y_test, feature_cols = make_train_test_balanced(
    df=daily,
    label_col=label_col,
    drop_cols=drop_cols,
    pos_to_neg_ratio=4,   # 4:1 neg:pos is usually good
)

# Optional: scale numeric continuous columns (LightGBM doesn’t require it)

# ========= 10) TRAIN LIGHTGBM =========
clf = lgb.LGBMClassifier(
    n_estimators=1000,
    learning_rate=0.03,
    num_leaves=64,
    subsample=0.8,
    colsample_bytree=0.8,
    class_weight="balanced",
    random_state=42
)
clf.fit(X_train, y_train)

# ========= 11) EVALUATE =========
p_test = clf.predict_proba(X_test)[:,1]
auc = roc_auc_score(y_test, p_test)
auprc = average_precision_score(y_test, p_test)

print(f"AUC  : {auc:.3f}")
print(f"AUPRC: {auprc:.3f}")

# Optional: show precision/recall at a few thresholds
prec, rec, thr = precision_recall_curve(y_test, p_test)
for t in [0.05, 0.10, 0.20]:
    # nearest threshold
    idx = (np.abs(thr - t)).argmin() if len(thr) else -1
    if idx >= 0 and idx < len(prec):
        print(f"Threshold~{t:0.2f}: Precision={prec[idx]:.3f}  Recall={rec[idx]:.3f}")

# Evaluate and store results
model_name = "Anxiety_Future_LGBM"  # or "Epilepsy_30d", "RA_NextDay", etc.
evaluate_and_store_results(model_name, y_test, p_test, feature_cols, clf, results_path="model_results.csv")

# ========= 12) FEATURE IMPORTANCE =========
imp = pd.Series(clf.feature_importances_, index=feature_cols).sort_values(ascending=False)
print("\nTop 25 features:\n", imp.head(25))


✅ label_future_anxiety: 27.095% positive rate
✅ Train/Test split complete: 225219 train (61297 pos, 163922 neg), 56263 test. Pos rate train=0.272, test=0.266
[LightGBM] [Info] Number of positive: 61297, number of negative: 163922
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.029417 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2636
[LightGBM] [Info] Number of data points in the train set: 225219, number of used features: 210
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000
AUC  : 0.894
AUPRC: 0.817
Threshold~0.05: Precision=0.358  Recall=0.944
Threshold~0.10: Precision=0.404  Recall=0.927
Threshold~0.20: Precision=0.514  Recall=0.874
✅ Results saved to model_results.csv

                                              Y True  \
0  0         0
1         

### Predicting POTS

In [20]:
# ========= 1) LOAD & CLEAN =========
# Assumes your full long-format table is in df with the columns you listed
path = "/Users/cristybanuelos/Downloads/Chronic_Illness_Dataset.csv"
df = pd.read_csv(path)

df = df[df["trackable_type"] == "Condition"].copy()
df["condition_clean"] = (
    df["trackable_name"].str.lower()
    .str.strip()
    .str.replace(r"[^a-z0-9\s\-']", " ", regex=True)
)

# Parse date
df["checkin_date"] = pd.to_datetime(df["checkin_date"], errors="coerce")
df = df.dropna(subset=["checkin_date"])  # keep only rows with a date

# Basic user-level fields
# Age cleaning: clip to a sensible window; set out-of-range to NaN, then impute
df["age"] = pd.to_numeric(df["age"], errors="coerce")
df.loc[(df["age"] < 0) | (df["age"] > 110), "age"] = np.nan  # clip biologically plausible ages
df["age"] = df.groupby("user_id")["age"].transform(lambda s: s.fillna(s.median()))
df["age"] = df["age"].fillna(df["age"].median())

# Sex cleaning: normalize categories
def norm_sex(x):
    x = str(x).strip().lower()
    if x in {"male", "m"}: return "male"
    if x in {"female", "f"}: return "female"
    if x in {"nan", "none", "", "unknown"}: return "unknown"
    return "other"
df["sex"] = df["sex"].apply(norm_sex)

# Country cleaning
def norm_country(x):
    x = str(x).strip()
    return "unknown" if (x == "" or x.lower() == "nan") else x
df["country"] = df["country"].apply(norm_country)

# ========= 2) TEXT NORMALIZATION (for matching) =========
# Create a clean text column for matching on trackable_name
df["name_clean"] = (
    df["trackable_name"]
    .fillna("")
    .astype(str)
    .str.lower()
    .str.replace(r"[^a-z0-9\s\-']", " ", regex=True)
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
)

# ========= 3) HELPERS TO FLAG KEYWORDS =========
def add_keyword_flags(sub_df, groups_dict, prefix):
    """
    For a subset of df (e.g., only Symptoms, only Food...), add binary columns
    indicating whether trackable_name matches any keyword group for that row.
    We then aggregate to daily features later.
    """
    out = sub_df.copy()
    for group, kws in groups_dict.items():
        # Prebuild a regex OR pattern for speed; escape non-alnum safely
        pattern = r"(" + "|".join([re.escape(k.lower()) for k in kws]) + r")"
        col = f"{prefix}__{group}"
        out[col] = out["name_clean"].str.contains(pattern, regex=True).astype(int)
    return out

# ========= 4) PER-TYPE KEYWORD FLAGS =========
# NOTE: You already have: keyword_groups (conditions incl. "epilepsy_seizure"),
#       symptom_keyword_groups, food_keyword_groups, tag_keyword_groups, treatment_keyword_groups

# Conditions (Condition rows only)
cond_rows = df[df["trackable_type"] == "Condition"]
cond_rows = add_keyword_flags(cond_rows, condition_keyword_groups, "cond")

# Symptoms
sym_rows = df[df["trackable_type"] == "Symptom"]
sym_rows = add_keyword_flags(sym_rows, symptom_keyword_groups, "sym")

# Food
food_rows = df[df["trackable_type"] == "Food"]
food_rows = add_keyword_flags(food_rows, food_keyword_groups, "food")

# Tags (triggers)
tag_rows = df[df["trackable_type"] == "Tag"]
tag_rows = add_keyword_flags(tag_rows, tag_keyword_groups, "tag")

# Treatments
trt_rows = df[df["trackable_type"] == "Treatment"]
trt_rows = add_keyword_flags(trt_rows, treatment_keyword_groups, "trt")

# Weather (already numeric columns; keep as-is if present)
weather_rows = df[df["trackable_type"] == "Weather"].copy()
# Example expected numeric weather columns (adjust to your schema if needed)
for col in ["temperature_min", "temperature_max", "precip_intensity", "pressure", "humidity"]:
    if col in weather_rows.columns:
        weather_rows[col] = pd.to_numeric(weather_rows[col], errors="coerce")

# ========= 5) DAILY AGGREGATION (no leakage) =========
# We build daily features per user, then later roll 7/30d windows that use ONLY past data.

def daily_agg_flags(sub, prefix):
    # get only the flag columns
    flag_cols = [c for c in sub.columns if c.startswith(prefix + "__")]
    if not flag_cols:
        return pd.DataFrame(columns=["user_id", "checkin_date"])
    # include severity if available (0-4); we’ll take max per day
    if "trackable_value" in sub.columns:
        sub["severity_val"] = pd.to_numeric(sub["trackable_value"], errors="coerce")
    else:
        sub["severity_val"] = np.nan

    agg = (
        sub.groupby(["user_id", "checkin_date"])
           .agg({**{c: "max" for c in flag_cols}, "severity_val": "max"})
           .reset_index()
    )
    # rename severity
    if "severity_val" in agg.columns:
        agg = agg.rename(columns={"severity_val": f"{prefix}__max_severity"})
    return agg

daily_cond = daily_agg_flags(cond_rows,  "cond")
daily_sym  = daily_agg_flags(sym_rows,   "sym")
daily_food = daily_agg_flags(food_rows,  "food")
daily_tag  = daily_agg_flags(tag_rows,   "tag")
daily_trt  = daily_agg_flags(trt_rows,   "trt")

# Daily weather (mean if multiple in same day)
if not weather_rows.empty:
    wcols = ["temperature_min", "temperature_max", "precip_intensity", "pressure", "humidity"]
    wcols = [c for c in wcols if c in weather_rows.columns]
    daily_wx = (
        weather_rows.groupby(["user_id", "checkin_date"])[wcols].mean().reset_index()
    )
else:
    daily_wx = pd.DataFrame(columns=["user_id", "checkin_date"])

# ========= Combine into one daily table =========
# Start from all user-day keys present in any table
parts = [daily_cond, daily_sym, daily_food, daily_tag, daily_trt, daily_wx]
daily = None
for p in parts:
    if p is None or p.empty: 
        continue
    daily = p if daily is None else pd.merge(daily, p, on=["user_id", "checkin_date"], how="outer")

if daily is None:
    raise ValueError("No daily features built. Check your inputs.")

daily = daily.sort_values(["user_id", "checkin_date"]).reset_index(drop=True)
daily = daily.fillna(0)  # for flags; numeric weather stays 0 if missing (fine for tree models)

# ========= 6) ROLLING (PAST-ONLY) FEATURES =========
# For each user, compute 7d/30d rolling sums of flags + rolling means of severities & weather.
def add_rollups(g):
    g = g.set_index("checkin_date").sort_index()
    # rolling windows (closed='left' to use ONLY past)
    win_defs = {"7d":"7D", "30d":"30D"}
    for col in g.columns:
        if col.startswith(("cond__", "sym__", "food__", "tag__", "trt__")) and col.endswith("__max_severity") is False:
            for k, win in win_defs.items():
                g[f"{col}__sum_{k}"] = g[col].rolling(win, closed="left").sum()
        # severities & weather: rolling mean
        if col.endswith("__max_severity") or col in ["temperature_min","temperature_max","precip_intensity","pressure","humidity"]:
            for k, win in win_defs.items():
                g[f"{col}__mean_{k}"] = g[col].rolling(win, closed="left").mean()
    return g.reset_index()

daily = daily.groupby("user_id", group_keys=False).apply(add_rollups)
# Fill remaining NaNs from leading window edges
daily = daily.fillna(0)

# ========= 7) BUILD THE TARGET: WHO WILL EVER DEVELOP THE CONDITION =========
# Example here is Depression — change target_condition per script.

target_condition = "pots_dysautonomia"  # <-- CHANGE THIS for each model
target_cols = [c for c in daily.columns if c.startswith(f"cond__{target_condition}")]
if not target_cols:
    raise ValueError(f"No columns found for {target_condition}. Check your keyword mapping!")

# Determine which users ever reported this condition
target_users = daily.loc[daily[target_cols].max(axis=1) > 0, "user_id"].unique()

# Label every record from those users as 1 (ever developed condition), others as 0
label_col = f"label_future_{target_condition}"
daily[label_col] = daily["user_id"].isin(target_users).astype(int)

print(f"✅ {label_col}: {daily[label_col].mean():.3%} positive rate")

# ========= 8) ADD DEMOGRAPHICS (static) =========
demo = df.drop_duplicates("user_id")[["user_id","age","sex","country"]].copy()
daily = daily.merge(demo, on="user_id", how="left")

# One-hot encode sex & country (country can be many; consider top-K and bucket rest as 'other')
top_countries = df["country"].value_counts().head(20).index
daily["country_top"] = daily["country"].where(daily["country"].isin(top_countries), "other")

X_cat = pd.get_dummies(daily[["sex","country_top"]], drop_first=False, dtype=int)
daily = pd.concat([daily.drop(columns=["sex","country","country_top"]), X_cat], axis=1)

# ========= 9) TRAIN / TEST TEMPORAL SPLIT =========
cutoff = daily["checkin_date"].quantile(0.8)  # 80% oldest for train, 20% most recent for test
train = daily[daily["checkin_date"] <= cutoff].copy()
test  = daily[daily["checkin_date"] >  cutoff].copy()

# Features: exclude identifiers & leakage columns
label_col = f"label_future_{target_condition}"
leaky_cols = target_cols + [
    "cond__max_severity__mean_30d",
    "cond__max_severity__mean_7d", 
    "cond__max_severity"
]
drop_cols = {
    "user_id", "checkin_date", label_col
}.union(leaky_cols)

# Features: exclude identifiers & leakage columns
feature_cols = [c for c in daily.columns if c not in drop_cols]

X_train, X_test, y_train, y_test, feature_cols = make_train_test_balanced(
    df=daily,
    label_col=label_col,
    drop_cols=drop_cols,
    pos_to_neg_ratio=4,   # 4:1 neg:pos is usually good
)


# ========= 10) TRAIN LIGHTGBM (unchanged) =========
clf = lgb.LGBMClassifier(
    n_estimators=1000,
    learning_rate=0.03,
    num_leaves=64,
    subsample=0.8,
    colsample_bytree=0.8,
    class_weight=None,
    random_state=42
)
clf.fit(X_train, y_train)

# ========= 11) EVALUATE =========
p_test = clf.predict_proba(X_test)[:,1]
auc = roc_auc_score(y_test, p_test)
auprc = average_precision_score(y_test, p_test)

print(f"AUC  : {auc:.3f}")
print(f"AUPRC: {auprc:.3f}")

from sklearn.metrics import precision_recall_curve
prec, rec, thr = precision_recall_curve(y_test, p_test)
for t in [0.05, 0.10, 0.20]:
    idx = (np.abs(thr - t)).argmin() if len(thr) else -1
    if idx >= 0 and idx < len(prec):
        print(f"Threshold~{t:0.2f}: Precision={prec[idx]:.3f}  Recall={rec[idx]:.3f}")

# Evaluate and store results
model_name = "POTS_Future_LGBM"  # or "Epilepsy_30d", "RA_NextDay", etc.
evaluate_and_store_results(model_name, y_test, p_test, feature_cols, clf, results_path="model_results.csv")


# ========= 12) FEATURE IMPORTANCE =========
imp = pd.Series(clf.feature_importances_, index=feature_cols).sort_values(ascending=False)
print("\nTop 25 features:\n", imp.head(25))


✅ label_future_pots_dysautonomia: 6.413% positive rate
✅ Train/Test split complete: 71950 train (14390 pos, 57560 neg), 56263 test. Pos rate train=0.200, test=0.065
[LightGBM] [Info] Number of positive: 14390, number of negative: 57560
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.017598 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2590
[LightGBM] [Info] Number of data points in the train set: 71950, number of used features: 209
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.200000 -> initscore=-1.386294
[LightGBM] [Info] Start training from score -1.386294
AUC  : 0.990
AUPRC: 0.950
Threshold~0.05: Precision=0.594  Recall=0.960
Threshold~0.10: Precision=0.739  Recall=0.937
Threshold~0.20: Precision=0.824  Recall=0.931
✅ Results saved to model_results.csv

                                              Y True  \
0  0         0
1  

In [21]:
# ========= 1) LOAD & CLEAN =========
# Assumes your full long-format table is in df with the columns you listed
path = "/Users/cristybanuelos/Downloads/Chronic_Illness_Dataset.csv"
df = pd.read_csv(path)

df = df[df["trackable_type"] == "Condition"].copy()
df["condition_clean"] = (
    df["trackable_name"].str.lower()
    .str.strip()
    .str.replace(r"[^a-z0-9\s\-']", " ", regex=True)
)

# Parse date
df["checkin_date"] = pd.to_datetime(df["checkin_date"], errors="coerce")
df = df.dropna(subset=["checkin_date"])  # keep only rows with a date

# Basic user-level fields
# Age cleaning: clip to a sensible window; set out-of-range to NaN, then impute
df["age"] = pd.to_numeric(df["age"], errors="coerce")
df.loc[(df["age"] < 0) | (df["age"] > 110), "age"] = np.nan  # clip biologically plausible ages
df["age"] = df.groupby("user_id")["age"].transform(lambda s: s.fillna(s.median()))
df["age"] = df["age"].fillna(df["age"].median())

# Sex cleaning: normalize categories
def norm_sex(x):
    x = str(x).strip().lower()
    if x in {"male", "m"}: return "male"
    if x in {"female", "f"}: return "female"
    if x in {"nan", "none", "", "unknown"}: return "unknown"
    return "other"
df["sex"] = df["sex"].apply(norm_sex)

# Country cleaning
def norm_country(x):
    x = str(x).strip()
    return "unknown" if (x == "" or x.lower() == "nan") else x
df["country"] = df["country"].apply(norm_country)

# ========= 2) TEXT NORMALIZATION (for matching) =========
# Create a clean text column for matching on trackable_name
df["name_clean"] = (
    df["trackable_name"]
    .fillna("")
    .astype(str)
    .str.lower()
    .str.replace(r"[^a-z0-9\s\-']", " ", regex=True)
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
)

# ========= 3) HELPERS TO FLAG KEYWORDS =========
def add_keyword_flags(sub_df, groups_dict, prefix):
    """
    For a subset of df (e.g., only Symptoms, only Food...), add binary columns
    indicating whether trackable_name matches any keyword group for that row.
    We then aggregate to daily features later.
    """
    out = sub_df.copy()
    for group, kws in groups_dict.items():
        # Prebuild a regex OR pattern for speed; escape non-alnum safely
        pattern = r"(" + "|".join([re.escape(k.lower()) for k in kws]) + r")"
        col = f"{prefix}__{group}"
        out[col] = out["name_clean"].str.contains(pattern, regex=True).astype(int)
    return out

# ========= 4) PER-TYPE KEYWORD FLAGS =========
# NOTE: You already have: keyword_groups (conditions incl. "epilepsy_seizure"),
#       symptom_keyword_groups, food_keyword_groups, tag_keyword_groups, treatment_keyword_groups

# Conditions (Condition rows only)
cond_rows = df[df["trackable_type"] == "Condition"]
cond_rows = add_keyword_flags(cond_rows, condition_keyword_groups, "cond")

# Symptoms
sym_rows = df[df["trackable_type"] == "Symptom"]
sym_rows = add_keyword_flags(sym_rows, symptom_keyword_groups, "sym")

# Food
food_rows = df[df["trackable_type"] == "Food"]
food_rows = add_keyword_flags(food_rows, food_keyword_groups, "food")

# Tags (triggers)
tag_rows = df[df["trackable_type"] == "Tag"]
tag_rows = add_keyword_flags(tag_rows, tag_keyword_groups, "tag")

# Treatments
trt_rows = df[df["trackable_type"] == "Treatment"]
trt_rows = add_keyword_flags(trt_rows, treatment_keyword_groups, "trt")

# Weather (already numeric columns; keep as-is if present)
weather_rows = df[df["trackable_type"] == "Weather"].copy()
# Example expected numeric weather columns (adjust to your schema if needed)
for col in ["temperature_min", "temperature_max", "precip_intensity", "pressure", "humidity"]:
    if col in weather_rows.columns:
        weather_rows[col] = pd.to_numeric(weather_rows[col], errors="coerce")

# ========= 5) DAILY AGGREGATION (no leakage) =========
# We build daily features per user, then later roll 7/30d windows that use ONLY past data.

def daily_agg_flags(sub, prefix):
    # get only the flag columns
    flag_cols = [c for c in sub.columns if c.startswith(prefix + "__")]
    if not flag_cols:
        return pd.DataFrame(columns=["user_id", "checkin_date"])
    # include severity if available (0-4); we’ll take max per day
    if "trackable_value" in sub.columns:
        sub["severity_val"] = pd.to_numeric(sub["trackable_value"], errors="coerce")
    else:
        sub["severity_val"] = np.nan

    agg = (
        sub.groupby(["user_id", "checkin_date"])
           .agg({**{c: "max" for c in flag_cols}, "severity_val": "max"})
           .reset_index()
    )
    # rename severity
    if "severity_val" in agg.columns:
        agg = agg.rename(columns={"severity_val": f"{prefix}__max_severity"})
    return agg

daily_cond = daily_agg_flags(cond_rows,  "cond")
daily_sym  = daily_agg_flags(sym_rows,   "sym")
daily_food = daily_agg_flags(food_rows,  "food")
daily_tag  = daily_agg_flags(tag_rows,   "tag")
daily_trt  = daily_agg_flags(trt_rows,   "trt")

# Daily weather (mean if multiple in same day)
if not weather_rows.empty:
    wcols = ["temperature_min", "temperature_max", "precip_intensity", "pressure", "humidity"]
    wcols = [c for c in wcols if c in weather_rows.columns]
    daily_wx = (
        weather_rows.groupby(["user_id", "checkin_date"])[wcols].mean().reset_index()
    )
else:
    daily_wx = pd.DataFrame(columns=["user_id", "checkin_date"])


# ========= Combine into one daily table =========
# Start from all user-day keys present in any table
parts = [daily_cond, daily_sym, daily_food, daily_tag, daily_trt, daily_wx]
daily = None
for p in parts:
    if p is None or p.empty: 
        continue
    daily = p if daily is None else pd.merge(daily, p, on=["user_id", "checkin_date"], how="outer")

if daily is None:
    raise ValueError("No daily features built. Check your inputs.")

daily = daily.sort_values(["user_id", "checkin_date"]).reset_index(drop=True)
daily = daily.fillna(0)  # for flags; numeric weather stays 0 if missing (fine for tree models)


# ========= 6) ROLLING (PAST-ONLY) FEATURES =========
# For each user, compute 7d/30d rolling sums of flags + rolling means of severities & weather.
def add_rollups(g):
    g = g.set_index("checkin_date").sort_index()
    # rolling windows (closed='left' to use ONLY past)
    win_defs = {"7d":"7D", "30d":"30D"}
    for col in g.columns:
        if col.startswith(("cond__", "sym__", "food__", "tag__", "trt__")) and col.endswith("__max_severity") is False:
            for k, win in win_defs.items():
                g[f"{col}__sum_{k}"] = g[col].rolling(win, closed="left").sum()
        # severities & weather: rolling mean
        if col.endswith("__max_severity") or col in ["temperature_min","temperature_max","precip_intensity","pressure","humidity"]:
            for k, win in win_defs.items():
                g[f"{col}__mean_{k}"] = g[col].rolling(win, closed="left").mean()
    return g.reset_index()

daily = daily.groupby("user_id", group_keys=False).apply(add_rollups)
# Fill remaining NaNs from leading window edges
daily = daily.fillna(0)

# ========= 7) BUILD THE TARGET: WHO WILL EVER DEVELOP THE CONDITION =========
# (e.g., Epilepsy, POTS, Depression, etc.)

# Identify all condition-specific flag columns
target_condition = "inflammatory_arthritis"  # <-- CHANGE THIS PER MODEL
target_cols = [c for c in daily.columns if c.startswith(f"cond__{target_condition}")]
if not target_cols:
    raise ValueError(f"No columns found for {target_condition}. Check your keyword mapping!")

# Determine which users ever reported the condition
target_users = daily.loc[daily[target_cols].max(axis=1) > 0, "user_id"].unique()

# Label every record from those users as 1 (ever developed condition), others as 0
label_col = f"label_future_{target_condition}"
daily[label_col] = daily["user_id"].isin(target_users).astype(int)

print(f"✅ {label_col}: {daily[label_col].mean():.3%} positive rate")

# ========= 8) ADD DEMOGRAPHICS (static) =========
# Build a static per-user table (age/sex/country) at any row; then merge with daily
demo = df.drop_duplicates("user_id")[["user_id","age","sex","country"]].copy()
daily = daily.merge(demo, on="user_id", how="left")

# One-hot encode sex & country (country can be many; consider top-K and bucket rest as 'other')
# Keep top 20 countries to control dimensionality
top_countries = df["country"].value_counts().head(20).index
daily["country_top"] = daily["country"].where(daily["country"].isin(top_countries), "other")

X_cat = pd.get_dummies(daily[["sex","country_top"]], drop_first=False, dtype=int)
daily = pd.concat([daily.drop(columns=["sex","country","country_top"]), X_cat], axis=1)

# ========= 9) TRAIN / TEST TEMPORAL SPLIT =========
# Choose a cutoff date (e.g., last year as test). Adjust to your range.
cutoff = daily["checkin_date"].quantile(0.8)  # 80% oldest for train, 20% most recent for test
train = daily[daily["checkin_date"] <= cutoff].copy()
test  = daily[daily["checkin_date"] >  cutoff].copy()

# Features: use all engineered columns except identifiers and leakage columns
label_col = f"label_future_{target_condition}"

leaky_cols = target_cols + [
    "cond__max_severity__mean_30d",
    "cond__max_severity__mean_7d", 
    "cond__max_severity"
]
drop_cols = {
    "user_id", "checkin_date", label_col
}.union(leaky_cols)

feature_cols = [c for c in daily.columns if c not in drop_cols]

X_train, X_test, y_train, y_test, feature_cols = make_train_test_balanced(
    df=daily,
    label_col=label_col,
    drop_cols=drop_cols,
    pos_to_neg_ratio=4,   # 4:1 neg:pos is usually good
)

# Optional: scale numeric continuous columns (LightGBM doesn’t require it)

# ========= 10) TRAIN LIGHTGBM =========
clf = lgb.LGBMClassifier(
    n_estimators=1000,
    learning_rate=0.03,
    num_leaves=64,
    subsample=0.8,
    colsample_bytree=0.8,
    class_weight=None,
    random_state=42
)
clf.fit(X_train, y_train)

# ========= 11) EVALUATE =========
p_test = clf.predict_proba(X_test)[:,1]
auc = roc_auc_score(y_test, p_test)
auprc = average_precision_score(y_test, p_test)

print(f"AUC  : {auc:.3f}")
print(f"AUPRC: {auprc:.3f}")

# Optional: show precision/recall at a few thresholds
prec, rec, thr = precision_recall_curve(y_test, p_test)
for t in [0.05, 0.10, 0.20]:
    # nearest threshold
    idx = (np.abs(thr - t)).argmin() if len(thr) else -1
    if idx >= 0 and idx < len(prec):
        print(f"Threshold~{t:0.2f}: Precision={prec[idx]:.3f}  Recall={rec[idx]:.3f}")

# Evaluate and store results
model_name = "IA_Future_LGBM"  # or "Epilepsy_30d", "RA_NextDay", etc.
evaluate_and_store_results(model_name, y_test, p_test, feature_cols, clf, results_path="model_results.csv")

# ========= 12) FEATURE IMPORTANCE =========
imp = pd.Series(clf.feature_importances_, index=feature_cols).sort_values(ascending=False)
print("\nTop 25 features:\n", imp.head(25))


✅ label_future_inflammatory_arthritis: 47.534% positive rate
✅ Train/Test split complete: 225219 train (108018 pos, 117201 neg), 56263 test. Pos rate train=0.480, test=0.458
[LightGBM] [Info] Number of positive: 108018, number of negative: 117201
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.029308 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2635
[LightGBM] [Info] Number of data points in the train set: 225219, number of used features: 210
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.479613 -> initscore=-0.081593
[LightGBM] [Info] Start training from score -0.081593
AUC  : 0.887
AUPRC: 0.867
Threshold~0.05: Precision=0.542  Recall=0.987
Threshold~0.10: Precision=0.569  Recall=0.975
Threshold~0.20: Precision=0.659  Recall=0.922
✅ Results saved to model_results.csv

                                              Y True  \
0  0  

### Predicting Next Day Inflammatory Arthritis Flares (Worsening)

In [23]:
# ========= 1) LOAD & CLEAN =========
# Assumes your full long-format table is in df with the columns you listed
path = "/Users/cristybanuelos/Downloads/Chronic_Illness_Dataset.csv"
df = pd.read_csv(path)

df = df[df["trackable_type"] == "Condition"].copy()
df["condition_clean"] = (
    df["trackable_name"].str.lower()
    .str.strip()
    .str.replace(r"[^a-z0-9\s\-']", " ", regex=True)
)

# Parse date
df["checkin_date"] = pd.to_datetime(df["checkin_date"], errors="coerce")
df = df.dropna(subset=["checkin_date"])  # keep only rows with a date

# Basic user-level fields
# Age cleaning: clip to a sensible window; set out-of-range to NaN, then impute
df["age"] = pd.to_numeric(df["age"], errors="coerce")
df.loc[(df["age"] < 0) | (df["age"] > 110), "age"] = np.nan  # clip biologically plausible ages
df["age"] = df.groupby("user_id")["age"].transform(lambda s: s.fillna(s.median()))
df["age"] = df["age"].fillna(df["age"].median())

# Sex cleaning: normalize categories
def norm_sex(x):
    x = str(x).strip().lower()
    if x in {"male", "m"}: return "male"
    if x in {"female", "f"}: return "female"
    if x in {"nan", "none", "", "unknown"}: return "unknown"
    return "other"
df["sex"] = df["sex"].apply(norm_sex)

# Country cleaning
def norm_country(x):
    x = str(x).strip()
    return "unknown" if (x == "" or x.lower() == "nan") else x
df["country"] = df["country"].apply(norm_country)

# ========= 2) TEXT NORMALIZATION (for matching) =========
# Create a clean text column for matching on trackable_name
df["name_clean"] = (
    df["trackable_name"]
    .fillna("")
    .astype(str)
    .str.lower()
    .str.replace(r"[^a-z0-9\s\-']", " ", regex=True)
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
)

# ========= 3) HELPERS TO FLAG KEYWORDS =========
def add_keyword_flags(sub_df, groups_dict, prefix):
    """
    For a subset of df (e.g., only Symptoms, only Food...), add binary columns
    indicating whether trackable_name matches any keyword group for that row.
    We then aggregate to daily features later.
    """
    out = sub_df.copy()
    for group, kws in groups_dict.items():
        # Prebuild a regex OR pattern for speed; escape non-alnum safely
        pattern = r"(" + "|".join([re.escape(k.lower()) for k in kws]) + r")"
        col = f"{prefix}__{group}"
        out[col] = out["name_clean"].str.contains(pattern, regex=True).astype(int)
    return out

# ========= 4) PER-TYPE KEYWORD FLAGS =========
# NOTE: You already have: keyword_groups (conditions incl. "epilepsy_seizure"),
#       symptom_keyword_groups, food_keyword_groups, tag_keyword_groups, treatment_keyword_groups

# Conditions (Condition rows only)
cond_rows = df[df["trackable_type"] == "Condition"]
cond_rows = add_keyword_flags(cond_rows, condition_keyword_groups, "cond")

# Symptoms
sym_rows = df[df["trackable_type"] == "Symptom"]
sym_rows = add_keyword_flags(sym_rows, symptom_keyword_groups, "sym")

# Food
food_rows = df[df["trackable_type"] == "Food"]
food_rows = add_keyword_flags(food_rows, food_keyword_groups, "food")

# Tags (triggers)
tag_rows = df[df["trackable_type"] == "Tag"]
tag_rows = add_keyword_flags(tag_rows, tag_keyword_groups, "tag")

# Treatments
trt_rows = df[df["trackable_type"] == "Treatment"]
trt_rows = add_keyword_flags(trt_rows, treatment_keyword_groups, "trt")

# Weather (already numeric columns; keep as-is if present)
weather_rows = df[df["trackable_type"] == "Weather"].copy()
# Example expected numeric weather columns (adjust to your schema if needed)
for col in ["temperature_min", "temperature_max", "precip_intensity", "pressure", "humidity"]:
    if col in weather_rows.columns:
        weather_rows[col] = pd.to_numeric(weather_rows[col], errors="coerce")

# ========= 5) DAILY AGGREGATION (no leakage) =========
# We build daily features per user, then later roll 7/30d windows that use ONLY past data.

def daily_agg_flags(sub, prefix):
    # get only the flag columns
    flag_cols = [c for c in sub.columns if c.startswith(prefix + "__")]
    if not flag_cols:
        return pd.DataFrame(columns=["user_id", "checkin_date"])
    # include severity if available (0-4); we’ll take max per day
    if "trackable_value" in sub.columns:
        sub["severity_val"] = pd.to_numeric(sub["trackable_value"], errors="coerce")
    else:
        sub["severity_val"] = np.nan

    agg = (
        sub.groupby(["user_id", "checkin_date"])
           .agg({**{c: "max" for c in flag_cols}, "severity_val": "max"})
           .reset_index()
    )
    # rename severity
    if "severity_val" in agg.columns:
        agg = agg.rename(columns={"severity_val": f"{prefix}__max_severity"})
    return agg

daily_cond = daily_agg_flags(cond_rows,  "cond")
daily_sym  = daily_agg_flags(sym_rows,   "sym")
daily_food = daily_agg_flags(food_rows,  "food")
daily_tag  = daily_agg_flags(tag_rows,   "tag")
daily_trt  = daily_agg_flags(trt_rows,   "trt")

# Daily weather (mean if multiple in same day)
if not weather_rows.empty:
    wcols = ["temperature_min", "temperature_max", "precip_intensity", "pressure", "humidity"]
    wcols = [c for c in wcols if c in weather_rows.columns]
    daily_wx = (
        weather_rows.groupby(["user_id", "checkin_date"])[wcols].mean().reset_index()
    )
else:
    daily_wx = pd.DataFrame(columns=["user_id", "checkin_date"])

# ========= Combine into one daily table =========
# Start from all user-day keys present in any table
parts = [daily_cond, daily_sym, daily_food, daily_tag, daily_trt, daily_wx]
daily = None
for p in parts:
    if p is None or p.empty: 
        continue
    daily = p if daily is None else pd.merge(daily, p, on=["user_id", "checkin_date"], how="outer")

if daily is None:
    raise ValueError("No daily features built. Check your inputs.")

daily = daily.sort_values(["user_id", "checkin_date"]).reset_index(drop=True)
daily = daily.fillna(0)  # for flags; numeric weather stays 0 if missing (fine for tree models)

# ========= 6) ROLLING (PAST-ONLY) FEATURES =========
# For each user, compute 7d/30d rolling sums of flags + rolling means of severities & weather.
def add_rollups(g):
    g = g.set_index("checkin_date").sort_index()
    # rolling windows (closed='left' to use ONLY past)
    win_defs = {"7d":"7D", "30d":"30D"}
    for col in g.columns:
        if col.startswith(("cond__", "sym__", "food__", "tag__", "trt__")) and col.endswith("__max_severity") is False:
            for k, win in win_defs.items():
                g[f"{col}__sum_{k}"] = g[col].rolling(win, closed="left").sum()
        # severities & weather: rolling mean
        if col.endswith("__max_severity") or col in ["temperature_min","temperature_max","precip_intensity","pressure","humidity"]:
            for k, win in win_defs.items():
                g[f"{col}__mean_{k}"] = g[col].rolling(win, closed="left").mean()
    return g.reset_index()

daily = daily.groupby("user_id", group_keys=False).apply(add_rollups)
# Fill remaining NaNs from leading window edges
daily = daily.fillna(0)

# ========= 7) BUILD THE TARGET: NEXT-DAY SYMPTOM WORSENING (FLARE) =========
import re

# 1️⃣ Identify relevant symptom columns
sym_cols = [c for c in daily.columns if re.search(r"(joint|pain|swelling|stiffness)", c, re.I)]
if not sym_cols:
    raise ValueError("No joint/pain/stiffness symptom columns found. Check your keyword mapping.")

print(f"✅ Found {len(sym_cols)} symptom columns used for flare labeling.")

# 2️⃣ Compute per-user daily mean across these symptoms
daily["symptom_pain_mean"] = daily[sym_cols].mean(axis=1)

# 3️⃣ Compute day-to-day change
daily["symptom_delta"] = daily.groupby("user_id")["symptom_pain_mean"].diff().fillna(0)

# 4️⃣ Define threshold more leniently (for binary data)
flare_threshold = 0.1  # or 0.05 if still too strict

daily["label_nextday_flare"] = (
    daily.groupby("user_id")["symptom_delta"].shift(-1).fillna(0) > flare_threshold
).astype(int)

# 5️⃣ Require at least two symptom days per user to detect deltas
valid_users = daily.groupby("user_id")["symptom_pain_mean"].transform("count") > 1
daily = daily[valid_users]

# 6️⃣ Optional: require some history
daily["has_7d_history"] = daily.groupby("user_id")["checkin_date"].transform(
    lambda s: (s - s.min()) >= pd.Timedelta(days=7)
)
daily = daily[daily["has_7d_history"]]

print("✅ Flare labeling complete. Positive rate:",
      daily["label_nextday_flare"].mean().round(3))

print(daily["label_nextday_flare"].value_counts(normalize=True))


# ========= 8) ADD DEMOGRAPHICS (static) =========
demo = df.drop_duplicates("user_id")[["user_id","age","sex","country"]].copy()
daily = daily.merge(demo, on="user_id", how="left")

# One-hot encode sex & country (country can be many; consider top-K and bucket rest as 'other')
top_countries = df["country"].value_counts().head(20).index
daily["country_top"] = daily["country"].where(daily["country"].isin(top_countries), "other")

X_cat = pd.get_dummies(daily[["sex","country_top"]], drop_first=False, dtype=int)
daily = pd.concat([daily.drop(columns=["sex","country","country_top"]), X_cat], axis=1)

# ========= FILTER TO AUTOIMMUNE ARTHRITIS USERS =========
import re

# 1️⃣ Identify inflammatory / autoimmune arthritis condition columns
cond_flag_cols = [c for c in daily.columns if c.startswith("cond__")]
arth_pattern = re.compile(
    r"(inflammatory[_\-\s]*arthritis|rheumatoid[_\-\s]*arthritis|psoriatic[_\-\s]*arthritis|ankylosing[_\-\s]*spondylitis)",
    re.I,
)
arth_flag_cols = [c for c in cond_flag_cols if arth_pattern.search(c.replace("cond__", ""))]

if not arth_flag_cols:
    raise ValueError("No autoimmune arthritis condition columns found in your dataset.")

# 2️⃣ Find users who have ever logged any of these conditions
arth_users = daily.loc[daily[arth_flag_cols].sum(axis=1) > 0, "user_id"].unique()
print(f"✅ Autoimmune arthritis users identified: {len(arth_users)}")

# 3️⃣ Restrict dataset to only those users
daily = daily[daily["user_id"].isin(arth_users)].copy()

print("✅ Dataset filtered to autoimmune arthritis users only.")
print("Remaining records:", len(daily))
print("Flare positive rate:", daily["label_nextday_flare"].mean().round(3))

# ========= 9) TRAIN / TEST TEMPORAL SPLIT =========
cutoff = daily["checkin_date"].quantile(0.8)  # 80% oldest for train, 20% most recent for test
train = daily[daily["checkin_date"] <= cutoff].copy()
test  = daily[daily["checkin_date"] >  cutoff].copy()

# Features: exclude identifiers & leakage columns
label_col = "label_nextday_flare"

leaky_cols = [c for c in daily.columns if "symptom" in c and "pain_mean" in c]

drop_cols = {"user_id", "checkin_date", "label_nextday_flare"}.union(leaky_cols)

feature_cols = [c for c in daily.columns if c not in drop_cols]

X_train, X_test, y_train, y_test, feature_cols = make_train_test_balanced(
    df=daily,
    label_col=label_col,
    drop_cols=drop_cols,
    pos_to_neg_ratio=4,   # 4:1 neg:pos is usually good
)

# ========= 10) TRAIN LIGHTGBM (unchanged) =========
clf = lgb.LGBMClassifier(
    n_estimators=1000,
    learning_rate=0.03,
    num_leaves=64,
    subsample=0.8,
    colsample_bytree=0.8,
    class_weight="balanced",
    random_state=42
)
clf.fit(X_train, y_train)

# ========= 11) EVALUATE =========
p_test = clf.predict_proba(X_test)[:,1]
auc = roc_auc_score(y_test, p_test)
auprc = average_precision_score(y_test, p_test)

print(f"AUC  : {auc:.3f}")
print(f"AUPRC: {auprc:.3f}")

from sklearn.metrics import precision_recall_curve
prec, rec, thr = precision_recall_curve(y_test, p_test)
for t in [0.05, 0.10, 0.20]:
    idx = (np.abs(thr - t)).argmin() if len(thr) else -1
    if idx >= 0 and idx < len(prec):
        print(f"Threshold~{t:0.2f}: Precision={prec[idx]:.3f}  Recall={rec[idx]:.3f}")

# Evaluate and store results
model_name = "RA_NextDay_LGBM"  # or "Epilepsy_30d", "RA_NextDay", etc.
evaluate_and_store_results(model_name, y_test, p_test, feature_cols, clf, results_path="model_results.csv")


# ========= 12) FEATURE IMPORTANCE =========
imp = pd.Series(clf.feature_importances_, index=feature_cols).sort_values(ascending=False)
print("\nTop 25 features:\n", imp.head(25))


✅ Found 15 symptom columns used for flare labeling.
✅ Flare labeling complete. Positive rate: 0.078
label_nextday_flare
0    0.921578
1    0.078422
Name: proportion, dtype: float64
✅ Autoimmune arthritis users identified: 5332
✅ Dataset filtered to autoimmune arthritis users only.
Remaining records: 99881
Flare positive rate: 0.098
✅ Train/Test split complete: 38855 train (7771 pos, 31084 neg), 19914 test. Pos rate train=0.200, test=0.102
[LightGBM] [Info] Number of positive: 7771, number of negative: 31084
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.014056 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2999
[LightGBM] [Info] Number of data points in the train set: 38855, number of used features: 215
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
AUC  : 0.951
AUPRC: 0.619
Threshold~0.05: Precision=0.

### Predicting Flares in Epilepsy

In [25]:
# ========= 1) LOAD & CLEAN =========
# Assumes your full long-format table is in df with the columns you listed
path = "/Users/cristybanuelos/Downloads/Chronic_Illness_Dataset.csv"
df = pd.read_csv(path)

df = df[df["trackable_type"] == "Condition"].copy()
df["condition_clean"] = (
    df["trackable_name"].str.lower()
    .str.strip()
    .str.replace(r"[^a-z0-9\s\-']", " ", regex=True)
)

# Parse date
df["checkin_date"] = pd.to_datetime(df["checkin_date"], errors="coerce")
df = df.dropna(subset=["checkin_date"])  # keep only rows with a date

# Basic user-level fields
# Age cleaning: clip to a sensible window; set out-of-range to NaN, then impute
df["age"] = pd.to_numeric(df["age"], errors="coerce")
df.loc[(df["age"] < 0) | (df["age"] > 110), "age"] = np.nan  # clip biologically plausible ages
df["age"] = df.groupby("user_id")["age"].transform(lambda s: s.fillna(s.median()))
df["age"] = df["age"].fillna(df["age"].median())

# Sex cleaning: normalize categories
def norm_sex(x):
    x = str(x).strip().lower()
    if x in {"male", "m"}: return "male"
    if x in {"female", "f"}: return "female"
    if x in {"nan", "none", "", "unknown"}: return "unknown"
    return "other"
df["sex"] = df["sex"].apply(norm_sex)

# Country cleaning
def norm_country(x):
    x = str(x).strip()
    return "unknown" if (x == "" or x.lower() == "nan") else x
df["country"] = df["country"].apply(norm_country)

# ========= 2) TEXT NORMALIZATION (for matching) =========
# Create a clean text column for matching on trackable_name
df["name_clean"] = (
    df["trackable_name"]
    .fillna("")
    .astype(str)
    .str.lower()
    .str.replace(r"[^a-z0-9\s\-']", " ", regex=True)
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
)

# ========= 3) HELPERS TO FLAG KEYWORDS =========
def add_keyword_flags(sub_df, groups_dict, prefix):
    """
    For a subset of df (e.g., only Symptoms, only Food...), add binary columns
    indicating whether trackable_name matches any keyword group for that row.
    We then aggregate to daily features later.
    """
    out = sub_df.copy()
    for group, kws in groups_dict.items():
        # Prebuild a regex OR pattern for speed; escape non-alnum safely
        pattern = r"(" + "|".join([re.escape(k.lower()) for k in kws]) + r")"
        col = f"{prefix}__{group}"
        out[col] = out["name_clean"].str.contains(pattern, regex=True).astype(int)
    return out

# ========= 4) PER-TYPE KEYWORD FLAGS =========
# NOTE: You already have: keyword_groups (conditions incl. "epilepsy_seizure"),
#       symptom_keyword_groups, food_keyword_groups, tag_keyword_groups, treatment_keyword_groups

# Conditions (Condition rows only)
cond_rows = df[df["trackable_type"] == "Condition"]
cond_rows = add_keyword_flags(cond_rows, condition_keyword_groups, "cond")

# Symptoms
sym_rows = df[df["trackable_type"] == "Symptom"]
sym_rows = add_keyword_flags(sym_rows, symptom_keyword_groups, "sym")

# Food
food_rows = df[df["trackable_type"] == "Food"]
food_rows = add_keyword_flags(food_rows, food_keyword_groups, "food")

# Tags (triggers)
tag_rows = df[df["trackable_type"] == "Tag"]
tag_rows = add_keyword_flags(tag_rows, tag_keyword_groups, "tag")

# Treatments
trt_rows = df[df["trackable_type"] == "Treatment"]
trt_rows = add_keyword_flags(trt_rows, treatment_keyword_groups, "trt")

# Weather (already numeric columns; keep as-is if present)
weather_rows = df[df["trackable_type"] == "Weather"].copy()
# Example expected numeric weather columns (adjust to your schema if needed)
for col in ["temperature_min", "temperature_max", "precip_intensity", "pressure", "humidity"]:
    if col in weather_rows.columns:
        weather_rows[col] = pd.to_numeric(weather_rows[col], errors="coerce")

# ========= 5) DAILY AGGREGATION (no leakage) =========
# We build daily features per user, then later roll 7/30d windows that use ONLY past data.

def daily_agg_flags(sub, prefix):
    # get only the flag columns
    flag_cols = [c for c in sub.columns if c.startswith(prefix + "__")]
    if not flag_cols:
        return pd.DataFrame(columns=["user_id", "checkin_date"])
    # include severity if available (0-4); we’ll take max per day
    if "trackable_value" in sub.columns:
        sub["severity_val"] = pd.to_numeric(sub["trackable_value"], errors="coerce")
    else:
        sub["severity_val"] = np.nan

    agg = (
        sub.groupby(["user_id", "checkin_date"])
           .agg({**{c: "max" for c in flag_cols}, "severity_val": "max"})
           .reset_index()
    )
    # rename severity
    if "severity_val" in agg.columns:
        agg = agg.rename(columns={"severity_val": f"{prefix}__max_severity"})
    return agg

daily_cond = daily_agg_flags(cond_rows,  "cond")
daily_sym  = daily_agg_flags(sym_rows,   "sym")
daily_food = daily_agg_flags(food_rows,  "food")
daily_tag  = daily_agg_flags(tag_rows,   "tag")
daily_trt  = daily_agg_flags(trt_rows,   "trt")

# Daily weather (mean if multiple in same day)
if not weather_rows.empty:
    wcols = ["temperature_min", "temperature_max", "precip_intensity", "pressure", "humidity"]
    wcols = [c for c in wcols if c in weather_rows.columns]
    daily_wx = (
        weather_rows.groupby(["user_id", "checkin_date"])[wcols].mean().reset_index()
    )
else:
    daily_wx = pd.DataFrame(columns=["user_id", "checkin_date"])

# ========= Combine into one daily table =========
# Start from all user-day keys present in any table
parts = [daily_cond, daily_sym, daily_food, daily_tag, daily_trt, daily_wx]
daily = None
for p in parts:
    if p is None or p.empty: 
        continue
    daily = p if daily is None else pd.merge(daily, p, on=["user_id", "checkin_date"], how="outer")

if daily is None:
    raise ValueError("No daily features built. Check your inputs.")

daily = daily.sort_values(["user_id", "checkin_date"]).reset_index(drop=True)
daily = daily.fillna(0)  # for flags; numeric weather stays 0 if missing (fine for tree models)

# ========= 6) ROLLING (PAST-ONLY) FEATURES =========
# For each user, compute 7d/30d rolling sums of flags + rolling means of severities & weather.
def add_rollups(g):
    g = g.set_index("checkin_date").sort_index()
    # rolling windows (closed='left' to use ONLY past)
    win_defs = {"7d":"7D", "30d":"30D"}
    for col in g.columns:
        if col.startswith(("cond__", "sym__", "food__", "tag__", "trt__")) and col.endswith("__max_severity") is False:
            for k, win in win_defs.items():
                g[f"{col}__sum_{k}"] = g[col].rolling(win, closed="left").sum()
        # severities & weather: rolling mean
        if col.endswith("__max_severity") or col in ["temperature_min","temperature_max","precip_intensity","pressure","humidity"]:
            for k, win in win_defs.items():
                g[f"{col}__mean_{k}"] = g[col].rolling(win, closed="left").mean()
    return g.reset_index()

daily = daily.groupby("user_id", group_keys=False).apply(add_rollups)
# Fill remaining NaNs from leading window edges
daily = daily.fillna(0)

# ========= 7) BUILD THE TARGET: NEXT-DAY FLARE (cluster-based) =========
target_condition = "epilepsy_seizure"   # or pots / epilepsy / depression

# choose cluster sets
flare_clusters = {
    "pots_dysautonomia": ["fatigue_exhaustion", "cardiovascular_symptoms",
             "lightheaded_dizziness", "sleep_symptoms", "headache_migraine"],
    "epilepsy_seizure": ["neurologic_other", "sleep_symptoms",
                 "fatigue_exhaustion", "cognitive_symptoms"],
    "anxiety": ["anxiety_fear_panic", "stress_tension",
                "sleep_symptoms", "cardiovascular_symptoms"],
    "depression": ["negative_affect", "fatigue_exhaustion",
                   "sleep_symptoms", "cognitive_symptoms"]
}[target_condition]

sym_cols = [c for c in daily.columns
            if c.startswith("sym__") and any(k in c for k in flare_clusters)]
print(f"✅ {len(sym_cols)} symptom cluster columns for {target_condition}")

daily["symptom_mean"] = daily[sym_cols].mean(axis=1)
daily["symptom_delta"] = daily.groupby("user_id")["symptom_mean"].diff().fillna(0)
flare_threshold = 0.1
daily["label_nextday_flare"] = (
    daily.groupby("user_id")["symptom_delta"].shift(-1).fillna(0) > flare_threshold
).astype(int)


# ========= 8) ADD DEMOGRAPHICS (static) =========
demo = df.drop_duplicates("user_id")[["user_id","age","sex","country"]].copy()
daily = daily.merge(demo, on="user_id", how="left")

# One-hot encode sex & country (country can be many; consider top-K and bucket rest as 'other')
top_countries = df["country"].value_counts().head(20).index
daily["country_top"] = daily["country"].where(daily["country"].isin(top_countries), "other")

X_cat = pd.get_dummies(daily[["sex","country_top"]], drop_first=False, dtype=int)
daily = pd.concat([daily.drop(columns=["sex","country","country_top"]), X_cat], axis=1)

# ========= FILTER TO EPILEPSY USERS =========
import re

# 1️⃣ Identify condition columns related to epilepsy / seizure disorders
cond_flag_cols = [c for c in daily.columns if c.startswith("cond__")]

# Pattern matches common epilepsy and seizure cluster names
epilepsy_pattern = re.compile(
    r"(epilepsy|seizure|convulsion|temporal[_\-\s]*lobe|tonic[_\-\s]*clonic|absence[_\-\s]*seizure)",
    re.I,
)

epilepsy_flag_cols = [
    c for c in cond_flag_cols if epilepsy_pattern.search(c.replace("cond__", ""))
]

if not epilepsy_flag_cols:
    raise ValueError("No epilepsy condition columns found in your dataset.")

# 2️⃣ Find users who have ever logged any epilepsy-related conditions
epilepsy_users = daily.loc[
    daily[epilepsy_flag_cols].sum(axis=1) > 0, "user_id"
].unique()
print(f"✅ Epilepsy users identified: {len(epilepsy_users)}")

# 3️⃣ Restrict dataset to only those users
daily = daily[daily["user_id"].isin(epilepsy_users)].copy()

print("✅ Dataset filtered to epilepsy users only.")
print("Remaining records:", len(daily))
print("Flare positive rate:", daily['label_nextday_flare'].mean().round(3))


# ========= 9) TRAIN / TEST TEMPORAL SPLIT =========
cutoff = daily["checkin_date"].quantile(0.8)  # 80% oldest for train, 20% most recent for test
train = daily[daily["checkin_date"] <= cutoff].copy()
test  = daily[daily["checkin_date"] >  cutoff].copy()

# Features: exclude identifiers & leakage columns
label_col = "label_nextday_flare"

leaky_cols = [c for c in daily.columns if "symptom" in c and "pain_mean" in c]

drop_cols = {"user_id", "checkin_date", "label_nextday_flare"}.union(leaky_cols)

feature_cols = [c for c in daily.columns if c not in drop_cols]

X_train, X_test, y_train, y_test, feature_cols = make_train_test_balanced(
    df=daily,
    label_col=label_col,
    drop_cols=drop_cols,
    pos_to_neg_ratio=4,   # 4:1 neg:pos is usually good
)

# ========= 10) TRAIN LIGHTGBM (unchanged) =========
clf = lgb.LGBMClassifier(
    n_estimators=1000,
    learning_rate=0.03,
    num_leaves=64,
    subsample=0.8,
    colsample_bytree=0.8,
    class_weight="balanced",
    random_state=42
)
clf.fit(X_train, y_train)

# ========= 11) EVALUATE =========
p_test = clf.predict_proba(X_test)[:,1]
auc = roc_auc_score(y_test, p_test)
auprc = average_precision_score(y_test, p_test)

print(f"AUC  : {auc:.3f}")
print(f"AUPRC: {auprc:.3f}")

from sklearn.metrics import precision_recall_curve
prec, rec, thr = precision_recall_curve(y_test, p_test)
for t in [0.05, 0.10, 0.20]:
    idx = (np.abs(thr - t)).argmin() if len(thr) else -1
    if idx >= 0 and idx < len(prec):
        print(f"Threshold~{t:0.2f}: Precision={prec[idx]:.3f}  Recall={rec[idx]:.3f}")

# Evaluate and store results
model_name = "Epi_Flare_LGBM"  # or "Epilepsy_30d", "RA_NextDay", etc.
evaluate_and_store_results(model_name, y_test, p_test, feature_cols, clf, results_path="model_results.csv")


# ========= 12) FEATURE IMPORTANCE =========
imp = pd.Series(clf.feature_importances_, index=feature_cols).sort_values(ascending=False)
print("\nTop 25 features:\n", imp.head(25))


✅ 0 symptom cluster columns for epilepsy_seizure
✅ Epilepsy users identified: 410
✅ Dataset filtered to epilepsy users only.
Remaining records: 3969
Flare positive rate: 0.0
✅ Train/Test split complete: 4 train (0 pos, 4 neg), 794 test. Pos rate train=0.000, test=0.000
[LightGBM] [Info] Number of positive: 0, number of negative: 4
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 4, number of used features: 0
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.000000 -> initscore=-34.538776
[LightGBM] [Info] Start training from score -34.538776


ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.

In [None]:
# ========= 1) LOAD & CLEAN =========
# Assumes your full long-format table is in df with the columns you listed
path = "/Users/cristybanuelos/Downloads/Chronic_Illness_Dataset.csv"
df = pd.read_csv(path)

df = df[df["trackable_type"] == "Condition"].copy()
df["condition_clean"] = (
    df["trackable_name"].str.lower()
    .str.strip()
    .str.replace(r"[^a-z0-9\s\-']", " ", regex=True)
)

# Parse date
df["checkin_date"] = pd.to_datetime(df["checkin_date"], errors="coerce")
df = df.dropna(subset=["checkin_date"])  # keep only rows with a date

# Basic user-level fields
# Age cleaning: clip to a sensible window; set out-of-range to NaN, then impute
df["age"] = pd.to_numeric(df["age"], errors="coerce")
df.loc[(df["age"] < 0) | (df["age"] > 110), "age"] = np.nan  # clip biologically plausible ages
df["age"] = df.groupby("user_id")["age"].transform(lambda s: s.fillna(s.median()))
df["age"] = df["age"].fillna(df["age"].median())

# Sex cleaning: normalize categories
def norm_sex(x):
    x = str(x).strip().lower()
    if x in {"male", "m"}: return "male"
    if x in {"female", "f"}: return "female"
    if x in {"nan", "none", "", "unknown"}: return "unknown"
    return "other"
df["sex"] = df["sex"].apply(norm_sex)

# Country cleaning
def norm_country(x):
    x = str(x).strip()
    return "unknown" if (x == "" or x.lower() == "nan") else x
df["country"] = df["country"].apply(norm_country)

# ========= 2) TEXT NORMALIZATION (for matching) =========
# Create a clean text column for matching on trackable_name
df["name_clean"] = (
    df["trackable_name"]
    .fillna("")
    .astype(str)
    .str.lower()
    .str.replace(r"[^a-z0-9\s\-']", " ", regex=True)
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
)

# ========= 3) HELPERS TO FLAG KEYWORDS =========
def add_keyword_flags(sub_df, groups_dict, prefix):
    """
    For a subset of df (e.g., only Symptoms, only Food...), add binary columns
    indicating whether trackable_name matches any keyword group for that row.
    We then aggregate to daily features later.
    """
    out = sub_df.copy()
    for group, kws in groups_dict.items():
        # Prebuild a regex OR pattern for speed; escape non-alnum safely
        pattern = r"(" + "|".join([re.escape(k.lower()) for k in kws]) + r")"
        col = f"{prefix}__{group}"
        out[col] = out["name_clean"].str.contains(pattern, regex=True).astype(int)
    return out

# ========= 4) PER-TYPE KEYWORD FLAGS =========
# NOTE: You already have: keyword_groups (conditions incl. "epilepsy_seizure"),
#       symptom_keyword_groups, food_keyword_groups, tag_keyword_groups, treatment_keyword_groups

# Conditions (Condition rows only)
cond_rows = df[df["trackable_type"] == "Condition"]
cond_rows = add_keyword_flags(cond_rows, condition_keyword_groups, "cond")

# Symptoms
sym_rows = df[df["trackable_type"] == "Symptom"]
sym_rows = add_keyword_flags(sym_rows, symptom_keyword_groups, "sym")

# Food
food_rows = df[df["trackable_type"] == "Food"]
food_rows = add_keyword_flags(food_rows, food_keyword_groups, "food")

# Tags (triggers)
tag_rows = df[df["trackable_type"] == "Tag"]
tag_rows = add_keyword_flags(tag_rows, tag_keyword_groups, "tag")

# Treatments
trt_rows = df[df["trackable_type"] == "Treatment"]
trt_rows = add_keyword_flags(trt_rows, treatment_keyword_groups, "trt")

# Weather (already numeric columns; keep as-is if present)
weather_rows = df[df["trackable_type"] == "Weather"].copy()
# Example expected numeric weather columns (adjust to your schema if needed)
for col in ["temperature_min", "temperature_max", "precip_intensity", "pressure", "humidity"]:
    if col in weather_rows.columns:
        weather_rows[col] = pd.to_numeric(weather_rows[col], errors="coerce")

# ========= 5) DAILY AGGREGATION (no leakage) =========
# We build daily features per user, then later roll 7/30d windows that use ONLY past data.

def daily_agg_flags(sub, prefix):
    # get only the flag columns
    flag_cols = [c for c in sub.columns if c.startswith(prefix + "__")]
    if not flag_cols:
        return pd.DataFrame(columns=["user_id", "checkin_date"])
    # include severity if available (0-4); we’ll take max per day
    if "trackable_value" in sub.columns:
        sub["severity_val"] = pd.to_numeric(sub["trackable_value"], errors="coerce")
    else:
        sub["severity_val"] = np.nan

    agg = (
        sub.groupby(["user_id", "checkin_date"])
           .agg({**{c: "max" for c in flag_cols}, "severity_val": "max"})
           .reset_index()
    )
    # rename severity
    if "severity_val" in agg.columns:
        agg = agg.rename(columns={"severity_val": f"{prefix}__max_severity"})
    return agg

daily_cond = daily_agg_flags(cond_rows,  "cond")
daily_sym  = daily_agg_flags(sym_rows,   "sym")
daily_food = daily_agg_flags(food_rows,  "food")
daily_tag  = daily_agg_flags(tag_rows,   "tag")
daily_trt  = daily_agg_flags(trt_rows,   "trt")

# Daily weather (mean if multiple in same day)
if not weather_rows.empty:
    wcols = ["temperature_min", "temperature_max", "precip_intensity", "pressure", "humidity"]
    wcols = [c for c in wcols if c in weather_rows.columns]
    daily_wx = (
        weather_rows.groupby(["user_id", "checkin_date"])[wcols].mean().reset_index()
    )
else:
    daily_wx = pd.DataFrame(columns=["user_id", "checkin_date"])

# ========= Combine into one daily table =========
# Start from all user-day keys present in any table
parts = [daily_cond, daily_sym, daily_food, daily_tag, daily_trt, daily_wx]
daily = None
for p in parts:
    if p is None or p.empty: 
        continue
    daily = p if daily is None else pd.merge(daily, p, on=["user_id", "checkin_date"], how="outer")

if daily is None:
    raise ValueError("No daily features built. Check your inputs.")

daily = daily.sort_values(["user_id", "checkin_date"]).reset_index(drop=True)
daily = daily.fillna(0)  # for flags; numeric weather stays 0 if missing (fine for tree models)

# ========= 6) ROLLING (PAST-ONLY) FEATURES =========
# For each user, compute 7d/30d rolling sums of flags + rolling means of severities & weather.
def add_rollups(g):
    g = g.set_index("checkin_date").sort_index()
    # rolling windows (closed='left' to use ONLY past)
    win_defs = {"7d":"7D", "30d":"30D"}
    for col in g.columns:
        if col.startswith(("cond__", "sym__", "food__", "tag__", "trt__")) and col.endswith("__max_severity") is False:
            for k, win in win_defs.items():
                g[f"{col}__sum_{k}"] = g[col].rolling(win, closed="left").sum()
        # severities & weather: rolling mean
        if col.endswith("__max_severity") or col in ["temperature_min","temperature_max","precip_intensity","pressure","humidity"]:
            for k, win in win_defs.items():
                g[f"{col}__mean_{k}"] = g[col].rolling(win, closed="left").mean()
    return g.reset_index()

daily = daily.groupby("user_id", group_keys=False).apply(add_rollups)
# Fill remaining NaNs from leading window edges
daily = daily.fillna(0)

# ========= 7) BUILD THE TARGET: NEXT-DAY FLARE (cluster-based) =========
target_condition = "depression"   # or pots / epilepsy / depression / anxiety

# choose cluster sets
flare_clusters = {
    "pots_dysautonomia": ["fatigue_exhaustion", "cardiovascular_symptoms",
             "lightheaded_dizziness", "sleep_symptoms", "headache_migraine"],
    "epilepsy_seizure": ["neurologic_other", "sleep_symptoms",
                 "fatigue_exhaustion", "cognitive_symptoms"],
    "anxiety": ["anxiety_fear_panic", "stress_tension",
                "sleep_symptoms", "cardiovascular_symptoms"],
    "depression": ["negative_affect", "fatigue_exhaustion",
                   "sleep_symptoms", "cognitive_symptoms"]
}[target_condition]

sym_cols = [c for c in daily.columns
            if c.startswith("sym__") and any(k in c for k in flare_clusters)]
print(f"✅ {len(sym_cols)} symptom cluster columns for {target_condition}")

daily["symptom_mean"] = daily[sym_cols].mean(axis=1)
daily["symptom_delta"] = daily.groupby("user_id")["symptom_mean"].diff().fillna(0)
flare_threshold = 0.1
daily["label_nextday_flare"] = (
    daily.groupby("user_id")["symptom_delta"].shift(-1).fillna(0) > flare_threshold
).astype(int)


# ========= 8) ADD DEMOGRAPHICS (static) =========
demo = df.drop_duplicates("user_id")[["user_id","age","sex","country"]].copy()
daily = daily.merge(demo, on="user_id", how="left")

# One-hot encode sex & country (country can be many; consider top-K and bucket rest as 'other')
top_countries = df["country"].value_counts().head(20).index
daily["country_top"] = daily["country"].where(daily["country"].isin(top_countries), "other")

X_cat = pd.get_dummies(daily[["sex","country_top"]], drop_first=False, dtype=int)
daily = pd.concat([daily.drop(columns=["sex","country","country_top"]), X_cat], axis=1)

# ========= FILTER TO DEPRESSION USERS =========
import re

# 1️⃣ Identify condition columns related to depression or mood disorders
cond_flag_cols = [c for c in daily.columns if c.startswith("cond__")]

# Pattern matches depression-related cluster names
depression_pattern = re.compile(
    r"(depression|depressive|mdd|major[_\-\s]*depressive|dysthymia|low[_\-\s]*mood)",
    re.I,
)

depression_flag_cols = [
    c for c in cond_flag_cols if depression_pattern.search(c.replace("cond__", ""))
]

if not depression_flag_cols:
    raise ValueError("No depression condition columns found in your dataset.")

# 2️⃣ Find users who have ever logged any depression-related conditions
depression_users = daily.loc[
    daily[depression_flag_cols].sum(axis=1) > 0, "user_id"
].unique()
print(f"✅ Depression users identified: {len(depression_users)}")

# 3️⃣ Restrict dataset to only those users
daily = daily[daily["user_id"].isin(depression_users)].copy()

print("✅ Dataset filtered to depression users only.")
print("Remaining records:", len(daily))
print("Flare positive rate:", daily['label_nextday_flare'].mean().round(3))


# ========= 9) TRAIN / TEST TEMPORAL SPLIT =========
cutoff = daily["checkin_date"].quantile(0.8)  # 80% oldest for train, 20% most recent for test
train = daily[daily["checkin_date"] <= cutoff].copy()
test  = daily[daily["checkin_date"] >  cutoff].copy()

# Features: exclude identifiers & leakage columns
label_col = "label_nextday_flare"

leaky_cols = [c for c in daily.columns if "symptom" in c and "pain_mean" in c]

drop_cols = {"user_id", "checkin_date", "label_nextday_flare"}.union(leaky_cols)

feature_cols = [c for c in daily.columns if c not in drop_cols]

X_train, X_test, y_train, y_test, feature_cols = make_train_test_balanced(
    df=daily,
    label_col=label_col,
    drop_cols=drop_cols,
    pos_to_neg_ratio=4,   # 4:1 neg:pos is usually good
)

# ========= 10) TRAIN LIGHTGBM (unchanged) =========
clf = lgb.LGBMClassifier(
    n_estimators=1000,
    learning_rate=0.03,
    num_leaves=64,
    subsample=0.8,
    colsample_bytree=0.8,
    class_weight="balanced",
    random_state=42
)
clf.fit(X_train, y_train)

# ========= 11) EVALUATE =========
p_test = clf.predict_proba(X_test)[:,1]
auc = roc_auc_score(y_test, p_test)
auprc = average_precision_score(y_test, p_test)

print(f"AUC  : {auc:.3f}")
print(f"AUPRC: {auprc:.3f}")

from sklearn.metrics import precision_recall_curve
prec, rec, thr = precision_recall_curve(y_test, p_test)
for t in [0.05, 0.10, 0.20]:
    idx = (np.abs(thr - t)).argmin() if len(thr) else -1
    if idx >= 0 and idx < len(prec):
        print(f"Threshold~{t:0.2f}: Precision={prec[idx]:.3f}  Recall={rec[idx]:.3f}")

# Evaluate and store results
model_name = "Dep_Flare_LGBM"  # or "Epilepsy_30d", "RA_NextDay", etc.
evaluate_and_store_results(model_name, y_test, p_test, feature_cols, clf, results_path="model_results.csv")


# ========= 12) FEATURE IMPORTANCE =========
imp = pd.Series(clf.feature_importances_, index=feature_cols).sort_values(ascending=False)
print("\nTop 25 features:\n", imp.head(25))


In [None]:
# ========= 1) LOAD & CLEAN =========
# Assumes your full long-format table is in df with the columns you listed
path = "/Users/cristybanuelos/Downloads/Chronic_Illness_Dataset.csv"
df = pd.read_csv(path)

df = df[df["trackable_type"] == "Condition"].copy()
df["condition_clean"] = (
    df["trackable_name"].str.lower()
    .str.strip()
    .str.replace(r"[^a-z0-9\s\-']", " ", regex=True)
)

# Parse date
df["checkin_date"] = pd.to_datetime(df["checkin_date"], errors="coerce")
df = df.dropna(subset=["checkin_date"])  # keep only rows with a date

# Basic user-level fields
# Age cleaning: clip to a sensible window; set out-of-range to NaN, then impute
df["age"] = pd.to_numeric(df["age"], errors="coerce")
df.loc[(df["age"] < 0) | (df["age"] > 110), "age"] = np.nan  # clip biologically plausible ages
df["age"] = df.groupby("user_id")["age"].transform(lambda s: s.fillna(s.median()))
df["age"] = df["age"].fillna(df["age"].median())

# Sex cleaning: normalize categories
def norm_sex(x):
    x = str(x).strip().lower()
    if x in {"male", "m"}: return "male"
    if x in {"female", "f"}: return "female"
    if x in {"nan", "none", "", "unknown"}: return "unknown"
    return "other"
df["sex"] = df["sex"].apply(norm_sex)

# Country cleaning
def norm_country(x):
    x = str(x).strip()
    return "unknown" if (x == "" or x.lower() == "nan") else x
df["country"] = df["country"].apply(norm_country)

# ========= 2) TEXT NORMALIZATION (for matching) =========
# Create a clean text column for matching on trackable_name
df["name_clean"] = (
    df["trackable_name"]
    .fillna("")
    .astype(str)
    .str.lower()
    .str.replace(r"[^a-z0-9\s\-']", " ", regex=True)
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
)

# ========= 3) HELPERS TO FLAG KEYWORDS =========
def add_keyword_flags(sub_df, groups_dict, prefix):
    """
    For a subset of df (e.g., only Symptoms, only Food...), add binary columns
    indicating whether trackable_name matches any keyword group for that row.
    We then aggregate to daily features later.
    """
    out = sub_df.copy()
    for group, kws in groups_dict.items():
        # Prebuild a regex OR pattern for speed; escape non-alnum safely
        pattern = r"(" + "|".join([re.escape(k.lower()) for k in kws]) + r")"
        col = f"{prefix}__{group}"
        out[col] = out["name_clean"].str.contains(pattern, regex=True).astype(int)
    return out

# ========= 4) PER-TYPE KEYWORD FLAGS =========
# NOTE: You already have: keyword_groups (conditions incl. "epilepsy_seizure"),
#       symptom_keyword_groups, food_keyword_groups, tag_keyword_groups, treatment_keyword_groups

# Conditions (Condition rows only)
cond_rows = df[df["trackable_type"] == "Condition"]
cond_rows = add_keyword_flags(cond_rows, condition_keyword_groups, "cond")

# Symptoms
sym_rows = df[df["trackable_type"] == "Symptom"]
sym_rows = add_keyword_flags(sym_rows, symptom_keyword_groups, "sym")

# Food
food_rows = df[df["trackable_type"] == "Food"]
food_rows = add_keyword_flags(food_rows, food_keyword_groups, "food")

# Tags (triggers)
tag_rows = df[df["trackable_type"] == "Tag"]
tag_rows = add_keyword_flags(tag_rows, tag_keyword_groups, "tag")

# Treatments
trt_rows = df[df["trackable_type"] == "Treatment"]
trt_rows = add_keyword_flags(trt_rows, treatment_keyword_groups, "trt")

# Weather (already numeric columns; keep as-is if present)
weather_rows = df[df["trackable_type"] == "Weather"].copy()
# Example expected numeric weather columns (adjust to your schema if needed)
for col in ["temperature_min", "temperature_max", "precip_intensity", "pressure", "humidity"]:
    if col in weather_rows.columns:
        weather_rows[col] = pd.to_numeric(weather_rows[col], errors="coerce")

# ========= 5) DAILY AGGREGATION (no leakage) =========
# We build daily features per user, then later roll 7/30d windows that use ONLY past data.

def daily_agg_flags(sub, prefix):
    # get only the flag columns
    flag_cols = [c for c in sub.columns if c.startswith(prefix + "__")]
    if not flag_cols:
        return pd.DataFrame(columns=["user_id", "checkin_date"])
    # include severity if available (0-4); we’ll take max per day
    if "trackable_value" in sub.columns:
        sub["severity_val"] = pd.to_numeric(sub["trackable_value"], errors="coerce")
    else:
        sub["severity_val"] = np.nan

    agg = (
        sub.groupby(["user_id", "checkin_date"])
           .agg({**{c: "max" for c in flag_cols}, "severity_val": "max"})
           .reset_index()
    )
    # rename severity
    if "severity_val" in agg.columns:
        agg = agg.rename(columns={"severity_val": f"{prefix}__max_severity"})
    return agg

daily_cond = daily_agg_flags(cond_rows,  "cond")
daily_sym  = daily_agg_flags(sym_rows,   "sym")
daily_food = daily_agg_flags(food_rows,  "food")
daily_tag  = daily_agg_flags(tag_rows,   "tag")
daily_trt  = daily_agg_flags(trt_rows,   "trt")

# Daily weather (mean if multiple in same day)
if not weather_rows.empty:
    wcols = ["temperature_min", "temperature_max", "precip_intensity", "pressure", "humidity"]
    wcols = [c for c in wcols if c in weather_rows.columns]
    daily_wx = (
        weather_rows.groupby(["user_id", "checkin_date"])[wcols].mean().reset_index()
    )
else:
    daily_wx = pd.DataFrame(columns=["user_id", "checkin_date"])

# ========= Combine into one daily table =========
# Start from all user-day keys present in any table
parts = [daily_cond, daily_sym, daily_food, daily_tag, daily_trt, daily_wx]
daily = None
for p in parts:
    if p is None or p.empty: 
        continue
    daily = p if daily is None else pd.merge(daily, p, on=["user_id", "checkin_date"], how="outer")

if daily is None:
    raise ValueError("No daily features built. Check your inputs.")

daily = daily.sort_values(["user_id", "checkin_date"]).reset_index(drop=True)
daily = daily.fillna(0)  # for flags; numeric weather stays 0 if missing (fine for tree models)

# ========= 6) ROLLING (PAST-ONLY) FEATURES =========
# For each user, compute 7d/30d rolling sums of flags + rolling means of severities & weather.
def add_rollups(g):
    g = g.set_index("checkin_date").sort_index()
    # rolling windows (closed='left' to use ONLY past)
    win_defs = {"7d":"7D", "30d":"30D"}
    for col in g.columns:
        if col.startswith(("cond__", "sym__", "food__", "tag__", "trt__")) and col.endswith("__max_severity") is False:
            for k, win in win_defs.items():
                g[f"{col}__sum_{k}"] = g[col].rolling(win, closed="left").sum()
        # severities & weather: rolling mean
        if col.endswith("__max_severity") or col in ["temperature_min","temperature_max","precip_intensity","pressure","humidity"]:
            for k, win in win_defs.items():
                g[f"{col}__mean_{k}"] = g[col].rolling(win, closed="left").mean()
    return g.reset_index()

daily = daily.groupby("user_id", group_keys=False).apply(add_rollups)
# Fill remaining NaNs from leading window edges
daily = daily.fillna(0)

# ========= 7) BUILD THE TARGET: NEXT-DAY FLARE (cluster-based) =========
target_condition = "anxiety"   # or pots / epilepsy / depression / anxiety

# choose cluster sets
flare_clusters = {
    "pots_dysautonomia": ["fatigue_exhaustion", "cardiovascular_symptoms",
             "lightheaded_dizziness", "sleep_symptoms", "headache_migraine"],
    "epilepsy_seizure": ["neurologic_other", "sleep_symptoms",
                 "fatigue_exhaustion", "cognitive_symptoms"],
    "anxiety": ["anxiety_fear_panic", "stress_tension",
                "sleep_symptoms", "cardiovascular_symptoms"],
    "depression": ["negative_affect", "fatigue_exhaustion",
                   "sleep_symptoms", "cognitive_symptoms"]
}[target_condition]

sym_cols = [c for c in daily.columns
            if c.startswith("sym__") and any(k in c for k in flare_clusters)]
print(f"✅ {len(sym_cols)} symptom cluster columns for {target_condition}")

daily["symptom_mean"] = daily[sym_cols].mean(axis=1)
daily["symptom_delta"] = daily.groupby("user_id")["symptom_mean"].diff().fillna(0)
flare_threshold = 0.1
daily["label_nextday_flare"] = (
    daily.groupby("user_id")["symptom_delta"].shift(-1).fillna(0) > flare_threshold
).astype(int)


# ========= 8) ADD DEMOGRAPHICS (static) =========
demo = df.drop_duplicates("user_id")[["user_id","age","sex","country"]].copy()
daily = daily.merge(demo, on="user_id", how="left")

# One-hot encode sex & country (country can be many; consider top-K and bucket rest as 'other')
top_countries = df["country"].value_counts().head(20).index
daily["country_top"] = daily["country"].where(daily["country"].isin(top_countries), "other")

X_cat = pd.get_dummies(daily[["sex","country_top"]], drop_first=False, dtype=int)
daily = pd.concat([daily.drop(columns=["sex","country","country_top"]), X_cat], axis=1)

# ========= FILTER TO ANXIETY USERS =========
import re

# 1️⃣ Identify condition columns related to anxiety or panic disorders
cond_flag_cols = [c for c in daily.columns if c.startswith("cond__")]

# Pattern matches anxiety-related condition cluster names
anxiety_pattern = re.compile(
    r"(anxiety|panic|gad|generalized[_\-\s]*anxiety|social[_\-\s]*anxiety|phobia|ocd|worry)",
    re.I,
)

anxiety_flag_cols = [
    c for c in cond_flag_cols if anxiety_pattern.search(c.replace("cond__", ""))
]

if not anxiety_flag_cols:
    raise ValueError("No anxiety condition columns found in your dataset.")

# 2️⃣ Find users who have ever logged any anxiety-related conditions
anxiety_users = daily.loc[
    daily[anxiety_flag_cols].sum(axis=1) > 0, "user_id"
].unique()
print(f"✅ Anxiety users identified: {len(anxiety_users)}")

# 3️⃣ Restrict dataset to only those users
daily = daily[daily["user_id"].isin(anxiety_users)].copy()

print("✅ Dataset filtered to anxiety users only.")
print("Remaining records:", len(daily))
print("Flare positive rate:", daily['label_nextday_flare'].mean().round(3))

# ========= 9) TRAIN / TEST TEMPORAL SPLIT =========
cutoff = daily["checkin_date"].quantile(0.8)  # 80% oldest for train, 20% most recent for test
train = daily[daily["checkin_date"] <= cutoff].copy()
test  = daily[daily["checkin_date"] >  cutoff].copy()

# Features: exclude identifiers & leakage columns
label_col = "label_nextday_flare"

leaky_cols = [c for c in daily.columns if "symptom" in c and "pain_mean" in c]

drop_cols = {"user_id", "checkin_date", "label_nextday_flare"}.union(leaky_cols)

feature_cols = [c for c in daily.columns if c not in drop_cols]

X_train, X_test, y_train, y_test, feature_cols = make_train_test_balanced(
    df=daily,
    label_col=label_col,
    drop_cols=drop_cols,
    pos_to_neg_ratio=4,   # 4:1 neg:pos is usually good
)

# ========= 10) TRAIN LIGHTGBM (unchanged) =========
clf = lgb.LGBMClassifier(
    n_estimators=1000,
    learning_rate=0.03,
    num_leaves=64,
    subsample=0.8,
    colsample_bytree=0.8,
    class_weight="balanced",
    random_state=42
)
clf.fit(X_train, y_train)

# ========= 11) EVALUATE =========
p_test = clf.predict_proba(X_test)[:,1]
auc = roc_auc_score(y_test, p_test)
auprc = average_precision_score(y_test, p_test)

print(f"AUC  : {auc:.3f}")
print(f"AUPRC: {auprc:.3f}")

from sklearn.metrics import precision_recall_curve
prec, rec, thr = precision_recall_curve(y_test, p_test)
for t in [0.05, 0.10, 0.20]:
    idx = (np.abs(thr - t)).argmin() if len(thr) else -1
    if idx >= 0 and idx < len(prec):
        print(f"Threshold~{t:0.2f}: Precision={prec[idx]:.3f}  Recall={rec[idx]:.3f}")

# Evaluate and store results
model_name = "Anx_Flare_LGBM"  # or "Epilepsy_30d", "RA_NextDay", etc.
evaluate_and_store_results(model_name, y_test, p_test, feature_cols, clf, results_path="model_results.csv")


# ========= 12) FEATURE IMPORTANCE =========
imp = pd.Series(clf.feature_importances_, index=feature_cols).sort_values(ascending=False)
print("\nTop 25 features:\n", imp.head(25))


In [None]:
# ========= 1) LOAD & CLEAN =========
# Assumes your full long-format table is in df with the columns you listed
path = "/Users/cristybanuelos/Downloads/Chronic_Illness_Dataset.csv"
df = pd.read_csv(path)

df = df[df["trackable_type"] == "Condition"].copy()
df["condition_clean"] = (
    df["trackable_name"].str.lower()
    .str.strip()
    .str.replace(r"[^a-z0-9\s\-']", " ", regex=True)
)

# Parse date
df["checkin_date"] = pd.to_datetime(df["checkin_date"], errors="coerce")
df = df.dropna(subset=["checkin_date"])  # keep only rows with a date

# Basic user-level fields
# Age cleaning: clip to a sensible window; set out-of-range to NaN, then impute
df["age"] = pd.to_numeric(df["age"], errors="coerce")
df.loc[(df["age"] < 0) | (df["age"] > 110), "age"] = np.nan  # clip biologically plausible ages
df["age"] = df.groupby("user_id")["age"].transform(lambda s: s.fillna(s.median()))
df["age"] = df["age"].fillna(df["age"].median())

# Sex cleaning: normalize categories
def norm_sex(x):
    x = str(x).strip().lower()
    if x in {"male", "m"}: return "male"
    if x in {"female", "f"}: return "female"
    if x in {"nan", "none", "", "unknown"}: return "unknown"
    return "other"
df["sex"] = df["sex"].apply(norm_sex)

# Country cleaning
def norm_country(x):
    x = str(x).strip()
    return "unknown" if (x == "" or x.lower() == "nan") else x
df["country"] = df["country"].apply(norm_country)

# ========= 2) TEXT NORMALIZATION (for matching) =========
# Create a clean text column for matching on trackable_name
df["name_clean"] = (
    df["trackable_name"]
    .fillna("")
    .astype(str)
    .str.lower()
    .str.replace(r"[^a-z0-9\s\-']", " ", regex=True)
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
)

# ========= 3) HELPERS TO FLAG KEYWORDS =========
def add_keyword_flags(sub_df, groups_dict, prefix):
    """
    For a subset of df (e.g., only Symptoms, only Food...), add binary columns
    indicating whether trackable_name matches any keyword group for that row.
    We then aggregate to daily features later.
    """
    out = sub_df.copy()
    for group, kws in groups_dict.items():
        # Prebuild a regex OR pattern for speed; escape non-alnum safely
        pattern = r"(" + "|".join([re.escape(k.lower()) for k in kws]) + r")"
        col = f"{prefix}__{group}"
        out[col] = out["name_clean"].str.contains(pattern, regex=True).astype(int)
    return out

# ========= 4) PER-TYPE KEYWORD FLAGS =========
# NOTE: You already have: keyword_groups (conditions incl. "epilepsy_seizure"),
#       symptom_keyword_groups, food_keyword_groups, tag_keyword_groups, treatment_keyword_groups

# Conditions (Condition rows only)
cond_rows = df[df["trackable_type"] == "Condition"]
cond_rows = add_keyword_flags(cond_rows, condition_keyword_groups, "cond")

# Symptoms
sym_rows = df[df["trackable_type"] == "Symptom"]
sym_rows = add_keyword_flags(sym_rows, symptom_keyword_groups, "sym")

# Food
food_rows = df[df["trackable_type"] == "Food"]
food_rows = add_keyword_flags(food_rows, food_keyword_groups, "food")

# Tags (triggers)
tag_rows = df[df["trackable_type"] == "Tag"]
tag_rows = add_keyword_flags(tag_rows, tag_keyword_groups, "tag")

# Treatments
trt_rows = df[df["trackable_type"] == "Treatment"]
trt_rows = add_keyword_flags(trt_rows, treatment_keyword_groups, "trt")

# Weather (already numeric columns; keep as-is if present)
weather_rows = df[df["trackable_type"] == "Weather"].copy()
# Example expected numeric weather columns (adjust to your schema if needed)
for col in ["temperature_min", "temperature_max", "precip_intensity", "pressure", "humidity"]:
    if col in weather_rows.columns:
        weather_rows[col] = pd.to_numeric(weather_rows[col], errors="coerce")

# ========= 5) DAILY AGGREGATION (no leakage) =========
# We build daily features per user, then later roll 7/30d windows that use ONLY past data.

def daily_agg_flags(sub, prefix):
    # get only the flag columns
    flag_cols = [c for c in sub.columns if c.startswith(prefix + "__")]
    if not flag_cols:
        return pd.DataFrame(columns=["user_id", "checkin_date"])
    # include severity if available (0-4); we’ll take max per day
    if "trackable_value" in sub.columns:
        sub["severity_val"] = pd.to_numeric(sub["trackable_value"], errors="coerce")
    else:
        sub["severity_val"] = np.nan

    agg = (
        sub.groupby(["user_id", "checkin_date"])
           .agg({**{c: "max" for c in flag_cols}, "severity_val": "max"})
           .reset_index()
    )
    # rename severity
    if "severity_val" in agg.columns:
        agg = agg.rename(columns={"severity_val": f"{prefix}__max_severity"})
    return agg

daily_cond = daily_agg_flags(cond_rows,  "cond")
daily_sym  = daily_agg_flags(sym_rows,   "sym")
daily_food = daily_agg_flags(food_rows,  "food")
daily_tag  = daily_agg_flags(tag_rows,   "tag")
daily_trt  = daily_agg_flags(trt_rows,   "trt")

# Daily weather (mean if multiple in same day)
if not weather_rows.empty:
    wcols = ["temperature_min", "temperature_max", "precip_intensity", "pressure", "humidity"]
    wcols = [c for c in wcols if c in weather_rows.columns]
    daily_wx = (
        weather_rows.groupby(["user_id", "checkin_date"])[wcols].mean().reset_index()
    )
else:
    daily_wx = pd.DataFrame(columns=["user_id", "checkin_date"])

# ========= Combine into one daily table =========
# Start from all user-day keys present in any table
parts = [daily_cond, daily_sym, daily_food, daily_tag, daily_trt, daily_wx]
daily = None
for p in parts:
    if p is None or p.empty: 
        continue
    daily = p if daily is None else pd.merge(daily, p, on=["user_id", "checkin_date"], how="outer")

if daily is None:
    raise ValueError("No daily features built. Check your inputs.")

daily = daily.sort_values(["user_id", "checkin_date"]).reset_index(drop=True)
daily = daily.fillna(0)  # for flags; numeric weather stays 0 if missing (fine for tree models)

# ========= 6) ROLLING (PAST-ONLY) FEATURES =========
# For each user, compute 7d/30d rolling sums of flags + rolling means of severities & weather.
def add_rollups(g):
    g = g.set_index("checkin_date").sort_index()
    # rolling windows (closed='left' to use ONLY past)
    win_defs = {"7d":"7D", "30d":"30D"}
    for col in g.columns:
        if col.startswith(("cond__", "sym__", "food__", "tag__", "trt__")) and col.endswith("__max_severity") is False:
            for k, win in win_defs.items():
                g[f"{col}__sum_{k}"] = g[col].rolling(win, closed="left").sum()
        # severities & weather: rolling mean
        if col.endswith("__max_severity") or col in ["temperature_min","temperature_max","precip_intensity","pressure","humidity"]:
            for k, win in win_defs.items():
                g[f"{col}__mean_{k}"] = g[col].rolling(win, closed="left").mean()
    return g.reset_index()

daily = daily.groupby("user_id", group_keys=False).apply(add_rollups)
# Fill remaining NaNs from leading window edges
daily = daily.fillna(0)

# ========= 7) BUILD THE TARGET: NEXT-DAY FLARE (cluster-based) =========
target_condition = "pots_dysautonomia"   # or pots / epilepsy / depression / anxiety

# choose cluster sets
flare_clusters = {
    "pots_dysautonomia": ["fatigue_exhaustion", "cardiovascular_symptoms",
             "lightheaded_dizziness", "sleep_symptoms", "headache_migraine"],
    "epilepsy": ["neurologic_other", "sleep_symptoms",
                 "fatigue_exhaustion", "cognitive_symptoms"],
    "anxiety": ["anxiety_fear_panic", "stress_tension",
                "sleep_symptoms", "cardiovascular_symptoms"],
    "depression": ["negative_affect", "fatigue_exhaustion",
                   "sleep_symptoms", "cognitive_symptoms"]
}[target_condition]

sym_cols = [c for c in daily.columns
            if c.startswith("sym__") and any(k in c for k in flare_clusters)]
print(f"✅ {len(sym_cols)} symptom cluster columns for {target_condition}")

daily["symptom_mean"] = daily[sym_cols].mean(axis=1)
daily["symptom_delta"] = daily.groupby("user_id")["symptom_mean"].diff().fillna(0)
flare_threshold = 0.1
daily["label_nextday_flare"] = (
    daily.groupby("user_id")["symptom_delta"].shift(-1).fillna(0) > flare_threshold
).astype(int)


# ========= 8) ADD DEMOGRAPHICS (static) =========
demo = df.drop_duplicates("user_id")[["user_id","age","sex","country"]].copy()
daily = daily.merge(demo, on="user_id", how="left")

# One-hot encode sex & country (country can be many; consider top-K and bucket rest as 'other')
top_countries = df["country"].value_counts().head(20).index
daily["country_top"] = daily["country"].where(daily["country"].isin(top_countries), "other")

X_cat = pd.get_dummies(daily[["sex","country_top"]], drop_first=False, dtype=int)
daily = pd.concat([daily.drop(columns=["sex","country","country_top"]), X_cat], axis=1)

# ========= FILTER TO POTS USERS =========
import re

# 1️⃣ Identify condition columns related to POTS / Dysautonomia
cond_flag_cols = [c for c in daily.columns if c.startswith("cond__")]

# Pattern matches typical POTS and autonomic dysfunction cluster names
pots_pattern = re.compile(
    r"(pots|postural[_\-\s]*orthostatic[_\-\s]*tachycardia|dysautonomia|autonomic[_\-\s]*dysfunction)",
    re.I,
)

pots_flag_cols = [
    c for c in cond_flag_cols if pots_pattern.search(c.replace("cond__", ""))
]

if not pots_flag_cols:
    raise ValueError("No POTS condition columns found in your dataset.")

# 2️⃣ Find users who have ever logged any POTS-related conditions
pots_users = daily.loc[
    daily[pots_flag_cols].sum(axis=1) > 0, "user_id"
].unique()
print(f"✅ POTS users identified: {len(pots_users)}")

# 3️⃣ Restrict dataset to only those users
daily = daily[daily["user_id"].isin(pots_users)].copy()

print("✅ Dataset filtered to POTS users only.")
print("Remaining records:", len(daily))
print("Flare positive rate:", daily['label_nextday_flare'].mean().round(3))


# ========= 9) TRAIN / TEST TEMPORAL SPLIT =========
cutoff = daily["checkin_date"].quantile(0.8)  # 80% oldest for train, 20% most recent for test
train = daily[daily["checkin_date"] <= cutoff].copy()
test  = daily[daily["checkin_date"] >  cutoff].copy()

# Features: exclude identifiers & leakage columns
label_col = "label_nextday_flare"

leaky_cols = [c for c in daily.columns if "symptom" in c and "pain_mean" in c]

drop_cols = {"user_id", "checkin_date", "label_nextday_flare"}.union(leaky_cols)

feature_cols = [c for c in daily.columns if c not in drop_cols]

X_train, X_test, y_train, y_test, feature_cols = make_train_test_balanced(
    df=daily,
    label_col=label_col,
    drop_cols=drop_cols,
    pos_to_neg_ratio=4,   # 4:1 neg:pos is usually good
)

# ========= 10) TRAIN LIGHTGBM (unchanged) =========
clf = lgb.LGBMClassifier(
    n_estimators=1000,
    learning_rate=0.03,
    num_leaves=64,
    subsample=0.8,
    colsample_bytree=0.8,
    class_weight="balanced",
    random_state=42
)
clf.fit(X_train, y_train)

# ========= 11) EVALUATE =========
p_test = clf.predict_proba(X_test)[:,1]
auc = roc_auc_score(y_test, p_test)
auprc = average_precision_score(y_test, p_test)

print(f"AUC  : {auc:.3f}")
print(f"AUPRC: {auprc:.3f}")

from sklearn.metrics import precision_recall_curve
prec, rec, thr = precision_recall_curve(y_test, p_test)
for t in [0.05, 0.10, 0.20]:
    idx = (np.abs(thr - t)).argmin() if len(thr) else -1
    if idx >= 0 and idx < len(prec):
        print(f"Threshold~{t:0.2f}: Precision={prec[idx]:.3f}  Recall={rec[idx]:.3f}")

# Evaluate and store results
model_name = "Anx_Flare_LGBM"  # or "Epilepsy_30d", "RA_NextDay", etc.
evaluate_and_store_results(model_name, y_test, p_test, feature_cols, clf, results_path="model_results.csv")


# ========= 12) FEATURE IMPORTANCE =========
imp = pd.Series(clf.feature_importances_, index=feature_cols).sort_values(ascending=False)
print("\nTop 25 features:\n", imp.head(25))


---
### Modeling Results

In [None]:
pd.set_option('display.max_colwidth', None)
results_df = pd.read_csv("model_results.csv")
results_df

In [None]:
results_df = pd.read_csv("model_results.csv")

# --- Barplot for AUC & AUPRC --
plt.figure(figsize=(8, 5))

sns.barplot(data=results_df.melt(id_vars="Model", value_vars=["AUC", "AUPRC"]),
            x="Model", y="value", hue="variable")
plt.title("Model Comparison: AUC and AUPRC")
plt.ylabel("Score")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

# --- Sensitivity vs Specificity ---
plt.figure(figsize=(8, 5))
sns.scatterplot(data=results_df, x="Specificity", y="Sensitivity", hue="Model", s=100)
plt.title("Sensitivity vs Specificity across Models")
plt.xlim(0, 1); plt.ylim(0, 1)
plt.grid(True, linestyle="--", alpha=0.7)
plt.tight_layout()
plt.show()

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_predictions(y_test, (p_test >= 0.5).astype(int))
plt.title(f"{model_name} Confusion Matrix")
plt.savefig(f"plots/{model_name}_confusion.png")
