# 04 — Build Labels (Churn = inactive in next 30 days)

This notebook creates the **training label** for churn modeling.

## Definition (30-day window)
- Choose a snapshot time: `snapshot_time = max(event_time) - 30 days`
- Look at the *future window*: `(snapshot_time, snapshot_time + 30 days]`
- A customer is **active_next_30d = 1** if they have **any strong-signal event** in the future window
- The churn label is:
  - `churn_label = 1 - active_next_30d`

## No leakage
- Features must be computed using **events ≤ snapshot_time**
- Labels use **events after snapshot_time** (future window)


In [None]:
from pathlib import Path
import sys
import pandas as pd
import numpy as np

PROJECT_ROOT = Path.cwd().parent
sys.path.append(str(PROJECT_ROOT))

from src.features.build_features import filter_strong_events

DATA_PATH = PROJECT_ROOT / "data" / "processed" / "interactions.parquet"
FEATURES_PATH = PROJECT_ROOT / "data" / "features" / "customer_features_snapshot_30d.parquet"
OUT_PATH = PROJECT_ROOT / "data" / "processed" / "model_dataset_label_30d.parquet"

print("DATA_PATH    :", DATA_PATH)
print("FEATURES_PATH:", FEATURES_PATH)
print("OUT_PATH     :", OUT_PATH)


## Load interactions (processed)

In [None]:
df = pd.read_parquet(DATA_PATH)
df["event_time"] = pd.to_datetime(df["event_time"], errors="coerce")
df = df.dropna(subset=["external_customerkey", "event_time", "interaction_type"]).copy()

df.shape


## Choose snapshot time and label window

In [None]:
CHURN_WINDOW_DAYS = 30
max_time = df["event_time"].max()
snapshot_time = max_time - pd.Timedelta(days=CHURN_WINDOW_DAYS)
future_end = snapshot_time + pd.Timedelta(days=CHURN_WINDOW_DAYS)

max_time, snapshot_time, future_end


## Build customer list (history up to snapshot)

In [None]:
df_hist = df[df["event_time"] <= snapshot_time].copy()
customers = df_hist[["external_customerkey"]].drop_duplicates().reset_index(drop=True)

customers.shape


## Compute churn label using strong-signal events in the future window

In [None]:
df_strong = filter_strong_events(df)

active_in_future = (
    df_strong[
        (df_strong["event_time"] > snapshot_time) &
        (df_strong["event_time"] <= future_end)
    ]["external_customerkey"]
    .drop_duplicates()
)

labels = customers.copy()
labels["active_next_30d"] = labels["external_customerkey"].isin(active_in_future).astype(int)
labels["churn_label"] = (1 - labels["active_next_30d"]).astype(int)

labels["churn_label"].value_counts(), float(labels["churn_label"].mean())


## Join labels with features and write training dataset

In [None]:
features = pd.read_parquet(FEATURES_PATH)

dataset = features.merge(
    labels[["external_customerkey", "churn_label"]],
    on="external_customerkey",
    how="inner",
)

dataset["churn_label"] = dataset["churn_label"].astype(int)

dataset = dataset.replace([np.inf, -np.inf], np.nan)
num_cols = dataset.select_dtypes(include=[np.number]).columns
dataset[num_cols] = dataset[num_cols].fillna(0)

OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
dataset.to_parquet(OUT_PATH, index=False)

print("Wrote:", OUT_PATH, "rows:", len(dataset), "cols:", dataset.shape[1])
print("label mean:", float(dataset["churn_label"].mean()))
