# 02 — Strong events exploration (EDA)

**Goal**
- Understand what "strong signal" events look like in the interaction log.
- Validate that the chosen `STRONG_SIGNAL_EVENTS` make sense (coverage + behavior).
- Produce quick EDA stats (gaps, coverage, per-event distribution) to guide feature engineering.

**Inputs**
- `data/processed/interactions.parquet` (created by pipeline: `python -m src.pipelines.01_make_processed`)

**Notes**
- This notebook is *EDA only*. Anything needed for production should live in `src/`.


In [None]:
from pathlib import Path
import sys
import numpy as np
import pandas as pd

# Project root = repo folder
PROJECT_ROOT = Path.cwd().parent
sys.path.append(str(PROJECT_ROOT))

DATA_PATH = PROJECT_ROOT / "data" / "processed" / "interactions.parquet"
df = pd.read_parquet(DATA_PATH).copy()

# Basic cleanup used across the notebook
df["event_time"] = pd.to_datetime(df["event_time"], errors="coerce")
df = df.dropna(subset=["external_customerkey", "event_time", "interaction_type"]).copy()

print("Loaded:", DATA_PATH)
print("Shape :", df.shape)
print("Min/Max event_time:", df["event_time"].min(), "→", df["event_time"].max())


## 1) Strong-signal event list

This is the list used by the feature pipeline.
If you change it, update `src/features/strong_events.py` (or how it is generated).


In [None]:
from src.features.strong_events import STRONG_SIGNAL_EVENTS

print("n STRONG_SIGNAL_EVENTS:", len(STRONG_SIGNAL_EVENTS))
STRONG_SIGNAL_EVENTS


## 2) Time gaps between strong events (per customer)

We compute the gap in days between consecutive strong events per customer.
This helps you decide recency windows and how "frequent" strong engagement is.


In [None]:
df_strong = df[df["interaction_type"].isin(STRONG_SIGNAL_EVENTS)].copy()
df_strong = df_strong.sort_values(["external_customerkey", "event_time"])

df_strong["gap_days"] = (
    df_strong.groupby("external_customerkey")["event_time"].diff().dt.days
)

gaps = df_strong["gap_days"].dropna()

print("Strong events rows:", len(df_strong))
print("Avg gap (days):", float(gaps.mean()))
print("Median gap (days):", float(gaps.median()))
gaps.quantile([0.5, 0.75, 0.9, 0.95, 0.99])


## 3) Per-customer average gap (optional)

This aggregates the gaps per customer, which can be more interpretable than the raw distribution.


In [None]:
customer_avg_gap = (
    df_strong.groupby("external_customerkey")["gap_days"]
    .mean()
    .dropna()
)

customer_avg_gap.describe(percentiles=[0.5, 0.75, 0.9, 0.95, 0.99])


## 4) Coverage: customers with/without strong events (ever)

This checks how many customers ever exhibit any strong-signal behavior.
If too many customers have *no* strong events, you may need fallback features (or a different event list).


In [None]:
all_customers = set(df["external_customerkey"].unique())
customers_with_strong = set(df_strong["external_customerkey"].unique())
customers_without_strong = all_customers - customers_with_strong

pct_without_strong = len(customers_without_strong) / len(all_customers) * 100
pct_with_strong = 100 - pct_without_strong

print(f"Total customers: {len(all_customers):,}")
print(f"Customers WITH ≥1 strong event: {pct_with_strong:.2f}%")
print(f"Customers with NO strong events: {pct_without_strong:.2f}%")


## 5) Per-event EDA (counts, unique customers, median gaps, diversity)

These are quick stats per event type:
- total events
- unique customers
- average events per customer
- median gap in days between events
- average customer event diversity for customers doing the event
- average active months per customer
- share contributed by the top 10% most-active customers (concentration)


In [None]:
IT = "interaction_type"
CK = "external_customerkey"
TS = "event_time"

df_sorted = df.sort_values([CK, IT, TS]).copy()

total_events = df_sorted[IT].value_counts().rename("total_events")

unique_customers = (
    df_sorted.groupby(IT)[CK]
    .nunique()
    .rename("unique_customers")
)

cust_event_counts = (
    df_sorted.groupby([IT, CK]).size().rename("cnt").reset_index()
)

avg_events_per_customer = (
    cust_event_counts.groupby(IT)["cnt"]
    .mean()
    .rename("avg_events_per_customer")
)

tmp = df_sorted[[IT, CK, TS]].copy()
tmp["gap_days"] = (
    tmp.groupby([IT, CK])[TS]
       .diff()
       .dt.total_seconds() / (60 * 60 * 24)
)

median_gap_days = (
    tmp.groupby(IT)["gap_days"]
    .median()
    .rename("median_gap_days")
)

event_diversity = (
    df_sorted.groupby(CK)[IT]
      .nunique()
      .rename("event_diversity")
)

df_div = df_sorted[[CK, IT]].merge(event_diversity, on=CK, how="left")

avg_customer_event_diversity = (
    df_div.groupby(IT)["event_diversity"]
          .mean()
          .rename("avg_customer_event_diversity")
)

df_month = df_sorted[[IT, CK, TS]].copy()
df_month["event_month"] = df_month[TS].dt.to_period("M")

active_months_per_customer = (
    df_month.groupby([IT, CK])["event_month"]
            .nunique()
            .rename("n_months")
            .reset_index()
)

avg_active_months = (
    active_months_per_customer.groupby(IT)["n_months"]
                              .mean()
                              .rename("avg_active_months")
)

def top_pct_share(keys: pd.Series, top_pct: float = 0.10) -> float:
    counts = keys.value_counts()
    if len(counts) == 0:
        return np.nan
    k = max(int(np.ceil(len(counts) * top_pct)), 1)
    return counts.head(k).sum() / counts.sum()

top_10pct_customer_event_share = (
    df_sorted.groupby(IT)[CK]
      .apply(lambda s: top_pct_share(s, top_pct=0.10))
      .rename("top_10pct_customer_event_share")
)

summary = pd.concat(
    [
        total_events,
        unique_customers,
        avg_events_per_customer,
        median_gap_days,
        avg_customer_event_diversity,
        avg_active_months,
        top_10pct_customer_event_share,
    ],
    axis=1
).sort_values("total_events", ascending=False)

summary.head(30)


## 6) Optional: "events within 7 days of an order"

If orders exist, this estimates which events tend to happen shortly before an order.
Useful for feature ideas (e.g., cart_update within N days).


In [None]:
has_orders = (df_sorted[IT] == "order").any()

if has_orders:
    orders = df_sorted.loc[df_sorted[IT] == "order", [CK, TS]].rename(columns={TS: "order_time"}).copy()

    merged = df_sorted[[CK, IT, TS]].merge(orders, on=CK, how="inner")
    merged["days_to_order"] = (merged["order_time"] - merged[TS]).dt.total_seconds() / (60 * 60 * 24)

    within_7d = merged[(merged["days_to_order"] >= 0) & (merged["days_to_order"] <= 7)]

    pct_events_within_7d_of_order = (
        within_7d.groupby(IT).size() / df_sorted.groupby(IT).size()
    ).rename("pct_events_within_7d_of_order")

    pct_events_within_7d_of_order.sort_values(ascending=False).head(30)
else:
    print("No orders found in data (interaction_type == 'order').")
