# 02 — Strong events exploration (EDA)

**Goal**
- Understand what "strong signal" events look like in the interaction log.
- Validate that the chosen `STRONG_SIGNAL_EVENTS` make sense (coverage + behavior).
- Produce quick EDA stats (gaps, coverage, per-event distribution) to guide feature engineering.

**Inputs**
- `data/processed/interactions.parquet` (created by pipeline: `python -m src.pipelines.01_make_processed`)

**Notes**
- This notebook is *EDA only*. Anything needed for production should live in `src/`.


In [1]:
from pathlib import Path
import sys
import numpy as np
import pandas as pd

# Project root = repo folder
PROJECT_ROOT = Path.cwd().parent
sys.path.append(str(PROJECT_ROOT))

DATA_PATH = PROJECT_ROOT / "data" / "processed" / "interactions.parquet"
df = pd.read_parquet(DATA_PATH).copy()

# Basic cleanup used across the notebook
df["event_time"] = pd.to_datetime(df["event_time"], errors="coerce")
df = df.dropna(subset=["external_customerkey", "event_time", "interaction_type"]).copy()

print("Loaded:", DATA_PATH)
print("Shape :", df.shape)
print("Min/Max event_time:", df["event_time"].min(), "→", df["event_time"].max())


Loaded: /Users/ahmedelgantiry/Documents/churn 1/data/processed/interactions.parquet
Shape : (4765402, 7)
Min/Max event_time: 2025-01-18 00:01:46 → 2026-01-18 03:01:46


## 1) Strong-signal event list

This is the list used by the feature pipeline.
If you change it, update `src/features/strong_events.py` (or how it is generated).


In [2]:
from src.features.strong_events import STRONG_SIGNAL_EVENTS

print("n STRONG_SIGNAL_EVENTS:", len(STRONG_SIGNAL_EVENTS))
STRONG_SIGNAL_EVENTS


n STRONG_SIGNAL_EVENTS: 8


['emarsys_sessions_content_category',
 'emarsys_sessions_content_url',
 'emarsys_sessions_content_tag',
 'emarsys_sessions_cart_update',
 'emarsys_open',
 'emarsys_sessions_purchase',
 'emarsys_webchannel_click',
 'emarsys_sessions_view']

## 2) Time gaps between strong events (per customer)

We compute the gap in days between consecutive strong events per customer.
This helps you decide recency windows and how "frequent" strong engagement is.


In [3]:
df_strong = df[df["interaction_type"].isin(STRONG_SIGNAL_EVENTS)].copy()
df_strong = df_strong.sort_values(["external_customerkey", "event_time"])

df_strong["gap_days"] = (
    df_strong.groupby("external_customerkey")["event_time"].diff().dt.days
)

gaps = df_strong["gap_days"].dropna()

print("Strong events rows:", len(df_strong))
print("Avg gap (days):", float(gaps.mean()))
print("Median gap (days):", float(gaps.median()))
gaps.quantile([0.5, 0.75, 0.9, 0.95, 0.99])


Strong events rows: 4422751
Avg gap (days): 2.149894570719097
Median gap (days): 0.0


0.50     0.0
0.75     1.0
0.90     3.0
0.95     7.0
0.99    36.0
Name: gap_days, dtype: float64

## 3) Per-customer average gap (optional)

This aggregates the gaps per customer, which can be more interpretable than the raw distribution.


In [4]:
customer_avg_gap = (
    df_strong.groupby("external_customerkey")["gap_days"]
    .mean()
    .dropna()
)

customer_avg_gap.describe(percentiles=[0.5, 0.75, 0.9, 0.95, 0.99])


count    59310.000000
mean        14.187245
std         33.370785
min          0.000000
50%          1.933333
75%         10.866667
90%         39.833333
95%         73.000000
99%        173.000000
max        363.000000
Name: gap_days, dtype: float64

## 4) Coverage: customers with/without strong events (ever)

This checks how many customers ever exhibit any strong-signal behavior.
If too many customers have *no* strong events, you may need fallback features (or a different event list).


In [5]:
all_customers = set(df["external_customerkey"].unique())
customers_with_strong = set(df_strong["external_customerkey"].unique())
customers_without_strong = all_customers - customers_with_strong

pct_without_strong = len(customers_without_strong) / len(all_customers) * 100
pct_with_strong = 100 - pct_without_strong

print(f"Total customers: {len(all_customers):,}")
print(f"Customers WITH ≥1 strong event: {pct_with_strong:.2f}%")
print(f"Customers with NO strong events: {pct_without_strong:.2f}%")


Total customers: 74,067
Customers WITH ≥1 strong event: 88.84%
Customers with NO strong events: 11.16%


## 5) Per-event EDA (counts, unique customers, median gaps, diversity)

These are quick stats per event type:
- total events
- unique customers
- average events per customer
- median gap in days between events
- average customer event diversity for customers doing the event
- average active months per customer
- share contributed by the top 10% most-active customers (concentration)


In [6]:
IT = "interaction_type"
CK = "external_customerkey"
TS = "event_time"

df_sorted = df.sort_values([CK, IT, TS]).copy()

total_events = df_sorted[IT].value_counts().rename("total_events")

unique_customers = (
    df_sorted.groupby(IT)[CK]
    .nunique()
    .rename("unique_customers")
)

cust_event_counts = (
    df_sorted.groupby([IT, CK]).size().rename("cnt").reset_index()
)

avg_events_per_customer = (
    cust_event_counts.groupby(IT)["cnt"]
    .mean()
    .rename("avg_events_per_customer")
)

tmp = df_sorted[[IT, CK, TS]].copy()
tmp["gap_days"] = (
    tmp.groupby([IT, CK])[TS]
       .diff()
       .dt.total_seconds() / (60 * 60 * 24)
)

median_gap_days = (
    tmp.groupby(IT)["gap_days"]
    .median()
    .rename("median_gap_days")
)

event_diversity = (
    df_sorted.groupby(CK)[IT]
      .nunique()
      .rename("event_diversity")
)

df_div = df_sorted[[CK, IT]].merge(event_diversity, on=CK, how="left")

avg_customer_event_diversity = (
    df_div.groupby(IT)["event_diversity"]
          .mean()
          .rename("avg_customer_event_diversity")
)

df_month = df_sorted[[IT, CK, TS]].copy()
df_month["event_month"] = df_month[TS].dt.to_period("M")

active_months_per_customer = (
    df_month.groupby([IT, CK])["event_month"]
            .nunique()
            .rename("n_months")
            .reset_index()
)

avg_active_months = (
    active_months_per_customer.groupby(IT)["n_months"]
                              .mean()
                              .rename("avg_active_months")
)

def top_pct_share(keys: pd.Series, top_pct: float = 0.10) -> float:
    counts = keys.value_counts()
    if len(counts) == 0:
        return np.nan
    k = max(int(np.ceil(len(counts) * top_pct)), 1)
    return counts.head(k).sum() / counts.sum()

top_10pct_customer_event_share = (
    df_sorted.groupby(IT)[CK]
      .apply(lambda s: top_pct_share(s, top_pct=0.10))
      .rename("top_10pct_customer_event_share")
)

summary = pd.concat(
    [
        total_events,
        unique_customers,
        avg_events_per_customer,
        median_gap_days,
        avg_customer_event_diversity,
        avg_active_months,
        top_10pct_customer_event_share,
    ],
    axis=1
).sort_values("total_events", ascending=False)

summary.head(30)


Unnamed: 0_level_0,total_events,unique_customers,avg_events_per_customer,median_gap_days,avg_customer_event_diversity,avg_active_months,top_10pct_customer_event_share
interaction_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
emarsys_open,2669813,53793,49.631235,1.35581,4.044652,5.744186,0.505466
emarsys_sessions_content_url,733879,33770,21.731685,0.000521,8.12864,1.763784,0.47065
emarsys_sessions_content_category,732270,33648,21.76266,0.000521,8.137008,1.766227,0.470811
emarsys_cancel,199787,26129,7.646179,2.001296,3.798455,2.287152,0.789831
emarsys_sessions_content_tag,146504,25254,5.80122,0.007546,8.28903,1.65942,0.481175
emarsys_click,77053,22257,3.461967,3.135463,7.842148,1.759761,0.429146
emarsys_sessions_view,65785,19437,3.384524,0.180532,8.189222,1.439677,0.443779
emarsys_sessions_purchase,58115,20615,2.819064,0.0,7.918231,1.232307,0.394889
order,33052,23314,1.417689,19.034208,7.26298,1.241743,0.273539
emarsys_sessions_cart_update,15802,4363,3.62182,0.0,8.794583,1.241348,0.476459


## 6)"events within 7 days of an order"

If orders exist, this estimates which events tend to happen shortly before an order.
Useful for feature ideas (e.g., cart_update within N days).


In [10]:
import pandas as pd
from IPython.display import display

# Base tables
events = df_sorted[[CK, IT, TS]].copy()
orders = df_sorted.loc[df_sorted[IT] == "order", [CK, TS]].rename(columns={TS: "order_time"}).copy()

# Clean nulls (merge_asof rejects null keys)
events = events.dropna(subset=[CK, TS])
orders = orders.dropna(subset=[CK, "order_time"])

# (Optional) exclude orders from the "event types" you're evaluating
events = events[events[IT] != "order"]

if orders.empty:
    print("No orders found in data (interaction_type == 'order').")
else:
    # IMPORTANT: sort by time first for merge_asof
    events = events.sort_values([TS, CK]).reset_index(drop=True)
    orders = orders.sort_values(["order_time", CK]).reset_index(drop=True)

    merged = pd.merge_asof(
        events,
        orders,
        by=CK,
        left_on=TS,
        right_on="order_time",
        direction="forward",
        allow_exact_matches=True,
    )

    merged["days_to_next_order"] = (merged["order_time"] - merged[TS]).dt.total_seconds() / 86400.0

    within_7d = merged[(merged["days_to_next_order"] >= 0) & (merged["days_to_next_order"] <= 7)]

    denom = events.groupby(IT).size()
    numer = within_7d.groupby(IT).size()

    out = (numer / denom).rename("pct_events_followed_by_order_within_7d") \
                         .sort_values(ascending=False) \
                         .head(30) \
                         .to_frame()

    display(out)




Unnamed: 0_level_0,pct_events_followed_by_order_within_7d
interaction_type,Unnamed: 1_level_1
emarsys_sessions_purchase,0.888359
emarsys_sessions_content_category,0.515848
emarsys_sessions_content_url,0.51472
emarsys_sessions_view,0.465516
emarsys_sessions_content_tag,0.402911
emarsys_sessions_category_view,0.343953
emarsys_sessions_cart_update,0.341223
emarsys_webchannel_show,0.272817
emarsys_webchannel_click,0.154374
emarsys_click,0.147691
