# Master Donor Feature Schema for DonorsChoose

## Overview

This notebook implements a comprehensive, reusable donor feature engineering pipeline that:

1. **Re-uses existing feature engineering work** while consolidating near-duplicate concepts
2. **Separates features by time scope**:
   - `STATIC`: Does not depend on reference time T (or treated as fixed once per donor)
   - `AS_OF_T`: Uses cumulative data up to time T
   - `WINDOWED`: Uses specific lookback windows relative to T
   - `LABEL`: Natural target variables for various use cases

3. **Pulls from 5 main data sources**:
   - Donor Project Records (gifts/donations)
   - Email Events (12-month summary)
   - Site Events (FY25-26 activity)
   - Monthly Donation Program (all-time)
   - Share Events (all-time)
   - Plus: ZIP-level ACS demographics

## Feature Categories

The schema includes approximately 150+ features across these domains:

1. **Identity & Demographics** (ZIP-based ACS)
2. **Lifetime Giving Behavior** (tenure, amounts, patterns)
3. **Windowed Giving** (3m, 12m, 36m activity + velocity)
4. **Channel/Payment Mix** (DAF, green, gift cards, etc.)
5. **Monthly Program Dynamics** (subscription behavior)
6. **Teacher/School/Content Preferences** (loyalty vs diversification)
7. **Seasonality** (back-to-school, year-end, etc.)
8. **Email Engagement** (opens, clicks, velocity)
9. **Site Behavior** (sessions, pages, devices)
10. **Share Activity** (social sharing patterns)
11. **Project Outcomes** (fully-funded rates, matching)
12. **Labels** (optional, for various prediction tasks)

## Key Design Principles

- **No duplicate concepts**: Single distance metric, single velocity calculation pattern
- **Consistent windowing**: 3-month (short), 12-month (mid), 36-month (long)
- **Defensive coding**: Small epsilon values prevent division by zero
- **Separation of concerns**: Feature building ≠ imputation/encoding
- **Time-relative**: All features parameterized by reference time T


## Setup and Imports

In [1]:
import numpy as np
import pandas as pd
from pyproj import Geod
from typing import Optional, Tuple
import warnings
warnings.filterwarnings('ignore')

# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', None)

## 1. Helper Functions

These utility functions support the main feature engineering pipeline.

In [2]:
def haversine_miles(lat1, lon1, lat2, lon2):
    """
    Compute great-circle distance between two points in miles using Haversine formula.
    
    All arguments can be scalars or pandas Series (for vectorized computation).
    
    Parameters
    ----------
    lat1, lon1 : float or pd.Series
        Latitude and longitude of first point(s) in decimal degrees
    lat2, lon2 : float or pd.Series
        Latitude and longitude of second point(s) in decimal degrees
        
    Returns
    -------
    distance : float or pd.Series
        Great-circle distance in miles
    """
    # Earth's radius in miles
    R = 3958.8
    
    # Convert all coordinates from degrees to radians
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    
    # Calculate differences
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    
    # Haversine formula
    a = np.sin(dlat / 2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0)**2
    c = 2 * np.arcsin(np.sqrt(a))
    
    return R * c

def entropy_vectorized(df, group_col, cat_col):
    """
    Calculate Shannon entropy in a fully vectorized way.
        
    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing the data
    group_col : str
        Column to group by (e.g., 'donor_id')
    cat_col : str
        Categorical column to calculate entropy for (e.g., 'teacher_id')
    
    Returns
    -------
    entropy : pd.Series
        Entropy value for each group, indexed by group_col
    """
    if df.empty:
        return pd.Series(dtype=float, name='entropy')
    
    # Count occurrences of each category within each group
    counts = df.groupby([group_col, cat_col]).size()
    
    # Calculate probabilities (normalize within each group)
    probs = counts.groupby(level=0).transform(lambda x: x / x.sum())
    
    # Calculate entropy: -sum(p * log(p)) for each group
    entropy = -(probs * np.log(probs + 1e-9)).groupby(level=0).sum()
    
    return entropy


def pct_amount(df, flag_col, amount_col='payment_amount'):
    """
    Calculate the fraction of total donation amount coming from flagged rows.
    
    Example: If flag_col='daf_payment', returns what % of each donor's
    total giving came through DAF donations.
    
    Parameters
    ----------
    df : pd.DataFrame
        Donation records with donor_id column
    flag_col : str
        Name of binary indicator column (1 = flagged, 0 = not flagged)
    amount_col : str, default='payment_amount'
        Name of amount column to aggregate
        
    Returns
    -------
    pct : pd.Series
        Fraction of amount from flagged rows, indexed by donor_id
    """

    if df.empty:
        return pd.Series(dtype=float)
    
    # Filter THEN group (vectorized, no lambda)
    flagged_df = df[df[flag_col] == 1]
    amt_flag = flagged_df.groupby('donor_id')[amount_col].sum()
    amt_total = df.groupby('donor_id')[amount_col].sum()
    amt_flag = amt_flag.reindex(amt_total.index, fill_value=0)
    
    # Return fraction (with small epsilon to avoid division by zero)
    return amt_flag / (amt_total + 1e-6)


def pct_count(df, flag_col):
    """
    Calculate the fraction of total gift count coming from flagged rows.
    
    Example: If flag_col='gift_card_purchase', returns what % of each donor's
    gifts were gift card purchases.
    
    Parameters
    ----------
    df : pd.DataFrame
        Donation records with donor_id column
    flag_col : str
        Name of binary indicator column
        
    Returns
    -------
    pct : pd.Series
        Fraction of gifts that are flagged, indexed by donor_id
    """
    if df.empty:
        return pd.Series(dtype=float)
        
    # Filter THEN group (vectorized)
    flagged_df = df[df[flag_col] == 1]
    cnt_flag = flagged_df.groupby('donor_id').size()
    cnt_total = df.groupby('donor_id').size()
    cnt_flag = cnt_flag.reindex(cnt_total.index, fill_value=0)
    
    return cnt_flag / (cnt_total + 1e-6)

## 2. Feature Group Functions

Each function builds a logical group of related features.

### 2.1 Identity & Demographics

In [3]:
def _identity_and_zip_features(dpr_pre_T, df_zip_acs, md_pre_T=None):
    """
    Build identity features and ZIP-level demographics.
    
    Parameters:
    -----------
    dpr_pre_T : pd.DataFrame
        Donation records before T
    df_zip_acs : pd.DataFrame
        ZIP-level ACS demographics
    md_pre_T : pd.DataFrame, optional
        Monthly donation data (for ZIP fallback for monthly-only donors)
    """
    # Create base index from all possible donors
    all_donor_ids = dpr_pre_T['donor_id'].unique()
    if md_pre_T is not None:
        all_donor_ids = pd.Index(
            pd.unique(
                pd.concat([
                    pd.Series(dpr_pre_T['donor_id'].unique()),
                    pd.Series(md_pre_T['donor_id'].unique())
                ])
            )
        )
    
    if dpr_pre_T.empty and (md_pre_T is None or md_pre_T.empty):
        return pd.DataFrame(index=pd.Index([], name='donor_id'))

    # Get ZIP from PROJECT donations (if available)
    if not dpr_pre_T.empty:
        donor_zip5_from_dpr = (
            dpr_pre_T
            .sort_values('payment_date')
            .groupby('donor_id')['donor_zip']
            .first()
        )
    else:
        donor_zip5_from_dpr = pd.Series(dtype=object, name='donor_zip')
    
    # Get ZIP from MONTHLY data as fallback
    if md_pre_T is not None and not md_pre_T.empty and 'donor_zip' in md_pre_T.columns:
        donor_zip5_from_monthly = (
            md_pre_T
            .sort_values('monthly_subscription_joined_date')
            .groupby('donor_id')['donor_zip']
            .first()
        )
        # Combine: use dpr ZIP if available, otherwise monthly ZIP
        donor_zip5_raw = donor_zip5_from_dpr.combine_first(donor_zip5_from_monthly)
    else:
        donor_zip5_raw = donor_zip5_from_dpr

    # Binary status flags (from dpr_pre_T)
    if not dpr_pre_T.empty:
        is_teacher = dpr_pre_T.groupby('donor_id')['is_teacher'].max()
        is_teacher_referred = dpr_pre_T.groupby('donor_id')['is_teacher_referred'].max()
        is_marketing_subscribed = dpr_pre_T.groupby('donor_id')['subscribed_to_marketing_emails'].max()
        is_major_gift_donor = dpr_pre_T.groupby('donor_id')['major_gift_donor'].max()
        
        # Account credit
        ever_used_account_credit = (
            dpr_pre_T.groupby('donor_id')['account_credit_balance']
            .max()
            .gt(0)
            .astype(int)
        )
        current_account_credit_balance = (
            dpr_pre_T
            .sort_values('payment_date')
            .groupby('donor_id')['account_credit_balance']
            .last()
        )
    else:
        # Create empty series if no dpr data
        is_teacher = pd.Series(dtype=float)
        is_teacher_referred = pd.Series(dtype=float)
        is_marketing_subscribed = pd.Series(dtype=float)
        is_major_gift_donor = pd.Series(dtype=float)
        ever_used_account_credit = pd.Series(dtype=int)
        current_account_credit_balance = pd.Series(dtype=float)

    # --- Donor Type Flags (always create for all donors) ---
    # Determine which donors appear in each data source
    project_donor_ids = set(dpr_pre_T['donor_id'].unique()) if not dpr_pre_T.empty else set()
    monthly_donor_ids = set(md_pre_T['donor_id'].unique()) if md_pre_T is not None and not md_pre_T.empty else set()
    
    # Create flags for all donors in our universe
    has_project_history = pd.Series([d in project_donor_ids for d in all_donor_ids], index=all_donor_ids, dtype=int)
    has_monthly_history = pd.Series([d in monthly_donor_ids for d in all_donor_ids], index=all_donor_ids, dtype=int)
    
    # Composite flags for donor segmentation
    is_monthly_only_donor = ((has_monthly_history == 1) & (has_project_history == 0)).astype(int)
    is_project_only_donor = ((has_project_history == 1) & (has_monthly_history == 0)).astype(int)
    is_hybrid_donor = ((has_project_history == 1) & (has_monthly_history == 1)).astype(int)

    # --- ZIP / ACS JOIN (dtype-safe) ------------------------------------
    donor_zip5 = pd.to_numeric(donor_zip5_raw, errors="coerce")

    df_zip_acs = df_zip_acs.copy()
    if 'ZIP5' in df_zip_acs.columns:
        df_zip_acs = df_zip_acs.set_index('ZIP5')
    df_zip_acs.index = pd.to_numeric(df_zip_acs.index, errors="coerce")

    acs = donor_zip5.to_frame('donor_zip5').join(
        df_zip_acs,
        on='donor_zip5',
        how='left'
    )

    # Assemble output
    out = pd.DataFrame(index=donor_zip5.index)
    out['donor_zip5'] = donor_zip5
    out['has_project_history'] = has_project_history
    out['has_monthly_history'] = has_monthly_history
    out['is_monthly_only_donor'] = is_monthly_only_donor
    out['is_project_only_donor'] = is_project_only_donor
    out['is_hybrid_donor'] = is_hybrid_donor
    out['is_teacher'] = is_teacher
    out['is_teacher_referred'] = is_teacher_referred
    out['is_marketing_subscribed'] = is_marketing_subscribed
    out['is_major_gift_donor'] = is_major_gift_donor
    out['ever_used_account_credit'] = ever_used_account_credit
    out['current_account_credit_balance'] = current_account_credit_balance

    # ACS features
    out['zip_pct_households_with_children'] = acs['pct_households_with_children']
    out['zip_pct_in_labor_force'] = acs['pct_in_labor_force']
    out['zip_unemployment_rate'] = acs['unemployment_rate']
    out['zip_pct_single_parent'] = acs['pct_single_parent']
    out['zip_pct_minority'] = acs['pct_minority']
    out['zip_avg_household_size'] = acs['avg_household_size']
    out['zip_median_age'] = acs['median_age']
    out['zip_median_home_value'] = acs['median_home_value']
    
    if 'total_population' in acs.columns:
        out['zip_log_total_population'] = np.log1p(acs['total_population'])
    elif 'log_total_population' in acs.columns:
        out['zip_log_total_population'] = acs['log_total_population']
    else:
        out['zip_log_total_population'] = np.nan
        
    return out

### 2.2 Lifetime Giving Behavior

In [4]:
def _lifetime_giving_features(dpr_pre_T, T):
    """
    Build cumulative giving features using all donations before T.
    
    CORRECTED VERSION with:
    - Tenure relative to T (not last_donation_date)
    - Gap calculations only for last 730 days
    
    Features created:
    - first_donation_date, last_donation_date: Temporal boundaries
    - tenure_days, tenure_years, tenure_bucket: How long donor has been active (relative to T)
    - lifetime_gift_count, lifetime_amount: Total volume
    - lifetime_median/max/cv_gift_amount: Distribution of gift sizes
    - mean/cv_gap_between_gifts_days: Giving rhythm/consistency (last 730 days only)
    - max_donation_sequence_number: How many gifts total
    - pct_early_gifts_in_lifetime: Concentration in first 3 gifts
    
    Parameters
    ----------
    dpr_pre_T : pd.DataFrame
        Donor Project Records filtered to payment_date < T
    T : pd.Timestamp
        Reference time (end of training period)
        
    Returns
    -------
    features : pd.DataFrame
        Feature matrix indexed by donor_id
    """
    if dpr_pre_T.empty:
        return pd.DataFrame(index=pd.Index([], name='donor_id'))
    
    dpr = dpr_pre_T.copy()
    dpr = dpr.sort_values(['donor_id', 'payment_date'])
    
    # --- Temporal boundaries ---
    first_donation_date = dpr.groupby('donor_id')['payment_date'].min()
    last_donation_date = dpr.groupby('donor_id')['payment_date'].max()
    
    # CORRECTED: Tenure relative to T (end of training period), not last donation
    tenure_days = (T - first_donation_date).dt.days
    tenure_years = tenure_days / 365.25
    
    # Bucket tenure for non-linear patterns (categorical)
    tenure_bucket_cat = pd.cut(
        tenure_years,
        bins=[0, 1, 3, 5, 100],
        labels=['<1y', '1-3y', '3-5y', '5y+'],
        right=False
    )
    
    # Map bucket to lower-bound in YEARS as float64
    tenure_bucket = tenure_bucket_cat.map({
        '<1y': 0.0,
        '1-3y': 1.0,
        '3-5y': 3.0,
        '5y+': 5.0,
    }).astype(float)
    
    # --- Volume metrics ---
    lifetime_gift_count = dpr.groupby('donor_id')['payment_amount'].size()
    lifetime_amount = dpr.groupby('donor_id')['payment_amount'].sum()
    
    # --- Distribution of gift sizes ---
    gifts_by_donor = dpr.groupby('donor_id')['payment_amount']
    med = gifts_by_donor.median()
    mx = gifts_by_donor.max()
    mean = gifts_by_donor.mean()
    std = gifts_by_donor.std()
    # Coefficient of variation: std/mean
    # High CV = erratic giving amounts, Low CV = consistent amounts
    cv = std / (mean + 1e-9)
    
    # --- Inter-gift timing patterns (LAST 730 DAYS ONLY) ---
    # CORRECTED: Only use donations in last 730 days for gap calculations
    cutoff_730 = T - pd.Timedelta(days=730)
    dpr_730 = dpr[dpr['payment_date'] >= cutoff_730].copy()
    
    if not dpr_730.empty:
        # Calculate days between consecutive gifts
        dpr_730['prev_payment_date'] = dpr_730.groupby('donor_id')['payment_date'].shift(1)
        dpr_730['gap_days'] = (dpr_730['payment_date'] - dpr_730['prev_payment_date']).dt.days
        
        gap_by_donor = dpr_730.groupby('donor_id')['gap_days']
        gap_mean = gap_by_donor.mean()
        gap_std = gap_by_donor.std()
        # CV of gaps: High = irregular giving, Low = regular/predictable
        cv_gap = gap_std / (gap_mean + 1e-9)
    else:
        # No donations in last 730 days
        gap_mean = pd.Series(dtype=float)
        cv_gap = pd.Series(dtype=float)
    
    # --- Gift sequence metrics ---
    # donation_n = sequential gift number for this donor
    max_donation_sequence_number = dpr.groupby('donor_id')['donation_n'].max()
    
    # What fraction of gifts were in the "early stage" (first 3 gifts)?
    # High concentration here might indicate acquisition success but poor retention
    dpr['is_early_gift'] = dpr['donation_n'] <= 3
    pct_early_gifts = dpr.groupby('donor_id')['is_early_gift'].mean()
    
    # --- Assemble output ---
    out = pd.DataFrame(index=lifetime_amount.index)
    
    out['first_donation_date'] = first_donation_date
    out['last_donation_date'] = last_donation_date
    out['tenure_days'] = tenure_days
    out['tenure_years'] = tenure_years
    out['tenure_bucket'] = tenure_bucket
    
    out['lifetime_gift_count'] = lifetime_gift_count
    out['lifetime_amount'] = lifetime_amount
    out['lifetime_median_gift_amount'] = med
    out['lifetime_max_gift_amount'] = mx
    out['lifetime_cv_gift_amount'] = cv
    
    out['mean_gap_between_gifts_days_last2yr'] = gap_mean
    out['cv_gap_between_gifts_days_last2yr'] = cv_gap
    
    out['max_donation_sequence_number'] = max_donation_sequence_number
    out['pct_early_gifts_in_lifetime'] = pct_early_gifts
    
    return out

### 2.3 Windowed Giving & Velocity

These features capture recent activity and trends.

In [5]:
def _windowed_giving_features(dpr_pre_T, dpr_3m, dpr_12m, dpr_36m, dpr_3to12m, dpr_12to36m, T):
    """
    Build windowed giving features and velocity metrics.
    
    Features created:
    - gift_count/amount/median_amount for 3m, 12m, 36m windows
    - days_since_last_gift, days_since_second_to_last_gift: Recency
    - amount_velocity_0to3_vs_3to12: Is giving accelerating or decelerating?
    - amount_velocity_0to12_vs_12to36: Longer-term trend
    - count_velocity_0to12_vs_12to36: Frequency trend
    
    Velocity > 1 indicates acceleration (more recent activity)
    Velocity < 1 indicates deceleration (declining activity)
    
    Parameters
    ----------
    dpr_pre_T : pd.DataFrame
        All donations before T
    dpr_3m, dpr_12m, dpr_36m : pd.DataFrame
        Donations in respective windows
    dpr_3to12m, dpr_12to36m : pd.DataFrame
        Intermediate periods for velocity calculations
    T : pd.Timestamp
        Reference time
        
    Returns
    -------
    features : pd.DataFrame
        Feature matrix indexed by donor_id
    """
    out_index = dpr_pre_T['donor_id'].unique()
    out = pd.DataFrame(index=out_index)
    
    # --- 3-month window (recent/short-term) ---
    if not dpr_3m.empty:
        by_3m = dpr_3m.groupby('donor_id')['payment_amount']
        out['gift_count_3m'] = by_3m.size()
        out['gift_amount_3m'] = by_3m.sum()
        out['median_gift_amount_3m'] = by_3m.median()

        # Split by green vs non-green payments
        if 'is_green_payment' in dpr_3m.columns:
            # Handle 't'/'f' as well as 1/0/True/False
            green_mask_3m = dpr_3m['is_green_payment'].isin(['t', 'T', True, 1])
            nongreen_mask_3m = dpr_3m['is_green_payment'].isin(['f', 'F', False, 0])

            dpr_3m_green = dpr_3m[green_mask_3m]
            dpr_3m_nongreen = dpr_3m[nongreen_mask_3m]

            by_3m_green = dpr_3m_green.groupby('donor_id')['payment_amount']
            by_3m_nongreen = dpr_3m_nongreen.groupby('donor_id')['payment_amount']

            out['gift_count_green_3m'] = by_3m_green.size()
            out['gift_amount_green_3m'] = by_3m_green.sum()
            out['gift_count_nongreen_3m'] = by_3m_nongreen.size()
            out['gift_amount_nongreen_3m'] = by_3m_nongreen.sum()
    
    # --- 12-month window (mid-term) ---
    if not dpr_12m.empty:
        by_12m = dpr_12m.groupby('donor_id')['payment_amount']
        out['gift_count_12m'] = by_12m.size()
        out['gift_amount_12m'] = by_12m.sum()
        out['median_gift_amount_12m'] = by_12m.median()

        if 'is_green_payment' in dpr_12m.columns:
            green_mask_12m = dpr_12m['is_green_payment'].isin(['t', 'T', True, 1])
            nongreen_mask_12m = dpr_12m['is_green_payment'].isin(['f', 'F', False, 0])

            dpr_12m_green = dpr_12m[green_mask_12m]
            dpr_12m_nongreen = dpr_12m[nongreen_mask_12m]

            by_12m_green = dpr_12m_green.groupby('donor_id')['payment_amount']
            by_12m_nongreen = dpr_12m_nongreen.groupby('donor_id')['payment_amount']

            out['gift_count_green_12m'] = by_12m_green.size()
            out['gift_amount_green_12m'] = by_12m_green.sum()
            out['gift_count_nongreen_12m'] = by_12m_nongreen.size()
            out['gift_amount_nongreen_12m'] = by_12m_nongreen.sum()
    
    # --- 36-month window (long-term) ---
    if not dpr_36m.empty:
        by_36m = dpr_36m.groupby('donor_id')['payment_amount']
        out['gift_count_36m'] = by_36m.size()
        out['gift_amount_36m'] = by_36m.sum()
        out['median_gift_amount_36m'] = by_36m.median()

        if 'is_green_payment' in dpr_36m.columns:
            green_mask_36m = dpr_36m['is_green_payment'].isin(['t', 'T', True, 1])
            nongreen_mask_36m = dpr_36m['is_green_payment'].isin(['f', 'F', False, 0])

            dpr_36m_green = dpr_36m[green_mask_36m]
            dpr_36m_nongreen = dpr_36m[nongreen_mask_36m]

            by_36m_green = dpr_36m_green.groupby('donor_id')['payment_amount']
            by_36m_nongreen = dpr_36m_nongreen.groupby('donor_id')['payment_amount']

            out['gift_count_green_36m'] = by_36m_green.size()
            out['gift_amount_green_36m'] = by_36m_green.sum()
            out['gift_count_nongreen_36m'] = by_36m_nongreen.size()
            out['gift_amount_nongreen_36m'] = by_36m_nongreen.sum()
    
    # --- Recency: days since last activity ---
    last_gift_date = dpr_pre_T.groupby('donor_id')['payment_date'].max()
    out['days_since_last_gift'] = (T - last_gift_date).dt.days
    
    # Second-to-last gift: helps detect if donor is becoming less frequent
    # Sort by payment_date descending and use groupby.nth(1)
    second_last = (
        dpr_pre_T
        .sort_values(['donor_id', 'payment_date'], ascending=[True, False])
        .groupby('donor_id')['payment_date']
        .nth(1)  # second row per donor, or NaT if fewer than 2
    )
    
    out['days_since_second_to_last_gift'] = (T - second_last).dt.days
    
    # --- Velocity metrics: are donations accelerating or decelerating? ---
    
    # Recent 3m vs prior 9m (months 3-12)
    amt_0_3 = out.get('gift_amount_3m', pd.Series(0, index=out.index))
    amt_3_12 = (
        dpr_3to12m.groupby('donor_id')['payment_amount'].sum()
        if not dpr_3to12m.empty
        else pd.Series(0, index=out.index)
    )
    # Ratio > 1 means recent giving exceeds prior period
    out['amount_velocity_0to3_vs_3to12'] = amt_0_3 / (amt_3_12 + 1e-6)
    
    # Recent 12m vs prior 24m (months 12-36)
    amt_0_12 = out.get('gift_amount_12m', pd.Series(0, index=out.index))
    amt_12_36 = (
        dpr_12to36m.groupby('donor_id')['payment_amount'].sum()
        if not dpr_12to36m.empty
        else pd.Series(0, index=out.index)
    )
    out['amount_velocity_0to12_vs_12to36'] = amt_0_12 / (amt_12_36 + 1e-6)
    
    # Gift frequency velocity (count-based)
    cnt_0_12 = out.get('gift_count_12m', pd.Series(0, index=out.index))
    cnt_12_36 = (
        dpr_12to36m.groupby('donor_id')['payment_amount'].size()
        if not dpr_12to36m.empty
        else pd.Series(0, index=out.index)
    )
    out['count_velocity_0to12_vs_12to36'] = cnt_0_12 / (cnt_12_36 + 1e-6)
    
    return out

### 2.4 Channel & Payment Type Mix

How donors give: DAF, green payments, gift cards, etc.

In [6]:
def _channel_mix_features(dpr_pre_T, dpr_12m):
    """
    Build features describing how donors give (payment methods, channels).
    
    For each channel, we compute both lifetime and 12-month versions:
    - pct_amount_*: What fraction of $ came through this channel?
    - pct_count_*: What fraction of gifts came through this channel?
    
    Channels covered:
    - DAF (Donor Advised Fund) payments
    - Green payments (environmental offset donations)
    - Gift card purchases
    - Big event donations (e.g., giving days)
    - Optional donation behavior
    - Anonymous gifts
    - Classroom essentials projects
    
    Parameters
    ----------
    dpr_pre_T : pd.DataFrame
        All donations before T (for lifetime metrics)
    dpr_12m : pd.DataFrame
        Donations in last 12 months (for recent behavior)
        
    Returns
    -------
    features : pd.DataFrame
        Feature matrix indexed by donor_id
    """
    out_index = dpr_pre_T['donor_id'].unique()
    out = pd.DataFrame(index=out_index)
    
    # --- DAF donations ---
    # DAF = Donor Advised Fund, typically indicates sophisticated/high-capacity donors
    out['pct_amount_daf_lifetime'] = pct_amount(dpr_pre_T, 'daf_payment')
    out['pct_count_daf_lifetime'] = pct_count(dpr_pre_T, 'daf_payment')
    out['pct_amount_daf_12m'] = pct_amount(dpr_12m, 'daf_payment')
    out['pct_count_daf_12m'] = pct_count(dpr_12m, 'daf_payment')
    
    # --- Green payments ---
    # Optional environmental offset donations
    if 'green_payment_amount' in dpr_pre_T.columns:
        by_life = dpr_pre_T.groupby('donor_id')
        by_12 = dpr_12m.groupby('donor_id')
        
        out['pct_amount_green_lifetime'] = (
            by_life['green_payment_amount'].sum()
            / (by_life['payment_amount'].sum() + 1e-6)
        )
        out['pct_amount_green_12m'] = (
            by_12['green_payment_amount'].sum()
            / (by_12['payment_amount'].sum() + 1e-6)
        )
        out['pct_count_green_lifetime'] = pct_count(dpr_pre_T, 'is_green_payment')
        out['pct_count_green_12m'] = pct_count(dpr_12m, 'is_green_payment')
    
    # --- Gift card purchases ---
    # Donors buying gift cards to give to others
    out['pct_gifts_gift_card_lifetime'] = pct_count(dpr_pre_T, 'gift_card_purchase')
    out['pct_amount_gift_card_lifetime'] = pct_amount(dpr_pre_T, 'gift_card_purchase')
    out['pct_gifts_gift_card_12m'] = pct_count(dpr_12m, 'gift_card_purchase')
    out['pct_amount_gift_card_12m'] = pct_amount(dpr_12m, 'gift_card_purchase')
    
    # --- Big event donations ---
    # Giving days, campaigns, etc.
    out['pct_amount_big_event_lifetime'] = pct_amount(dpr_pre_T, 'payment_on_big_event')
    out['pct_count_big_event_lifetime'] = pct_count(dpr_pre_T, 'payment_on_big_event')
    out['pct_amount_big_event_12m'] = pct_amount(dpr_12m, 'payment_on_big_event')
    out['pct_count_big_event_12m'] = pct_count(dpr_12m, 'payment_on_big_event')
    
    # --- Optional donation rate ---
    # When donors can add optional amounts (e.g., to cover fees)
    out['avg_optional_donation_rate_lifetime'] = (
        dpr_pre_T.groupby('donor_id')['optional_donation_rate'].mean()
    )
    out['avg_optional_donation_rate_12m'] = (
        dpr_12m.groupby('donor_id')['optional_donation_rate'].mean()
    )
    
    # --- Anonymous donations ---
    # Privacy-conscious or humble donors
    out['pct_gifts_anonymous_lifetime'] = pct_count(dpr_pre_T, 'donation_is_anonymous')
    out['pct_gifts_anonymous_12m'] = pct_count(dpr_12m, 'donation_is_anonymous')
    
    # --- Classroom essentials ---
    # Donations to specific project type (basic supplies)
    out['pct_amount_classroom_essentials_lifetime'] = pct_amount(
        dpr_pre_T, 'is_classroom_essentials_list'
    )
    out['pct_amount_classroom_essentials_12m'] = pct_amount(
        dpr_12m, 'is_classroom_essentials_list'
    )
    
    return out

### 2.5 Monthly Giving Program

Subscription/recurring donation features.

In [7]:
def _monthly_features(md_pre_T, dpr_pre_T, dpr_12m, T):
    """
    Build monthly subscription program features.
    
    Features created:
    - is_monthly_donor_current: Currently subscribed?
    - monthly_lifetime_amount, monthly_amount_12m: How much via subscription
    - monthly_median_gift_amount: Typical subscription size
    - pct_amount_monthly_*: What fraction of total giving is subscription
    - months_on_program, months_since_last_monthly_charge: Tenure metrics
    - monthly_longest_streak_months: Longest uninterrupted run
    - monthly_joined_before_first_project_gift: Acquisition source indicator
    
    Parameters
    ----------
    md_pre_T : pd.DataFrame
        Monthly subscription records where join date < T
    dpr_pre_T : pd.DataFrame
        All donations before T (for computing fractions)
    dpr_12m : pd.DataFrame
        Donations in last 12 months
    T : pd.Timestamp
        Reference time
        
    Returns
    -------
    features : pd.DataFrame
        Feature matrix indexed by donor_id
    """
    if md_pre_T.empty:
        return pd.DataFrame(index=dpr_pre_T['donor_id'].unique())
    
    out_index = pd.Index(md_pre_T['donor_id'].unique(), name='donor_id')
    out = pd.DataFrame(index=out_index)
    
    # --- Current subscription status ---
    def is_active(row):
        """Check if subscription is active at time T"""
        retired = row['monthly_subscription_retired_date']
        return (
            (row['monthly_subscription_joined_date'] <= T) and
            (pd.isna(retired) or (retired > T))
        )
    
    # Vectorized boolean operations (processes all rows at once)
    md_pre_T['is_active'] = (
        (md_pre_T['monthly_subscription_joined_date'] <= T) &
        (md_pre_T['monthly_subscription_retired_date'].isna() | 
         (md_pre_T['monthly_subscription_retired_date'] > T))
    )
    
    out['is_monthly_donor_current'] = md_pre_T.groupby('donor_id')['is_active'].max().astype(float)
    
    # --- Lifetime monthly amounts ---
    by_life = md_pre_T.groupby('donor_id')
    out['monthly_lifetime_amount'] = by_life['monthly_subscription_payment_amount'].sum()
    out['monthly_median_gift_amount'] = by_life['monthly_subscription_payment_amount'].median()
    
    # --- 12-month window ---
    if 'charge_date' in md_pre_T.columns:
        md_12m = md_pre_T[
            (md_pre_T['charge_date'] >= T - pd.DateOffset(months=12)) &
            (md_pre_T['charge_date'] < T)
        ]
        out['monthly_amount_12m'] = (
            md_12m.groupby('donor_id')['monthly_subscription_payment_amount'].sum()
        )
    
    # --- Fraction of total giving that's monthly ---
    # This shows how dependent a donor is on subscription vs one-time gifts
    total_life = dpr_pre_T.groupby('donor_id')['payment_amount'].sum()
    total_12 = dpr_12m.groupby('donor_id')['payment_amount'].sum()
    
    out['pct_amount_monthly_lifetime'] = (
        out['monthly_lifetime_amount'] / (total_life + 1e-6)
    )
    out['pct_amount_monthly_12m'] = (
        out.get('monthly_amount_12m', 0) / (total_12 + 1e-6)
    )
    
    # --- Months on program ---
    # How long has donor been (or was) subscribed?
    def months_on_program_fn(x):
        join = x['monthly_subscription_joined_date'].min()
        retire = x['monthly_subscription_retired_date'].dropna().min()
        # If still active, use T; otherwise use retirement date
        end = min(T, retire) if pd.notna(retire) else T
        return (end.to_period('M') - join.to_period('M')).n
    
    # Vectorized date operations
    join_dates = md_pre_T.groupby('donor_id')['monthly_subscription_joined_date'].min()
    retire_dates = md_pre_T.groupby('donor_id')['monthly_subscription_retired_date'].min()
    end_dates = retire_dates.fillna(T).clip(upper=T)
    out['months_on_program'] = (end_dates.dt.to_period('M') - join_dates.dt.to_period('M')).apply(lambda x: x.n)

    # --- Recency of last charge ---
    if 'charge_date' in md_pre_T.columns:
        last_charge = md_pre_T.groupby('donor_id')['charge_date'].max()
        out['months_since_last_monthly_charge'] = (
            (T.to_period('M') - last_charge.dt.to_period('M')).astype('int')
        )
    
    # --- Longest streak ---
    # Maximum consecutive months of successful charges
    out['monthly_longest_streak_months'] = (
        md_pre_T.groupby('donor_id')['monthly_subscription_longest_streak'].max()
    )
    
    # --- Acquisition indicator ---
    # Did donor join monthly program BEFORE making first project donation?
    first_monthly_join = (
        md_pre_T.groupby('donor_id')['monthly_subscription_joined_date'].min()
    )
    first_donation_date = (
        dpr_pre_T.groupby('donor_id')['payment_date'].min()
    )

    # Align on union of donor_ids
    idx_union = first_monthly_join.index.union(first_donation_date.index)
    fmj = first_monthly_join.reindex(idx_union)
    fdd = first_donation_date.reindex(idx_union)

    joined_before = (fmj < fdd)
    joined_before = joined_before.fillna(False).astype(int)

    out['monthly_joined_before_first_project_gift'] = (
        joined_before.reindex(out.index).fillna(0).astype(int)
    )

    return out

### 2.6 Teacher, School & Content Preferences

Loyalty vs diversification patterns.

In [8]:
def _teacher_school_features(dpr_pre_T):
    """
    Build features describing donor loyalty vs diversification patterns.
    
    CORRECTED VERSION with:
    - Entropy features set to null/NaN if only one donation
    - New features for pct amount to first/last project gifts
    
    Features measure:
    - Concentration: How focused is giving on specific teachers/schools/categories?
    - Diversification: How many different entities has donor supported?
    - Entropy: Information-theoretic measure of spread
    - Teacher quality: Average metrics of teachers supported
    - Project position: First/last gift preferences
    
    Parameters
    ----------
    dpr_pre_T : pd.DataFrame
        Donor Project Records filtered to payment_date < T
        
    Returns
    -------
    features : pd.DataFrame
        Feature matrix indexed by donor_id
    """
    if dpr_pre_T.empty:
        return pd.DataFrame(index=pd.Index([], name='donor_id'))
    
    dpr = dpr_pre_T.copy()
    out_index = dpr['donor_id'].unique()
    out = pd.DataFrame(index=out_index)
    
    # --- Count unique entities ---
    out['num_unique_teachers'] = dpr.groupby('donor_id')['teacher_id'].nunique()
    out['num_unique_schools'] = dpr.groupby('donor_id')['school_id'].nunique()
    out['num_unique_categories'] = dpr.groupby('donor_id')['project_category'].nunique()
    out['num_unique_grades'] = dpr.groupby('donor_id')['project_grade'].nunique()
    
    # School ZIP diversity (if available)
    if 'school_zip' in dpr.columns:
        out['num_unique_school_zips'] = dpr.groupby('donor_id')['school_zip'].nunique()
    
    # --- Entropy (diversity) measures ---
    # CORRECTED: Set to NaN if donor only has one donation
    
    # Get gift count per donor
    gift_counts = dpr.groupby('donor_id').size()
    single_gift_donors = gift_counts[gift_counts == 1].index
    
    # Calculate entropy for each dimension
    entropy_teacher = entropy_vectorized(dpr, 'donor_id', 'teacher_id')
    entropy_school = entropy_vectorized(dpr, 'donor_id', 'school_id')
    entropy_category = entropy_vectorized(dpr, 'donor_id', 'project_category')
    entropy_grade = entropy_vectorized(dpr, 'donor_id', 'project_grade')
    
    # Set to NaN for single-gift donors
    out['entropy_teacher'] = entropy_teacher
    out.loc[single_gift_donors, 'entropy_teacher'] = np.nan
    
    out['entropy_school'] = entropy_school
    out.loc[single_gift_donors, 'entropy_school'] = np.nan
    
    out['entropy_category'] = entropy_category
    out.loc[single_gift_donors, 'entropy_category'] = np.nan
    
    out['entropy_grade'] = entropy_grade
    out.loc[single_gift_donors, 'entropy_grade'] = np.nan
    
    if 'school_zip' in dpr.columns:
        entropy_zip = entropy_vectorized(dpr, 'donor_id', 'school_zip')
        out['entropy_zip'] = entropy_zip
        out.loc[single_gift_donors, 'entropy_zip'] = np.nan
    
    # --- Concentration metrics: % of $ going to top entity ---
    
    # Top teacher
    teacher_amounts = dpr.groupby(['donor_id', 'teacher_id'])['payment_amount'].sum()
    top_teacher_amt = teacher_amounts.groupby('donor_id').max()
    total_amt = dpr.groupby('donor_id')['payment_amount'].sum()
    out['pct_amount_to_top_teacher'] = top_teacher_amt / (total_amt + 1e-9)
    
    # Top school
    school_amounts = dpr.groupby(['donor_id', 'school_id'])['payment_amount'].sum()
    top_school_amt = school_amounts.groupby('donor_id').max()
    out['pct_amount_to_top_school'] = top_school_amt / (total_amt + 1e-9)
    
    # Top category
    category_amounts = dpr.groupby(['donor_id', 'project_category'])['payment_amount'].sum()
    top_category_amt = category_amounts.groupby('donor_id').max()
    out['pct_amount_to_top_category'] = top_category_amt / (total_amt + 1e-9)
    
    # Top grade level
    grade_amounts = dpr.groupby(['donor_id', 'project_grade'])['payment_amount'].sum()
    top_grade_amt = grade_amounts.groupby('donor_id').max()
    out['pct_amount_to_top_grade'] = top_grade_amt / (total_amt + 1e-9)
    
    # --- Teacher quality metrics ---
    # Average lifetime projects fully funded by teachers this donor supports
    if 'teacher_lifetime_projects_fully_funded' in dpr.columns:
        out['mean_teacher_lifetime_projects_fully_funded'] = (
            dpr.groupby('donor_id')['teacher_lifetime_projects_fully_funded'].mean()
        )
    
    # Average lifetime donations received by teachers this donor supports
    if 'teacher_lifetime_donations' in dpr.columns:
        out['mean_teacher_lifetime_donations'] = (
            dpr.groupby('donor_id')['teacher_lifetime_donations'].mean()
        )
    
    # --- NEW: Project position preferences ---
    # Percent of lifetime amount given to first gifts (gift_is_projects_first = 1)
    if 'gift_is_projects_first' in dpr.columns:
        out['pct_gifts_first_project'] = pct_count(dpr, 'gift_is_projects_first')
        out['pct_amount_first_project'] = pct_amount(dpr, 'gift_is_projects_first')
    
    # Percent of lifetime amount given to last gifts (gift_is_projects_last = 1)
    if 'gift_is_projects_last' in dpr.columns:
        out['pct_gifts_last_project'] = pct_count(dpr, 'gift_is_projects_last')
        out['pct_amount_last_project'] = pct_amount(dpr, 'gift_is_projects_last')
    
    return out

### 2.7 Seasonality Features

In [9]:
def _seasonality_features(dpr_pre_T):
    """
    Build features capturing temporal patterns in giving behavior.
    
    CORRECTED VERSION with:
    - entropy_gift_month set to null/NaN if only one donation
    
    Features created:
    - first_donation_month/quarter: When did they start?
    - first_donation_dow_sin/cos: Day-of-week cyclic encoding
    - pct_amount_in_back_to_school: Aug-Sep giving (7-10% of annual budget)
    - pct_amount_in_final_week_of_year: Dec 24-31 giving (tax planning)
    - pct_amount_on_weekends: Saturday/Sunday behavior
    - pct_amount_in_top_month/quarter: Concentration in favorite period
    - entropy_gift_month: How spread out is giving across months?
    
    Parameters
    ----------
    dpr_pre_T : pd.DataFrame
        Donor Project Records filtered to payment_date < T
        
    Returns
    -------
    features : pd.DataFrame
        Feature matrix indexed by donor_id
    """
    if dpr_pre_T.empty:
        return pd.DataFrame(index=pd.Index([], name='donor_id'))
    
    dpr = dpr_pre_T.copy()
    out_index = dpr['donor_id'].unique()
    out = pd.DataFrame(index=out_index)
    
    # Extract temporal components
    dpr['gift_month'] = dpr['payment_date'].dt.month
    dpr['gift_quarter'] = dpr['payment_date'].dt.quarter
    dpr['gift_dow'] = dpr['payment_date'].dt.dayofweek  # Monday=0, Sunday=6
    dpr['gift_day'] = dpr['payment_date'].dt.day
    
    # --- First donation temporal features ---
    first_donation = dpr.sort_values(['donor_id', 'payment_date']).groupby('donor_id').first()
    
    out['first_donation_month'] = first_donation['gift_month']
    out['first_donation_quarter'] = first_donation['gift_quarter']
    
    # Cyclic encoding of day-of-week (preserves weekend vs weekday similarity)
    out['first_donation_dow_sin'] = np.sin(2 * np.pi * first_donation['gift_dow'] / 7)
    out['first_donation_dow_cos'] = np.cos(2 * np.pi * first_donation['gift_dow'] / 7)
    
    # --- Seasonal concentration patterns ---
    
    # Back-to-school season (August-September)
    dpr['is_back_to_school'] = dpr['gift_month'].isin([8, 9])
    out['pct_amount_in_back_to_school'] = pct_amount(dpr, 'is_back_to_school')
    
    # Final week of year (tax planning, year-end giving)
    dpr['is_final_week'] = (dpr['gift_month'] == 12) & (dpr['gift_day'] >= 24)
    out['pct_amount_in_final_week_of_year'] = pct_amount(dpr, 'is_final_week')
    
    # Weekend giving (different behavior than weekday)
    dpr['is_weekend'] = dpr['gift_dow'].isin([5, 6])  # Saturday=5, Sunday=6
    out['pct_amount_on_weekends'] = pct_amount(dpr, 'is_weekend')
    
    # --- Peak period concentration ---
    # What % of giving happens in donor's most active month?
    month_amounts = dpr.groupby(['donor_id', 'gift_month'])['payment_amount'].sum()
    top_month_amt = month_amounts.groupby('donor_id').max()
    total_amt = dpr.groupby('donor_id')['payment_amount'].sum()
    out['pct_amount_in_top_month'] = top_month_amt / (total_amt + 1e-9)
    
    # What % of giving happens in donor's most active quarter?
    quarter_amounts = dpr.groupby(['donor_id', 'gift_quarter'])['payment_amount'].sum()
    top_quarter_amt = quarter_amounts.groupby('donor_id').max()
    out['pct_amount_in_top_quarter'] = top_quarter_amt / (total_amt + 1e-9)
    
    # --- Entropy of giving across months ---
    # CORRECTED: Set to NaN if donor only has one donation
    gift_counts = dpr.groupby('donor_id').size()
    single_gift_donors = gift_counts[gift_counts == 1].index
    
    entropy_month = entropy_vectorized(dpr, 'donor_id', 'gift_month')
    out['entropy_gift_month'] = entropy_month
    out.loc[single_gift_donors, 'entropy_gift_month'] = np.nan
    
    return out

### 2.8 Email Engagement

In [10]:
def _email_features(email_pre_T, email_3m, email_12m, T):
    """
    Build email engagement features from 12-month email summary data.
    
    Features created:
    - emails_sent/opened/clicked for 3m and 12m windows
    - open_rate, click_rate (clicks per email sent)
    - email_open_rate_velocity: recent vs longer-term trend
    - days_since_last_email_sent: recency
    
    Note: This uses monthly aggregated data, so recency is approximate
    
    Parameters
    ----------
    email_pre_T : pd.DataFrame
        All email events before T
    email_3m, email_12m : pd.DataFrame
        Email events in respective windows
    T : pd.Timestamp
        Reference time
        
    Returns
    -------
    features : pd.DataFrame
        Feature matrix indexed by donor_id
    """
    if email_pre_T.empty:
        return pd.DataFrame(index=pd.Index([], name='donor_id'))
    
    def email_agg(df):
        """Aggregate email metrics for a given window"""
        if df.empty:
            idx = email_pre_T['donor_id'].unique()
            zero = pd.Series(0, index=idx)
            return zero, zero, zero, zero, zero
        
        by = df.groupby('donor_id')
        sent = by['email_sent_count'].sum()
        opened = by['email_open_count'].sum()
        clicked = by['email_click_count'].sum()
        
        # Rates: opens/clicks per email sent
        open_rate = opened / (sent + 1e-6)
        click_rate = clicked / (sent + 1e-6)
        
        return sent, opened, clicked, open_rate, click_rate
    
    # Aggregate for both windows
    sent_3, open_3, click_3, or_3, cr_3 = email_agg(email_3m)
    sent_12, open_12, click_12, or_12, cr_12 = email_agg(email_12m)
    
    idx = email_pre_T['donor_id'].unique()
    out = pd.DataFrame(index=idx)
    
    # 3-month window
    out['emails_sent_3m'] = sent_3
    out['emails_opened_3m'] = open_3
    out['emails_clicked_3m'] = click_3
    out['email_open_rate_3m'] = or_3
    out['email_click_rate_3m'] = cr_3
    
    # 12-month window
    out['emails_sent_12m'] = sent_12
    out['emails_opened_12m'] = open_12
    out['emails_clicked_12m'] = click_12
    out['email_open_rate_12m'] = or_12
    out['email_click_rate_12m'] = cr_12
    
    # Velocity: is engagement improving or declining?
    # Positive = recent engagement higher than long-term average
    out['email_open_rate_velocity_3m_vs_12m'] = or_3 - or_12
    
    # Recency (approximate, since data is monthly)
    last_email_month = email_pre_T.groupby('donor_id')['email_month_start'].max()
    out['days_since_last_email_sent'] = (T - last_email_month).dt.days
    
    # Note: email type mix features would go here if you have
    # a mapping from email_type to type_group (appeal, newsletter, etc.)
    
    return out

### 2.9 Site Behavior

In [11]:
def _site_features(site_pre_T, site_3m, T):
    """
    Build on-site browsing and engagement features.
    
    Features created:
    - days_with_any_site_activity_3m: Active days count
    - avg_sessions_per_active_day_3m: Session frequency
    - avg_session_duration_min_3m: Session length
    - checkout_intent_min_per_session_3m: Cart engagement
    - days_since_last_cart_visit: Browse-to-buy recency (relative to T)
    - campaign_session_share_3m: Attribution
    - share_*_page_session_pct_3m: Page type mix
    - device_share_*_3m: Device usage profile
    
    Parameters
    ----------
    site_pre_T : pd.DataFrame
        Site events with activity_date < T
    site_3m : pd.DataFrame
        Site events with activity_date in [T-3m, T)
    T : pd.Timestamp
        As-of timestamp
    """
    if site_pre_T.empty:
        return pd.DataFrame(index=pd.Index([], name='donor_id'))

    idx = site_pre_T['donor_id'].unique()
    out = pd.DataFrame(index=idx)

    # --- Recent activity metrics (3m window) ---
    if not site_3m.empty:
        # Days with any activity
        activity_by_day = (
            site_3m
            .assign(any_activity=1)
            .groupby(['donor_id', 'activity_date'])['any_activity']
            .max()
            .reset_index()
        )
        days_with_any = activity_by_day.groupby('donor_id')['activity_date'].nunique()
        
        # Session counts
        sessions_by_donor = site_3m.groupby('donor_id').size()
        
        out['days_with_any_site_activity_3m'] = days_with_any
        out['avg_sessions_per_active_day_3m'] = (
            sessions_by_donor / (days_with_any + 1e-6)
        )
        
        # Session duration (if available)
        if 'session_duration_min' in site_3m.columns:
            out['avg_session_duration_min_3m'] = (
                site_3m.groupby('donor_id')['session_duration_min'].mean()
            )
        
        # Checkout intent: cart visits per session
        if 'cart_visits_day' in site_3m.columns:
            out['checkout_intent_min_per_session_3m'] = (
                site_3m.groupby('donor_id')['cart_visits_day'].sum()
                / (sessions_by_donor + 1e-6)
            )
        
        # Campaign attribution
        if 'came_from_campaign' in site_3m.columns:
            out['campaign_session_share_3m'] = (
                site_3m.groupby('donor_id')['came_from_campaign'].mean()
            )
        
        # --- Page type mix ---
        # Ensure required columns exist with default 0
        for col in ['project_page_visits_day', 'teacher_page_visits_day', 'search_visits_day']:
            if col not in site_3m.columns:
                site_3m[col] = 0
        
        page_counts = site_3m.groupby('donor_id').agg({
            'project_page_visits_day': 'sum',
            'teacher_page_visits_day': 'sum',
            'search_visits_day'      : 'sum'
        })
        total_page_visits = page_counts.sum(axis=1) + 1e-6
        
        out['share_project_page_session_pct_3m'] = (
            page_counts['project_page_visits_day'] / total_page_visits
        )
        out['share_teacher_page_session_pct_3m'] = (
            page_counts['teacher_page_visits_day'] / total_page_visits
        )
        out['share_search_page_session_pct_3m'] = (
            page_counts['search_visits_day'] / total_page_visits
        )
        
        # --- Device profile ---
        # Mobile-first, desktop-only, or mixed?
        if 'device_type' in site_3m.columns:
            # Normalize device_type to lowercase to match ['mobile', 'desktop', 'tablet']
            dev_df = site_3m.copy()
            dev_df['device_type'] = dev_df['device_type'].astype(str).str.lower()
            
            device_counts = (
                dev_df.groupby(['donor_id', 'device_type'])
                .size()
                .unstack(fill_value=0)
            )
            total_device = device_counts.sum(axis=1) + 1e-6
            
            for dev in ['mobile', 'desktop', 'tablet']:
                colname = f'device_share_{dev}_3m'
                if dev in device_counts.columns:
                    out[colname] = device_counts[dev] / total_device
                else:
                    out[colname] = 0.0

    # --- Cart recency (using full history, relative to T) ---
    if 'cart_visits_day' in site_pre_T.columns:
        has_cart = site_pre_T[site_pre_T['cart_visits_day'] > 0]
        if not has_cart.empty:
            last_cart_date = has_cart.groupby('donor_id')['activity_date'].max()
            out['days_since_last_cart_visit'] = (
                (T - last_cart_date).dt.days
            )
    
    return out

### 2.10 Share Events

In [12]:
def _share_features(share_pre_T, share_12m, T):
    """
    Build social sharing behavior features.
    
    Features created:
    - share_events_lifetime, share_events_12m: Volume
    - share_active_months_12m: Frequency
    - share_gap_mean/cv_days: Sharing rhythm
    - share_month_coverage_ratio: Consistency over tenure
    
    Parameters
    ----------
    share_pre_T : pd.DataFrame
        All share events before T
    share_12m : pd.DataFrame
        Share events in last 12 months
    T : pd.Timestamp
        Reference time
        
    Returns
    -------
    features : pd.DataFrame
        Feature matrix indexed by donor_id
    """
    if share_pre_T.empty:
        return pd.DataFrame(index=pd.Index([], name='donor_id'))
    
    out_index = share_pre_T['donor_id'].unique()
    out = pd.DataFrame(index=out_index)
    
    # --- Volume metrics ---
    out['share_events_lifetime'] = (
        share_pre_T.groupby('donor_id')['share_event_count'].sum()
    )
    out['share_events_12m'] = (
        share_12m.groupby('donor_id')['share_event_count'].sum()
        if not share_12m.empty else 0
    )
    
    # --- Active months (consistency) ---
    if not share_12m.empty:
        active_months_12m = (
            share_12m[share_12m['share_event_count'] > 0]
            .groupby('donor_id')['share_month_start']
            .nunique()
        )
        out['share_active_months_12m'] = active_months_12m
    
    # --- Sharing rhythm: gaps between share months ---
    def gap_stats(s):
        if s.shape[0] < 2:
            return pd.Series({
                'share_gap_mean_days': np.nan,
                'share_gap_cv_days': np.nan
            })
        # Sort by time, then diff
        gaps = s.sort_values().diff().dropna().dt.days
        mean = gaps.mean()
        cv = gaps.std() / (mean + 1e-6)
        return pd.Series({
            'share_gap_mean_days': mean,
            'share_gap_cv_days': cv
        })

    gap_df = (
        share_pre_T.groupby('donor_id')['share_month_start']
        .apply(gap_stats)
        .unstack()          # columns: share_gap_mean_days, share_gap_cv_days
    )

    out = out.join(gap_df, how='left')

    # --- Coverage ratio ---
    first_share = share_pre_T.groupby('donor_id')['share_month_start'].min()
    last_share = share_pre_T.groupby('donor_id')['share_month_start'].max()
    
    # Difference of two Periods is a DateOffset (e.g., <MonthEnd>); use .n to get month count
    tenure_offsets = last_share.dt.to_period('M') - first_share.dt.to_period('M')
    tenure_months = tenure_offsets.apply(lambda x: x.n if pd.notnull(x) else 0)
    tenure_months = tenure_months.clip(lower=1)  # at least 1 month of tenure
    
    out['share_month_coverage_ratio'] = (
        out.get('share_active_months_12m', 0) / (tenure_months + 1e-6)
    )

    return out

### 2.11 Same Schoool & Teacher Available Flags

In [13]:
def _future_opportunity_features(dpr_pre_T, df_project_dates, T, H):
    """
    Future opportunity features based on the donor's TOP (most-funded) school and teacher,
    not just their first one.

    For each donor:
      - Find the school where they have given the most (by payment_amount).
      - Find the teacher where they have given the most (by payment_amount).
      - Flag whether that school/teacher has ANY project active in [T, T+H].

    This approximates "did the donor's favorite school/teacher have opportunities
    to receive more gifts during the horizon".
    """
    # If we don't have project data or donations, return zeros
    if df_project_dates is None or df_project_dates.empty or dpr_pre_T.empty:
        idx = dpr_pre_T['donor_id'].unique()
        return pd.DataFrame({
            'school_still_available_during_range': 0,
            'teacher_still_available_during_range': 0
        }, index=pd.Index(idx, name='donor_id'))

    # Work on copies
    proj = df_project_dates.copy()
    dpr = dpr_pre_T.copy()

    # Ensure project date columns are datetime
    for c in ['project_last_posted_date', 'project_funded_date', 'project_expiration_date']:
        if c in proj.columns:
            proj[c] = pd.to_datetime(proj[c], errors='coerce')

    # Project end date = earlier of funded or expired
    proj['end_date'] = proj[['project_funded_date', 'project_expiration_date']].min(axis=1)

    # Normalize H
    if isinstance(H, int):
        H = pd.Timedelta(days=H)
    elif H is None:
        H = pd.Timedelta(days=365)

    win_start = T
    win_end = T + H

    # ----------------------------------------------------------------------
    # 1) Identify TOP (most-funded) school and teacher per donor
    # ----------------------------------------------------------------------
    donor_index = dpr['donor_id'].unique()
    out = pd.DataFrame(index=pd.Index(donor_index, name='donor_id'))

    # Top school by total payment_amount
    if {'school_id', 'payment_amount'}.issubset(dpr.columns):
        school_agg = (
            dpr.groupby(['donor_id', 'school_id'])['payment_amount']
            .sum()
            .reset_index()
        )
        top_school = (
            school_agg
            .sort_values(['donor_id', 'payment_amount'], ascending=[True, False])
            .groupby('donor_id')
            .first()
            .reset_index()[['donor_id', 'school_id']]
        )
    else:
        top_school = pd.DataFrame(columns=['donor_id', 'school_id'])

    # Top teacher by total payment_amount
    if {'teacher_id', 'payment_amount'}.issubset(dpr.columns):
        teacher_agg = (
            dpr.groupby(['donor_id', 'teacher_id'])['payment_amount']
            .sum()
            .reset_index()
        )
        top_teacher = (
            teacher_agg
            .sort_values(['donor_id', 'payment_amount'], ascending=[True, False])
            .groupby('donor_id')
            .first()
            .reset_index()[['donor_id', 'teacher_id']]
        )
    else:
        top_teacher = pd.DataFrame(columns=['donor_id', 'teacher_id'])

    # If a donor has only one gift, top == first; so behavior matches what you described.

    # ----------------------------------------------------------------------
    # 2) Precompute which schools/teachers have projects active in horizon
    # ----------------------------------------------------------------------
    proj['active_in_horizon'] = (
        (proj['project_last_posted_date'] <= win_end) &
        (proj['end_date'] >= win_start)
    )

    # Guard against missing columns, though in your data they should exist
    if 'school_id' in proj.columns:
        school_active = (
            proj.groupby('school_id')['active_in_horizon']
            .any()
            .astype(int)
        )
    else:
        school_active = pd.Series(dtype=int)

    if 'teacher_id' in proj.columns:
        teacher_active = (
            proj.groupby('teacher_id')['active_in_horizon']
            .any()
            .astype(int)
        )
    else:
        teacher_active = pd.Series(dtype=int)

    # ----------------------------------------------------------------------
    # 3) Map top school/teacher to these activity flags
    # ----------------------------------------------------------------------
    # School: donor -> top school_id -> activity flag
    if not top_school.empty:
        school_flag = (
            top_school
            .set_index('donor_id')['school_id']
            .map(school_active)
            .reindex(donor_index)
            .fillna(0)
            .astype(int)
        )
    else:
        school_flag = pd.Series(0, index=donor_index)

    # Teacher: donor -> top teacher_id -> activity flag
    if not top_teacher.empty:
        teacher_flag = (
            top_teacher
            .set_index('donor_id')['teacher_id']
            .map(teacher_active)
            .reindex(donor_index)
            .fillna(0)
            .astype(int)
        )
    else:
        teacher_flag = pd.Series(0, index=donor_index)

    out['school_still_available_during_range'] = school_flag
    out['teacher_still_available_during_range'] = teacher_flag

    return out

### 2.11 Project Outcomes & Matching

In [14]:
def _project_outcome_features(dpr_pre_T):
    """
    Build features describing project success and matching behavior.
    
    CORRECTED VERSION with:
    - mean_match_multiplier: subtract 1 from all values, impute 0 for records with optional_donation_rate
    
    Features created:
    - pct_projects_fully_funded: Success rate of projects supported
    - pct_gifts_with_match: How often donations are matched
    - mean_match_multiplier: Average EXCESS match (1.5 becomes 0.5), with 0s for optional donations
    - mean/median_project_total_cost: Scale of projects supported
    - median_donor_to_project_distance_mi: Geographic preference
    - pct_gifts_within_15mi: Local giving behavior
    - is_local_donor: Predominantly supports nearby schools
    
    Parameters
    ----------
    dpr_pre_T : pd.DataFrame
        Donor Project Records filtered to payment_date < T
    T : pd.Timestamp
        Reference time
        
    Returns
    -------
    features : pd.DataFrame
        Feature matrix indexed by donor_id
    """
    if dpr_pre_T.empty:
        return pd.DataFrame(index=pd.Index([], name='donor_id'))
    
    dpr = dpr_pre_T.copy()
    out_index = dpr['donor_id'].unique()
    out = pd.DataFrame(index=out_index)
    
    # --- Project success metrics ---
    if 'project_got_fully_funded' in dpr.columns:
        out['pct_projects_fully_funded'] = (
            dpr.groupby('donor_id')['project_got_fully_funded'].mean()
        )
    elif 'project_fully_funded' in dpr.columns:
        out['pct_projects_fully_funded'] = (
            dpr.groupby('donor_id')['project_fully_funded'].mean()
        )
    
    # --- Matching behavior ---
    # 1. Subtract 1 from all match_xyi_multiplier values (so 1.5 becomes 0.5)
    # 2. Impute 0 for records that have a value in optional_donation_rate
    
    if 'match_xyi_multiplier' in dpr.columns:
        # Step 1: Filter to rows with optional_donation_rate present
        dpr_with_optional = dpr[dpr['optional_donation_rate'].notna()].copy()
        
        if not dpr_with_optional.empty:
            # Step 2: Calculate match_excess (subtract 1)
            dpr_with_optional['match_excess'] = dpr_with_optional['match_xyi_multiplier'] - 1.0
            
            # Step 3: Fill nulls in match_excess with 0
            dpr_with_optional['match_excess'] = dpr_with_optional['match_excess'].fillna(0)
            
            # Step 4: Calculate mean per donor
            out['mean_match_multiplier'] = (
                dpr_with_optional.groupby('donor_id')['match_excess'].mean()
            )
        else:
            out['mean_match_multiplier'] = np.nan
        
        # Percent of gifts that received any match (before adjustments)
        had_match = dpr['match_xyi_multiplier'] > 1.0
        out['pct_gifts_with_match'] = (
            dpr.groupby('donor_id')
            .apply(lambda x: had_match.loc[x.index].mean())
        )
    
    # --- Project cost metrics ---
    out['mean_project_total_cost'] = dpr.groupby('donor_id')['project_total_cost'].mean()
    out['median_project_total_cost'] = dpr.groupby('donor_id')['project_total_cost'].median()
    
    # --- Geographic patterns ---
    if 'distance_mi' in dpr.columns:
        out['median_donor_to_project_distance_mi'] = (
            dpr.groupby('donor_id')['distance_mi'].median()
        )
        
        # Local giving: within 15 miles
        dpr['is_within_15mi'] = dpr['distance_mi'] <= 15
        out['pct_gifts_within_15mi'] = (
            dpr.groupby('donor_id')['is_within_15mi'].mean()
        )
        
        # Predominantly local donor (>75% of gifts within 15 miles)
        out['is_local_donor'] = (out['pct_gifts_within_15mi'] > 0.75).astype(int)
    
    return out

### 2.12 Latest Donation

In [15]:
def _latest_donation_features(dpr_pre_T, df_share, T):
    """
    Extract features from each donor's most recent donation before time T.
    
    This captures the "state" of the donor at their last interaction,
    which can be highly predictive of near-term behavior.
    
    Features include:
    - Amount split by green vs non-green
    - Project characteristics (cost, category, grade, fully funded status)
    - Teacher metrics
    - Payment type and optional donation behavior
    - Geographic distance
    - Referral channel (how they arrived)
    - Social sharing behavior in the month of latest donation
    
    Parameters
    ----------
    dpr_pre_T : pd.DataFrame
        Donor Project Records filtered to payment_date < T
        Expected columns: donor_id, payment_date, payment_amount, 
        is_green_payment, project_total_cost, project_got_fully_funded,
        teacher_lifetime_projects_fully_funded, gift_is_projects_first,
        gift_is_projects_last, optional_donation_rate, payment_type,
        project_category, project_grade, referral_source, 
        referral_medium, donor_lat_long, school_lat_long
    df_share : pd.DataFrame
        Share events with columns: donor_id, share_sent_month
        share_sent_month format: "YYYY-MM" (e.g., "2022-07")
    T : pd.Timestamp
        Reference time (as-of date)
        
    Returns
    -------
    features : pd.DataFrame
        Feature matrix indexed by donor_id with 'latest_*' columns
    """
    
    if dpr_pre_T.empty:
        return pd.DataFrame(index=pd.Index([], name='donor_id'))
    
    # =========================================================================
    # Get most recent donation per donor
    # =========================================================================
    
    latest = (
        dpr_pre_T
        .sort_values(['donor_id', 'payment_date'])
        .groupby('donor_id', as_index=False)
        .last()
        .set_index('donor_id')
    )
    
    out = pd.DataFrame(index=latest.index)
    
    # =========================================================================
    # Amount features: split by green vs non-green
    # =========================================================================
    
    # is_green_payment is string "t" or "f"
    is_green = latest['is_green_payment'].isin(['t', 'T', True, 1])
    
    out['latest_gift_amount_green'] = np.where(
        is_green, 
        latest['payment_amount'], 
        0
    )
    
    out['latest_gift_amount_nongreen'] = np.where(
        ~is_green,
        latest['payment_amount'],
        0
    )
    
    # =========================================================================
    # Project characteristics
    # =========================================================================
    
    out['latest_match_xyi_multiplier'] = latest.get('match_xyi_multiplier', np.nan)
    out['latest_project_total_cost'] = latest['project_total_cost']
    out['latest_project_got_fully_funded'] = latest['project_got_fully_funded'].astype(int)
    
    # Teacher metrics at time of latest donation
    out['latest_teacher_lifetime_projects_fully_funded'] = latest.get(
        'teacher_lifetime_projects_fully_funded', np.nan
    )
    
    # Position in project funding sequence
    out['latest_donation_is_projects_first'] = latest['gift_is_projects_first'].astype(int)
    out['latest_donation_is_projects_last'] = latest['gift_is_projects_last'].astype(int)
    
    # =========================================================================
    # Payment type and behavior
    # =========================================================================
    
    out['latest_payment_type_is_green'] = is_green.astype(int)
    
    # Optional donation as percentage of total
    out['latest_optional_donation_percent'] = latest.get('optional_donation_rate', 0)
    
    # Gift card purchase flag
    if 'gift_card_purchase' in latest.columns:
        out['latest_is_giftcard_purchase'] = (
            latest['gift_card_purchase'].fillna(0).astype(int)
        )
    else:
        out['latest_is_giftcard_purchase'] = 0
    
    # =========================================================================
    # Categorical features (for one-hot encoding downstream)
    # =========================================================================
    
    out['latest_project_grade'] = latest['project_grade']
    out['latest_project_category'] = latest['project_category']
    
    # =========================================================================
    # Referral channel (same logic as Repeat Donor Behaviors)
    # =========================================================================
    
    valid_media = {
        'email', 'directlink', 'facebook', 'nextdoor',
        'sharetray', 'mobilesharetray', 'page', 'ig', 'sendfriend'
    }
    valid_sources = {'dc', 'google'}
    
    # If referral_medium is in valid_media, use it; otherwise use referral_source
    # Treat non-dc/google sources as "Oth"
    referral_source_clean = latest.get('referral_source', pd.Series(index=latest.index))
    referral_source_clean = referral_source_clean.where(
        referral_source_clean.isin(valid_sources),
        'Oth'  # Replace non-dc/google sources with "Oth"
    )
    
    referral_medium = latest.get('referral_medium', pd.Series(index=latest.index))
    
    out['latest_referral_channel'] = np.where(
        referral_medium.isin(valid_media),
        referral_medium,
        referral_source_clean
    )
    
    # =========================================================================
    # Distance to school
    # =========================================================================
    
    if 'distance_mi' in latest.columns:
        out['latest_distance_mi'] = latest['distance_mi']
    else:
        out['latest_distance_mi'] = np.nan
    
    # =========================================================================
    # Social sharing in month of latest donation
    # =========================================================================
    
    if df_share is not None and not df_share.empty:
        # Extract year-month from latest donation date
        latest_month = pd.to_datetime(latest['payment_date']).dt.to_period('M').astype(str)
        
        # Create a mapping of donor_id -> set of months they shared
        share_months = (
            df_share
            .groupby('donor_id')['share_sent_month']
            .apply(set)
            .to_dict()
        )
        
        # For each donor, check if they shared in their latest donation month
        out['latest_shared_any'] = out.index.map(
            lambda donor_id: int(
                latest_month.loc[donor_id] in share_months.get(donor_id, set())
                if donor_id in latest_month.index
                else 0
            )
        )
    else:
        out['latest_shared_any'] = 0
    
    return out

### 2.13 Labels

In [16]:
def _build_labels(df_dpr, df_monthly, df_share, T, H):
    """
    Build label/target variables for future horizon H.
    
    This is a basic implementation covering repeat giving labels.
    You can extend this for:
    - Monthly program labels (became_monthly, churned_monthly)
    - Share labels (shared_in_H)
    - Upgrade/downgrade labels
    - High-value donor labels
    
    Parameters
    ----------
    df_dpr : pd.DataFrame
        Complete donation records (not pre-filtered)
    df_monthly : pd.DataFrame
        Monthly subscription records
    df_share : pd.DataFrame
        Share events
    T : pd.Timestamp
        Start of label window
    H : pd.Timedelta or int
        Length of label window (e.g., 365 days for 12-month prediction)
        
    Returns
    -------
    labels : pd.DataFrame
        Label matrix indexed by donor_id
    """
    T = pd.to_datetime(T)
    if isinstance(H, int):
        H = pd.Timedelta(days=H)
    
    start = T
    end = T + H
    
    # --- Repeat giving labels ---
    df_dpr = df_dpr.copy()
    df_dpr['payment_date'] = pd.to_datetime(df_dpr['payment_date'])
    
    # Filter to label window
    dpr_label = df_dpr[
        (df_dpr['payment_date'] >= start) &
        (df_dpr['payment_date'] < end)
    ]
    
    by_label = dpr_label.groupby('donor_id')['payment_amount']
    
    # Basic repeat giving labels
    gave_any_in_H = by_label.size().gt(0).astype(int)
    gift_count_in_H = by_label.size()
    gift_amount_in_H = by_label.sum()
    median_gift_in_H = by_label.median()
    
    labels = pd.DataFrame(index=gave_any_in_H.index)
    labels['gave_any_in_H'] = gave_any_in_H
    labels['gift_count_in_H'] = gift_count_in_H
    labels['gift_amount_in_H'] = gift_amount_in_H
    labels['median_gift_amount_in_H'] = median_gift_in_H
    
    # Could add more labels here:
    # - became_monthly_in_H (from df_monthly)
    # - churned_monthly_in_H
    # - shared_in_H (from df_share)
    # - upgrade_in_H (median_gift_in_H > median_gift_12m)
    
    return labels

## 3. Main Construction Functions

### 3.1 Build Features

In [17]:
def build_features(
    df_dpr,
    df_email,
    df_site,
    df_monthly,
    df_share,
    df_zip_acs,
    df_project_dates,
    E,
    T,
    H=None
):
    """
    Build complete donor-level feature matrix as of time T.
    
    This is the main entry point for feature engineering. It:
    1. Normalizes all timestamps
    2. Filters data to events before T
    3. Creates windowed subsets (3m, 12m, 36m)
    4. Calls helper functions for each feature group
    5. Optionally builds labels for horizon H
    
    IMPORTANT: This does NOT do final imputation or encoding.
    You should call finalize_features() after this to handle:
    - Missing value imputation
    - Categorical encoding
    - Feature scaling (if desired)
    
    Parameters
    ----------
    df_dpr : pd.DataFrame
        Donor Project Records with columns:
        - donor_id, payment_date, payment_amount, donation_n
        - donor_zip, is_teacher, is_teacher_referred
        - teacher_id, school_id, project_id, project_category, etc.
    df_email : pd.DataFrame
        Email Events 12mo with columns:
        - donor_id, email_sent_month
        - email_sent_count, email_open_count, email_click_count
    df_site : pd.DataFrame
        Site Events with columns:
        - donor_id, activity_date
        - device_type, came_from_campaign
        - project_page_visits_day, teacher_page_visits_day, etc.
    df_monthly : pd.DataFrame
        Monthly DonationLevel with columns:
        - donor_id, monthly_subscription_joined_date
        - monthly_subscription_retired_date
        - monthly_subscription_payment_amount, charge_date
    df_share : pd.DataFrame
        Share Events with columns:
        - donor_id, share_sent_month, share_event_count
    df_zip_acs : pd.DataFrame
        ZIP-level ACS demographics with column:
        - ZIP5 (index), pct_households_with_children, unemployment_rate, etc.
    T : pd.Timestamp or str
        Reference time (as-of date) for feature computation
        All features use only data strictly before this timestamp
    H : pd.Timedelta, int, or None
        Label horizon (e.g., pd.Timedelta(days=365) or 365)
        If provided, labels will be computed for window [T, T+H)
        
    Returns
    -------
    features : pd.DataFrame
        Donor-level feature matrix indexed by donor_id
        Contains ~150 columns across 11 feature groups
        May contain NaN values that should be imputed
        
    Example
    -------
    >>> T = pd.Timestamp('2024-01-01')
    >>> H = pd.Timedelta(days=365)  # 12-month prediction
    >>> features = build_features(
    ...     df_dpr, df_email, df_site, df_monthly, df_share, df_zip_acs,
    ...     T=T, H=H
    ... )
    >>> print(features.shape)
    (100000, 152)  # 100k donors, 152 features
    """
    # =====================================================================
    # SETUP: Convert dates and normalize T
    # =====================================================================
    
    T = pd.to_datetime(T)
    
    # Convert all date columns upfront (before eligibility filtering)
    df_dpr = df_dpr.copy()
    df_dpr['payment_date'] = pd.to_datetime(df_dpr['payment_date'])
    
    df_monthly = df_monthly.copy()
    df_monthly['monthly_subscription_joined_date'] = pd.to_datetime(
        df_monthly['monthly_subscription_joined_date']
    )
    df_monthly['monthly_subscription_retired_date'] = pd.to_datetime(
        df_monthly['monthly_subscription_retired_date']
    )
    if 'charge_date' in df_monthly.columns:
        df_monthly['charge_date'] = pd.to_datetime(df_monthly['charge_date'])
    
    # =====================================================================
    # ELIGIBILITY FILTERING (if requested)
    # =====================================================================
    
    if E is not None:
        eligibility_start = T - pd.DateOffset(months=E)
        
        # Donors with project donations in eligibility window
        active_project_donors = df_dpr[
            (df_dpr['payment_date'] >= eligibility_start) &
            (df_dpr['payment_date'] < T)
        ]['donor_id'].unique()
        
        # Donors active in monthly program during eligibility window
        active_monthly_donors = df_monthly[
            (df_monthly['monthly_subscription_joined_date'] < T) &
            (df_monthly['monthly_subscription_retired_date'].isna() | 
             (df_monthly['monthly_subscription_retired_date'] >= eligibility_start))
        ]['donor_id'].unique()
        
        # Union of active donors
        base_donor_ids = pd.Index(
            pd.unique(
                pd.concat([
                    pd.Series(active_project_donors),
                    pd.Series(active_monthly_donors)
                ], ignore_index=True)
            ),
            name='donor_id'
        )        

    else:
        # Original behavior: all donors from both sources
        base_donor_ids = pd.Index(
            pd.unique(
                pd.concat([
                    df_dpr['donor_id'],
                    df_monthly['donor_id']
                ], ignore_index=True)
            ),
            name='donor_id'
        )
    
    # Create master feature dataframe
    features = pd.DataFrame(index=base_donor_ids)
    
    # Restrict all event tables to the base donor universe
    df_dpr = df_dpr[df_dpr['donor_id'].isin(base_donor_ids)].copy()
    df_email = df_email[df_email['donor_id'].isin(base_donor_ids)].copy()
    df_site = df_site[df_site['donor_id'].isin(base_donor_ids)].copy()
    df_monthly = df_monthly[df_monthly['donor_id'].isin(base_donor_ids)].copy()
    df_share = df_share[df_share['donor_id'].isin(base_donor_ids)].copy()

    # =====================================================================
    # NORMALIZE DATES & FILTER PRE-T
    # =====================================================================
    
    # DPR: main donation records
    df_dpr = df_dpr.copy()
    df_dpr['payment_date'] = pd.to_datetime(df_dpr['payment_date'])
    dpr_pre_T = df_dpr[df_dpr['payment_date'] < T].copy()


    # Precompute distances once using direct lat/lon → miles
    if {'donor_lat_long', 'school_lat_long'}.issubset(dpr_pre_T.columns):

        # Parse lat/long strings into numeric columns
        dpr_pre_T['donor_lat_long'] = dpr_pre_T['donor_lat_long'].astype(str)
        dpr_pre_T['school_lat_long'] = dpr_pre_T['school_lat_long'].astype(str)

        donor_lat_lon = dpr_pre_T['donor_lat_long'].str.split(',', expand=True)
        school_lat_lon = dpr_pre_T['school_lat_long'].str.split(',', expand=True)

        dpr_pre_T['donor_lat'] = donor_lat_lon[0].astype(float)
        dpr_pre_T['donor_lon'] = donor_lat_lon[1].astype(float)
        dpr_pre_T['school_lat'] = school_lat_lon[0].astype(float)
        dpr_pre_T['school_lon'] = school_lat_lon[1].astype(float)

        # Direct haversine distance from donor to project
        dpr_pre_T['distance_mi'] = haversine_miles(
            dpr_pre_T['donor_lat'],
            dpr_pre_T['donor_lon'],
            dpr_pre_T['school_lat'],
            dpr_pre_T['school_lon'],
        )
    else:
        dpr_pre_T['distance_mi'] = np.nan
    
    # Site events
    df_site = df_site.copy()
    df_site['activity_date'] = pd.to_datetime(df_site['activity_date'])
    site_pre_T = df_site[df_site['activity_date'] < T].copy()
    
    # Share events (monthly aggregates)
    df_share = df_share.copy()
    df_share['share_month_start'] = pd.to_datetime(
        df_share['share_sent_month']
    ).dt.to_period('M').dt.to_timestamp()
    share_pre_T = df_share[df_share['share_month_start'] < T].copy()
    
    # Email events (monthly aggregates)
    df_email = df_email.copy()
    df_email['email_month_start'] = pd.to_datetime(
        df_email['email_sent_month']
    ).dt.to_period('M').dt.to_timestamp()
    email_pre_T = df_email[df_email['email_month_start'] < T].copy()
    
    # Monthly subscriptions
    df_monthly = df_monthly.copy()
    df_monthly['monthly_subscription_joined_date'] = pd.to_datetime(
        df_monthly['monthly_subscription_joined_date']
    )
    df_monthly['monthly_subscription_retired_date'] = pd.to_datetime(
        df_monthly['monthly_subscription_retired_date']
    )
    if 'charge_date' in df_monthly.columns:
        df_monthly['charge_date'] = pd.to_datetime(df_monthly['charge_date'])
    
    md_pre_T = df_monthly[
        df_monthly['monthly_subscription_joined_date'] < T
    ].copy()
    
    # =====================================================================
    # WINDOW BOUNDARIES
    # =====================================================================
    # These define our lookback periods for windowed features
    
    W_short_start = T - pd.DateOffset(months=3)   # [T-3m, T)
    W_mid_start = T - pd.DateOffset(months=12)    # [T-12m, T)
    W_long_start = T - pd.DateOffset(months=36)   # [T-36m, T)
    
    # Create windowed DPR subsets for counts, amounts, velocities
    dpr_3m = dpr_pre_T[dpr_pre_T['payment_date'] >= W_short_start]
    dpr_12m = dpr_pre_T[dpr_pre_T['payment_date'] >= W_mid_start]
    dpr_36m = dpr_pre_T[dpr_pre_T['payment_date'] >= W_long_start]
    
    # Intermediate periods for velocity calculations
    dpr_3to12m = dpr_pre_T[
        (dpr_pre_T['payment_date'] >= W_mid_start) &
        (dpr_pre_T['payment_date'] < W_short_start)
    ]
    dpr_12to36m = dpr_pre_T[
        (dpr_pre_T['payment_date'] >= W_long_start) &
        (dpr_pre_T['payment_date'] < W_mid_start)
    ]
    
    # Email and site windows
    email_12m = email_pre_T[email_pre_T['email_month_start'] >= W_mid_start]
    email_3m = email_pre_T[email_pre_T['email_month_start'] >= W_short_start]
    site_3m = site_pre_T[site_pre_T['activity_date'] >= W_short_start]
    share_12m = share_pre_T[share_pre_T['share_month_start'] >= W_mid_start]
    
    # =====================================================================
    # BUILD FEATURE GROUPS
    # =====================================================================
    # Each join adds a group of related features
    # Using left joins to preserve all donor_ids
    
    # 1. Identity & ZIP / ACS demographics
    features = features.join(
        _identity_and_zip_features(dpr_pre_T, df_zip_acs, md_pre_T),
        how='left'
    )
    
    # 2. Lifetime giving behavior & tenure
    features = features.join(
        _lifetime_giving_features(dpr_pre_T, T),  # Now includes T
        how='left'
    )
    
    # 3. Windowed giving & velocity trends
    features = features.join(
        _windowed_giving_features(
            dpr_pre_T, dpr_3m, dpr_12m, dpr_36m, dpr_3to12m, dpr_12to36m, T
        ),
        how='left'
    )
    
    # 4. Channel/payment type mix
    features = features.join(
        _channel_mix_features(dpr_pre_T, dpr_12m),
        how='left'
    )
    
    # 5. Monthly subscription program
    features = features.join(
        _monthly_features(md_pre_T, dpr_pre_T, dpr_12m, T),
        how='left'
    )
    
    # 6. Teacher/school/content preferences
    features = features.join(
        _teacher_school_features(dpr_pre_T),
        how='left'
    )
    
    # 7. Seasonality & rhythm patterns
    features = features.join(
        _seasonality_features(dpr_pre_T),
        how='left'
    )
    
    # 8. Email engagement
    features = features.join(
        _email_features(email_pre_T, email_3m, email_12m, T),
        how='left'
    )
    
    # 9. Site behavior
    features = features.join(
        _site_features(site_pre_T, site_3m, T),
        how='left'
    )
    
    # 10. Share events
    features = features.join(
        _share_features(share_pre_T, share_12m, T),
        how='left'
    )
    
    # 11. Project outcomes & matching
    features = features.join(
        _project_outcome_features(dpr_pre_T),
        how='left'
    )

    # 12) School/teacher availability
    if df_project_dates is not None and not df_project_dates.empty:
        horizon = H if H is not None else pd.Timedelta(days=365)
        features = features.join(
            _future_opportunity_features(dpr_pre_T, df_project_dates, T, horizon),
            how='left'
        )
    else:
        features['school_still_available_during_range'] = 0
        features['teacher_still_available_during_range'] = 0

    # 13) Latest donation
    f_latest = _latest_donation_features(
        dpr_pre_T=dpr_pre_T,
        df_share=df_share,
        T=T
    )
    features = features.join(f_latest, how='left')
    
    # =====================================================================
    # OPTIONAL: BUILD LABELS
    # =====================================================================
    # If H is provided, create labels for prediction horizon
    
    if H is not None:
        labels = _build_labels(df_dpr, df_monthly, df_share, T, H)
        features = features.join(labels, how='left')
    
    return features

### 3.2 Filter Cohorts

In [18]:
def apply_cohort_filters(features, cohort_definitions):
    """
    Filter feature dataframes by cohort definitions.
    
    Parameters
    ----------
    features : pd.DataFrame
        Feature matrix from build_features()
    cohort_definitions : list of dict
        Each dict defines a cohort with keys:
        - 'name': str - cohort identifier (used in output keys)
        - 'filters': dict - {column_name: filter_spec}
        
        Filter specs can be:
        - scalar: exact match (e.g., {'is_teacher': 1})
        - tuple ('op', value): comparison operation
          Supported ops: '>', '>=', '<', '<=', '==', '!='
          (e.g., {'lifetime_gift_count': ('>', 3)})
        - tuple ('between', low, high): inclusive range
          (e.g., {'tenure_years': ('between', 1, 5)})
        - list: isin check (e.g., {'state': ['CA', 'NY', 'TX']})
    
    Returns
    -------
    dict of pd.DataFrame
        Keys are cohort names, values are filtered feature dataframes
    
    Examples
    --------
    >>> cohorts = [
    ...     {
    ...         'name': 'teachers_3plus',
    ...         'filters': {
    ...             'is_teacher': 1,
    ...             'lifetime_gift_count': ('>', 3)
    ...         }
    ...     },
    ...     {
    ...         'name': 'high_value_1to3yrs',
    ...         'filters': {
    ...             'lifetime_amount': ('>=', 500),
    ...             'tenure_years': ('between', 1, 3)
    ...         }
    ...     },
    ...     {
    ...         'name': 'all_donors',
    ...         'filters': {}  # No filtering
    ...     }
    ... ]
    >>> 
    >>> result = apply_cohort_filters(train_features, cohorts)
    >>> print(result.keys())
    dict_keys(['teachers_3plus', 'high_value_1to3yrs', 'all_donors'])
    >>> print(result['teachers_3plus'].shape)
    (12543, 153)
    """
    results = {}
    
    for cohort in cohort_definitions:
        name = cohort['name']
        filters = cohort.get('filters', {})
        
        # Start with all rows
        mask = pd.Series(True, index=features.index)
        
        # Apply each filter
        for col, spec in filters.items():
            if col not in features.columns:
                raise ValueError(f"Column '{col}' not found in features")
            
            col_data = features[col]
            
            # Handle different filter specifications
            if isinstance(spec, tuple):
                op = spec[0]
                
                if op == 'between' and len(spec) == 3:
                    # Range filter
                    low, high = spec[1], spec[2]
                    mask &= (col_data >= low) & (col_data <= high)
                
                elif op in ('>', '>=', '<', '<=', '==', '!='):
                    # Comparison filter
                    value = spec[1]
                    if op == '>':
                        mask &= col_data > value
                    elif op == '>=':
                        mask &= col_data >= value
                    elif op == '<':
                        mask &= col_data < value
                    elif op == '<=':
                        mask &= col_data <= value
                    elif op == '==':
                        mask &= col_data == value
                    elif op == '!=':
                        mask &= col_data != value
                else:
                    raise ValueError(f"Unknown operator: {op}")
            
            elif isinstance(spec, list):
                # isin filter
                mask &= col_data.isin(spec)
            
            else:
                # Exact match (scalar)
                mask &= col_data == spec
        
        # Apply mask and store
        results[name] = features[mask].copy()
        
        print(f"Cohort '{name}': {mask.sum():,} donors ({mask.sum()/len(features)*100:.1f}%)")
    
    return results

## 4. Usage

Here's how to use the pipeline with your data.

In [19]:
'''# Example workflow (uncomment and adapt to your data)

# 1. Load your data
df_dpr = pd.read_csv('/Users/matt.fritz/Desktop/DonorProjectRecords_251118.csv')
df_email = pd.read_csv('/Users/matt.fritz/Desktop/Email Events 36mo.csv')
df_site = pd.read_csv('/Users/matt.fritz/Desktop/Site Events FY25-26.csv')
df_monthly = pd.read_csv('/Users/matt.fritz/Desktop/Monthly Donation Data All Time.csv')
df_share = pd.read_csv('/Users/matt.fritz/Desktop/Share Events All Time.csv')
df_project_dates = pd.read_csv('/Users/matt.fritz/Desktop/Project Dates FY22-26.csv')
df_zip_acs = pd.read_csv('/Users/matt.fritz/Desktop/Merged_Zip_ACS_Demographics.csv')

# 2. Define reference time, horizon, and eligibility window
eligibility_months = 12             # Only include donors who gave in N months prior to T
T = pd.Timestamp('2025-11-13')      # Training features end date 
H = pd.Timedelta(days= 90)          # Label prediction horizon
O = pd.Timedelta(days=365)          # Out-of-time lag to apply to T

# 3. Build features
train_features = build_features(
    df_dpr=df_dpr,
    df_email=df_email,
    df_site=df_site,
    df_monthly=df_monthly,
    df_share=df_share,
    df_project_dates=df_project_dates,
    df_zip_acs=df_zip_acs,
    eligibility_months=eligibility_months,
    T=T,
    H=H
)
oot_features = build_features(
    df_dpr=df_dpr,
    df_email=df_email,
    df_site=df_site,
    df_monthly=df_monthly,
    df_share=df_share,
    df_project_dates=df_project_dates,
    df_zip_acs=df_zip_acs,
    eligibility_months=eligibility_months,
    T=T+O,
    H=H
)

# 4. Finalize for modeling
# X, y = finalize_features(
#     features,
#     label_cols=['gave_any_in_H', 'gift_amount_in_H'],
#     numeric_impute_strategy='median'
# )'''

"# Example workflow (uncomment and adapt to your data)\n\n# 1. Load your data\ndf_dpr = pd.read_csv('/Users/matt.fritz/Desktop/DonorProjectRecords_251118.csv')\ndf_email = pd.read_csv('/Users/matt.fritz/Desktop/Email Events 36mo.csv')\ndf_site = pd.read_csv('/Users/matt.fritz/Desktop/Site Events FY25-26.csv')\ndf_monthly = pd.read_csv('/Users/matt.fritz/Desktop/Monthly Donation Data All Time.csv')\ndf_share = pd.read_csv('/Users/matt.fritz/Desktop/Share Events All Time.csv')\ndf_project_dates = pd.read_csv('/Users/matt.fritz/Desktop/Project Dates FY22-26.csv')\ndf_zip_acs = pd.read_csv('/Users/matt.fritz/Desktop/Merged_Zip_ACS_Demographics.csv')\n\n# 2. Define reference time, horizon, and eligibility window\neligibility_months = 12             # Only include donors who gave in N months prior to T\nT = pd.Timestamp('2025-11-13')      # Training features end date \nH = pd.Timedelta(days= 90)          # Label prediction horizon\nO = pd.Timedelta(days=365)          # Out-of-time lag to app

## 4. Cohort Filtering

In [20]:
# 1. Load your data
df_dpr = pd.read_csv('/Users/matt.fritz/Desktop/DonorProjectRecords_251118.csv')
df_email = pd.read_csv('/Users/matt.fritz/Desktop/Email Events 36mo.csv')
df_site = pd.read_csv('/Users/matt.fritz/Desktop/Site Events FY25-26.csv')
df_monthly = pd.read_csv('/Users/matt.fritz/Desktop/Monthly Donation Data All Time.csv')
df_share = pd.read_csv('/Users/matt.fritz/Desktop/Share Events All Time.csv')
df_project_dates = pd.read_csv('/Users/matt.fritz/Desktop/Project Dates FY22-26.csv')
df_zip_acs = pd.read_csv('/Users/matt.fritz/Desktop/Merged_Zip_ACS_Demographics.csv')

# 2. Define reference time, horizon, and eligibility window
E = 12                              # Only donors who gave in E months prior to T are eligible
T = pd.Timestamp('2024-07-01')      # Training features end date 
H = pd.Timedelta(days= 90)          # Label prediction horizon
O = pd.Timedelta(days=365)          # Out-of-time lag to apply to T

'''
                                    ELIGIBILITY
                              T-E | ----------- |
                                  
                              TRAINING FEATURES       TRAINING LABEL
  | ALL HISTORY ------------------------------- | T | -------------- | T+H
                                                      
                                                      OOT LAG
                                                    | -- O -- |
                                                    
                                                 OOT FEATURES           OOT LABEL
  | ALL HISTORY --------------------------------------------- | T+O | -------------- | T+O+H
'''

# 3. Define your cohorts
cohorts = [
    {
        'name': 'all_donors',
        'filters': {}
    },
    {
        'name': 'teachers_active',
        'filters': {
            'is_teacher': 1,
            'lifetime_gift_count': ('>', 3)
        }
    },
    {
        'name': 'high_ltv_mid_tenure',
        'filters': {
            'lifetime_amount': ('>=', 500),
            'tenure_years': ('between', 1, 5)
        }
    },
    {
        'name': 'monthly_only_subscribers',
        'filters': {
            'is_monthly_only_donor': 1
        }
    }
]

# 4. Build features
train_features = build_features(
    df_dpr=df_dpr,
    df_email=df_email,
    df_site=df_site,
    df_monthly=df_monthly,
    df_share=df_share,
    df_project_dates=df_project_dates,
    df_zip_acs=df_zip_acs,
    E=E,
    T=T,
    H=H
)
oot_features = build_features(
    df_dpr=df_dpr,
    df_email=df_email,
    df_site=df_site,
    df_monthly=df_monthly,
    df_share=df_share,
    df_project_dates=df_project_dates,
    df_zip_acs=df_zip_acs,
    E=E,
    T=T+O,
    H=H
)

# 5. Generate cohorts
train_cohorts = apply_cohort_filters(train_features, cohorts)
oot_cohorts = apply_cohort_filters(oot_features, cohorts)

# Now you have dictionaries of dataframes for each cohort
for cohort_name in cohorts:
    name = cohort_name['name']
    print(f"\n{name}:")
    print(f"  Train: {train_cohorts[name].shape}")
    print(f"  OOT:   {oot_cohorts[name].shape}")

Cohort 'all_donors': 25,332 donors (100.0%)
Cohort 'teachers_active': 2,375 donors (9.4%)
Cohort 'high_ltv_mid_tenure': 3,746 donors (14.8%)
Cohort 'monthly_only_subscribers': 10,792 donors (42.6%)
Cohort 'all_donors': 30,325 donors (100.0%)
Cohort 'teachers_active': 3,360 donors (11.1%)
Cohort 'high_ltv_mid_tenure': 4,968 donors (16.4%)
Cohort 'monthly_only_subscribers': 9,447 donors (31.2%)

all_donors:
  Train: (25332, 169)
  OOT:   (30325, 180)

teachers_active:
  Train: (2375, 169)
  OOT:   (3360, 180)

high_ltv_mid_tenure:
  Train: (3746, 169)
  OOT:   (4968, 180)

monthly_only_subscribers:
  Train: (10792, 169)
  OOT:   (9447, 180)


## 6. Feature Inventory

Quick reference of all features by category:

### Identity & Demographics (17 features)
- donor_zip5, is_teacher, is_teacher_referred
- is_marketing_subscribed, is_major_gift_donor
- ever_used_account_credit, current_account_credit_balance
- zip_* (9 ACS features)

### Lifetime Giving (14 features)
- first_donation_date, last_donation_date
- tenure_days, tenure_years, tenure_bucket
- lifetime_gift_count, lifetime_amount
- lifetime_median/max/cv_gift_amount
- mean/cv_gap_between_gifts_days
- max_donation_sequence_number
- pct_early_gifts_in_lifetime

### Windowed Giving (12 features)
- gift_count/amount/median_amount for 3m, 12m, 36m
- days_since_last/second_to_last_gift
- amount_velocity_0to3_vs_3to12
- amount/count_velocity_0to12_vs_12to36

### Channel Mix (26 features)
- pct_amount/count_daf (lifetime, 12m)
- pct_amount/count_green (lifetime, 12m)
- pct_gifts/amount_gift_card (lifetime, 12m)
- pct_amount/count_big_event (lifetime, 12m)
- avg_optional_donation_rate (lifetime, 12m)
- pct_gifts_anonymous (lifetime, 12m)
- pct_amount_classroom_essentials (lifetime, 12m)

### Monthly Program (10 features)
- is_monthly_donor_current
- monthly_lifetime_amount, monthly_amount_12m
- monthly_median_gift_amount
- pct_amount_monthly (lifetime, 12m)
- months_on_program, months_since_last_monthly_charge
- monthly_longest_streak_months
- monthly_joined_before_first_project_gift

### Teacher/School Preferences (17 features)
- entropy_* (5: teacher, school, zip, category, grade)
- num_unique_* (5)
- pct_amount_to_top_* (4)
- pct_gifts_first/last_project
- mean_teacher_lifetime_projects_fully_funded
- mean_teacher_lifetime_donations

### Seasonality (10 features)
- pct_amount_in_back_to_school
- pct_amount_in_final_week_of_year
- pct_amount_on_weekends
- entropy_gift_month
- pct_amount_in_top_month/quarter
- first_donation_month/quarter
- first_donation_dow_sin/cos

### Email Engagement (11 features)
- emails_sent/opened/clicked (3m, 12m)
- email_open/click_rate (3m, 12m)
- email_open_rate_velocity_3m_vs_12m
- days_since_last_email_sent

### Site Behavior (11 features)
- days_with_any_site_activity_3m
- avg_sessions_per_active_day_3m
- avg_session_duration_min_3m
- checkout_intent_min_per_session_3m
- days_since_last_cart_visit
- campaign_session_share_3m
- share_*_page_session_pct_3m (3)
- device_share_*_3m (3)

### Share Events (5 features)
- share_events_lifetime/12m
- share_active_months_12m
- share_gap_mean/cv_days
- share_month_coverage_ratio

### Project Outcomes (8 features)
- pct_projects_fully_funded
- mean/median_project_total_cost
- mean_match_multiplier
- pct_gifts_with_match
- median_donor_to_project_distance_mi
- pct_gifts_within_15mi
- is_local_donor

### Labels (4 features, if H provided)
- gave_any_in_H
- gift_count_in_H
- gift_amount_in_H
- median_gift_amount_in_H

**Total: ~145+ base features** (before one-hot encoding categoricals)

## 7. Notes & Best Practices

### Time Scope Summary

- **STATIC**: Does not depend on T
  - ZIP/ACS features
  - Some "once ever" flags (is_major_gift_donor, etc.)
  - First donation seasonality

- **AS_OF_T**: Cumulative through time T
  - Lifetime giving metrics
  - Distance/locality
  - Channel mix (lifetime)
  - Teacher/school concentration
  - Overall seasonality patterns

- **WINDOWED**: Specific lookback relative to T
  - 3m/12m/36m counts & amounts
  - Velocity metrics
  - Email engagement (3m/12m)
  - Site behavior (3m)
  - Share activity (12m)

- **LABEL**: Future horizon [T, T+H)
  - Repeat giving targets
  - Monthly program changes
  - Upgrade/downgrade indicators

### Key Design Decisions

1. **No duplicate concepts**: Single distance metric, single velocity pattern
2. **Consistent windows**: 3m (short), 12m (mid), 36m (long)
3. **Defensive epsilon**: Small values (1e-6) prevent division by zero
4. **Separation of concerns**: Feature engineering ≠ imputation
5. **Time-relative**: All features parameterized by T

### Recommended Workflow

1. **Build features** for each training example at different T values
2. **Finalize features** with consistent imputation strategy
3. **Feature selection** based on importance/correlation
4. **Model training** with cross-validation
5. **Production scoring** with latest T value

### Common Pitfalls to Avoid

- **Leakage**: Don't use features from [T, T+H) for predictions
- **Inconsistent windows**: Always use same T-relative boundaries
- **Missing joins**: Ensure all tables have donor_id
- **Scale mismatch**: Consider feature scaling for some models
- **Category explosion**: Monitor one-hot encoding dimension

### Performance Tips

- For large datasets, consider building features in chunks
- Cache intermediate results (windowed subsets)
- Use categorical dtypes to save memory
- Consider feature hashing for high-cardinality categoricals
- Profile your code to identify bottlenecks

### Extensions

This schema can be extended with:
- Interaction features (e.g., tenure × amount)
- Polynomial features for non-linear patterns
- Embedding features from text (project descriptions)
- Time-series features (rolling statistics)
- Graph features (donor networks)
- External data (economic indicators, events)