# RILA EDA: Feature Engineering - REFACTORED

**Refactored:** 2026-01-28  
**Original:** notebooks/rila/03_EDA_RILA_feature_engineering.ipynb  

**Changes:**
- Migrated from helpers.* to src.* imports
- Added canonical sys.path auto-detection
- Improved cell structure and documentation
- Preserved exploratory flexibility
- Added validation checkpoints

**Purpose:** Exploratory feature engineering combining sales and competitive rate data. Creates CPI-adjusted features, weekly aggregations, lag features, and spreads for correlation analysis.

**Dependencies:** Notebooks 1 and 2 (sales and rates data outputs)

**Note:** EDA notebook - mathematical equivalence not required, exploratory flexibility preserved

## Table of Contents
* [Section 1: Load Preprocessed Data](#sec1:Load)
* [Section 2: CPI Adjustment and Data Integration](#sec2:cpi)
* [Section 3: Weekly Aggregation and Lag Features](#sec3:weekly)
* [Section 4: Spread Calculation and Visualization](#sec4:spread)

In [None]:
%%capture# =============================================================================# STANDARD SETUP CELL - Clean Dependency Pattern# =============================================================================# Standard library importsimport sysimport osfrom pathlib import Pathimport pandas as pdimport numpy as npimport warningsimport matplotlib.pyplot as pltimport seaborn as snsfrom datetime import datetime, timedelta# Suppress warnings for clean outputwarnings.filterwarnings("ignore")# Canonical sys.path setup (auto-detect project root)# Canonical sys.path setup (auto-detect project root)
# Auto-detect project root (handles actual directory structure)
cwd = os.getcwd()

# Check for actual directory structure
if 'notebooks/production/rila' in cwd:
    project_root = Path(cwd).parents[2]
elif 'notebooks/production/fia' in cwd:
    project_root = Path(cwd).parents[2]
elif 'notebooks/eda/rila' in cwd:
    project_root = Path(cwd).parents[2]
elif 'notebooks/archive' in cwd:
    project_root = Path(cwd).parents[2]
elif os.path.basename(cwd) == 'notebooks':
    project_root = Path(cwd).parent
else:
    project_root = Path(cwd)

project_root = str(project_root)

# IMPORTANT: Verify import will work
if not os.path.exists(os.path.join(project_root, 'src')):
    raise RuntimeError(
        f"sys.path setup failed: 'src' package not found at {project_root}/src\n"
        f"Current directory: {cwd}\n"
        "This indicates the sys.path detection logic needs adjustment."
    )

sys.path.insert(0, project_root)

# Refactored imports (src.* pattern)from src.data import extraction as extfrom src.data import pipelinesfrom src.data.dvc_manager import save_dataset, load_dataset# Visualization themesns.set_theme(style="whitegrid", palette="deep")print("✓ Dependencies loaded successfully")

In [None]:
# =============================================================================
# AWS CONFIGURATION - Reuse from 00_data_pipeline pattern
# =============================================================================

aws_config = {
    'xid': "x259830",
    'role_arn': "arn:aws:iam::159058241883:role/isg-usbie-annuity-CA-s3-sharing",
    'sts_endpoint_url': "https://sts.us-east-1.amazonaws.com",
    'source_bucket_name': "pruvpcaws031-east-isg-ie-lake",
    'output_bucket_name': "cdo-annuity-364524684987-bucket",
    'output_base_path': "ANN_Price_Elasticity_Data_Science"
}

# Product parameters
version = "v2_0"

# Date parameters
current_time = datetime.now()
current_date = current_time.strftime("%Y-%m-%d")
current_date_of_mature_data = (current_time - timedelta(days=0)).strftime("%Y-%m-%d")

# Analysis parameters
start_date = "2021-01-01"
end_date = current_time.strftime("%Y-%m-%d")

print(f"✓ Configuration loaded")
print(f"  Version: {version}")
print(f"  Analysis period: {start_date} to {end_date}")

## Section 1: Load Preprocessed Data <a id="sec1:Load"></a>

**Business Purpose**: Load preprocessed sales, rates, and economic indicator data from previous EDA notebooks and production pipeline outputs

**Data Sources**: 
- WINK competitive rates (from notebook 2)
- Sales time series (from notebook 1)
- Economic indicators (DGS5, VIX, CPI) from S3

In [None]:
# =============================================================================
# LOAD PREPROCESSED DATA
# =============================================================================

# Define file paths
file_name_dgs5 = "MACRO_ECONOMIC_DATA/DGS5_index"
file_name_vixcls = "MACRO_ECONOMIC_DATA/VIXCLS_index"
file_name_cpi = "MACRO_ECONOMIC_DATA/cpi_scaled"
file_name_sales = f"RILA_{version}/FlexGuard_Sales"
file_name_contract = f"RILA_{version}/FlexGuard_Sales_contract"

# Load economic data from S3
df_dgs5 = ext.download_s3_parquet_with_optional_date_suffix(
    aws_config['output_bucket_name'], 
    f"{aws_config['output_base_path']}/{file_name_dgs5}", 
    None
)
df_vixcls = ext.download_s3_parquet_with_optional_date_suffix(
    aws_config['output_bucket_name'], 
    f"{aws_config['output_base_path']}/{file_name_vixcls}", 
    None
)
df_cpi = ext.download_s3_parquet_with_optional_date_suffix(
    aws_config['output_bucket_name'], 
    f"{aws_config['output_base_path']}/{file_name_cpi}", 
    None
)

# Load sales data from S3
df_sales = ext.download_s3_parquet_with_optional_date_suffix(
    aws_config['output_bucket_name'], 
    f"{aws_config['output_base_path']}/{file_name_sales}", 
    current_date
)
df_sales_contract = ext.download_s3_parquet_with_optional_date_suffix(
    aws_config['output_bucket_name'], 
    f"{aws_config['output_base_path']}/{file_name_contract}", 
    current_date
)

# Load rates from DVC-tracked EDA output (from notebook 2)
try:
    df_rates = load_dataset("WINK_competitive_landscape_1Y10_EDA")
except:
    # Fallback to S3 if DVC not available
    print("  Note: Loading from S3 fallback (DVC not available)")
    date_path = f"year={current_time.year}/month={current_time.month:02}/day={current_time.day:02}"
    file_path = f"WINK_rate_features_archive/RILA_{version}/{date_path}"
    file_name = f"WINK_competitive_landscape_1Y10"
    df_rates = ext.download_s3_parquet_with_optional_date_suffix(
        aws_config['output_bucket_name'], 
        f"{aws_config['output_base_path']}/{file_path}/{file_name}", 
        None
    )

print(f"✓ All data loaded")
print(f"  Rates: {df_rates.shape}")
print(f"  Sales: {df_sales.shape}")
print(f"  Economic indicators: DGS5({df_dgs5.shape}), VIX({df_vixcls.shape}), CPI({df_cpi.shape})")

In [None]:
# =============================================================================
# VALIDATION CHECKPOINT
# =============================================================================

# Validate data quality
assert not df_rates.empty, "Rates DataFrame is empty"
assert not df_sales.empty, "Sales DataFrame is empty"
assert 'date' in df_rates.columns, "Missing date column in rates"
assert 'date' in df_sales.columns, "Missing date column in sales"

print(f"✓ Data validation passed")
print(f"  Rates records: {len(df_rates):,}")
print(f"  Sales records: {len(df_sales):,}")

## Section 2: CPI Adjustment and Data Integration <a id="sec2:cpi"></a>

Integrate sales, rates, and economic indicators with CPI adjustment

In [None]:
# =============================================================================
# DATA INTEGRATION - Using refactored pipeline
# =============================================================================

# Configure data integration (following 00_data_pipeline pattern)
data_sources = {
    'sales': df_sales,
    'sales_contract': df_sales_contract,
    'dgs5': df_dgs5,
    'vixcls': df_vixcls,
    'cpi': df_cpi
}

data_integration_config = {
    'start_date': start_date,
    'end_date': end_date,
    'merge_how': 'outer',
    'fill_method': 'ffill'
}

# Apply data integration (creates CPI-adjusted sales)
df_sales_cpi_adj = pipelines.apply_data_integration(
    df_rates, 
    data_sources, 
    data_integration_config
)

# Merge with rates
df_ts = df_rates.merge(df_sales_cpi_adj, on="date").drop_duplicates()

print(f"✓ Data integration complete")
print(f"  Integrated dataset: {df_ts.shape}")
print(f"  Date range: {df_ts['date'].min()} to {df_ts['date'].max()}")

In [None]:
# =============================================================================
# VALIDATION CHECKPOINT
# =============================================================================

assert not df_ts.empty, "Integrated DataFrame is empty"
assert df_ts.shape[0] > 100, f"Expected >100 rows, got {df_ts.shape[0]}"
assert 'date' in df_ts.columns, "Missing date column"

print(f"✓ Integration validation passed")
print(f"  Integrated records: {len(df_ts):,}")
print(f"  Columns: {df_ts.shape[1]}")

## Section 3: Weekly Aggregation and Lag Features <a id="sec3:weekly"></a>

Aggregate to weekly frequency and create lag features for time series analysis

In [None]:
# =============================================================================
# WEEKLY AGGREGATION - Using refactored pipeline
# =============================================================================

# Configure weekly aggregation
weekly_agg_config = {
    'freq': 'W',
    'how': 'last',  # Use last value in week
    'rolling': 1,   # No additional smoothing
    'numeric_agg': 'mean'  # For numeric columns
}

# Apply weekly aggregation
df_ts_weekly = pipelines.apply_weekly_aggregation(df_ts, weekly_agg_config)

print(f"✓ Weekly aggregation complete")
print(f"  Weekly records: {len(df_ts_weekly):,}")
print(f"  Date range: {df_ts_weekly['date'].min()} to {df_ts_weekly['date'].max()}")

In [None]:
# =============================================================================
# LAG FEATURES - Using refactored pipeline
# =============================================================================

# Configure lag features (following 00_data_pipeline pattern)
lag_config = {
    'lag_periods': [0, 1, 2, 4, 7],  # Common lags for EDA
    'create_polynomials': False,  # Keep simple for EDA
    'semantic_naming': True
}

# Apply lag features
df_ts_weekly_w_lag = pipelines.apply_lag_and_polynomial_features(
    df_ts_weekly, 
    lag_config
)

# Filter to analysis period
mask_time = df_ts_weekly_w_lag["date"] > pd.to_datetime("2021-02-01")
df_weekly = df_ts_weekly_w_lag[mask_time]
df_weekly = df_weekly[sorted(df_weekly.columns)].reset_index(drop=True)

# Add temporal features
df_weekly["day_of_year"] = df_weekly.date.dt.day_of_year
mask_holiday = (df_weekly["day_of_year"] < 13) | (df_weekly["day_of_year"] > 359)
df_weekly["holiday"] = mask_holiday.astype("int")

print(f"✓ Lag features created")
print(f"  Final records: {len(df_weekly):,}")
print(f"  Total features: {df_weekly.shape[1]}")
print(f"  Date range: {df_weekly['date'].min()} to {df_weekly['date'].max()}")

In [None]:
# =============================================================================
# VALIDATION CHECKPOINT
# =============================================================================

assert not df_weekly.empty, "Weekly DataFrame is empty"
assert df_weekly.shape[0] > 50, f"Expected >50 weekly records, got {df_weekly.shape[0]}"
assert 'date' in df_weekly.columns, "Missing date column"
assert 'holiday' in df_weekly.columns, "Missing holiday feature"

print(f"✓ Weekly data validation passed")
print(f"  Weekly records: {len(df_weekly):,}")
print(f"  Features: {df_weekly.shape[1]}")

In [None]:
# =============================================================================
# SAVE ENGINEERED FEATURES - Using DVC for tracking
# =============================================================================

# Save for downstream notebooks
save_dataset(df_weekly, "RILA_engineered_features_EDA")

print(f"✓ Engineered features saved")

## Section 4: Spread Calculation and Visualization <a id="sec4:spread"></a>

Calculate competitive spread and visualize relationship with sales

In [None]:
# =============================================================================
# SPREAD CALCULATION (exploratory - multiple variants)
# =============================================================================

# Filter to non-holiday periods
mask_time = df_weekly["date"] <= current_date_of_mature_data
df_weekly_filtered = df_weekly[df_weekly["holiday"] == 0].copy()

# Calculate different spread variants (exploratory)
if 'P_lag_0' in df_weekly_filtered.columns and 'C_lag_0' in df_weekly_filtered.columns:
    df_weekly_filtered["Spread"] = df_weekly_filtered["P_lag_0"] - df_weekly_filtered["C_lag_0"]
    df_weekly_filtered["Spread_2"] = df_weekly_filtered["P_lag_0"] - df_weekly_filtered["C_lag_0"]
    
    print(f"✓ Spread calculated")
    print(f"  Spread range: {df_weekly_filtered['Spread'].min():.2f} to {df_weekly_filtered['Spread'].max():.2f}")
else:
    print("  Warning: Lag columns not found, using alternative approach")
    # Fallback: calculate spread from raw columns
    if 'Prudential' in df_weekly_filtered.columns and 'C_weighted_mean' in df_weekly_filtered.columns:
        df_weekly_filtered["Spread"] = df_weekly_filtered["Prudential"] - df_weekly_filtered["C_weighted_mean"]
        print(f"✓ Spread calculated (fallback method)")

In [None]:
# =============================================================================
# CORRELATION VISUALIZATION
# =============================================================================

# Determine sales column (flexible for different data versions)
sales_col = None
for col in ["sales_forward_0", "sales", "sales_target_contract_current"]:
    if col in df_weekly_filtered.columns:
        sales_col = col
        break

if sales_col and 'Spread' in df_weekly_filtered.columns:
    figure, axes = plt.subplots(2, 1, sharex=False, sharey=False, figsize=(12, 10))
    
    figure.suptitle(f"FLEXGUARD Feature Engineering EDA")
    axes[0].set_title(
        "CPI Adjusted FlexGuard Sales (Black) and Cap Rate Distance to Mean Rate (Orange)"
    )
    axes[0].set_xlabel("Date")
    axes[0].set_ylabel("CPI Adjusted FlexGuard Sales (dollars)")
    
    # Plot sales
    sns.lineplot(
        df_weekly_filtered[:-1],
        x="date",
        y=sales_col,
        ax=axes[0],
        color="k",
        linewidth=5,
        errorbar=("pi", 95),
    )
    
    # Plot spread on secondary axis
    ax0 = axes[0].twinx()
    ax0.set_ylabel("Cap Rate Distance to Mean Rate (bps)")
    ax0.grid(False)
    sns.lineplot(
        data=df_weekly_filtered, 
        x="date", 
        y="Spread", 
        ax=ax0, 
        color="tab:orange", 
        linewidth=5
    )
    
    # Scatter plot
    sns.regplot(
        df_weekly_filtered, 
        x="Spread", 
        y=sales_col, 
        ax=axes[1], 
        ci=100, 
        order=1
    )
    
    axes[1].set_title("CPI Adjusted FlexGuard Sales vs. Cap Rate Distance to Mean Rate")
    axes[1].set_xlabel("Cap Rate Distance to Mean Rate (bps)")
    axes[1].set_ylabel("CPI Adjusted FlexGuard Sales (dollars)")
    
    plt.tight_layout()
    plt.savefig("feature_engineering_correlation.png")
    plt.show()
    
    # Calculate correlation
    correlation = df_weekly_filtered[sales_col].corr(df_weekly_filtered["Spread"])
    print(f"\n✓ Visualization complete")
    print(f"  Correlation (Sales vs Spread): {correlation:.3f}")
    print(f"  Saved to: feature_engineering_correlation.png")
else:
    print("  Warning: Required columns not found for visualization")

---

## EDA Complete

**Key Findings:**
- Sales and competitive rates integrated with economic indicators
- CPI-adjusted features created
- Weekly aggregation reduces noise while preserving trends
- Lag features capture temporal dependencies
- Spread (Prudential - Competitor) shows correlation with sales

**Engineered Features:**
- CPI-adjusted sales
- Weekly aggregated time series
- Lag features (0, 1, 2, 4, 7 periods)
- Holiday indicators
- Competitive spread

**Next Steps:** 
- Proceed to 04_RILA_feature_selection.ipynb for model-based feature selection
- Use engineered features for AIC-based selection