# RILA EDA: FlexGuard Sales Analysis - REFACTORED

**Refactored:** 2026-01-28  
**Original:** notebooks/rila/01_EDA_sales_RILA.ipynb  

**Changes:**
- Migrated from helpers.* to src.* imports
- Added canonical sys.path auto-detection
- Improved cell structure and documentation
- Preserved exploratory flexibility
- Added validation checkpoints

**Purpose:** Exploratory analysis of RILA FlexGuard sales data including sales distribution by firm, date gap analysis between application and contract issue dates, and time series patterns.

**Dependencies:** None (first EDA notebook in sequence)

**Note:** EDA notebook - mathematical equivalence not required, exploratory flexibility preserved

## Table of Contents
* [Section 1: Load RILA Sales](#sec1:Load)
* [Section 2: Sales Distribution Analysis](#sec2:distribution)
* [Section 3: Sales Distribution by Firm and Duration](#sec3:sales_by_policy_and_firm)
* [Section 4: Application to Contract Date Gap Analysis](#sec4:contract_issue_date_gap_by_firm)
* [Section 5: Sales By Contract Issue Date vs Application Signed Date](#sec5:sales_by_date_types)

In [None]:
%%capture# =============================================================================# STANDARD SETUP CELL - Clean Dependency Pattern# =============================================================================# Standard library importsimport sysimport osfrom pathlib import Pathimport pandas as pdimport numpy as npimport warningsimport matplotlib.pyplot as pltimport seaborn as snsfrom datetime import datetime, timedelta# Suppress warnings for clean outputwarnings.filterwarnings("ignore")# Canonical sys.path setup (auto-detect project root)# Canonical sys.path setup (auto-detect project root)
# Auto-detect project root (handles actual directory structure)
cwd = os.getcwd()

# Check for actual directory structure
if 'notebooks/production/rila' in cwd:
    project_root = Path(cwd).parents[2]
elif 'notebooks/production/fia' in cwd:
    project_root = Path(cwd).parents[2]
elif 'notebooks/eda/rila' in cwd:
    project_root = Path(cwd).parents[2]
elif 'notebooks/archive' in cwd:
    project_root = Path(cwd).parents[2]
elif os.path.basename(cwd) == 'notebooks':
    project_root = Path(cwd).parent
else:
    project_root = Path(cwd)

project_root = str(project_root)

# IMPORTANT: Verify import will work
if not os.path.exists(os.path.join(project_root, 'src')):
    raise RuntimeError(
        f"sys.path setup failed: 'src' package not found at {project_root}/src\n"
        f"Current directory: {cwd}\n"
        "This indicates the sys.path detection logic needs adjustment."
    )

sys.path.insert(0, project_root)

# Refactored imports (src.* pattern)from src.data import extraction as extfrom src.data import pipelinesfrom src.data.dvc_manager import save_dataset# Visualization themesns.set_theme(style="whitegrid", palette="deep")print("✓ Dependencies loaded successfully")

In [None]:
# =============================================================================
# AWS CONFIGURATION - Reuse from 00_data_pipeline pattern
# =============================================================================

aws_config = {
    'xid': "x259830",
    'role_arn': "arn:aws:iam::159058241883:role/isg-usbie-annuity-CA-s3-sharing",
    'sts_endpoint_url': "https://sts.us-east-1.amazonaws.com",
    'source_bucket_name': "pruvpcaws031-east-isg-ie-lake",
    'output_bucket_name': "cdo-annuity-364524684987-bucket",
    'output_base_path': "ANN_Price_Elasticity_Data_Science"
}

# Product parameters
product_name = "FlexGuard indexed variable annuity"
version = "v2_0"

# Date parameters
current_time = datetime.now()
current_date = current_time.strftime("%Y-%m-%d")
current_date_of_mature_data = (current_time - timedelta(days=0)).strftime("%Y-%m-%d")

print(f"✓ Configuration loaded: {product_name}")
print(f"  Version: {version}")
print(f"  Current date: {current_date}")

## Section 1: Load RILA Sales Data <a id="sec1:Load"></a>

**Business Purpose**: Load all available FlexGuard sales data from Prudential's data lake for exploratory analysis

**Data Source**: S3 parquet files from `access/ierpt/tde_sales_by_product_by_fund/`

In [None]:
# =============================================================================
# AWS CONNECTION SETUP
# =============================================================================

# AWS Connection using refactored extraction module
sts_client = ext.setup_aws_sts_client_with_validation(aws_config)
assumed_role = ext.assume_iam_role_with_validation(sts_client, aws_config)
s3_resource, bucket = ext.setup_s3_resource_with_validation(
    assumed_role['Credentials'],
    aws_config['source_bucket_name']
)

print(f"✓ AWS connection established")
print(f"  Role: {aws_config['xid']}")
print(f"  Bucket: {aws_config['source_bucket_name']}")

In [None]:
# =============================================================================
# LOAD SALES DATA - Using refactored extraction
# =============================================================================

# Load all FlexGuard sales data
df_FlexGuard_full = ext.discover_and_load_sales_data(
    bucket, 
    s3_resource, 
    aws_config['source_bucket_name']
)

print(f"✓ Sales data loaded")
print(f"  Total records: {len(df_FlexGuard_full):,}")
print(f"  Total columns: {df_FlexGuard_full.shape[1]}")

In [None]:
# =============================================================================
# SALES DATA CLEANUP - Using refactored pipelines
# =============================================================================

# Configure cleanup (inline config for EDA flexibility)
cleanup_config = {
    'remove_nulls': ['contract_initial_premium_amount'],
    'date_columns': ['application_signed_date', 'contract_issue_date'],
    'dedup_columns': ['contract_number', 'application_signed_date']
}

# Apply cleanup using refactored pipeline
df_FlexGuard = pipelines.apply_sales_data_cleanup(df_FlexGuard_full, cleanup_config)

# Filter out specific fund (original notebook logic preserved)
df_FlexGuard = df_FlexGuard[df_FlexGuard["fund_group_name"] != "S&P 500 - 6 Yr Cap B20"]

print(f"✓ Data cleanup completed")
print(f"  Clean records: {len(df_FlexGuard):,}")
print(f"  Records removed: {len(df_FlexGuard_full) - len(df_FlexGuard):,}")

In [None]:
# =============================================================================
# VALIDATION CHECKPOINT
# =============================================================================

# Validate data quality
assert not df_FlexGuard.empty, "DataFrame is empty"
assert len(df_FlexGuard) > 100, f"Expected >100 rows, got {len(df_FlexGuard)}"
assert 'application_signed_date' in df_FlexGuard.columns, "Missing application_signed_date column"
assert 'contract_issue_date' in df_FlexGuard.columns, "Missing contract_issue_date column"
assert 'contract_initial_premium_amount' in df_FlexGuard.columns, "Missing premium amount column"

# Calculate date difference for gap analysis
df_FlexGuard['difference'] = (df_FlexGuard['contract_issue_date'] - df_FlexGuard['application_signed_date']).dt.days

print(f"✓ Data validation passed")
print(f"  Records: {len(df_FlexGuard):,}")
print(f"  Columns: {df_FlexGuard.shape[1]}")
print(f"  Date range: {df_FlexGuard['application_signed_date'].min()} to {df_FlexGuard['application_signed_date'].max()}")
print(f"  Missing values: {df_FlexGuard.isna().sum().sum()}")

In [None]:
# Column inspection (exploratory)
sorted(df_FlexGuard_full.columns)

## Section 2: Sales Distribution Analysis <a id="sec2:distribution"></a>

Analyze distribution of contract initial premium amounts

In [None]:
# Premium amount distribution (remove outliers for visualization)
FlexGuard = df_FlexGuard[df_FlexGuard.contract_initial_premium_amount.notna()]
threshold_top = np.quantile(FlexGuard.contract_initial_premium_amount, 0.99)
FlexGuard = FlexGuard[FlexGuard["contract_initial_premium_amount"] < threshold_top]

sns.histplot(
    data=FlexGuard, 
    x="contract_initial_premium_amount", 
    stat="density", 
    kde=True
)
plt.title("Distribution of Contract Initial Premium Amount (99th percentile)")
plt.xlabel("Premium Amount ($)")
plt.show()

## Section 3: Sales Distribution by Firm and Duration <a id="sec3:sales_by_policy_and_firm"></a>

Analyze how sales are distributed across different firms

In [None]:
# Calculate sales proportion by firm
numerator_firm = (
    100 * df_FlexGuard.groupby("firm_name").contract_initial_premium_amount.sum()
)
denominator = df_FlexGuard.contract_initial_premium_amount.sum()

ratio_by_firm = np.round(numerator_firm / denominator, 2)
ratio_by_firm = ratio_by_firm.reset_index()

print("\nProportion by Firm (>2% market share)")
sub_ratio = ratio_by_firm[ratio_by_firm["contract_initial_premium_amount"] > 2]
print(sub_ratio.to_markdown(tablefmt="fancy_grid"))

all_products = sorted(sub_ratio.firm_name.unique())

In [None]:
# Date gap quantile analysis
print("Date gap analysis (application to contract):")
print(f"99th percentile: {int(np.round(np.quantile(df_FlexGuard['difference'], 0.99)))} days")
print(f"95th percentile: {int(np.round(np.quantile(df_FlexGuard['difference'], 0.95)))} days")
print(f"97.5th percentile: {int(np.round(np.quantile(df_FlexGuard['difference'], 0.975)))} days")

## Section 4: Application to Contract Date Gap Analysis by Firm <a id="sec4:contract_issue_date_gap_by_firm"></a>

Analyze processing time from application signed to contract issue by firm

In [None]:
print("---------------------------------------------------------------")
for firm in all_products:
    subset = df_FlexGuard[df_FlexGuard["firm_name"] == firm]
    sales = subset.contract_initial_premium_amount.sum()
    max_sale = subset.contract_initial_premium_amount.max()
    mean_diff = int(np.round(subset["difference"].mean()))
    bottom = int(np.round(np.quantile(subset["difference"], 0.025)))
    top = int(np.round(np.quantile(subset["difference"], 0.95)))
    
    print(f"{firm} :")
    print(f"Total Sales: ${sales:,.2f}")
    print(f"Largest Sale: ${max_sale:,.2f}")
    print(f"First Sale Record: {subset.contract_issue_date.min().strftime('%Y-%m-%d')} ")
    print(f"Final Sale Record: {subset.application_signed_date.max().strftime('%Y-%m-%d')} ")
    print(f"On average there is {mean_diff: d} days between application signed date and contract issue with 95 percentile of {top:d} days")
    print("---------------------------------------------------------------\n")

In [None]:
# Date gap distribution visualization
df = df_FlexGuard.melt(id_vars="firm_name", value_vars="difference")
df["value"] = 1 + df["value"]

# Log scale distribution
sns.displot(data=df, x="value", log_scale=True, kde=True)
plt.xlabel("Days between Signature and Contract Issue (log scale)")
plt.title("Distribution of Processing Time (Log Scale)")
plt.show()

# Linear scale distribution
sns.displot(data=df, x="value", log_scale=False, kde=True)
plt.xlabel("Days between Signature and Contract Issue")
plt.title("Distribution of Processing Time (Linear Scale)")
plt.show()

## Section 5: Sales By Contract Issue Date vs Application Signed Date <a id="sec5:sales_by_date_types"></a>

Compare time series patterns using application date vs contract date

In [None]:
# =============================================================================
# TIME SERIES PREPARATION - Application Date
# =============================================================================

roll = 1 * 7  # 1 week rolling average

time_series_FlexGuard_daily_application_signed_date = (
    df_FlexGuard[["application_signed_date", "contract_initial_premium_amount"]]
    .sort_values(by="application_signed_date")
    .rename(
        columns={
            "contract_initial_premium_amount": "sales",
            "application_signed_date": "date",
        }
    )
    .groupby(pd.Grouper(key="date", freq="d"))
    .sum()
    .reset_index()
)

time_series_FlexGuard_daily_application_signed_date["sales"] = (
    time_series_FlexGuard_daily_application_signed_date.sales.rolling(roll).mean()
)

print(f"✓ Application date time series created: {time_series_FlexGuard_daily_application_signed_date.shape}")

In [None]:
# =============================================================================
# TIME SERIES PREPARATION - Contract Date
# =============================================================================

time_series_FlexGuard_daily_contract_issue_date = (
    df_FlexGuard[["contract_issue_date", "contract_initial_premium_amount"]]
    .sort_values(by="contract_issue_date")
    .rename(
        columns={
            "contract_initial_premium_amount": "sales",
            "contract_issue_date": "date",
        }
    )
    .groupby(pd.Grouper(key="date", freq="d"))
    .sum()
    .reset_index()
)

time_series_FlexGuard_daily_contract_issue_date["sales"] = (
    time_series_FlexGuard_daily_contract_issue_date.sales.rolling(4 * roll).mean()
)

print(f"✓ Contract date time series created: {time_series_FlexGuard_daily_contract_issue_date.shape}")

In [None]:
# =============================================================================
# SAVE TIME SERIES - Using DVC for tracking (optional for EDA)
# =============================================================================

# Save application date time series
save_dataset(time_series_FlexGuard_daily_application_signed_date, "FlexGuard_Sales_EDA")

# Save contract date time series
save_dataset(
    time_series_FlexGuard_daily_contract_issue_date.rename(
        columns={"sales": "sales_by_contract_date"}
    ), 
    "FlexGuard_Sales_contract_EDA"
)

print(f"✓ Time series data saved")

In [None]:
# =============================================================================
# FINAL VISUALIZATION - Application vs Contract Date Comparison
# =============================================================================

figure, axes = plt.subplots(1, 1, sharex=True, sharey=False, figsize=(20, 6))
roll = 4 * 7  # 4 week rolling average

# Recalculate with 4-week rolling for better smoothing
time_series_FlexGuard_daily_application_signed_date = (
    df_FlexGuard[["application_signed_date", "contract_initial_premium_amount"]]
    .sort_values(by="application_signed_date")
    .rename(
        columns={
            "contract_initial_premium_amount": "sales",
            "application_signed_date": "date",
        }
    )
    .groupby(pd.Grouper(key="date", freq="d"))
    .sum()
    .reset_index()
)

time_series_FlexGuard_daily_application_signed_date["sales"] = (
    time_series_FlexGuard_daily_application_signed_date.sales.rolling(roll).mean()
)

time_series_FlexGuard_daily_contract_issue_date = (
    df_FlexGuard[["contract_issue_date", "contract_initial_premium_amount"]]
    .sort_values(by="contract_issue_date")
    .rename(
        columns={
            "contract_initial_premium_amount": "sales",
            "contract_issue_date": "date",
        }
    )
    .groupby(pd.Grouper(key="date", freq="d"))
    .sum()
    .reset_index()
)

time_series_FlexGuard_daily_contract_issue_date["sales"] = (
    time_series_FlexGuard_daily_contract_issue_date.sales.rolling(roll).mean()
)

axes.set_title(f"Smoothed Daily Sales by Application Date for FlexGuard (4-week rolling average)")

# Filter to 2021 onwards for cleaner visualization
mask_date_app = (
    time_series_FlexGuard_daily_application_signed_date["date"] > "2021-01-01"
) & (
    time_series_FlexGuard_daily_application_signed_date["date"]
    < current_date_of_mature_data
)
mask_date_con = (
    time_series_FlexGuard_daily_contract_issue_date["date"] > "2021-01-01"
) & (
    time_series_FlexGuard_daily_contract_issue_date["date"]
    < current_date_of_mature_data
)

sns.lineplot(
    data=time_series_FlexGuard_daily_application_signed_date[mask_date_app],
    x="date",
    y="sales",
    ax=axes,
    color="tab:blue",
    linewidth=3,
    label="Application Signed Date",
)
sns.lineplot(
    data=time_series_FlexGuard_daily_contract_issue_date[mask_date_con],
    x="date",
    y="sales",
    ax=axes,
    color="tab:orange",
    linewidth=3,
    label="Contract Issue Date",
)

plt.xlabel("Date")
plt.ylabel("Sales ($)")
plt.tight_layout()
plt.savefig("application_signed_date.png")
plt.show()

print("✓ Visualization complete - saved to application_signed_date.png")

---

## EDA Complete

**Key Findings:**
- Sales distribution across major firms analyzed
- Processing time gaps between application and contract issue quantified
- Time series patterns compared for both date types

**Next Steps:** 
- Proceed to 02_EDA_rates_RILA.ipynb for competitive rate analysis
- Combine sales insights with rate analysis in feature engineering notebook