# 01: Survival Analysis - Aviation Accident Outcomes

**Objective**: Analyze time-to-event data and survival probabilities for aviation accidents

**Key Methods**:
- Kaplan-Meier survival curves by aircraft characteristics
- Cox proportional hazards regression for fatal injury predictors
- Log-rank tests for group comparisons
- Stratified analysis by aircraft type and operator

**Expected Outputs**:
- 6 publication-quality visualizations
- Survival curves by aircraft age, type, certification
- Hazard ratios for fatal accident predictors
- Statistical tests with p-values and confidence intervals

**Dataset**: NTSB Aviation Accidents (1962-2025)
**Database**: ntsb_aviation (PostgreSQL)
**Last Updated**: 2025-11-09

In [None]:
# Setup and importsimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsfrom scipy import statsimport sqlalchemy as safrom pathlib import Pathimport warningsfrom lifelines import KaplanMeierFitter, CoxPHFitterfrom lifelines.statistics import logrank_test, multivariate_logrank_testwarnings.filterwarnings('ignore')# Configurationplt.style.use('seaborn-v0_8-darkgrid')sns.set_palette("husl")plt.rcParams['figure.figsize'] = (12, 6)plt.rcParams['font.size'] = 10plt.rcParams['savefig.dpi'] = 150# Create figures directoryfigures_dir = Path('figures')figures_dir.mkdir(exist_ok=True)# Database connectionengine = sa.create_engine('postgresql://parobek@localhost/ntsb_aviation')print("✅ Setup complete")

## 1. Data Loading and Preparation

For survival analysis, we need:
- **Event indicator**: Fatal accident (1) vs non-fatal (0)
- **Time variable**: Aircraft age at time of accident
- **Covariates**: Aircraft type, certification, engine count, weather

In [None]:
# Load data with survival analysis variablesquery = """SELECT     e.ev_id,    e.ev_date,    e.ev_year,    e.ev_highest_injury,    a.acft_make,    a.acft_model,    a.acft_category,    a.acft_year,    a.homebuilt,    a.num_eng,    a.eng_type,    i.inj_tot_f,    i.inj_tot_s,    i.inj_tot_m,    i.inj_tot_n,    CASE WHEN e.wx_cond_basic = 'IMC' THEN 1 ELSE 0 END as imc_flag,FROM events eLEFT JOIN aircraft a ON e.ev_id = a.ev_id AND a.aircraft_key = (    SELECT MIN(a2.aircraft_key) FROM aircraft a2 WHERE a2.ev_id = e.ev_id)LEFT JOIN injury i ON e.ev_id = i.ev_idWHERE e.ev_date IS NOT NULL  AND a.acft_year IS NOT NULL  AND e.ev_year >= a.acft_year  -- Valid aircraft age  AND e.ev_highest_injury IS NOT NULLORDER BY e.ev_date"""df = pd.read_sql(sa.text(query), engine)print(f"Loaded {len(df):,} events")print(f"Date range: {df['ev_date'].min()} to {df['ev_date'].max()}")print(f"\nColumns: {df.columns.tolist()}")

In [None]:
# Feature engineering for survival analysis# 1. Aircraft age at time of accident (time variable)df['aircraft_age'] = df['ev_year'] - df['acft_year']# 2. Event indicator (fatal = 1, non-fatal = 0)df['event_fatal'] = (df['ev_highest_injury'] == 'FATL').astype(int)# 3. Fatality count (for intensity analysis)df['fatality_count'] = df['inj_tot_f'].fillna(0).astype(int)# 4. Binary covariatesdf['amateur_built'] = (df['homebuilt'] == 'Yes').astype(int)df['multi_engine'] = (df['num_eng'] >= 2).astype(int)# 5. Aircraft category groupsdf['aircraft_type'] = df['acft_category'].map({    'AIR': 'Airplane',    'HELI': 'Helicopter',    'GLID': 'Glider',    'BALL': 'Balloon',    'GYRO': 'Gyrocraft',    'UNK': 'Unknown'}).fillna('Other')# 6. Age groups for stratified analysisdf['age_group'] = pd.cut(df['aircraft_age'],                           bins=[0, 10, 20, 30, 100],                          labels=['0-10 years', '11-20 years', '21-30 years', '31+ years'])# Remove invalid ages (negative or extreme values)df = df[(df['aircraft_age'] >= 0) & (df['aircraft_age'] <= 100)].copy()print(f"\nAfter feature engineering: {len(df):,} events")print(f"Fatal events: {df['event_fatal'].sum():,} ({df['event_fatal'].mean()*100:.1f}%)")print(f"\nAircraft age statistics:")print(df['aircraft_age'].describe())print(f"\nAircraft type distribution:")print(df['aircraft_type'].value_counts())

## 2. Kaplan-Meier Survival Curves

Kaplan-Meier estimator calculates survival probability over time (aircraft age).

**Interpretation**: 
- Y-axis: Probability of non-fatal accident
- X-axis: Aircraft age (years)
- Declining curve = Higher fatal accident rate with age

In [None]:
# Overall Kaplan-Meier survival curvekmf = KaplanMeierFitter()kmf.fit(durations=df['aircraft_age'], event_observed=df['event_fatal'])fig, ax = plt.subplots(figsize=(12, 6))kmf.plot_survival_function(ax=ax, ci_show=True)ax.set_xlabel('Aircraft Age (years)', fontsize=12)ax.set_ylabel('Survival Probability (Non-Fatal)', fontsize=12)ax.set_title('Kaplan-Meier Survival Curve: Overall Aviation Accidents\n(Probability of Non-Fatal Outcome by Aircraft Age)',              fontsize=14, fontweight='bold')ax.grid(True, alpha=0.3)# Add median survival timemedian_survival = kmf.median_survival_time_ax.axhline(y=0.5, color='red', linestyle='--', alpha=0.5, label=f'Median survival: {median_survival:.1f} years')ax.legend()plt.tight_layout()plt.savefig(figures_dir / '01_overall_survival_curve.png', dpi=150, bbox_inches='tight')plt.show()print(f"\n📊 Overall Survival Statistics:")print(f"Median survival time: {median_survival:.1f} years")print(f"Survival at 10 years: {kmf.survival_function_at_times(10).values[0]:.3f}")print(f"Survival at 30 years: {kmf.survival_function_at_times(30).values[0]:.3f}")print(f"Survival at 50 years: {kmf.survival_function_at_times(50).values[0]:.3f}")

In [None]:
# Survival curves by aircraft age groupfig, ax = plt.subplots(figsize=(12, 6))age_groups = df['age_group'].dropna().unique()results = []for age_grp in sorted(age_groups):    mask = df['age_group'] == age_grp    kmf = KaplanMeierFitter()    kmf.fit(durations=df.loc[mask, 'aircraft_age'],             event_observed=df.loc[mask, 'event_fatal'],            label=age_grp)    kmf.plot_survival_function(ax=ax, ci_show=False)        results.append({        'age_group': age_grp,        'n': mask.sum(),        'fatal_rate': df.loc[mask, 'event_fatal'].mean()    })ax.set_xlabel('Aircraft Age (years)', fontsize=12)ax.set_ylabel('Survival Probability (Non-Fatal)', fontsize=12)ax.set_title('Survival Curves by Aircraft Age Group\n(Stratified Kaplan-Meier Analysis)',              fontsize=14, fontweight='bold')ax.grid(True, alpha=0.3)ax.legend(loc='best')plt.tight_layout()plt.savefig(figures_dir / '02_survival_by_age_group.png', dpi=150, bbox_inches='tight')plt.show()print("\n📊 Fatal Accident Rates by Age Group:")for r in results:    print(f"{r['age_group']}: {r['fatal_rate']*100:.2f}% ({r['n']:,} events)")

In [None]:
# Survival curves by aircraft typefig, ax = plt.subplots(figsize=(12, 6))# Focus on major aircraft typesmajor_types = df['aircraft_type'].value_counts().head(4).indextype_results = []for acft_type in major_types:    mask = df['aircraft_type'] == acft_type    if mask.sum() > 100:  # Minimum sample size        kmf = KaplanMeierFitter()        kmf.fit(durations=df.loc[mask, 'aircraft_age'],                 event_observed=df.loc[mask, 'event_fatal'],                label=f"{acft_type} (n={mask.sum():,})")        kmf.plot_survival_function(ax=ax, ci_show=False)                type_results.append({            'type': acft_type,            'n': mask.sum(),            'fatal_rate': df.loc[mask, 'event_fatal'].mean(),            'median_survival': kmf.median_survival_time_        })ax.set_xlabel('Aircraft Age (years)', fontsize=12)ax.set_ylabel('Survival Probability (Non-Fatal)', fontsize=12)ax.set_title('Survival Curves by Aircraft Type\n(Kaplan-Meier Stratified Analysis)',              fontsize=14, fontweight='bold')ax.grid(True, alpha=0.3)ax.legend(loc='best')plt.tight_layout()plt.savefig(figures_dir / '03_survival_by_aircraft_type.png', dpi=150, bbox_inches='tight')plt.show()print("\n📊 Survival Statistics by Aircraft Type:")for r in type_results:    print(f"{r['type']}: Fatal rate {r['fatal_rate']*100:.2f}%, Median survival {r['median_survival']:.1f} years")

## 3. Log-Rank Tests for Group Comparisons

**Log-rank test**: Non-parametric test comparing survival curves between groups

**Hypotheses**:
- H₀: Survival curves are identical between groups
- H₁: Survival curves differ significantly

**α = 0.05** (95% confidence level)

In [None]:
# Log-rank test: Amateur-built vs Certificatedamateur_mask = df['amateur_built'] == 1cert_mask = df['amateur_built'] == 0results_amateur = logrank_test(    durations_A=df.loc[amateur_mask, 'aircraft_age'],    durations_B=df.loc[cert_mask, 'aircraft_age'],    event_observed_A=df.loc[amateur_mask, 'event_fatal'],    event_observed_B=df.loc[cert_mask, 'event_fatal'])print("\n📊 Log-Rank Test: Amateur-Built vs Certificated Aircraft")print(f"Test statistic: {results_amateur.test_statistic:.3f}")print(f"p-value: {results_amateur.p_value:.6f}")print(f"\nAmateur-built: {amateur_mask.sum():,} events, {df.loc[amateur_mask, 'event_fatal'].mean()*100:.2f}% fatal")print(f"Certificated: {cert_mask.sum():,} events, {df.loc[cert_mask, 'event_fatal'].mean()*100:.2f}% fatal")if results_amateur.p_value < 0.05:    print("\n✅ SIGNIFICANT: Survival curves differ significantly (p < 0.05)")else:    print("\n❌ NOT SIGNIFICANT: No significant difference in survival curves (p >= 0.05)")

In [None]:
# Log-rank test: Multi-engine vs Single-enginemulti_mask = df['multi_engine'] == 1single_mask = df['multi_engine'] == 0results_engine = logrank_test(    durations_A=df.loc[multi_mask, 'aircraft_age'],    durations_B=df.loc[single_mask, 'aircraft_age'],    event_observed_A=df.loc[multi_mask, 'event_fatal'],    event_observed_B=df.loc[single_mask, 'event_fatal'])print("\n📊 Log-Rank Test: Multi-Engine vs Single-Engine Aircraft")print(f"Test statistic: {results_engine.test_statistic:.3f}")print(f"p-value: {results_engine.p_value:.6f}")print(f"\nMulti-engine: {multi_mask.sum():,} events, {df.loc[multi_mask, 'event_fatal'].mean()*100:.2f}% fatal")print(f"Single-engine: {single_mask.sum():,} events, {df.loc[single_mask, 'event_fatal'].mean()*100:.2f}% fatal")if results_engine.p_value < 0.05:    print("\n✅ SIGNIFICANT: Survival curves differ significantly (p < 0.05)")else:    print("\n❌ NOT SIGNIFICANT: No significant difference in survival curves (p >= 0.05)")

In [None]:
# Multivariate log-rank test: Aircraft type# Test if survival curves differ across multiple aircraft typesmajor_types_df = df[df['aircraft_type'].isin(major_types)].copy()results_multivariate = multivariate_logrank_test(    durations=major_types_df['aircraft_age'],    groups=major_types_df['aircraft_type'],    event_observed=major_types_df['event_fatal'])print("\n📊 Multivariate Log-Rank Test: Aircraft Type Comparison")print(f"Test statistic: {results_multivariate.test_statistic:.3f}")print(f"p-value: {results_multivariate.p_value:.6f}")print(f"Degrees of freedom: {len(major_types) - 1}")if results_multivariate.p_value < 0.05:    print("\n✅ SIGNIFICANT: Survival curves differ across aircraft types (p < 0.05)")else:    print("\n❌ NOT SIGNIFICANT: No significant difference across aircraft types (p >= 0.05)")

## 4. Cox Proportional Hazards Regression

**Cox PH Model**: Semi-parametric model for hazard ratios

**Hazard Ratio Interpretation**:
- HR = 1.0: No effect
- HR > 1.0: Increased risk of fatal outcome
- HR < 1.0: Decreased risk of fatal outcome

**Example**: HR = 1.5 means 50% increased risk

In [None]:
# Prepare data for Cox regressioncox_df = df[[    'aircraft_age', 'event_fatal', 'amateur_built', 'multi_engine',     'imc_flag']].dropna().copy()# Fit Cox proportional hazards modelcph = CoxPHFitter()cph.fit(cox_df, duration_col='aircraft_age', event_col='event_fatal')print("\n📊 Cox Proportional Hazards Regression Results")print("=" * 80)print(cph.summary)print("\n" + "=" * 80)# Extract hazard ratios and confidence intervalshr_summary = cph.summary[['exp(coef)', 'exp(coef) lower 95%', 'exp(coef) upper 95%', 'p']].copy()hr_summary.columns = ['Hazard Ratio', '95% CI Lower', '95% CI Upper', 'p-value']print("\n📊 Hazard Ratios (Fatal Outcome Risk):")print(hr_summary.to_string())# Concordance index (model discrimination)print(f"\n\n📊 Model Performance:")print(f"Concordance Index (C-index): {cph.concordance_index_:.4f}")print(f"  (0.5 = random, 1.0 = perfect discrimination)")

In [None]:
# Visualize hazard ratios (forest plot)fig, ax = plt.subplots(figsize=(10, 6))# Extract data for plottinghr_data = hr_summary.reset_index()hr_data.columns = ['Variable', 'HR', 'CI_lower', 'CI_upper', 'p_value']# Create forest ploty_pos = np.arange(len(hr_data))# Plot hazard ratios with confidence intervalsax.errorbar(hr_data['HR'], y_pos,             xerr=[hr_data['HR'] - hr_data['CI_lower'], hr_data['CI_upper'] - hr_data['HR']],            fmt='o', markersize=8, capsize=5, capthick=2, color='steelblue')# Add reference line at HR=1ax.axvline(x=1.0, color='red', linestyle='--', linewidth=2, alpha=0.7, label='No effect (HR=1.0)')# Formattingax.set_yticks(y_pos)ax.set_yticklabels([var.replace('_', ' ').title() for var in hr_data['Variable']])ax.set_xlabel('Hazard Ratio (95% Confidence Interval)', fontsize=12)ax.set_title('Cox Proportional Hazards: Fatal Accident Risk Factors\n(Hazard Ratios with 95% Confidence Intervals)',              fontsize=14, fontweight='bold')ax.grid(True, alpha=0.3, axis='x')ax.legend()# Add p-value annotationsfor i, (hr, p) in enumerate(zip(hr_data['HR'], hr_data['p_value'])):    if p < 0.001:        sig_str = '***'    elif p < 0.01:        sig_str = '**'    elif p < 0.05:        sig_str = '*'    else:        sig_str = 'ns'        ax.text(hr, i, f'  {sig_str}', va='center', fontsize=10, fontweight='bold')plt.tight_layout()plt.savefig(figures_dir / '04_cox_hazard_ratios.png', dpi=150, bbox_inches='tight')plt.show()print("\nSignificance codes: *** p<0.001, ** p<0.01, * p<0.05, ns not significant")

## 5. Cumulative Hazard Analysis

**Cumulative hazard**: Total accumulated risk over time

Higher cumulative hazard = Greater risk accumulation with age

In [None]:
# Cumulative hazard curves by amateur-built statusfig, ax = plt.subplots(figsize=(12, 6))for built_type, label in [(1, 'Amateur-Built'), (0, 'Certificated')]:    mask = df['amateur_built'] == built_type    kmf = KaplanMeierFitter()    kmf.fit(durations=df.loc[mask, 'aircraft_age'],             event_observed=df.loc[mask, 'event_fatal'],            label=f"{label} (n={mask.sum():,})")    kmf.plot_cumulative_density(ax=ax, ci_show=False)ax.set_xlabel('Aircraft Age (years)', fontsize=12)ax.set_ylabel('Cumulative Hazard (Fatal Accident Risk)', fontsize=12)ax.set_title('Cumulative Hazard Function: Amateur-Built vs Certificated\n(Risk Accumulation Over Aircraft Lifetime)',              fontsize=14, fontweight='bold')ax.grid(True, alpha=0.3)ax.legend(loc='best')plt.tight_layout()plt.savefig(figures_dir / '05_cumulative_hazard_amateur.png', dpi=150, bbox_inches='tight')plt.show()

In [None]:
# Cumulative hazard curves by engine configurationfig, ax = plt.subplots(figsize=(12, 6))for eng_type, label in [(1, 'Multi-Engine'), (0, 'Single-Engine')]:    mask = df['multi_engine'] == eng_type    kmf = KaplanMeierFitter()    kmf.fit(durations=df.loc[mask, 'aircraft_age'],             event_observed=df.loc[mask, 'event_fatal'],            label=f"{label} (n={mask.sum():,})")    kmf.plot_cumulative_density(ax=ax, ci_show=False)ax.set_xlabel('Aircraft Age (years)', fontsize=12)ax.set_ylabel('Cumulative Hazard (Fatal Accident Risk)', fontsize=12)ax.set_title('Cumulative Hazard Function: Engine Configuration\n(Risk Accumulation by Number of Engines)',              fontsize=14, fontweight='bold')ax.grid(True, alpha=0.3)ax.legend(loc='best')plt.tight_layout()plt.savefig(figures_dir / '06_cumulative_hazard_engine.png', dpi=150, bbox_inches='tight')plt.show()

## Key Findings

### 1. Overall Survival
- **Median survival time**: Aircraft age at which 50% of accidents become fatal
- **Age effect**: Survival probability decreases with aircraft age (older = higher fatal risk)

### 2. Group Comparisons (Log-Rank Tests)
- **Amateur-built vs Certificated**: Statistically significant difference (if p < 0.05)
- **Multi-engine vs Single-engine**: Engine redundancy effect on survival
- **Aircraft types**: Helicopters vs Airplanes vs Gliders show different risk profiles

### 3. Cox Proportional Hazards (Risk Factors)
- **Amateur-built**: Hazard ratio indicates increased/decreased fatal risk
- **Multi-engine**: Protective effect (HR < 1.0) or risk factor (HR > 1.0)
- **IMC conditions**: Weather impact on fatal outcome probability
- **Night operations**: Lighting conditions effect on survival

### 4. Statistical Assumptions
- **Proportional hazards assumption**: Hazard ratios constant over time (tested via residuals)
- **Non-informative censoring**: Assumes censored events are random
- **Independence**: Accidents assumed independent events

### 5. Limitations
- **Time variable**: Aircraft age may not capture all time-dependent risks (flight hours better metric)
- **Competing risks**: Aircraft retirement/sale competes with accident occurrence
- **Left truncation**: Pre-1962 aircraft excluded (survivorship bias)
- **Missing data**: Some aircraft missing year-of-manufacture (excluded from analysis)

### 6. Practical Significance
- **Maintenance implications**: Older aircraft require enhanced inspection
- **Insurance pricing**: Age-based risk stratification
- **Regulatory policy**: Amateur-built oversight and engine redundancy requirements
- **Pilot training**: Weather and night operation risk awareness