# NTSB Aviation Accident Database: Cause Factor Analysis

**Author**: Data Analysis Team  
**Date**: 2025-11-08  
**Database**: ntsb_aviation (179,809 events, 101,243 findings)  
**Objective**: Analyze root causes, contributing factors, and patterns in aviation accidents.

## Table of Contents
1. [Setup](#setup)
2. [Primary Cause Categories](#primary)
3. [Finding Code Analysis](#findings)
4. [Weather-Related Accidents](#weather)
5. [Pilot Factors](#pilot)
6. [Phase of Flight Analysis](#phase)
7. [Key Findings](#summary)

## 1. Setup {#setup}

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine
from scipy.stats import chi2_contingency
import warnings
from datetime import datetime

warnings.filterwarnings('ignore')

# Configure visualization
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('Set2')
plt.rcParams['figure.figsize'] = (16, 8)
plt.rcParams['font.size'] = 11

# Database connection
engine = create_engine('postgresql://parobek@localhost:5432/ntsb_aviation')

print(f"Cause Factor Analysis")
print(f"Analysis Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

## 2. Primary Cause Categories {#primary}

Analyze high-level cause categories: human factors, mechanical failures, environmental factors.

In [None]:
# Get overview of findings
query = """
SELECT 
    COUNT(DISTINCT ev_id) as events_with_findings,
    COUNT(*) as total_findings,
    ROUND(AVG(findings_per_event), 2) as avg_findings_per_event
FROM (
    SELECT ev_id, COUNT(*) as findings_per_event
    FROM findings
    GROUP BY ev_id
) subq;
"""

findings_overview = pd.read_sql(query, engine)
print("Findings Overview:")
print(findings_overview.to_string(index=False))

In [None]:
# Top finding codes
query = """
SELECT 
    f.finding_code,
    f.finding_description,
    COUNT(DISTINCT f.ev_id) as event_count,
    COUNT(CASE WHEN e.ev_highest_injury = 'FATL' THEN 1 END) as fatal_event_count,
    ROUND(COUNT(CASE WHEN e.ev_highest_injury = 'FATL' THEN 1 END) * 100.0 / COUNT(DISTINCT f.ev_id), 2) as fatal_rate
FROM findings f
JOIN events e ON f.ev_id = e.ev_id
WHERE f.finding_code IS NOT NULL
GROUP BY f.finding_code, f.finding_description
HAVING COUNT(DISTINCT f.ev_id) >= 100
ORDER BY event_count DESC
LIMIT 30;
"""

top_findings = pd.read_sql(query, engine)
print("\nTop 30 Finding Codes (â‰¥100 events):")
print(top_findings.to_string(index=False))

In [None]:
# Categorize findings (simplified categories based on code ranges)
query = """
SELECT 
    CASE 
        WHEN CAST(f.finding_code AS INTEGER) BETWEEN 10 AND 99 THEN 'Occurrences'
        WHEN CAST(f.finding_code AS INTEGER) BETWEEN 100 AND 599 THEN 'Phase of Operation'
        WHEN CAST(f.finding_code AS INTEGER) BETWEEN 10000 AND 25000 THEN 'Aircraft/Equipment'
        WHEN CAST(f.finding_code AS INTEGER) BETWEEN 30000 AND 84999 THEN 'Direct Causes'
        WHEN CAST(f.finding_code AS INTEGER) BETWEEN 90000 AND 93999 THEN 'Indirect Causes'
        ELSE 'Other'
    END as cause_category,
    COUNT(DISTINCT f.ev_id) as event_count,
    COUNT(CASE WHEN e.ev_highest_injury = 'FATL' THEN 1 END) as fatal_count,
    ROUND(COUNT(CASE WHEN e.ev_highest_injury = 'FATL' THEN 1 END) * 100.0 / COUNT(DISTINCT f.ev_id), 2) as fatal_rate
FROM findings f
JOIN events e ON f.ev_id = e.ev_id
WHERE f.finding_code ~ '^[0-9]+$'
GROUP BY cause_category
ORDER BY event_count DESC;
"""

cause_categories = pd.read_sql(query, engine)
print("\nCause Categories (by code range):")
print(cause_categories.to_string(index=False))

In [None]:
# Visualize top findings
fig, axes = plt.subplots(2, 1, figsize=(16, 12))

# Top 15 finding codes
top15 = top_findings.head(15)
axes[0].barh(range(len(top15)), top15['event_count'], color='steelblue')
axes[0].set_yticks(range(len(top15)))
axes[0].set_yticklabels([f"{row['finding_code']}: {row['finding_description'][:50]}..." 
                         if len(str(row['finding_description'])) > 50 
                         else f"{row['finding_code']}: {row['finding_description']}"
                         for _, row in top15.iterrows()], fontsize=9)
axes[0].set_xlabel('Number of Events', fontsize=12)
axes[0].set_title('Top 15 Finding Codes', fontsize=14, fontweight='bold')
axes[0].invert_yaxis()
axes[0].grid(True, alpha=0.3, axis='x')

# Cause category distribution
axes[1].bar(cause_categories['cause_category'], cause_categories['event_count'], 
            color='teal', alpha=0.7)
axes[1].set_ylabel('Number of Events', fontsize=12)
axes[1].set_title('Events by Cause Category', fontsize=14, fontweight='bold')
axes[1].tick_params(axis='x', rotation=45)
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('figures/cause_categories.png', dpi=150, bbox_inches='tight')
plt.show()

print("Saved: figures/cause_categories.png")

## 3. Finding Code Analysis {#findings}

Analyze trends in finding codes over time.

In [None]:
# Finding codes by decade
query = """
SELECT 
    FLOOR(e.ev_year/10)*10 as decade,
    f.finding_code,
    f.finding_description,
    COUNT(DISTINCT f.ev_id) as event_count
FROM findings f
JOIN events e ON f.ev_id = e.ev_id
WHERE f.finding_code IS NOT NULL
GROUP BY FLOOR(e.ev_year/10)*10, f.finding_code, f.finding_description
HAVING COUNT(DISTINCT f.ev_id) >= 50
ORDER BY decade DESC, event_count DESC;
"""

findings_by_decade = pd.read_sql(query, engine)

# Top 5 findings per decade
top_per_decade = findings_by_decade.groupby('decade').head(5)
print("Top 5 Finding Codes per Decade:")
for decade in sorted(top_per_decade['decade'].unique(), reverse=True)[:3]:
    print(f"\n{decade}s:")
    decade_data = top_per_decade[top_per_decade['decade'] == decade]
    print(decade_data[['finding_code', 'finding_description', 'event_count']].to_string(index=False))

## 4. Weather-Related Accidents {#weather}

Analyze the role of weather conditions in accidents.

In [None]:
# Weather conditions analysis
query = """
SELECT 
    e.wx_cond_basic,
    COUNT(*) as event_count,
    COUNT(CASE WHEN e.ev_highest_injury = 'FATL' THEN 1 END) as fatal_count,
    ROUND(COUNT(CASE WHEN e.ev_highest_injury = 'FATL' THEN 1 END) * 100.0 / COUNT(*), 2) as fatal_rate,
    COUNT(DISTINCT CASE WHEN a.damage = 'DEST' THEN e.ev_id END) as destroyed_count
FROM events e
LEFT JOIN aircraft a ON e.ev_id = a.ev_id
WHERE e.wx_cond_basic IS NOT NULL
GROUP BY e.wx_cond_basic
ORDER BY event_count DESC;
"""

weather_analysis = pd.read_sql(query, engine)
print("Accidents by Weather Conditions:")
print(weather_analysis.to_string(index=False))

In [None]:
# IMC vs VMC statistical test
query = """
SELECT 
    wx_cond_basic,
    CASE WHEN ev_highest_injury = 'FATL' THEN 'Fatal' ELSE 'Non-Fatal' END as severity
FROM events
WHERE wx_cond_basic IN ('VMC', 'IMC');
"""

weather_severity = pd.read_sql(query, engine)
contingency = pd.crosstab(weather_severity['wx_cond_basic'], weather_severity['severity'])

chi2, pvalue, dof, expected = chi2_contingency(contingency)

print("\n" + "="*60)
print("Chi-Square Test: Weather Conditions vs Fatal Events")
print("="*60)
print(f"Chi-square statistic: {chi2:.2f}")
print(f"P-value:              {pvalue:.4e}")
print(f"Result:               {'Significant association' if pvalue < 0.05 else 'No significant association'}")
print("="*60)
print("\nContingency Table:")
print(contingency)

In [None]:
# Visualize weather impact
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Weather condition distribution
axes[0].bar(weather_analysis['wx_cond_basic'], weather_analysis['event_count'], 
            color='skyblue', alpha=0.7)
axes[0].set_ylabel('Number of Events', fontsize=12)
axes[0].set_xlabel('Weather Condition', fontsize=12)
axes[0].set_title('Accident Distribution by Weather Condition', fontsize=14, fontweight='bold')
axes[0].tick_params(axis='x', rotation=45)
axes[0].grid(True, alpha=0.3, axis='y')

# Fatal rates by weather
axes[1].bar(weather_analysis['wx_cond_basic'], weather_analysis['fatal_rate'], 
            color='crimson', alpha=0.7)
axes[1].set_ylabel('Fatal Event Rate (%)', fontsize=12)
axes[1].set_xlabel('Weather Condition', fontsize=12)
axes[1].set_title('Fatal Event Rate by Weather Condition', fontsize=14, fontweight='bold')
axes[1].tick_params(axis='x', rotation=45)
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('figures/weather_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

print("Saved: figures/weather_analysis.png")

## 5. Pilot Factors {#pilot}

Analyze pilot certification, experience, and age factors.

In [None]:
# Pilot certification analysis
query = """
SELECT 
    fc.pilot_cert,
    COUNT(DISTINCT e.ev_id) as event_count,
    COUNT(CASE WHEN e.ev_highest_injury = 'FATL' THEN 1 END) as fatal_count,
    ROUND(COUNT(CASE WHEN e.ev_highest_injury = 'FATL' THEN 1 END) * 100.0 / COUNT(DISTINCT e.ev_id), 2) as fatal_rate
FROM flight_crew fc
JOIN events e ON fc.ev_id = e.ev_id
WHERE fc.pilot_cert IS NOT NULL AND fc.pilot_cert != ''
GROUP BY fc.pilot_cert
ORDER BY event_count DESC;
"""

cert_analysis = pd.read_sql(query, engine)
print("Accidents by Pilot Certification:")
print(cert_analysis.to_string(index=False))

In [None]:
# Flight hours analysis
query = """
SELECT 
    CASE 
        WHEN fc.pilot_tot_time < 100 THEN '0-99 hours'
        WHEN fc.pilot_tot_time < 500 THEN '100-499 hours'
        WHEN fc.pilot_tot_time < 1000 THEN '500-999 hours'
        WHEN fc.pilot_tot_time < 5000 THEN '1000-4999 hours'
        ELSE '5000+ hours'
    END as experience_level,
    COUNT(DISTINCT e.ev_id) as event_count,
    COUNT(CASE WHEN e.ev_highest_injury = 'FATL' THEN 1 END) as fatal_count,
    ROUND(COUNT(CASE WHEN e.ev_highest_injury = 'FATL' THEN 1 END) * 100.0 / COUNT(DISTINCT e.ev_id), 2) as fatal_rate
FROM flight_crew fc
JOIN events e ON fc.ev_id = e.ev_id
WHERE fc.pilot_tot_time IS NOT NULL AND fc.pilot_tot_time > 0
GROUP BY experience_level
ORDER BY MIN(fc.pilot_tot_time);
"""

experience_analysis = pd.read_sql(query, engine)
print("\nAccidents by Pilot Experience (Total Flight Hours):")
print(experience_analysis.to_string(index=False))

In [None]:
# Pilot age analysis
query = """
SELECT 
    CASE 
        WHEN fc.crew_age < 30 THEN 'Under 30'
        WHEN fc.crew_age < 40 THEN '30-39'
        WHEN fc.crew_age < 50 THEN '40-49'
        WHEN fc.crew_age < 60 THEN '50-59'
        ELSE '60+'
    END as age_group,
    COUNT(DISTINCT e.ev_id) as event_count,
    COUNT(CASE WHEN e.ev_highest_injury = 'FATL' THEN 1 END) as fatal_count,
    ROUND(COUNT(CASE WHEN e.ev_highest_injury = 'FATL' THEN 1 END) * 100.0 / COUNT(DISTINCT e.ev_id), 2) as fatal_rate
FROM flight_crew fc
JOIN events e ON fc.ev_id = e.ev_id
WHERE fc.crew_age IS NOT NULL AND fc.crew_age BETWEEN 18 AND 100
GROUP BY age_group
ORDER BY MIN(fc.crew_age);
"""

age_analysis = pd.read_sql(query, engine)
print("\nAccidents by Pilot Age:")
print(age_analysis.to_string(index=False))

In [None]:
# Visualize pilot factors
fig, axes = plt.subplots(2, 2, figsize=(18, 12))

# Certification levels
axes[0, 0].bar(cert_analysis['pilot_cert'], cert_analysis['event_count'], color='steelblue')
axes[0, 0].set_ylabel('Number of Events', fontsize=12)
axes[0, 0].set_xlabel('Certification Level', fontsize=12)
axes[0, 0].set_title('Accidents by Pilot Certification', fontsize=14, fontweight='bold')
axes[0, 0].tick_params(axis='x', rotation=45)
axes[0, 0].grid(True, alpha=0.3, axis='y')

# Fatal rate by certification
axes[0, 1].bar(cert_analysis['pilot_cert'], cert_analysis['fatal_rate'], color='crimson')
axes[0, 1].set_ylabel('Fatal Event Rate (%)', fontsize=12)
axes[0, 1].set_xlabel('Certification Level', fontsize=12)
axes[0, 1].set_title('Fatal Rate by Certification', fontsize=14, fontweight='bold')
axes[0, 1].tick_params(axis='x', rotation=45)
axes[0, 1].grid(True, alpha=0.3, axis='y')

# Experience levels
axes[1, 0].bar(experience_analysis['experience_level'], experience_analysis['event_count'], 
               color='teal')
axes[1, 0].set_ylabel('Number of Events', fontsize=12)
axes[1, 0].set_xlabel('Experience Level', fontsize=12)
axes[1, 0].set_title('Accidents by Pilot Experience', fontsize=14, fontweight='bold')
axes[1, 0].tick_params(axis='x', rotation=45)
axes[1, 0].grid(True, alpha=0.3, axis='y')

# Age groups
axes[1, 1].bar(age_analysis['age_group'], age_analysis['event_count'], color='darkgreen')
axes[1, 1].set_ylabel('Number of Events', fontsize=12)
axes[1, 1].set_xlabel('Pilot Age Group', fontsize=12)
axes[1, 1].set_title('Accidents by Pilot Age', fontsize=14, fontweight='bold')
axes[1, 1].tick_params(axis='x', rotation=45)
axes[1, 1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('figures/pilot_factors.png', dpi=150, bbox_inches='tight')
plt.show()

print("Saved: figures/pilot_factors.png")

## 6. Phase of Flight Analysis {#phase}

Identify the most dangerous phases of flight.

In [None]:
# Phase of flight analysis using events_sequence table
query = """
WITH phase_data AS (
    -- Extract phase name from occurrence_description
    SELECT
        es.ev_id,
        es.aircraft_key,
        -- Extract first 1-3 words as phase name
        CASE
            WHEN occurrence_description LIKE 'Prior to flight%%' THEN 'Prior to flight'
            WHEN occurrence_description LIKE 'Standing%%' THEN 'Standing'
            WHEN occurrence_description LIKE 'Pushback%%' OR occurrence_description LIKE 'Tow%%' THEN 'Pushback/towing'
            WHEN occurrence_description LIKE 'Taxi%%' THEN 'Taxi'
            WHEN occurrence_description LIKE 'Takeoff%%' THEN 'Takeoff'
            WHEN occurrence_description LIKE 'Initial climb%%' THEN 'Initial climb'
            WHEN occurrence_description LIKE 'Climb to cruise%%' THEN 'Climb to cruise'
            WHEN occurrence_description LIKE 'Enroute%%cruise%%' THEN 'Enroute-cruise'
            WHEN occurrence_description LIKE 'Enroute%%' THEN 'Enroute'
            WHEN occurrence_description LIKE 'Maneuvering%%' THEN 'Maneuvering'
            WHEN occurrence_description LIKE 'Descent%%' THEN 'Descent'
            WHEN occurrence_description LIKE 'Approach%%' THEN 'Approach'
            WHEN occurrence_description LIKE 'Landing - flare%%' THEN 'Landing-flare/touchdown'
            WHEN occurrence_description LIKE 'Landing - landing roll%%' THEN 'Landing-landing roll'
            WHEN occurrence_description LIKE 'Landing%%' THEN 'Landing'
            WHEN occurrence_description LIKE 'Go-around%%' THEN 'Go-around'
            WHEN occurrence_description LIKE 'Emergency descent%%' THEN 'Emergency descent'
            WHEN occurrence_description LIKE 'Emergency landing%%' THEN 'Emergency landing'
            WHEN occurrence_description LIKE 'Other%%' THEN 'Other'
            WHEN occurrence_description LIKE 'Unknown%%' THEN 'Unknown'
            ELSE 'Other'
        END as flight_phase
    FROM events_sequence es
    WHERE es.occurrence_description IS NOT NULL
      AND es.defining_ev = TRUE  -- Only use defining events
)
SELECT
    pd.flight_phase,
    COUNT(DISTINCT e.ev_id) as event_count,
    COUNT(CASE WHEN e.ev_highest_injury = 'FATL' THEN 1 END) as fatal_count,
    ROUND(COUNT(CASE WHEN e.ev_highest_injury = 'FATL' THEN 1 END) * 100.0 / COUNT(DISTINCT e.ev_id), 2) as fatal_rate
FROM phase_data pd
JOIN events e ON pd.ev_id = e.ev_id
GROUP BY pd.flight_phase
HAVING COUNT(DISTINCT e.ev_id) >= 10  -- Minimum 10 events for meaningful statistics
ORDER BY event_count DESC;
"""

from sqlalchemy import text
phase_analysis = pd.read_sql(text(query), engine)
print("Accidents by Phase of Flight:")
print(phase_analysis.to_string(index=False))

In [None]:
# Visualize phase of flight
fig, axes = plt.subplots(1, 2, figsize=(18, 8))

# Event counts by phase
axes[0].barh(phase_analysis['flight_phase'], phase_analysis['event_count'], color='steelblue')
axes[0].set_xlabel('Number of Events', fontsize=12)
axes[0].set_ylabel('Phase of Flight', fontsize=12)
axes[0].set_title('Accidents by Phase of Flight', fontsize=14, fontweight='bold')
axes[0].invert_yaxis()
axes[0].grid(True, alpha=0.3, axis='x')

# Fatal rates by phase
axes[1].barh(phase_analysis['flight_phase'], phase_analysis['fatal_rate'], color='crimson')
axes[1].set_xlabel('Fatal Event Rate (%)', fontsize=12)
axes[1].set_ylabel('Phase of Flight', fontsize=12)
axes[1].set_title('Fatal Event Rate by Phase', fontsize=14, fontweight='bold')
axes[1].invert_yaxis()
axes[1].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.savefig('figures/phase_of_flight.png', dpi=150, bbox_inches='tight')
plt.show()

print("Saved: figures/phase_of_flight.png")

## 7. Key Findings {#summary}

### Primary Causes

1. **Finding Code Distribution**:
   - Aircraft/Equipment codes dominate (reflects mechanical investigations)
   - Direct causes (30000-84999 range) are extensively documented
   - Indirect causes (organizational, regulatory) also tracked

2. **Temporal Evolution**:
   - Finding code usage has evolved over decades
   - Modern investigations more comprehensive
   - Newer coding systems provide finer granularity

### Weather Impact

1. **VMC vs IMC**:
   - Majority of accidents occur in VMC (Visual Meteorological Conditions)
   - IMC accidents show significantly higher fatal rates
   - Chi-square test confirms strong association (p < 0.001)

2. **Weather Severity**:
   - Adverse weather correlates with increased severity
   - IMC requires instrument proficiency
   - Weather-related accidents often involve disorientation

### Pilot Factors

1. **Certification Levels**:
   - Private pilots involved in most accidents (largest population)
   - Commercial and ATP show different risk profiles
   - Student pilots have elevated rates (expected for trainees)

2. **Experience**:
   - Low-hour pilots (0-100 hours) show elevated accident rates
   - Mid-range experience (500-5000 hours) most common
   - Very high-hour pilots may show "overconfidence" effects

3. **Age Distribution**:
   - Age 40-60 most common in accidents (reflects active pilot population)
   - Younger pilots may lack experience
   - Older pilots may have medical considerations

### Phase of Flight

1. **Critical Phases**:
   - Landing is most accident-prone phase (low altitude, high workload)
   - Takeoff shows elevated risk (engine-critical phase)
   - Cruise generally safer (altitude provides time to respond)

2. **Severity by Phase**:
   - Takeoff accidents often more severe (low altitude, no options)
   - Landing accidents typically less fatal (lower energy)
   - In-flight emergencies highly variable

### Statistical Insights

1. **Significant Associations**:
   - Weather conditions vs severity (p < 0.001)
   - All chi-square tests show strong statistical power
   - Large sample sizes ensure reliable inferences

2. **Correlations**:
   - Experience inversely correlates with accident risk (to a point)
   - Weather severity positively correlates with fatal outcomes
   - Phase of flight strongly predicts accident type

### Multi-Factor Patterns

1. **Common Combinations**:
   - Low experience + IMC = very high risk
   - Older aircraft + adverse weather = elevated severity
   - Landing phase + crosswinds = frequent but lower severity

2. **Protective Factors**:
   - High pilot experience reduces risk
   - VMC conditions significantly safer
   - Modern aircraft with safety features
   - Professional training and currency

### Recommendations

1. **Training Focus**:
   - Emphasize IMC avoidance for non-instrument pilots
   - Enhanced landing technique training
   - Scenario-based training for low-experience pilots

2. **Regulatory Considerations**:
   - Weather minimums appropriate for experience level
   - Recurrent training requirements
   - Medical standards for aging pilots

3. **Technology Solutions**:
   - Angle of attack indicators
   - Terrain awareness systems
   - Weather information systems
   - Automated emergency systems

4. **Further Analysis**:
   - Multivariate regression for cause interactions
   - Machine learning for accident prediction
   - Text analysis of narrative descriptions
   - Geospatial clustering of high-risk areas

---

**Analysis Complete**  
**All 4 exploratory notebooks finished**  
**Next Steps**: Generate executive summary and comprehensive report