# Summative Lab: Data Analysis — Jet Airline Safety (EDA & Recommendations)

This notebook covers **Part Two** of the assessment: exploratory data analysis (EDA) and evidence-backed recommendations.

**What this notebook does:**

- Loads the cleaned CSV from Part One.
- Defines small vs. large aircraft groups (threshold: **20 seats/occupants**).
- Computes group-level safety metrics (serious/fatal injury rate, destruction rate) for **make/model** with a minimum sample size.
- Explores at least **two additional factors** (choose any two of: **Weather Condition**, **Engine Type**, **Number of Engines**, **Phase of Flight**, **Purpose of Flight**) and visualizes their relationships to outcomes.
- Provides **clear, evidence-based recommendations** for **small** and **large** aircraft.

> Notes:
> - If the dataset lacks an explicit capacity field, we use **best available estimate** via `total_aboard_est` (from Part One) to separate small/large groups. If `total_aboard_est` is missing, we will fall back to `number_of_engines` as a crude proxy.


In [None]:
# Imports & load
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(context="talk", style="whitegrid")

possible_cleaned = ["data/aviation_cleaned.csv","aviation_cleaned.csv"]
df = None
for p in possible_cleaned:
    if os.path.exists(p):
        df = pd.read_csv(p)
        source_path = p
        break
if df is None:
    raise FileNotFoundError("Cleaned file not found. Expected one of: " + ", ".join(possible_cleaned))

print(f"Loaded cleaned data from {source_path}. Shape = {df.shape}")
df.head(3)

In [None]:
# Define small vs. large by total_aboard_est threshold (>=20 is large)
aboard_col = "total_aboard_est" if "total_aboard_est" in df.columns else None
eng_col = "number_of_engines" if "number_of_engines" in df.columns else None

def is_large(row):
    if aboard_col and pd.notna(row.get(aboard_col, np.nan)):
        return row[aboard_col] >= 20
    if eng_col and pd.notna(row.get(eng_col, np.nan)):
        return row[eng_col] > 2
    return False

df["is_large"] = df.apply(is_large, axis=1)
df["size_group"] = np.where(df["is_large"], "Large (>=20)", "Small (<20)")
df["year"] = pd.to_datetime(df["event_date_parsed"], errors="coerce").dt.year

df["destroyed"] = df["destroyed"].astype(bool)

df.head(3)

## Group-level safety metrics (by make/model)

In [None]:
# Group metrics by make_model_key within each size group
min_n = 30

metrics = (df
    .groupby(["size_group","make_model_key"], dropna=False)
    .agg(
        n=("make_model_key","size"),
        serious_fatal_rate_mean=("serious_fatal_rate","mean"),
        destroyed_rate=("destroyed","mean")
    )
    .reset_index()
)

metrics = metrics[metrics["n"] >= min_n].sort_values(["size_group","serious_fatal_rate_mean","destroyed_rate"])

print("Top safer candidates (lower injury rate + lower destruction rate):")
display(metrics.groupby("size_group").head(10))

In [None]:
# Visual summaries
fig, axes = plt.subplots(1,2, figsize=(16,6), sharey=True)
for ax, grp in zip(axes, ["Small (<20)","Large (>=20)"]):
    sub = metrics[metrics["size_group"]==grp].nsmallest(10, ["serious_fatal_rate_mean","destroyed_rate"])
    sns.barplot(data=sub, x="serious_fatal_rate_mean", y="make_model_key", ax=ax)
    ax.set_title(f"{grp}: Lowest Serious/Fatal Injury Rates (Top 10)")
    ax.set_xlabel("Mean Serious/Fatal Injury Rate")
    ax.set_ylabel("Make | Model")
plt.tight_layout()

## Explore Other Factors

Analyze how **Weather Condition**, **Engine Type**, **Number of Engines**, **Phase of Flight**, **Purpose of Flight** relate to outcomes.


In [None]:
factors = []
for c in ["weather_condition","engine_type","number_of_engines","broad_phase_of_flight","purpose_of_flight"]:
    if c in df.columns:
        factors.append(c)

def factor_summary_plot(col, outcome="serious_fatal_rate"):
    temp = (df.groupby(col)
              .agg(n=(outcome,"count"),
                   mean_rate=(outcome,"mean"),
                   destroyed_rate=("destroyed","mean"))
              .reset_index()
              .sort_values("mean_rate"))
    display(temp.head(10))
    plt.figure(figsize=(12,6))
    sns.barplot(data=temp, x="mean_rate", y=col)
    plt.title(f"{col}: Mean {outcome} by Category")
    plt.xlabel(f"Mean {outcome}")
    plt.ylabel(col)
    plt.tight_layout()
    plt.show()
    return temp

summaries = {}
for col in factors[:3]:
    summaries[col] = factor_summary_plot(col, outcome="serious_fatal_rate")

## Recommendations

- Provide separate shortlists for **Small (<20)** and **Large (>=20)** groups using tables/plots above.
- Highlight models with **n ≥ 30**, **low serious/fatal rate**, and **low destroyed_rate**.
- Discuss how the two examined factors (e.g., weather, engines, phase) influence outcomes and what actions insurers/airlines could take.
