# Aviation Accidents Analysis

You are part of a consulting firm that is tasked to do an analysis of commercial and passenger jet airline safety. The client (an airline/airplane insurer) is interested in knowing what types of aircraft (makes/models) exhibit low rates of total destruction and low likelihood of fatal or serious passenger injuries in the event of an accident. They are also interested in any general variables/conditions that might be at play. Your analysis will be based off of aviation accident data accumulated from the years 1948-2023. 

Our client is only interested in airplane makes/models that are professional builds and could potentially still be active. Assume a max lifetime of 40 years for a make/model retirement and make sure to filter your data accordingly (i.e. from 1983 onwards). They would also like separate recommendations for small aircraft vs. larger passenger models. **In addition, make sure that claims that you make are statistically robust and that you have enough samples when making comparisons between groups.**


In this summative assessment you will demonstrate your ability to:
- **Use Pandas to load, inspect, and clean the dataset appropriately.**
- **Transform relevant columns to create measures that address the problem at hand.**
- conduct EDA: visualization and statistical measures to systematically understand the structure of the data
- recommend a set of airplanes and makes conforming to the client's request and identify at least *two* factors contributing to airplane safety. You must provide supporting evidence (visuals, summary statistics, tables) for each claim you make.

### Make relevant library imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Data Loading and Inspection

### Load in data from the relevant directory and inspect the dataframe.
- inspect NaNs, datatypes, and summary statistics

In [30]:
aviation_df = pd.read_csv('./data/AviationData.csv', encoding='windows-1252')

aviation_df.info()

  aviation_df = pd.read_csv('./data/AviationData.csv', encoding='windows-1252')


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88889 entries, 0 to 88888
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      88889 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Latitude                34382 non-null  object 
 7   Longitude               34373 non-null  object 
 8   Airport.Code            50132 non-null  object 
 9   Airport.Name            52704 non-null  object 
 10  Injury.Severity         87889 non-null  object 
 11  Aircraft.damage         85695 non-null  object 
 12  Aircraft.Category       32287 non-null  object 
 13  Registration.Number     87507 non-null  object 
 14  Make                    88826 non-null

## Data Cleaning

### Filtering aircrafts and events

We want to filter the dataset to include aircraft that the client is interested in an analysis of:
- inspect relevant columns
- figure out any reasonable imputations
- filter the dataset

In [None]:
# Inspect Aircraft.Category column
print(aviation_df['Aircraft.Category'].value_counts(dropna=False))
# Filter Aircraft.Category
aviation_df = aviation_df[aviation_df['Aircraft.Category'] == 'Airplane']

# Inspect Amateur.Built column
print(aviation_df['Amateur.Built'].value_counts(dropna=False))
# Imputation: Impute NaNs as 'No'
aviation_df['Amateur.Built'] = aviation_df['Amateur.Built'].fillna('No')
# Filter Amateur.Built
aviation_df = aviation_df[aviation_df['Amateur.Built'] == 'No']

# Inspect Event.Date column
print(aviation_df['Event.Date'].dtype)
print(aviation_df['Event.Date'].head())
# Convert Event.Date to datetime
aviation_df['Event.Date'] = pd.to_datetime(aviation_df['Event.Date'], errors='coerce')
# Check if any dates are lost in this conversion
print(aviation_df['Event.Date'].isna().sum())
# Remove all events older than 40 years ago
aviation_df = aviation_df[aviation_df['Event.Date'] >= (pd.Timestamp.today() - pd.DateOffset(years=40))]
print(aviation_df.shape)


Aircraft.Category
NaN                  56602
Airplane             27617
Helicopter            3440
Glider                 508
Balloon                231
Gyrocraft              173
Weight-Shift           161
Powered Parachute       91
Ultralight              30
Unknown                 14
WSFT                     9
Powered-Lift             5
Blimp                    4
UNK                      2
Rocket                   1
ULTR                     1
Name: count, dtype: int64
Amateur.Built
No     24417
Yes     3183
NaN       17
Name: count, dtype: int64
object
5     1979-09-17
7     1982-01-01
8     1982-01-01
12    1982-01-02
13    1982-01-02
Name: Event.Date, dtype: object
0
21440


### Cleaning and constructing Key Measurables

Injuries and robustness to destruction are a key interest point for the client. Clean and impute relevant columns and then create derived fields that best quantifies what the client wishes to track. **Use commenting or markdown to explain any cleaning assumptions as well as any derived columns you create.**

**Construct metric for fatal/serious injuries**

*Hint:* Estimate the total number of passengers on each flight. The likelihood of serious / fatal injury can be estimated as a fraction from this.

In [34]:
print(aviation_df.columns.tolist())
# Inspect relevant columns
injury_cols = ['Total.Fatal.Injuries', 'Total.Serious.Injuries', 
               'Total.Minor.Injuries', 'Total.Uninjured']
print(aviation_df[injury_cols].isna().sum())
print(aviation_df[injury_cols].describe())

# Imputation: fill NaNs with 0
# Assumption: if injury counts are missing, we can assume no injuries were recorded in that category, so no need to drop the row
aviation_df[injury_cols] = aviation_df[injury_cols].fillna(0)

# Estimate total passengers on each flight by summing all people from above list
aviation_df['Total.Passengers'] = (
    aviation_df['Total.Fatal.Injuries'] +
    aviation_df['Total.Serious.Injuries'] +
    aviation_df['Total.Minor.Injuries'] +
    aviation_df['Total.Uninjured']
)

# Calculate injury rate: proportion of passengers with fatal or serious injuries
# Rows where Total.Passengers == 0 would cause division by zero, so use np.where to assign NaN
aviation_df['Serious.Fatal.Injury.Rate'] = np.where(
    aviation_df['Total.Passengers'] > 0,
    (aviation_df['Total.Fatal.Injuries'] + aviation_df['Total.Serious.Injuries']) / aviation_df['Total.Passengers'],
    np.nan
)

print(aviation_df['Serious.Fatal.Injury.Rate'].describe())

['Event.Id', 'Investigation.Type', 'Accident.Number', 'Event.Date', 'Location', 'Country', 'Latitude', 'Longitude', 'Airport.Code', 'Airport.Name', 'Injury.Severity', 'Aircraft.damage', 'Aircraft.Category', 'Registration.Number', 'Make', 'Model', 'Amateur.Built', 'Number.of.Engines', 'Engine.Type', 'FAR.Description', 'Schedule', 'Purpose.of.flight', 'Air.carrier', 'Total.Fatal.Injuries', 'Total.Serious.Injuries', 'Total.Minor.Injuries', 'Total.Uninjured', 'Weather.Condition', 'Broad.phase.of.flight', 'Report.Status', 'Publication.Date']
Total.Fatal.Injuries      2748
Total.Serious.Injuries    2825
Total.Minor.Injuries      2538
Total.Uninjured            714
dtype: int64
       Total.Fatal.Injuries  Total.Serious.Injuries  Total.Minor.Injuries  \
count          18692.000000            18615.000000          18902.000000   
mean               0.740049                0.323127              0.230452   
std                6.770447                2.378603              1.752958   
min         

**Aircraft.Damage**
- identify and execute any cleaning tasks
- construct a derived column tracking whether an aircraft was destroyed or not.

In [44]:
print(aviation_df['Aircraft.damage'].value_counts(dropna=False))
# No NaNs present, no imputation needed

# Was the aircraft destroyed?
# 'Destroyed' is a specific value in this column, all others (Substantial, Minor, Unknown are treated as not destroyed)
aviation_df['Was.Destroyed'] = aviation_df['Aircraft.damage'] == 'Destroyed'

print(aviation_df['Was.Destroyed'].value_counts())
print(aviation_df['Aircraft.damage'].value_counts())

Aircraft.damage
Substantial    16985
Destroyed       2311
Unknown         1326
Minor            818
Name: count, dtype: int64
Was.Destroyed
False    19129
True      2311
Name: count, dtype: int64
Aircraft.damage
Substantial    16985
Destroyed       2311
Unknown         1326
Minor            818
Name: count, dtype: int64


### Investigate the *Make* column
- Identify cleaning tasks here
- List cleaning tasks clearly in markdown
- Execute the cleaning tasks
- For your analysis, keep Makes with a reasonable number (you can put the threshold at 50 though lower could work as well)

### Inspect Model column
- Get rid of any NaNs.
- Inspect the column and counts for each model/make. Are model labels unique to each make?
- If not, create a derived column that is a unique identifier for a given plane type.

### Cleaning other columns
- there are other columns containing data that might be related to the outcome of an accident. We list a few here:
- Engine.Type
- Weather.Condition
- Number.of.Engines
- Purpose.of.flight
- Broad.phase.of.flight

Inspect and identify potential cleaning tasks in each of the above columns. Execute those cleaning tasks. 

**Note**: You do not necessarily need to impute or drop NaNs here.

### Column Removal
- inspect the dataframe and drop any columns that have too many NaNs

### Save DataFrame to csv
- its generally useful to save data to file/server after its in a sufficiently cleaned or intermediate state
- the data can then be loaded directly in another notebook for further analysis
- this helps keep your notebooks and workflow readable, clean and modularized