# Aviation Accidents Analysis

You are part of a consulting firm that is tasked to do an analysis of commercial and passenger jet airline safety. The client (an airline/airplane insurer) is interested in knowing what types of aircraft (makes/models) exhibit low rates of total destruction and low likelihood of fatal or serious passenger injuries in the event of an accident. They are also interested in any general variables/conditions that might be at play. Your analysis will be based off of aviation accident data accumulated from the years 1948-2023. 

Our client is only interested in airplane makes/models that are professional builds and could potentially still be active. Assume a max lifetime of 40 years for a make/model retirement and make sure to filter your data accordingly (i.e. from 1983 onwards). They would also like separate recommendations for small aircraft vs. larger passenger models. **In addition, make sure that claims that you make are statistically robust and that you have enough samples when making comparisons between groups.**


In this summative assessment you will demonstrate your ability to:
- **Use Pandas to load, inspect, and clean the dataset appropriately.**
- **Transform relevant columns to create measures that address the problem at hand.**
- conduct EDA: visualization and statistical measures to systematically understand the structure of the data
- recommend a set of airplanes and makes conforming to the client's request and identify at least *two* factors contributing to airplane safety. You must provide supporting evidence (visuals, summary statistics, tables) for each claim you make.

### Make relevant library imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Data Loading and Inspection

### Load in data from the relevant directory and inspect the dataframe.
- inspect NaNs, datatypes, and summary statistics

In [2]:
# Load the data as aviation_df
aviation_df = pd.read_csv(
    "data/AviationData.csv",
    encoding="latin1",
    low_memory=False
)

aviation_shape = aviation_df.shape
aviation_shape

(88889, 31)

In [3]:
aviation_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88889 entries, 0 to 88888
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      88889 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Latitude                34382 non-null  object 
 7   Longitude               34373 non-null  object 
 8   Airport.Code            50132 non-null  object 
 9   Airport.Name            52704 non-null  object 
 10  Injury.Severity         87889 non-null  object 
 11  Aircraft.damage         85695 non-null  object 
 12  Aircraft.Category       32287 non-null  object 
 13  Registration.Number     87507 non-null  object 
 14  Make                    88826 non-null

In [4]:
pd.set_option('display.max_columns', None)
aviation_df.head()

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,Injury.Severity,Aircraft.damage,Aircraft.Category,Registration.Number,Make,Model,Amateur.Built,Number.of.Engines,Engine.Type,FAR.Description,Schedule,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,Fatal(2),Destroyed,,NC6404,Stinson,108-3,No,1.0,Reciprocating,,,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,Fatal(4),Destroyed,,N5069P,Piper,PA24-180,No,1.0,Reciprocating,,,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.922223,-81.878056,,,Fatal(3),Destroyed,,N5142R,Cessna,172M,No,1.0,Reciprocating,,,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,Fatal(2),Destroyed,,N1168J,Rockwell,112,No,1.0,Reciprocating,,,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,Fatal(1),Destroyed,,N15NY,Cessna,501,No,,,,,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


## Data Cleaning

### Filtering aircrafts and events

We want to filter the dataset to include aircraft that the client is interested in an analysis of:
- inspect relevant columns
- figure out any reasonable imputations
- filter the dataset

In [5]:

#Convert Event.Date to datetime
aviation_df['Event.Date'] = pd.to_datetime(
    aviation_df['Event.Date'], errors="coerce"
)

In [6]:
#Add an Event.Year column derived from Event.Date
aviation_df['Event.Year'] = aviation_df['Event.Date'].dt.year
aviation_df.head() 

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,Injury.Severity,Aircraft.damage,Aircraft.Category,Registration.Number,Make,Model,Amateur.Built,Number.of.Engines,Engine.Type,FAR.Description,Schedule,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date,Event.Year
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,Fatal(2),Destroyed,,NC6404,Stinson,108-3,No,1.0,Reciprocating,,,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,,1948
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,Fatal(4),Destroyed,,N5069P,Piper,PA24-180,No,1.0,Reciprocating,,,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996,1962
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.922223,-81.878056,,,Fatal(3),Destroyed,,N5142R,Cessna,172M,No,1.0,Reciprocating,,,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007,1974
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,Fatal(2),Destroyed,,N1168J,Rockwell,112,No,1.0,Reciprocating,,,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000,1977
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,Fatal(1),Destroyed,,N15NY,Cessna,501,No,,,,,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980,1979


In [7]:
#Filter out events before 1983
aviation_filter_year_df = aviation_df[aviation_df['Event.Year'] >= 1983]
aviation_filter_year_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 85289 entries, 3600 to 88888
Data columns (total 32 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Event.Id                85289 non-null  object        
 1   Investigation.Type      85289 non-null  object        
 2   Accident.Number         85289 non-null  object        
 3   Event.Date              85289 non-null  datetime64[ns]
 4   Location                85237 non-null  object        
 5   Country                 85073 non-null  object        
 6   Latitude                34379 non-null  object        
 7   Longitude               34370 non-null  object        
 8   Airport.Code            48435 non-null  object        
 9   Airport.Name            50514 non-null  object        
 10  Injury.Severity         84289 non-null  object        
 11  Aircraft.damage         82151 non-null  object        
 12  Aircraft.Category       28723 non-null  object  

In [8]:
#Filter out Amateur Built aircrafts
aviation_filtered_df = aviation_filter_year_df[aviation_filter_year_df['Amateur.Built'] == "No"].copy()
aviation_filtered_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 76960 entries, 3600 to 88888
Data columns (total 32 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Event.Id                76960 non-null  object        
 1   Investigation.Type      76960 non-null  object        
 2   Accident.Number         76960 non-null  object        
 3   Event.Date              76960 non-null  datetime64[ns]
 4   Location                76913 non-null  object        
 5   Country                 76750 non-null  object        
 6   Latitude                30167 non-null  object        
 7   Longitude               30161 non-null  object        
 8   Airport.Code            43375 non-null  object        
 9   Airport.Name            45285 non-null  object        
 10  Injury.Severity         75961 non-null  object        
 11  Aircraft.damage         73868 non-null  object        
 12  Aircraft.Category       25405 non-null  object  

In [9]:
#Filter to only airplanes
aviation_filtered_df = aviation_filtered_df[aviation_filtered_df['Aircraft.Category'] == 'Airplane']
aviation_filtered_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 21447 entries, 4149 to 88886
Data columns (total 32 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Event.Id                21447 non-null  object        
 1   Investigation.Type      21447 non-null  object        
 2   Accident.Number         21447 non-null  object        
 3   Event.Date              21447 non-null  datetime64[ns]
 4   Location                21441 non-null  object        
 5   Country                 21446 non-null  object        
 6   Latitude                19169 non-null  object        
 7   Longitude               19163 non-null  object        
 8   Airport.Code            13983 non-null  object        
 9   Airport.Name            14070 non-null  object        
 10  Injury.Severity         20634 non-null  object        
 11  Aircraft.damage         20220 non-null  object        
 12  Aircraft.Category       21447 non-null  object  

### Cleaning and constructing Key Measurables

Injuries and robustness to destruction are a key interest point for the client. Clean and impute relevant columns and then create derived fields that best quantifies what the client wishes to track. **Use commenting or markdown to explain any cleaning assumptions as well as any derived columns you create.**

**Construct metric for fatal/serious injuries**

*Hint:* Estimate the total number of passengers on each flight. The likelihood of serious / fatal injury can be estimated as a fraction from this.

In [10]:
#Fill missing injury counts
#Assumption: If Total.Fatal.Injuries is missing, it likely means 0 injuries reported
aviation_filtered_df["Total.Fatal.Injuries"] = aviation_filtered_df["Total.Fatal.Injuries"].fillna(0)
aviation_filtered_df["Total.Serious.Injuries"] = aviation_filtered_df["Total.Serious.Injuries"].fillna(0)
aviation_filtered_df["Total.Minor.Injuries"] = aviation_filtered_df["Total.Minor.Injuries"].fillna(0)
aviation_filtered_df["Total.Uninjured"] = aviation_filtered_df["Total.Uninjured"].fillna(0)

aviation_filtered_df[['Event.Id','Total.Fatal.Injuries','Total.Serious.Injuries','Total.Minor.Injuries','Total.Uninjured']].info()

<class 'pandas.core.frame.DataFrame'>
Index: 21447 entries, 4149 to 88886
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                21447 non-null  object 
 1   Total.Fatal.Injuries    21447 non-null  float64
 2   Total.Serious.Injuries  21447 non-null  float64
 3   Total.Minor.Injuries    21447 non-null  float64
 4   Total.Uninjured         21447 non-null  float64
dtypes: float64(4), object(1)
memory usage: 1005.3+ KB


In [11]:
#Estimate total passengers on each flight
#Total Passengers = Fatal + Serious + Minor + Uninjured
aviation_filtered_df['Total.Passengers'] = (
    aviation_filtered_df['Total.Fatal.Injuries'] +
    aviation_filtered_df['Total.Serious.Injuries'] +
    aviation_filtered_df['Total.Minor.Injuries'] +
    aviation_filtered_df['Total.Uninjured']
)

aviation_filtered_df.tail(10)

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,Injury.Severity,Aircraft.damage,Aircraft.Category,Registration.Number,Make,Model,Amateur.Built,Number.of.Engines,Engine.Type,FAR.Description,Schedule,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date,Event.Year,Total.Passengers
88857,20221213106447,Accident,ERA23LA086,2022-12-08,"Covington, GA",United States,332754N,0835049W,,,Non-Fatal,Substantial,Airplane,N900AW,BEECH,A36,No,1.0,,091,,Personal,,0.0,0.0,0.0,2.0,VMC,,,28-12-2022,2022,2.0
88858,20221211106438,Accident,ERA23LA083,2022-12-09,"Hawkinsville, GA",United States,321814N,0832534W,51A,HAWKINSVILLE-PULASKI COUNTY,Minor,Substantial,Airplane,N160PT,PIPER,PA-44,No,2.0,,091,,Personal,,0.0,1.0,0.0,0.0,VMC,,,15-12-2022,2022,1.0
88859,20221212106443,Accident,WPR23LA064,2022-12-09,"Casa Grande, AZ",United States,325736N,1114536W,CGZ,Casa Grande Municipal Airport,Non-Fatal,Substantial,Airplane,190DK,ARADO-FLUGZEUGWERKE GMBH,FW190 A-5,No,1.0,,091,,Personal,,0.0,0.0,0.0,1.0,VMC,,,13-12-2022,2022,1.0
88861,20221215106460,Accident,ERA23LA088,2022-12-10,"Alabaster, AL",United States,331040N,0086470W,EET,,Non-Fatal,Substantial,Airplane,N5301G,CESSNA,305A,No,1.0,,091,,Personal,,0.0,0.0,0.0,2.0,,,,19-12-2022,2022,2.0
88865,20221212106444,Accident,ERA23LA085,2022-12-12,"Knoxville, TN",United States,355745N,0835218W,DKX,KNOXVILLE DOWNTOWN ISLAND,Non-Fatal,Substantial,Airplane,N783SF,CESSNA,172,No,1.0,,091,,Instructional,Knoxville Flight Training Academy,0.0,0.0,0.0,1.0,VMC,,,15-12-2022,2022,1.0
88869,20221213106455,Accident,WPR23LA065,2022-12-13,"Lewistown, MT",United States,047257N,0109280W,KLWT,Lewiston Municipal Airport,Non-Fatal,Substantial,Airplane,C-GZPU,PIPER,PA42,No,2.0,,NUSC,,,,0.0,0.0,0.0,1.0,,,,14-12-2022,2022,1.0
88873,20221215106463,Accident,ERA23LA090,2022-12-14,"San Juan, PR",United States,182724N,0066554W,SIG,FERNANDO LUIS RIBAS DOMINICCI,Non-Fatal,Substantial,Airplane,N416PC,CIRRUS DESIGN CORP,SR22,No,1.0,,091,,Personal,SKY WEST AVIATION INC TRUSTEE,0.0,0.0,0.0,1.0,VMC,,,27-12-2022,2022,1.0
88876,20221219106475,Accident,WPR23LA069,2022-12-15,"Wichita, KS",United States,373829N,0972635W,ICT,WICHITA DWIGHT D EISENHOWER NT,Non-Fatal,Substantial,Airplane,N398KL,SWEARINGEN,SA226TC,No,2.0,,135,SCHD,,,0.0,0.0,0.0,1.0,,,,19-12-2022,2022,1.0
88877,20221219106470,Accident,ERA23LA091,2022-12-16,"Brooksville, FL",United States,282825N,0822719W,BKV,BROOKSVILLE-TAMPA BAY RGNL,Minor,Substantial,Airplane,N5405V,CESSNA,R172K,No,1.0,,091,,Personal,GERBER RICHARD E,0.0,1.0,0.0,0.0,VMC,,,23-12-2022,2022,1.0
88886,20221227106497,Accident,WPR23LA075,2022-12-26,"Payson, AZ",United States,341525N,1112021W,PAN,PAYSON,Non-Fatal,Substantial,Airplane,N749PJ,AMERICAN CHAMPION AIRCRAFT,8GCBC,No,1.0,,091,,Personal,,0.0,0.0,0.0,1.0,VMC,,,27-12-2022,2022,1.0


In [12]:
#Compute the fraction of passengers fatally or seriously injured
#Assumption: If Total.Passengers = 0, we set the fraction to NaN to avoid divide-by-zero
aviation_filtered_df['Frac.Fatal.Serious'] = (
    (aviation_filtered_df['Total.Fatal.Injuries'] + aviation_filtered_df['Total.Serious.Injuries']) /
    aviation_filtered_df['Total.Passengers']
).replace([np.inf, -np.inf], np.nan)
aviation_filtered_df[['Event.Id','Total.Fatal.Injuries','Total.Serious.Injuries','Total.Minor.Injuries','Total.Uninjured','Total.Passengers','Frac.Fatal.Serious']].info()

<class 'pandas.core.frame.DataFrame'>
Index: 21447 entries, 4149 to 88886
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                21447 non-null  object 
 1   Total.Fatal.Injuries    21447 non-null  float64
 2   Total.Serious.Injuries  21447 non-null  float64
 3   Total.Minor.Injuries    21447 non-null  float64
 4   Total.Uninjured         21447 non-null  float64
 5   Total.Passengers        21447 non-null  float64
 6   Frac.Fatal.Serious      20543 non-null  float64
dtypes: float64(6), object(1)
memory usage: 1.3+ MB


In [14]:
## Create binary flags for fata, serious, or destroyed
aviation_filtered_df['fatal_accident'] = aviation_filtered_df['Total.Fatal.Injuries'] > 0
aviation_filtered_df['serious_accident'] = aviation_filtered_df['Total.Serious.Injuries'] > 0
aviation_filtered_df['fatal_or_serious'] = aviation_filtered_df['fatal_accident'] | aviation_filtered_df['serious_accident']

aviation_filtered_df[[
    "Total.Passengers", "Total.Fatal.Injuries",
    "Total.Serious.Injuries", "Frac.Fatal.Serious", "fatal_or_serious"
]].head(10)

Unnamed: 0,Total.Passengers,Total.Fatal.Injuries,Total.Serious.Injuries,Frac.Fatal.Serious,fatal_or_serious
3600,2.0,0.0,1.0,0.5,True
3601,4.0,0.0,0.0,0.0,False
3602,2.0,0.0,0.0,0.0,False
3603,1.0,0.0,0.0,0.0,False
3604,2.0,0.0,0.0,0.0,False
3605,2.0,0.0,0.0,0.0,False
3606,2.0,0.0,1.0,0.5,True
3607,4.0,0.0,0.0,0.0,False
3608,2.0,2.0,0.0,1.0,True
3609,3.0,3.0,0.0,1.0,True


**Aircraft.Damage**
- identify and execute any cleaning tasks
- construct a derived column tracking whether an aircraft was destroyed or not.

In [13]:
# See unique values in column
aviation_filtered_df['Aircraft.damage'].value_counts(dropna=False)

Aircraft.damage
Substantial    16990
Destroyed       2316
NaN             1227
Minor            817
Unknown           97
Name: count, dtype: int64

In [14]:
# Fill NaN to 'Unknown'
aviation_filtered_df['Aircraft.damage'] = aviation_filtered_df['Aircraft.damage'].fillna('Unknown')
aviation_filtered_df['Aircraft.damage'].value_counts(dropna=False)

Aircraft.damage
Substantial    16990
Destroyed       2316
Unknown         1324
Minor            817
Name: count, dtype: int64

In [15]:
## Create a binary flag: True if Destroyed, False otherwise
aviation_filtered_df['destroyed'] = aviation_filtered_df['Aircraft.damage'] == 'Destroyed'
aviation_filtered_df[['Aircraft.damage','destroyed']].value_counts()

Aircraft.damage  destroyed
Substantial      False        16990
Destroyed        True          2316
Unknown          False         1324
Minor            False          817
Name: count, dtype: int64

### Investigate the *Make* column
- Identify cleaning tasks here
- List cleaning tasks clearly in markdown
- Execute the cleaning tasks
- For your analysis, keep Makes with a reasonable number (you can put the threshold at 50 though lower could work as well)

### Inspect the Make column
- Look for:
    - Typos / inconsistent capitalization
    - Missing values (NaN)
    - Rare makes with very few accidents

In [17]:
#Check unique makes and top occurences
aviation_filtered_df['Make'].value_counts(dropna=False).head(20)

Make
CESSNA                4867
PIPER                 2803
Cessna                2279
Piper                 1186
BOEING                1037
BEECH                 1018
Beech                  413
MOONEY                 238
Boeing                 227
CIRRUS DESIGN CORP     218
AIR TRACTOR INC        217
AIRBUS                 215
BELLANCA               158
AERONCA                149
MAULE                  144
Mooney                 125
EMBRAER                123
Air Tractor            117
LUSCOMBE                95
CHAMPION                91
Name: count, dtype: int64

### Identify Cleaning Tasks
Proposed cleaning tasks:
1. Standardize capitalization
    - Convert all entries to title case (boeing to Boeing)
2. Remove missing values
    - Remove rows where Make is NaN since we cannot assign safety metrics to unknown aircraft
3. Remove rare Makes
    - Deep only Makes with >= 50 occurrences to ensure statistical robustness

In [18]:
# Standardize capitalization
aviation_filtered_df['Make'] = aviation_filtered_df['Make'].str.title()
aviation_filtered_df['Make'].value_counts().head(30)

Make
Cessna                       7146
Piper                        3989
Beech                        1431
Boeing                       1264
Mooney                        363
Airbus                        243
Cirrus Design Corp            220
Air Tractor Inc               219
Bellanca                      219
Maule                         215
Air Tractor                   206
Aeronca                       200
Champion                      158
Embraer                       153
Grumman                       147
Luscombe                      141
Cirrus                        137
Stinson                       129
Mcdonnell Douglas             108
North American                106
Dehavilland                    95
Taylorcraft                    93
Aero Commander                 90
Aviat Aircraft Inc             76
Socata                         75
Diamond Aircraft Ind Inc       74
De Havilland                   73
Aviat                          70
Bombardier Inc                 65
Raytheon 

In [19]:
# Drop rows with missing Make
aviation_filtered_df = aviation_filtered_df.dropna(subset=['Make'])
aviation_filtered_df[['Event.Id','Make']].info()

<class 'pandas.core.frame.DataFrame'>
Index: 21444 entries, 4149 to 88886
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Event.Id  21444 non-null  object
 1   Make      21444 non-null  object
dtypes: object(2)
memory usage: 502.6+ KB


In [20]:
# Filter to makes with at least 50 occurences
make_counts = aviation_filtered_df['Make'].value_counts()
valid_makes = make_counts[make_counts >= 50].index
aviation_filtered_make_df = aviation_filtered_df[aviation_filtered_df['Make'].isin(valid_makes)].copy()
aviation_filtered_make_df['Make'].value_counts()

Make
Cessna                            7146
Piper                             3989
Beech                             1431
Boeing                            1264
Mooney                             363
Airbus                             243
Cirrus Design Corp                 220
Bellanca                           219
Air Tractor Inc                    219
Maule                              215
Air Tractor                        206
Aeronca                            200
Champion                           158
Embraer                            153
Grumman                            147
Luscombe                           141
Cirrus                             137
Stinson                            129
Mcdonnell Douglas                  108
North American                     106
Dehavilland                         95
Taylorcraft                         93
Aero Commander                      90
Aviat Aircraft Inc                  76
Socata                              75
Diamond Aircraft Ind

In [21]:
# Mapping dictionary for Make standardization
make_mapping = {
    "Robinson Helicopter": "Robinson",
    "Robinson Helicopter Company": "Robinson",
    "Robinson": "Robinson",
    "Airbus Industrie": "Airbus",
    "Cirrus Design Corp": "Cirrus",
    "Cirrus Design Corp.": "Cirrus",
    "Cirrus": "Cirrus",
    "McDonnell Douglas": "McDonnell Douglas",
    "Douglas": "McDonnell Douglas",
    "Dehavilland": "De Havilland",
    "De Havilland": "De Havilland",
    "Grumman": "Grumman",
    "Grumman American": "Grumman",
    "Grumman American Avn. Corp.": "Grumman",
    "Boeing Stearman": "Boeing",
    "Aerospatiale": "Aerospatiale",
    "Air Tractor Inc": "Air Tractor",
    "Ercoupe (Eng & Research Corp.)": "Ercoupe",
    "Rockwell International": "Rockwell",
    "Raytheon Aircraft Company": "Raytheon Aircraft",
    "Mcdonnell Douglas": "McDonnell Douglas",
    "Grumman-Schweizer": "Grumman",
    "Grumman Acft Eng Cor-Schweizer": "Grumman",
    "Aviat Aircraft Inc": "Aviat",
    "Bombardier Inc": "Bombardier",
    "Smith, Ted Aerostar": "Aerostar",
    "Flight Design Gmbh": "Flight Design",
    "Let": "Let Aircraft"
}
aviation_filtered_make_df['Make'] = aviation_filtered_make_df['Make'].replace(make_mapping)

In [24]:
make_counts_2 = aviation_filtered_make_df['Make'].value_counts()
valid_makes_2 = make_counts_2.index
valid_makes_2.sort_values()

Index(['Aero Commander', 'Aeronca', 'Air Tractor', 'Airbus',
       'American Champion Aircraft', 'Aviat', 'Ayres', 'Beech', 'Bellanca',
       'Boeing', 'Bombardier', 'Cessna', 'Champion', 'Cirrus', 'De Havilland',
       'Diamond Aircraft Ind Inc', 'Embraer', 'Ercoupe', 'Grumman', 'Luscombe',
       'Maule', 'McDonnell Douglas', 'Mooney', 'North American', 'Piper',
       'Raytheon Aircraft', 'Rockwell', 'Socata', 'Stinson', 'Taylorcraft'],
      dtype='object', name='Make')

In [25]:
aviation_filtered_make_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17892 entries, 4150 to 88886
Data columns (total 35 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Event.Id                17892 non-null  object        
 1   Investigation.Type      17892 non-null  object        
 2   Accident.Number         17892 non-null  object        
 3   Event.Date              17892 non-null  datetime64[ns]
 4   Location                17888 non-null  object        
 5   Country                 17891 non-null  object        
 6   Latitude                15990 non-null  object        
 7   Longitude               15987 non-null  object        
 8   Airport.Code            11654 non-null  object        
 9   Airport.Name            11762 non-null  object        
 10  Injury.Severity         17173 non-null  object        
 11  Aircraft.damage         17892 non-null  object        
 12  Aircraft.Category       17892 non-null  object  

### Inspect Model column
- Get rid of any NaNs.
- Inspect the column and counts for each model/make. Are model labels unique to each make?
- If not, create a derived column that is a unique identifier for a given plane type.

In [26]:
aviation_filtered_make_df['Model'].isna().sum()

13

In [27]:
# Drop rows where Model is NaN (only 22)
aviation_filtered_make_df = aviation_filtered_make_df.dropna(subset=['Model'])

aviation_filtered_make_df['Model'].isna().sum()

0

In [28]:
# Count top 20 most common models
aviation_filtered_make_df['Model'].value_counts().head(20)

Model
172          769
737          403
152          316
182          304
172S         276
PA28         273
172N         249
SR22         240
180          213
A36          181
172M         180
150          179
PA-18-150    175
PA-28-140    169
172P         143
140          117
172R         109
170B         107
PA-28-180    105
PA-28-161    102
Name: count, dtype: int64

In [29]:
# Number of unique Makes per Model
model_make_counts = aviation_filtered_make_df.groupby('Model')['Make'].nunique()

# Models that appear more than one Make
non_unique_models = model_make_counts[model_make_counts > 1]
non_unique_models

Model
100          2
112          2
112A         2
140          2
190          2
1900D        2
200          2
320          2
350          2
390          2
400          3
400A         2
401B         2
402A         2
402B         2
500          3
500 S        2
560          2
58           2
60           2
690A         2
7AC          3
7ACA         2
7BCM         2
7EC          3
7ECA         3
7GCAA        3
7GCB         2
7GCBC        3
7KCAB        2
8GCBC        3
8KCAB        3
A36          2
AT           2
B200         2
B300         2
B36TC        2
C90A         2
DC-10        2
DC-10-30     2
DHC-8        2
DHC-8-102    2
DHC-8-202    2
DHC8         2
G36          2
MD           2
MD-11        2
MD-11F       2
MD-82        2
MD-88        2
MD-90        2
MD82         2
MD83         2
S 2R         2
S-2R         2
S2R          3
Name: Make, dtype: int64

In [30]:
# Create unique identified for plate type.
# Combined Make + Model to create a unique plane type.
aviation_filtered_make_df['PlaneType'] = (
    aviation_filtered_make_df['Make'] + ' ' + aviation_filtered_make_df['Model']
)

aviation_filtered_make_df[['Make', 'Model', 'PlaneType']].head(10)

Unnamed: 0,Make,Model,PlaneType
4150,Boeing,747,Boeing 747
4171,Piper,PA-28-140,Piper PA-28-140
4285,De Havilland,DHC-6,De Havilland DHC-6
6760,Boeing,727-200,Boeing 727-200
6806,Beech,C35,Beech C35
7084,Cessna,180K,Cessna 180K
7708,Beech,99,Beech 99
8585,Piper,PA-23-250,Piper PA-23-250
8865,Piper,PA-18-150,Piper PA-18-150
10247,Cessna,R172E,Cessna R172E


In [31]:
aviation_filtered_make_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17879 entries, 4150 to 88886
Data columns (total 36 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Event.Id                17879 non-null  object        
 1   Investigation.Type      17879 non-null  object        
 2   Accident.Number         17879 non-null  object        
 3   Event.Date              17879 non-null  datetime64[ns]
 4   Location                17875 non-null  object        
 5   Country                 17878 non-null  object        
 6   Latitude                15981 non-null  object        
 7   Longitude               15978 non-null  object        
 8   Airport.Code            11648 non-null  object        
 9   Airport.Name            11754 non-null  object        
 10  Injury.Severity         17162 non-null  object        
 11  Aircraft.damage         17879 non-null  object        
 12  Aircraft.Category       17879 non-null  object  

### Cleaning other columns
- there are other columns containing data that might be related to the outcome of an accident. We list a few here:
- Engine.Type
- Weather.Condition
- Number.of.Engines
- Purpose.of.flight
- Broad.phase.of.flight

Inspect and identify potential cleaning tasks in each of the above columns. Execute those cleaning tasks. 

**Note**: You do not necessarily need to impute or drop NaNs here.

### Clean Engine.Type

In [32]:
# Work on a copy of the filtered dataframe
aviation_clean_df = aviation_filtered_make_df.copy()

In [33]:
# Show all unique Engine.Type values
aviation_clean_df['Engine.Type'].value_counts(dropna=False)

Engine.Type
Reciprocating      12835
NaN                 3214
Turbo Prop           931
Turbo Fan            701
Unknown              105
Turbo Jet             71
Geared Turbofan       12
Turbo Shaft            9
UNK                    1
Name: count, dtype: int64

In [34]:
engine_mapping = {
    'Reciprocating': 'Reciprocating',
    'Turbo Shaft': 'Turboshaft',
    'Turbo Prop': 'Turboprop',
    'Turbo Fan': 'Turbofan',
    'Turbo Jet': 'Turbojet',
    'Geared Turbofan': 'Turbofan',
    'Unknown': 'Unknown',
    'UNK': 'Unknown',   # now Unknown
    'NONE': 'Unknown'   # now Unknown
}

# Apply mapping
aviation_clean_df['Engine.Type.Clean'] = aviation_clean_df['Engine.Type'].replace(engine_mapping)

# Fill any remaining NaNs with 'Unknown
aviation_clean_df['Engine.Type.Clean'] = aviation_clean_df['Engine.Type.Clean'].fillna('Unknown')

aviation_clean_df['Engine.Type.Clean'].value_counts()

Engine.Type.Clean
Reciprocating    12835
Unknown           3320
Turboprop          931
Turbofan           713
Turbojet            71
Turboshaft           9
Name: count, dtype: int64

### Clean Weather.Condition

In [35]:
# Show all unique Weather.Condition values and counts
aviation_clean_df['Weather.Condition'].value_counts(dropna=False)

Weather.Condition
VMC    14295
NaN     2417
IMC      905
Unk      186
UNK       76
Name: count, dtype: int64

In [36]:
weather_mapping = {
    'VMC': 'VMC',
    'IMC': 'IMC',
    'UNK': 'Unknown',
    'Unk': 'Unknown'
}

# Apply mapping and fill NaNs with 'Unknown
aviation_clean_df['Weather.Condition.Clean'] = aviation_clean_df['Weather.Condition'].replace(weather_mapping)
aviation_clean_df['Weather.Condition.Clean'] = aviation_clean_df['Weather.Condition.Clean'].fillna('Unknown')

aviation_clean_df['Weather.Condition.Clean'].value_counts()

Weather.Condition.Clean
VMC        14295
Unknown     2679
IMC          905
Name: count, dtype: int64

In [37]:
aviation_clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17879 entries, 4150 to 88886
Data columns (total 38 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Event.Id                 17879 non-null  object        
 1   Investigation.Type       17879 non-null  object        
 2   Accident.Number          17879 non-null  object        
 3   Event.Date               17879 non-null  datetime64[ns]
 4   Location                 17875 non-null  object        
 5   Country                  17878 non-null  object        
 6   Latitude                 15981 non-null  object        
 7   Longitude                15978 non-null  object        
 8   Airport.Code             11648 non-null  object        
 9   Airport.Name             11754 non-null  object        
 10  Injury.Severity          17162 non-null  object        
 11  Aircraft.damage          17879 non-null  object        
 12  Aircraft.Category        17879 non

### Number.Of.Engines

In [38]:
# Show unique values and counts
aviation_clean_df['Number.of.Engines'].value_counts(dropna=False)

Number.of.Engines
1.0    13222
2.0     2470
NaN     2089
4.0       67
3.0       26
0.0        5
Name: count, dtype: int64

### Purpose.of.flight

In [39]:
# Show all unique values and counts
aviation_clean_df['Purpose.of.flight'].value_counts(dropna=False)

Purpose.of.flight
Personal                     9844
NaN                          3047
Instructional                2410
Aerial Application            724
Business                      409
Unknown                       303
Positioning                   269
Skydiving                     157
Aerial Observation            147
Other Work Use                121
Banner Tow                     86
Flight Test                    73
Ferry                          72
Executive/corporate            65
Glider Tow                     29
Public Aircraft - Federal      28
Public Aircraft                27
Public Aircraft - State        21
Air Race show                  15
Firefighting                   12
Public Aircraft - Local        11
PUBS                            3
Air Race/show                   2
Air Drop                        2
ASHO                            2
Name: count, dtype: int64

In [72]:
# Create a new column based on Purpose.of.flight
aviation_clean_df['Purpose.of.flight.Clean'] = aviation_clean_df['Purpose.of.flight'].replace({
    'Air Race/show': 'Air Race show'
})

# Fill NaNs with string 'NaN'
aviation_clean_df['Purpose.of.flight.Clean'] = aviation_clean_df['Purpose.of.flight.Clean'].fillna('NaN')

aviation_clean_df['Purpose.of.flight.Clean'].value_counts()

Purpose.of.flight.Clean
Personal                     36907
Instructional                 9392
Unknown                       5504
NaN                           5328
Aerial Application            4134
Business                      3430
Positioning                   1446
Other Work Use                1061
Aerial Observation             708
Public Aircraft                643
Ferry                          640
Executive/corporate            424
Skydiving                      175
Flight Test                    158
Banner Tow                      97
External Load                   85
Public Aircraft - Federal       75
Public Aircraft - State         57
Public Aircraft - Local         52
Air Race show                   45
Glider Tow                      41
Firefighting                    33
Air Drop                        10
PUBS                             3
ASHO                             3
PUBL                             1
Name: count, dtype: int64

### Broad.phase.of.flight

In [43]:
# Show unique values and counts
aviation_clean_df['Broad.phase.of.flight'].value_counts(dropna=False)

Broad.phase.of.flight
NaN            15427
Landing         1110
Takeoff          425
Cruise           238
Approach         210
Maneuvering      127
Taxi              99
Go-around         81
Descent           62
Climb             52
Standing          35
Unknown           11
Other              2
Name: count, dtype: int64

In [41]:
# Minimal cleaning with a new column
aviation_clean_df['Phase.Clean'] = aviation_clean_df['Broad.phase.of.flight'].replace({
    'Approach': 'Landing',
    'Go-around': 'Landing',
    'Take-off Run': 'Takeoff',
    'Takeoff Run': 'Takeoff',
    'Initial climb': 'Climb',
    'Climb-out': 'Climb'
})

# Fill NaNs with string 'NaN'
aviation_clean_df['Phase.Clean'] = aviation_clean_df['Phase.Clean'].fillna('NaN')

# Check result
aviation_clean_df['Phase.Clean'].value_counts()

Phase.Clean
NaN            15427
Landing         1401
Takeoff          425
Cruise           238
Maneuvering      127
Taxi              99
Descent           62
Climb             52
Standing          35
Unknown           11
Other              2
Name: count, dtype: int64

### Column Removal
- inspect the dataframe and drop any columns that have too many NaNs

In [47]:
pd.set_option('display.max_columns', None)
aviation_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88889 entries, 0 to 88888
Data columns (total 32 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Event.Id                88889 non-null  object        
 1   Investigation.Type      88889 non-null  object        
 2   Accident.Number         88889 non-null  object        
 3   Event.Date              88889 non-null  datetime64[ns]
 4   Location                88837 non-null  object        
 5   Country                 88663 non-null  object        
 6   Latitude                34382 non-null  object        
 7   Longitude               34373 non-null  object        
 8   Airport.Code            50132 non-null  object        
 9   Airport.Name            52704 non-null  object        
 10  Injury.Severity         87889 non-null  object        
 11  Aircraft.damage         85695 non-null  object        
 12  Aircraft.Category       32287 non-null  object

In [42]:
# Show fraction of missing values per column, sorted descending
aviation_clean_df.isna().mean().sort_values(ascending=False)

Schedule                   0.880362
Broad.phase.of.flight      0.862856
Air.carrier                0.527490
Airport.Code               0.348509
Airport.Name               0.342581
Report.Status              0.211701
Engine.Type                0.179764
Purpose.of.flight          0.170423
Weather.Condition          0.135187
Number.of.Engines          0.116841
Longitude                  0.106326
Latitude                   0.106158
Publication.Date           0.044074
Frac.Fatal.Serious         0.043403
Injury.Severity            0.040103
FAR.Description            0.019240
Registration.Number        0.009173
Location                   0.000224
Country                    0.000056
Event.Year                 0.000000
Total.Passengers           0.000000
Total.Uninjured            0.000000
destroyed                  0.000000
PlaneType                  0.000000
Engine.Type.Clean          0.000000
Weather.Condition.Clean    0.000000
Event.Id                   0.000000
Total.Minor.Injuries       0

In [44]:
# Drop very sparse columns (>50% missing)
cols_to_drop = ['Schedule', 'Air.carrier']

aviation_clean_df = aviation_clean_df.drop(columns=cols_to_drop)

aviation_clean_df.isna().mean().sort_values(ascending=False)

Broad.phase.of.flight      0.862856
Airport.Code               0.348509
Airport.Name               0.342581
Report.Status              0.211701
Engine.Type                0.179764
Purpose.of.flight          0.170423
Weather.Condition          0.135187
Number.of.Engines          0.116841
Longitude                  0.106326
Latitude                   0.106158
Publication.Date           0.044074
Frac.Fatal.Serious         0.043403
Injury.Severity            0.040103
FAR.Description            0.019240
Registration.Number        0.009173
Location                   0.000224
Country                    0.000056
Weather.Condition.Clean    0.000000
Engine.Type.Clean          0.000000
Total.Passengers           0.000000
Event.Year                 0.000000
Total.Uninjured            0.000000
PlaneType                  0.000000
destroyed                  0.000000
Event.Id                   0.000000
Total.Minor.Injuries       0.000000
Total.Serious.Injuries     0.000000
Total.Fatal.Injuries       0

### Save DataFrame to csv
- its generally useful to save data to file/server after its in a sufficiently cleaned or intermediate state
- the data can then be loaded directly in another notebook for further analysis
- this helps keep your notebooks and workflow readable, clean and modularized

In [45]:
aviation_clean_df.to_csv('data/AviationData_Cleaned.csv', index=False, encoding='utf-8')