# Aviation Accidents Analysis

You are part of a consulting firm that is tasked to do an analysis of commercial and passenger jet airline safety. The client (an airline/airplane insurer) is interested in knowing what types of aircraft (makes/models) exhibit low rates of total destruction and low likelihood of fatal or serious passenger injuries in the event of an accident. They are also interested in any general variables/conditions that might be at play. Your analysis will be based off of aviation accident data accumulated from the years 1948-2023. 

Our client is only interested in airplane makes/models that are professional builds and could potentially still be active. Assume a max lifetime of 40 years for a make/model retirement and make sure to filter your data accordingly (i.e. from 1983 onwards). They would also like separate recommendations for small aircraft vs. larger passenger models. **In addition, make sure that claims that you make are statistically robust and that you have enough samples when making comparisons between groups.**


In this summative assessment you will demonstrate your ability to:
- **Use Pandas to load, inspect, and clean the dataset appropriately.**
- **Transform relevant columns to create measures that address the problem at hand.**
- conduct EDA: visualization and statistical measures to systematically understand the structure of the data
- recommend a set of airplanes and makes conforming to the client's request and identify at least *two* factors contributing to airplane safety. You must provide supporting evidence (visuals, summary statistics, tables) for each claim you make.

### Make relevant library imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Data Loading and Inspection

### Load in data from the relevant directory and inspect the dataframe.
- inspect NaNs, datatypes, and summary statistics

In [2]:
# Reading in csv
accidents_df = pd.read_csv("AviationData.csv", encoding="latin-1", low_memory=False)

# Checking Data
accidents_df.head()

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,...,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.922223,-81.878056,,,...,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,...,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


In [3]:
# Getting DF info
accidents_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88889 entries, 0 to 88888
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      88889 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Latitude                34382 non-null  object 
 7   Longitude               34373 non-null  object 
 8   Airport.Code            50132 non-null  object 
 9   Airport.Name            52704 non-null  object 
 10  Injury.Severity         87889 non-null  object 
 11  Aircraft.damage         85695 non-null  object 
 12  Aircraft.Category       32287 non-null  object 
 13  Registration.Number     87507 non-null  object 
 14  Make                    88826 non-null

In [4]:
accidents_df.describe(include = "object")

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Amateur.Built,Engine.Type,FAR.Description,Schedule,Purpose.of.flight,Air.carrier,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
count,88889,88889,88889,88889,88837,88663,34382,34373,50132,52704,...,88787,81793,32023,12582,82697,16648,84397,61724,82505,75118
unique,87951,2,88863,14782,27758,219,25589,27154,10374,24870,...,2,12,31,3,26,13590,4,12,17074,2924
top,20001212X19172,Accident,CEN22LA149,1984-06-30,"ANCHORAGE, AK",United States,332739N,0112457W,NONE,Private,...,No,Reciprocating,91,NSCH,Personal,Pilot,VMC,Landing,Probable Cause,25-09-2020
freq,3,85015,2,25,434,82248,19,24,1488,240,...,80312,69530,18221,4474,49448,258,77303,15428,61754,17019


## Data Cleaning

### Filtering aircrafts and events

We want to filter the dataset to include aircraft that the client is interested in an analysis of:
- inspect relevant columns
- figure out any reasonable imputations
- filter the dataset

In [5]:
# converting Event.Dates to Datetime dtype
accidents_df["Event.Date"] = pd.to_datetime(accidents_df["Event.Date"])

# Remove data from greater than 40 years prior
date_cleaned_df = accidents_df.loc[accidents_df["Event.Date"] > "1983-01-01"] 

In [6]:
#Checking for duplicate entries
date_cleaned_df["Event.Id"].value_counts()

Event.Id
20001212X19172    3
20001214X45071    3
20001213X28363    2
20001213X32577    2
20001213X29734    2
                 ..
20001211X12036    1
20001211X11969    1
20001211X12014    1
20001211X12027    1
20221230106513    1
Name: count, Length: 84391, dtype: int64

In [7]:
# Removing repeated entries
no_duplicated_events = date_cleaned_df.drop_duplicates(subset="Event.Id")

In [8]:
#Recheck
no_duplicated_events["Event.Id"].value_counts()

Event.Id
20001214X42064    1
20060607X00697    1
20060618X00761    1
20060525X00622    1
20060618X00764    1
                 ..
20001211X11796    1
20001211X11767    1
20001211X11800    1
20001211X11864    1
20221230106513    1
Name: count, Length: 84391, dtype: int64

In [9]:
# Rechecking nulls
no_duplicated_events.isnull().sum()

Event.Id                      0
Investigation.Type            0
Accident.Number               0
Event.Date                    0
Location                     52
Country                     212
Latitude                  50182
Longitude                 50191
Airport.Code              36580
Airport.Name              34525
Injury.Severity             990
Aircraft.damage            3046
Aircraft.Category         55728
Registration.Number        1350
Make                         57
Model                        78
Amateur.Built                99
Number.of.Engines          6025
Engine.Type                7042
FAR.Description           55994
Schedule                  72560
Purpose.of.flight          6117
Air.carrier               68089
Total.Fatal.Injuries      11242
Total.Serious.Injuries    12293
Total.Minor.Injuries      11732
Total.Uninjured            5854
Weather.Condition          4472
Broad.phase.of.flight     27112
Report.Status              6364
Publication.Date          13591
dtype: i

In [10]:
# Dropping na in Amateur.Built
no_duplicated_events.dropna(subset=["Amateur.Built"], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  no_duplicated_events.dropna(subset=["Amateur.Built"], inplace=True)


In [11]:
# Removing records of Amateur Built aircraft
professional_builds = no_duplicated_events[no_duplicated_events["Amateur.Built"] != "Yes"]
professional_builds = professional_builds.drop(["Amateur.Built"], axis=1)

In [12]:
# Removing craft listed as non-airplane
professional_builds.fillna({"Aircraft.Category":"Airplane"}, inplace=True)
professional_airplanes1 = professional_builds.loc[professional_builds["Aircraft.Category"]=="Airplane"]

In [13]:
#Removing "Balloon Works" aircraft as non-airplane
professional_airplanes = professional_airplanes1.loc[professional_airplanes1["Make"]!="Balloon Works"]

## Data Cleaning

### Cleaning and constructing Key Measurables

Injuries and robustness to destruction are a key interest point for the client. Clean and impute relevant columns and then create derived fields that best quantifies what the client wishes to track. **Use commenting or markdown to explain any cleaning assumptions as well as any derived columns you create.**

**Construct metric for fatal/serious injuries**

*Hint:* Estimate the total number of passengers on each flight. The likelihood of serious / fatal injury can be estimated as a fraction from this.

In [14]:
# Checking statistics on injury data
professional_airplanes.describe()

Unnamed: 0,Event.Date,Number.of.Engines,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured
count,72042,67159.0,62538.0,61611.0,62089.0,67370.0
mean,1999-07-27 03:39:40.311485056,1.171667,0.667322,0.278716,0.362931,5.88909
min,1983-01-02 00:00:00,0.0,0.0,0.0,0.0,0.0
25%,1989-06-22 00:00:00,1.0,0.0,0.0,0.0,0.0
50%,1998-02-11 00:00:00,1.0,0.0,0.0,0.0,1.0
75%,2008-07-05 00:00:00,1.0,0.0,0.0,0.0,2.0
max,2022-12-29 00:00:00,8.0,349.0,161.0,200.0,699.0
std,,0.458691,5.767482,1.67733,1.900831,29.023902


In [15]:
# Replacing nulls with 0 for injury columns with medians of 0 and means less than 1
professional_airplanes["Total.Fatal.Injuries"] = professional_airplanes["Total.Fatal.Injuries"].fillna(0)
professional_airplanes["Total.Serious.Injuries"] = professional_airplanes["Total.Fatal.Injuries"].fillna(0)
professional_airplanes["Total.Minor.Injuries"] = professional_airplanes["Total.Fatal.Injuries"].fillna(0)

# Replacing uninjured with the median value of 1
professional_airplanes["Total.Uninjured"] = professional_airplanes["Total.Uninjured"].fillna(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  professional_airplanes["Total.Fatal.Injuries"] = professional_airplanes["Total.Fatal.Injuries"].fillna(0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  professional_airplanes["Total.Serious.Injuries"] = professional_airplanes["Total.Fatal.Injuries"].fillna(0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-

In [16]:
# Renaming dataframe for ease of coding
airplane_df = professional_airplanes

In [17]:
# Creating total passenger column
def total_passenger(row):
    return row["Total.Fatal.Injuries"]+row["Total.Serious.Injuries"]+row["Total.Minor.Injuries"]+row["Total.Uninjured"]

airplane_df["Total.Passengers"] = airplane_df.apply(total_passenger, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  airplane_df["Total.Passengers"] = airplane_df.apply(total_passenger, axis=1)


In [18]:
# Filtering out any results with Total Passengers equal to zero
airplane_filtered = airplane_df[airplane_df["Total.Passengers"] != 0]

In [19]:
# Creating ratio columns for fatal and serious unjuries
def fatal_ratio(row):
    return row["Total.Fatal.Injuries"]/row["Total.Passengers"]

def serious_ratio(row):
    return row["Total.Serious.Injuries"]/row["Total.Passengers"]

def fatal_and_serious_ratio(row):
    return row["Fatality Ratio"]+row["Serious Injury Ratio"]
    
airplane_filtered["Fatality Ratio"] = airplane_filtered.apply(fatal_ratio, axis=1)
airplane_filtered["Serious Injury Ratio"] = airplane_filtered.apply(serious_ratio, axis=1)
airplane_filtered["Fatality/Serious Injury Ratio"] = airplane_filtered.apply(fatal_and_serious_ratio, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  airplane_filtered["Fatality Ratio"] = airplane_filtered.apply(fatal_ratio, axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  airplane_filtered["Serious Injury Ratio"] = airplane_filtered.apply(serious_ratio, axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  airplane_filtered["Fatality/Ser

**Aircraft.Damage**
- identify and execute any cleaning tasks
- construct a derived column tracking whether an aircraft was destroyed or not.

In [20]:
# Checking Aircraft Damage collumn
airplane_filtered["Aircraft.damage"].value_counts()

Aircraft.damage
Substantial    43529
Destroyed      12202
Minor           2071
Unknown           61
Name: count, dtype: int64

In [21]:
# Filtering Aircraft damage for unknown and nan
airplane_filtered["Aircraft.damage"] = airplane_filtered["Aircraft.damage"].replace({"Unknown":np.nan})
airplane_filtered.dropna(subset=["Aircraft.damage"], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  airplane_filtered["Aircraft.damage"] = airplane_filtered["Aircraft.damage"].replace({"Unknown":np.nan})
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  airplane_filtered.dropna(subset=["Aircraft.damage"], inplace=True)


In [22]:
# Boolean column created for True if Aircraft listed as destroyed
airplane_filtered["Aircraft Destroyed?"] = airplane_filtered["Aircraft.damage"]=="Destroyed"

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  airplane_filtered["Aircraft Destroyed?"] = airplane_filtered["Aircraft.damage"]=="Destroyed"


### Investigate the *Make* column
- Identify cleaning tasks here
- List cleaning tasks clearly in markdown
- Execute the cleaning tasks
- For your analysis, keep Makes with a reasonable number (you can put the threshold at 50 though lower could work as well)

In [23]:
# Dropping nulls from make
airplane_filtered = airplane_filtered.dropna(subset=["Make"])

In [24]:
# Checking Make values
airplane_filtered["Make"].value_counts().head(50)

Make
Cessna                            17332
Piper                              9341
CESSNA                             3941
Beech                              3485
PIPER                              2260
Bell                               1347
Boeing                              875
Mooney                              847
BEECH                               832
Grumman                             782
Bellanca                            655
Robinson                            574
Hughes                              533
Air Tractor                         466
Schweizer                           419
Maule                               394
BOEING                              354
Aeronca                             347
Mcdonnell Douglas                   335
Champion                            333
De Havilland                        299
Aero Commander                      280
Stinson                             266
Rockwell                            251
North American                     

In [25]:
# Stripping and capitalizing Make Names
airplane_filtered["Make"] = airplane_filtered["Make"].str.strip()
airplane_filtered["Make"] = airplane_filtered["Make"].str.capitalize()

# Standardizing Airbus, Mcdonnel douglas, Bombardier, Aviat, and de Havilland
airplane_filtered["Make"] = airplane_filtered["Make"].replace("Airbus industrie", "Airbus")
airplane_filtered["Make"] = airplane_filtered["Make"].replace("Dehavilland", "De havilland")
airplane_filtered["Make"] = airplane_filtered["Make"].replace("Douglas", "Mcdonnell douglas")
airplane_filtered["Make"] = airplane_filtered["Make"].replace("Bombardier inc", "Bombardier")
airplane_filtered["Make"] = airplane_filtered["Make"].replace("Aviat aircraft inc", "Aviat")

In [26]:
# Recheck
airplane_filtered["Make"].value_counts().head(50)

Make
Cessna                            21273
Piper                             11601
Beech                              4317
Bell                               1353
Boeing                             1229
Mooney                             1031
Grumman                             836
Bellanca                            780
Robinson                            583
Mcdonnell douglas                   565
Hughes                              533
Air tractor                         533
Maule                               519
Aeronca                             461
Schweizer                           423
De havilland                        413
Champion                            406
Stinson                             336
Aero commander                      328
North american                      308
Luscombe                            307
Rockwell                            272
Taylorcraft                         266
Aerospatiale                        213
Hiller                             

In [27]:
# Removing Makes with fewer than 50 mentions
value_counts = airplane_filtered["Make"].value_counts()
restricted_value_counts = value_counts>=50
restricted_value_counts = restricted_value_counts[restricted_value_counts == False]
drop_list = restricted_value_counts.index

airplane_makes_cut = airplane_filtered[~airplane_filtered["Make"].isin(drop_list)]

In [28]:
# Check
airplane_makes_cut["Make"].value_counts()

Make
Cessna               21273
Piper                11601
Beech                 4317
Bell                  1353
Boeing                1229
                     ...  
British aerospace       55
Gates learjet           55
Ercoupe                 54
Gulfstream              53
Canadair                50
Name: count, Length: 65, dtype: int64

### Inspect Model column
- Get rid of any NaNs.
- Inspect the column and counts for each model/make. Are model labels unique to each make?
- If not, create a derived column that is a unique identifier for a given plane type.

In [29]:
# Drop Model Nulls

airplane_model_cut = airplane_makes_cut.dropna(subset=["Model"])

In [30]:
# Check
airplane_model_cut.info()

<class 'pandas.core.frame.DataFrame'>
Index: 53104 entries, 3608 to 88876
Data columns (total 35 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   Event.Id                       53104 non-null  object        
 1   Investigation.Type             53104 non-null  object        
 2   Accident.Number                53104 non-null  object        
 3   Event.Date                     53104 non-null  datetime64[ns]
 4   Location                       53071 non-null  object        
 5   Country                        52970 non-null  object        
 6   Latitude                       19742 non-null  object        
 7   Longitude                      19737 non-null  object        
 8   Airport.Code                   31540 non-null  object        
 9   Airport.Name                   33040 non-null  object        
 10  Injury.Severity                53104 non-null  object        
 11  Aircraft.damage  

In [31]:
# Stripping Model strings and checking counts
airplane_model_cut["Model"] = airplane_model_cut["Model"].str.strip()
airplane_model_cut["Model"].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  airplane_model_cut["Model"] = airplane_model_cut["Model"].str.strip()


Model
152          1859
172          1346
172N          910
PA-28-140     638
172M          617
             ... 
BAE3201         1
767-251         1
747-269BC       1
TU206D          1
P63             1
Name: count, Length: 4660, dtype: int64

In [32]:
# Creating combined "Make - Model" column

airplane_model_cut["Make - Model"] = airplane_model_cut["Make"] + " - " + airplane_model_cut["Model"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  airplane_model_cut["Make - Model"] = airplane_model_cut["Make"] + " - " + airplane_model_cut["Model"]


In [33]:
airplane_model_cut["Make - Model"].value_counts()

Make - Model
Cessna - 152                      1859
Cessna - 172                      1346
Cessna - 172N                      910
Piper - PA-28-140                  638
Cessna - 172M                      617
                                  ... 
Cessna - CE-550                      1
Bell - B-3                           1
Bell - 47G2-M                        1
Burkhart grob - SPEED ASTIR II       1
Bell - P63                           1
Name: count, Length: 4986, dtype: int64

### Cleaning other columns
- there are other columns containing data that might be related to the outcome of an accident. We list a few here:
- Engine.Type
- Weather.Condition
- Number.of.Engines
- Purpose.of.flight
- Broad.phase.of.flight

Inspect and identify potential cleaning tasks in each of the above columns. Execute those cleaning tasks. 

**Note**: You do not necessarily need to impute or drop NaNs here.

In [34]:
# Checking engine types
airplane_model_cut["Engine.Type"].value_counts()

Engine.Type
Reciprocating      44066
Turbo Prop          2273
Turbo Shaft         1542
Turbo Fan           1225
Unknown              846
Turbo Jet            355
Geared Turbofan        1
Name: count, dtype: int64

In [35]:
# Checking Weather conditions
airplane_model_cut["Weather.Condition"].value_counts()

Weather.Condition
VMC    46363
IMC     4358
UNK      564
Unk      129
Name: count, dtype: int64

In [36]:
# Taking Weather conditions to all caps
airplane_model_cut["Weather.Condition"] = airplane_model_cut["Weather.Condition"].str.upper()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  airplane_model_cut["Weather.Condition"] = airplane_model_cut["Weather.Condition"].str.upper()


In [37]:
# Weather recheck
airplane_model_cut["Weather.Condition"].value_counts()

Weather.Condition
VMC    46363
IMC     4358
UNK      693
Name: count, dtype: int64

In [38]:
# number of engines check
airplane_model_cut["Number.of.Engines"].value_counts()

Number.of.Engines
1.0    42430
2.0     7284
0.0      282
4.0      273
3.0      265
Name: count, dtype: int64

In [39]:
# Removing rows with 0 engines listed
airplane_model_cut = airplane_model_cut[airplane_model_cut["Number.of.Engines"] != 0]

In [40]:
# recheck
airplane_model_cut["Number.of.Engines"].value_counts()

Number.of.Engines
1.0    42430
2.0     7284
4.0      273
3.0      265
Name: count, dtype: int64

In [41]:
# Checking purpose of flight:
airplane_model_cut["Purpose.of.flight"].value_counts()

Purpose.of.flight
Personal                     29209
Instructional                 7446
Unknown                       3861
Aerial Application            2927
Business                      2720
Positioning                   1045
Other Work Use                 671
Ferry                          467
Public Aircraft                454
Aerial Observation             430
Executive/corporate            340
Skydiving                      153
Flight Test                    111
Banner Tow                      70
Public Aircraft - Federal       42
Glider Tow                      25
Public Aircraft - State         25
Firefighting                    20
Public Aircraft - Local         14
Air Race/show                   13
External Load                   12
Air Race show                   11
Air Drop                         5
PUBS                             2
ASHO                             2
Name: count, dtype: int64

In [42]:
# Checking phase of flight:
airplane_model_cut["Broad.phase.of.flight"].value_counts()

Broad.phase.of.flight
Landing        11516
Takeoff         7641
Cruise          6473
Maneuvering     4694
Approach        3786
Taxi            1376
Climb           1201
Descent         1125
Go-around        884
Standing         516
Unknown          357
Other             66
Name: count, dtype: int64

### Column Removal
- inspect the dataframe and drop any columns that have too many NaNs

In [43]:
airplane_model_cut.isnull().sum()

Event.Id                             0
Investigation.Type                   0
Accident.Number                      0
Event.Date                           0
Location                            31
Country                            132
Latitude                         33107
Longitude                        33112
Airport.Code                     21376
Airport.Name                     19885
Injury.Severity                      0
Aircraft.damage                      0
Aircraft.Category                    0
Registration.Number                848
Make                                 0
Model                                0
Number.of.Engines                 2570
Engine.Type                       2762
FAR.Description                  38454
Schedule                         45640
Purpose.of.flight                 2747
Air.carrier                      44395
Total.Fatal.Injuries                 0
Total.Serious.Injuries               0
Total.Minor.Injuries                 0
Total.Uninjured          

In [44]:
# Longitude, Latitude, Airport.Code, Airport.Name, FAR.Decription, Schedule, Air.Carrier removed
# Reason: Greater than 20,000 null and not inherently relevant

column_cut_df = airplane_model_cut.drop(["Longitude", "Latitude", "Airport.Code", "Airport.Name", "FAR.Description", "Schedule", "Air.carrier"], axis = 1)

In [45]:
# Recheck
column_cut_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 52822 entries, 3608 to 88876
Data columns (total 29 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   Event.Id                       52822 non-null  object        
 1   Investigation.Type             52822 non-null  object        
 2   Accident.Number                52822 non-null  object        
 3   Event.Date                     52822 non-null  datetime64[ns]
 4   Location                       52791 non-null  object        
 5   Country                        52690 non-null  object        
 6   Injury.Severity                52822 non-null  object        
 7   Aircraft.damage                52822 non-null  object        
 8   Aircraft.Category              52822 non-null  object        
 9   Registration.Number            51974 non-null  object        
 10  Make                           52822 non-null  object        
 11  Model            

### Save DataFrame to csv
- its generally useful to save data to file/server after its in a sufficiently cleaned or intermediate state
- the data can then be loaded directly in another notebook for further analysis
- this helps keep your notebooks and workflow readable, clean and modularized

In [46]:
column_cut_df.to_csv("Cleaned_Accident_Data.csv", index=False)