# Aviation Accidents Analysis

You are part of a consulting firm that is tasked to do an analysis of commercial and passenger jet airline safety. The client (an airline/airplane insurer) is interested in knowing what types of aircraft (makes/models) exhibit low rates of total destruction and low likelihood of fatal or serious passenger injuries in the event of an accident. They are also interested in any general variables/conditions that might be at play. Your analysis will be based off of aviation accident data accumulated from the years 1948-2023. 

Our client is only interested in airplane makes/models that are professional builds and could potentially still be active. Assume a max lifetime of 40 years for a make/model retirement and make sure to filter your data accordingly (i.e. from 1983 onwards). They would also like separate recommendations for small aircraft vs. larger passenger models. **In addition, make sure that claims that you make are statistically robust and that you have enough samples when making comparisons between groups.**


In this summative assessment you will demonstrate your ability to:
- **Use Pandas to load, inspect, and clean the dataset appropriately.**
- **Transform relevant columns to create measures that address the problem at hand.**
- conduct EDA: visualization and statistical measures to systematically understand the structure of the data
- recommend a set of airplanes and makes conforming to the client's request and identify at least *two* factors contributing to airplane safety. You must provide supporting evidence (visuals, summary statistics, tables) for each claim you make.

### Make relevant library imports

In [64]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Data Loading and Inspection

### Load in data from the relevant directory and inspect the dataframe.
- inspect NaNs, datatypes, and summary statistics

In [65]:
# Reading in csv
accidents_df = pd.read_csv("AviationData.csv", encoding="1252")

accidents_df.head()

  accidents_df = pd.read_csv("AviationData.csv", encoding="1252")


Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,...,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.922223,-81.878056,,,...,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,...,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


In [66]:
accidents_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88889 entries, 0 to 88888
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      88889 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Latitude                34382 non-null  object 
 7   Longitude               34373 non-null  object 
 8   Airport.Code            50132 non-null  object 
 9   Airport.Name            52704 non-null  object 
 10  Injury.Severity         87889 non-null  object 
 11  Aircraft.damage         85695 non-null  object 
 12  Aircraft.Category       32287 non-null  object 
 13  Registration.Number     87507 non-null  object 
 14  Make                    88826 non-null

## Data Cleaning

### Filtering aircrafts and events

We want to filter the dataset to include aircraft that the client is interested in an analysis of:
- inspect relevant columns
- figure out any reasonable imputations
- filter the dataset

In [67]:
# converting Event.Dates to Datetime dtype
accidents_df["Event.Date"] = pd.to_datetime(accidents_df["Event.Date"])

# Remove data from greater than 40 years prior
date_cleaned_df = accidents_df.loc[accidents_df["Event.Date"] > "1983-01-01"] 

In [68]:
# Removing records of Amateur Built aircraft
professional_builds = date_cleaned_df[date_cleaned_df["Amateur.Built"] != "Yes"]
professional_builds = professional_builds.drop(["Amateur.Built"], axis=1)

In [69]:
# Removing craft listed as non-airplane
professional_builds.fillna({"Aircraft.Category":"Airplane"}, inplace=True)
professional_airplanes = professional_builds.loc[professional_builds["Aircraft.Category"]=="Airplane"]

In [86]:
professional_airplanes.head()

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date,Total.Passengers
3606,20001214X42064,Accident,MKC83LA051,1983-01-02,"INDIANOLA, IA",United States,,,,,...,,0.0,0.0,0.0,1.0,VMC,Landing,Probable Cause,,1.0
3607,20001214X42010,Accident,LAX83FA064,1983-01-02,"MONTEREY, CA",United States,,,,,...,,0.0,0.0,0.0,0.0,VMC,Takeoff,Probable Cause,,0.0
3608,20001214X41937,Accident,CHI83FA069,1983-01-02,"GENOA CITY, WI",United States,,,64C,VINCENT,...,,2.0,2.0,2.0,0.0,VMC,Maneuvering,Probable Cause,,6.0
3609,20001214X41919,Accident,ATL83FA081,1983-01-02,"BEAUFORT, SC",United States,,,,,...,,3.0,3.0,3.0,0.0,IMC,Cruise,Probable Cause,,9.0
3610,20001214X42051,Accident,MIA83LA056,1983-01-02,"NEAR VERO BEACH, FL",United States,,,,,...,,0.0,0.0,0.0,2.0,VMC,Maneuvering,Probable Cause,,2.0


## Data Cleaning

### Cleaning and constructing Key Measurables

Injuries and robustness to destruction are a key interest point for the client. Clean and impute relevant columns and then create derived fields that best quantifies what the client wishes to track. **Use commenting or markdown to explain any cleaning assumptions as well as any derived columns you create.**

**Construct metric for fatal/serious injuries**

*Hint:* Estimate the total number of passengers on each flight. The likelihood of serious / fatal injury can be estimated as a fraction from this.

In [74]:
# Calculating and printing mean and median for injury columns
fatal_mean = professional_airplanes["Total.Fatal.Injuries"].mean()
serious_mean = professional_airplanes["Total.Serious.Injuries"].mean()
minor_mean = professional_airplanes["Total.Minor.Injuries"].mean()
uninjured_mean = professional_airplanes["Total.Uninjured"].mean()

fatal_median = professional_airplanes["Total.Fatal.Injuries"].median()
serious_median = professional_airplanes["Total.Serious.Injuries"].median()
minor_median = professional_airplanes["Total.Minor.Injuries"].median()
uninjured_median = professional_airplanes["Total.Uninjured"].median()

print("Injury Means:")
print(fatal_mean, serious_mean, minor_mean, uninjured_mean)
print("Injury Medians:")
print(fatal_median, serious_median, minor_median, uninjured_median)

Injury Means:
0.597376125201784 0.597376125201784 0.597376125201784 6.182222742709919
Injury Medians:
0.0 0.0 0.0 1.0


In [75]:
# Replacing nulls with 0 for injury columns with medians of 0 and means less than 1
professional_airplanes["Total.Fatal.Injuries"] = professional_airplanes["Total.Fatal.Injuries"].fillna(0)
professional_airplanes["Total.Serious.Injuries"] = professional_airplanes["Total.Fatal.Injuries"].fillna(0)
professional_airplanes["Total.Minor.Injuries"] = professional_airplanes["Total.Fatal.Injuries"].fillna(0)

# Replacing uninjured with the median value of 1
professional_airplanes["Total.Uninjured"] = professional_airplanes["Total.Uninjured"].fillna(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  professional_airplanes["Total.Fatal.Injuries"] = professional_airplanes["Total.Fatal.Injuries"].fillna(0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  professional_airplanes["Total.Serious.Injuries"] = professional_airplanes["Total.Fatal.Injuries"].fillna(0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-

In [79]:
# Renaming dataframe for ease of coding
airplane_df = professional_airplanes

<class 'pandas.core.frame.DataFrame'>
Index: 73098 entries, 3606 to Total.Minor.Injuries
Data columns (total 30 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Event.Id                73095 non-null  object        
 1   Investigation.Type      73095 non-null  object        
 2   Accident.Number         73095 non-null  object        
 3   Event.Date              73095 non-null  datetime64[ns]
 4   Location                73045 non-null  object        
 5   Country                 72886 non-null  object        
 6   Latitude                26637 non-null  object        
 7   Longitude               26632 non-null  object        
 8   Airport.Code            41834 non-null  object        
 9   Airport.Name            43717 non-null  object        
 10  Injury.Severity         72166 non-null  object        
 11  Aircraft.damage         70152 non-null  object        
 12  Aircraft.Category       73095 non

In [82]:
# Creating total passenger column
def total_passenger(row):
    return row["Total.Fatal.Injuries"]+row["Total.Serious.Injuries"]+row["Total.Minor.Injuries"]+row["Total.Uninjured"]

airplane_df["Total.Passengers"] = airplane_df.apply(total_passenger, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  airplane_df["Total.Passengers"] = airplane_df.apply(total_passenger, axis=1)


In [83]:
airplane_df.head()

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date,Total.Passengers
3606,20001214X42064,Accident,MKC83LA051,1983-01-02,"INDIANOLA, IA",United States,,,,,...,,0.0,0.0,0.0,1.0,VMC,Landing,Probable Cause,,1.0
3607,20001214X42010,Accident,LAX83FA064,1983-01-02,"MONTEREY, CA",United States,,,,,...,,0.0,0.0,0.0,0.0,VMC,Takeoff,Probable Cause,,0.0
3608,20001214X41937,Accident,CHI83FA069,1983-01-02,"GENOA CITY, WI",United States,,,64C,VINCENT,...,,2.0,2.0,2.0,0.0,VMC,Maneuvering,Probable Cause,,6.0
3609,20001214X41919,Accident,ATL83FA081,1983-01-02,"BEAUFORT, SC",United States,,,,,...,,3.0,3.0,3.0,0.0,IMC,Cruise,Probable Cause,,9.0
3610,20001214X42051,Accident,MIA83LA056,1983-01-02,"NEAR VERO BEACH, FL",United States,,,,,...,,0.0,0.0,0.0,2.0,VMC,Maneuvering,Probable Cause,,2.0


In [119]:
# Filtering out any results with Total Passengers equal to zero
airplane_filtered = airplane_df[airplane_df["Total.Passengers"] != 0]

In [120]:
# Creating ratio columns for fatal and serious unjuries
def fatal_ratio(row):
    return row["Total.Fatal.Injuries"]/row["Total.Passengers"]

def serious_ratio(row):
    return row["Total.Serious.Injuries"]/row["Total.Passengers"]
    
airplane_filtered["Fatality Ratio"] = airplane_filtered.apply(fatal_ratio, axis=1)
airplane_filtered["Serious Injury Ratio"] = airplane_filtered.apply(serious_ratio, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  airplane_filtered["Fatality Ratio"] = airplane_filtered.apply(fatal_ratio, axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  airplane_filtered["Serious Injury Ratio"] = airplane_filtered.apply(serious_ratio, axis=1)


In [101]:
airplane_filtered.head()

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date,Total.Passengers,Fatality Ratio,Serious Injury Ratio
3606,20001214X42064,Accident,MKC83LA051,1983-01-02,"INDIANOLA, IA",United States,,,,,...,0.0,0.0,1.0,VMC,Landing,Probable Cause,,1.0,0.0,0.0
3608,20001214X41937,Accident,CHI83FA069,1983-01-02,"GENOA CITY, WI",United States,,,64C,VINCENT,...,2.0,2.0,0.0,VMC,Maneuvering,Probable Cause,,6.0,0.333333,0.333333
3609,20001214X41919,Accident,ATL83FA081,1983-01-02,"BEAUFORT, SC",United States,,,,,...,3.0,3.0,0.0,IMC,Cruise,Probable Cause,,9.0,0.333333,0.333333
3610,20001214X42051,Accident,MIA83LA056,1983-01-02,"NEAR VERO BEACH, FL",United States,,,,,...,0.0,0.0,2.0,VMC,Maneuvering,Probable Cause,,2.0,0.0,0.0
3611,20001214X41994,Accident,FTW83LA073,1983-01-02,"BIG SPRING, TX",United States,,,21XS,BIG SPRING MUNICIPAL,...,0.0,0.0,3.0,VMC,Takeoff,Probable Cause,,3.0,0.0,0.0


**Aircraft.Damage**
- identify and execute any cleaning tasks
- construct a derived column tracking whether an aircraft was destroyed or not.

In [121]:
# Boolean column created for True if Aircraft listed as destroyed
airplane_filtered["Aircraft Destroyed?"] = airplane_filtered["Aircraft.damage"]=="Destroyed"

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  airplane_filtered["Aircraft Destroyed?"] = airplane_filtered["Aircraft.damage"]=="Destroyed"


### Investigate the *Make* column
- Identify cleaning tasks here
- List cleaning tasks clearly in markdown
- Execute the cleaning tasks
- For your analysis, keep Makes with a reasonable number (you can put the threshold at 50 though lower could work as well)

In [122]:
# Dropping nulls from make
airplane_filtered = airplane_filtered.dropna(subset=["Make"])

In [123]:
# Checking Make values
airplane_filtered["Make"].value_counts()

Make
Cessna                    17744
Piper                      9522
CESSNA                     4057
Beech                      3563
PIPER                      2308
                          ...  
NAVAL AIRCRAFT FACTORY        1
Carlson Aircraft              1
PARTENAVIA SPA                1
Wsk                           1
JAMES R DERNOVSEK             1
Name: count, Length: 1610, dtype: int64

In [124]:
# Stripping and capitalizing Make Names
airplane_filtered["Make"] = airplane_filtered["Make"].str.strip()
airplane_filtered["Make"] = airplane_filtered["Make"].str.capitalize()

In [125]:
# Recheck
airplane_filtered["Make"].value_counts()

Make
Cessna                            21801
Piper                             11830
Beech                              4417
Boeing                             2108
Bell                               1405
                                  ...  
Briegleb                              1
Bell-moore                            1
Extra flugzeugproduktions-gmbh        1
Columbia aircraft mfg.                1
James r dernovsek                     1
Name: count, Length: 1323, dtype: int64

In [138]:
# Removing Makes with fewer than 50 mentions
value_counts = airplane_filtered["Make"].value_counts()
restricted_value_counts = value_counts>=50
restricted_value_counts = restricted_value_counts[restricted_value_counts == False]
drop_list = restricted_value_counts.index

airplane_makes_cut = airplane_filtered[~airplane_filtered["Make"].isin(drop_list)]

In [140]:
# Check
airplane_makes_cut["Make"].value_counts()

Make
Cessna         21801
Piper          11830
Beech           4417
Boeing          2108
Bell            1405
               ...  
Schleicher        54
Ercoupe           54
Pilatus           51
Mbb               50
Great lakes       50
Name: count, Length: 78, dtype: int64

### Inspect Model column
- Get rid of any NaNs.
- Inspect the column and counts for each model/make. Are model labels unique to each make?
- If not, create a derived column that is a unique identifier for a given plane type.

In [142]:
# Drop Model Nulls

airplane_model_cut = airplane_makes_cut.dropna(subset=["Model"])

In [143]:
# Check

airplane_model_cut.info()

<class 'pandas.core.frame.DataFrame'>
Index: 56320 entries, 3606 to 88888
Data columns (total 34 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Event.Id                56320 non-null  object        
 1   Investigation.Type      56320 non-null  object        
 2   Accident.Number         56320 non-null  object        
 3   Event.Date              56320 non-null  datetime64[ns]
 4   Location                56280 non-null  object        
 5   Country                 56164 non-null  object        
 6   Latitude                20638 non-null  object        
 7   Longitude               20634 non-null  object        
 8   Airport.Code            33015 non-null  object        
 9   Airport.Name            34544 non-null  object        
 10  Injury.Severity         56320 non-null  object        
 11  Aircraft.damage         54257 non-null  object        
 12  Aircraft.Category       56320 non-null  object  

In [148]:
# Stripping Model strings and checking counts
airplane_model_cut["Model"] = airplane_model_cut["Model"].str.strip()
airplane_model_cut["Model"].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  airplane_model_cut["Model"] = airplane_model_cut["Model"].str.strip()


Model
152          1920
172          1378
172N          936
PA-28-140     651
172M          637
             ... 
TC19            1
DW-1            1
146-300A        1
305B            1
P63             1
Name: count, Length: 5191, dtype: int64

In [150]:
# Creating combined "Make - Model" column

airplane_model_cut["Make - Model"] = airplane_model_cut["Make"] + " - " + airplane_model_cut["Model"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  airplane_model_cut["Make - Model"] = airplane_model_cut["Make"] + " - " + airplane_model_cut["Model"]


In [151]:
airplane_model_cut["Make - Model"].value_counts()

Make - Model
Cessna - 152             1920
Cessna - 172             1378
Cessna - 172N             936
Piper - PA-28-140         651
Cessna - 172M             637
                         ... 
Stinson - ST-108-2          1
Schleicher - ASW 20 B       1
Weatherly - 201             1
Piper - J2                  1
Bell - P63                  1
Name: count, Length: 5641, dtype: int64

### Cleaning other columns
- there are other columns containing data that might be related to the outcome of an accident. We list a few here:
- Engine.Type
- Weather.Condition
- Number.of.Engines
- Purpose.of.flight
- Broad.phase.of.flight

Inspect and identify potential cleaning tasks in each of the above columns. Execute those cleaning tasks. 

**Note**: You do not necessarily need to impute or drop NaNs here.

In [152]:
# Checking engine types
airplane_model_cut["Engine.Type"].value_counts()

Engine.Type
Reciprocating      44893
Turbo Prop          2521
Turbo Fan           2028
Turbo Shaft         1663
Unknown             1152
Turbo Jet            519
Geared Turbofan        1
Name: count, dtype: int64

In [153]:
# Checking Weather conditions
airplane_model_cut["Weather.Condition"].value_counts()

Weather.Condition
VMC    48654
IMC     4574
UNK      620
Unk      166
Name: count, dtype: int64

In [154]:
# Taking Weather conditions to all caps
airplane_model_cut["Weather.Condition"] = airplane_model_cut["Weather.Condition"].str.upper()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  airplane_model_cut["Weather.Condition"] = airplane_model_cut["Weather.Condition"].str.upper()


In [155]:
# Weather recheck
airplane_model_cut["Weather.Condition"].value_counts()

Weather.Condition
VMC    48654
IMC     4574
UNK      786
Name: count, dtype: int64

In [156]:
# number of engines check
airplane_model_cut["Number.of.Engines"].value_counts()

Number.of.Engines
1.0    43296
2.0     8449
3.0      422
0.0      404
4.0      379
Name: count, dtype: int64

In [157]:
# Removing rows with 0 engines listed
airplane_model_cut = airplane_model_cut[airplane_model_cut["Number.of.Engines"] != 0]

In [158]:
# recheck
airplane_model_cut["Number.of.Engines"].value_counts()

Number.of.Engines
1.0    43296
2.0     8449
3.0      422
4.0      379
Name: count, dtype: int64

In [159]:
# Checking purpose of flight:
airplane_model_cut["Purpose.of.flight"].value_counts()

Purpose.of.flight
Personal                     29821
Instructional                 7642
Unknown                       4730
Aerial Application            2960
Business                      2823
Positioning                   1106
Other Work Use                 713
Ferry                          477
Public Aircraft                470
Aerial Observation             444
Executive/corporate            374
Skydiving                      160
Flight Test                    117
Banner Tow                      70
Public Aircraft - Federal       44
Public Aircraft - State         26
Glider Tow                      25
Firefighting                    21
Air Race/show                   15
Public Aircraft - Local         14
Air Race show                   13
External Load                   12
Air Drop                         5
ASHO                             3
PUBS                             2
Name: count, dtype: int64

In [160]:
# Checking phase of flight:
airplane_model_cut["Broad.phase.of.flight"].value_counts()

Broad.phase.of.flight
Landing        11777
Takeoff         7855
Cruise          6885
Maneuvering     4808
Approach        4014
Taxi            1601
Climb           1345
Descent         1269
Go-around        897
Standing         751
Unknown          360
Other             71
Name: count, dtype: int64

### Column Removal
- inspect the dataframe and drop any columns that have too many NaNs

In [161]:
airplane_model_cut.info()

<class 'pandas.core.frame.DataFrame'>
Index: 55917 entries, 3608 to Model
Data columns (total 35 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Event.Id                55916 non-null  object        
 1   Investigation.Type      55916 non-null  object        
 2   Accident.Number         55916 non-null  object        
 3   Event.Date              55916 non-null  datetime64[ns]
 4   Location                55878 non-null  object        
 5   Country                 55762 non-null  object        
 6   Latitude                20587 non-null  object        
 7   Longitude               20583 non-null  object        
 8   Airport.Code            32896 non-null  object        
 9   Airport.Name            34418 non-null  object        
 10  Injury.Severity         55916 non-null  object        
 11  Aircraft.damage         53893 non-null  object        
 12  Aircraft.Category       55916 non-null  object  

In [162]:
# Longitude, Latitude, Airport.Code, Airport.Name, FAR.Decription, Schedule, Air.Carrier removed
# Reason: Greater than 20,000 null and not inherently relevant

column_cut_df = airplane_model_cut.drop(["Longitude", "Latitude", "Airport.Code", "Airport.Name", "FAR.Description", "Schedule", "Air.carrier"], axis = 1)

In [163]:
column_cut_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 55917 entries, 3608 to Model
Data columns (total 28 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Event.Id                55916 non-null  object        
 1   Investigation.Type      55916 non-null  object        
 2   Accident.Number         55916 non-null  object        
 3   Event.Date              55916 non-null  datetime64[ns]
 4   Location                55878 non-null  object        
 5   Country                 55762 non-null  object        
 6   Injury.Severity         55916 non-null  object        
 7   Aircraft.damage         53893 non-null  object        
 8   Aircraft.Category       55916 non-null  object        
 9   Registration.Number     54864 non-null  object        
 10  Make                    55916 non-null  object        
 11  Model                   55916 non-null  object        
 12  Number.of.Engines       52546 non-null  float64 

### Save DataFrame to csv
- its generally useful to save data to file/server after its in a sufficiently cleaned or intermediate state
- the data can then be loaded directly in another notebook for further analysis
- this helps keep your notebooks and workflow readable, clean and modularized

In [164]:
column_cut_df.to_csv("Cleaned_Accident_Data.csv", index=False)