# Business Understanding

### Background: 
My company is planning to diversify its porfolio by extending to different markets. More specifically, my company is looking to invest in airplanes in order for their expansion to different markets to be successful.

### Business goals: 

This project is focusing on determining the risks of different aircrafts. Specifically, this project has the goal of identifying which aircrafts pose the least amount of potential risk. This information will be used in order to provide three recommendations to my company regarding the safest aircrafts to purchase. 

### Business success criteria: 

The success criteria for this project will be to provide three recommendations about the safest aircrafts that my company should invest in. For this project, the term "safest" refers to types of aircrafts with the least amount of crashes and the least number of casualties.

# Data Understanding

The National Transportation Safety Board (NTSB) collects data on aviation accidents and incidents that occur in the United States (which include its territories) as well as international waters. 

Each entry in the dataset represents an aircraft involved in an accident (or incident). For each aircraft there is a unique ID associated with the specific accident (or incident) the aircraft was involved in. Additional information is included about each entry, such as the accident (or incident) date, location, and number of injuries, as well as characteristics about the aircraft, such as the make, model, and number of engines.

## Data Preparation

To prepare the data for analysis I began by examining the data type and number of NaN values in each column. 

* In the `Aircraft.Category` column there were categories besides airplanes. I removed all rows that included non-airplane aircraft, since the company is specifically looking to invest in airplanes. With `Aircraft.Category` only containing airplanes, I removed this column to avoid redundancy. 

* The `Amateur.Built` column provided information about whether each aircraft was built by an amatuer or not. For safety purposes, I decided that only professionally-built airplanes should be included in this analysis, so I removed all rows including amateur-built airplanes. Then, to avoid redundancy, I removed the `Amateur.Built` column since it then only included professionally built airplanes 

* From this point, I examined the total number of NaN values in each column using `df.isna().sum()`. Using this information, I removed the following columns because they contained around 50% or more NaN values: `Latitude`, `Longitude`, `FAR.Description`, `Air.Carrier`, and `Schedule`

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
df = pd.read_csv('AviationData.csv', encoding = 'latin-1')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [3]:
df.head(10)

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,...,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.9222,-81.8781,,,...,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,...,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980
5,20170710X52551,Accident,NYC79AA106,1979-09-17,"BOSTON, MA",United States,42.4453,-70.7583,,,...,,Air Canada,,,1.0,44.0,VMC,Climb,Probable Cause,19-09-2017
6,20001218X45446,Accident,CHI81LA106,1981-08-01,"COTTON, MN",United States,,,,,...,Personal,,4.0,0.0,0.0,0.0,IMC,Unknown,Probable Cause,06-11-2001
7,20020909X01562,Accident,SEA82DA022,1982-01-01,"PULLMAN, WA",United States,,,,BLACKBURN AG STRIP,...,Personal,,0.0,0.0,0.0,2.0,VMC,Takeoff,Probable Cause,01-01-1982
8,20020909X01561,Accident,NYC82DA015,1982-01-01,"EAST HANOVER, NJ",United States,,,N58,HANOVER,...,Business,,0.0,0.0,0.0,2.0,IMC,Landing,Probable Cause,01-01-1982
9,20020909X01560,Accident,MIA82DA029,1982-01-01,"JACKSONVILLE, FL",United States,,,JAX,JACKSONVILLE INTL,...,Personal,,0.0,0.0,3.0,0.0,IMC,Cruise,Probable Cause,01-01-1982


In [4]:
#function that displays the dataframe .info() and .isna().sum()
def display_df_information(dataframe):
    print(dataframe.info())
    print()
    print('--- Number of NaNs per Column ---')
    print(dataframe.isna().sum())
    print()

In [5]:
display_df_information(df)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88889 entries, 0 to 88888
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      88889 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Latitude                34382 non-null  object 
 7   Longitude               34373 non-null  object 
 8   Airport.Code            50249 non-null  object 
 9   Airport.Name            52790 non-null  object 
 10  Injury.Severity         87889 non-null  object 
 11  Aircraft.damage         85695 non-null  object 
 12  Aircraft.Category       32287 non-null  object 
 13  Registration.Number     87572 non-null  object 
 14  Make                    88826 non-null

In [6]:
#Creates a list of columns that have more than 45000 NaN values
half_nan_cols = list(df.columns[df.isna().sum() > 45000])
half_nan_cols

['Latitude',
 'Longitude',
 'Aircraft.Category',
 'FAR.Description',
 'Schedule',
 'Air.carrier']

In [7]:
#function uses a for loop to print the column name and .value_counts() for column names in a list
def multiple_value_counts(dataframe,list_of_columns):
    for col in list_of_columns:
        print("----" + col + "----")
        print()
        print(dataframe[col].value_counts())
        print()
        print("There are " + str(dataframe[col].nunique()) + ' unique entries in ' + col)
        print()

In [8]:
#Prints the .value_counts() for columns in half_nan_cols
multiple_value_counts(df, half_nan_cols)

----Latitude----

332739N      19
335219N      18
334118N      17
32.815556    17
324934N      16
             ..
003354N       1
37.148611     1
35.886111     1
003955N       1
373235N       1
Name: Latitude, Length: 25592, dtype: int64

There are 25592 unique entries in Latitude

----Longitude----

0112457W       24
1114342W       18
1151140W       17
-104.673056    17
1114840W       16
               ..
-159.3          1
-75.782223      1
0732919W        1
1115445W        1
0094026W        1
Name: Longitude, Length: 27156, dtype: int64

There are 27156 unique entries in Longitude

----Aircraft.Category----

Airplane             27617
Helicopter            3440
Glider                 508
Balloon                231
Gyrocraft              173
Weight-Shift           161
Powered Parachute       91
Ultralight              30
Unknown                 14
WSFT                     9
Powered-Lift             5
Blimp                    4
UNK                      2
ULTR                     1
Rock

In [9]:
#function that examines features with >50% NaN more closely 
#by printing the first ten entries and data types for each of these features

def examine_features(dataframe, list_of_features):
    if type(list_of_features) == list:
        for feature in list_of_features:
            print("---" + feature + ' First 10 entries' + '---')
            print(dataframe[feature].head(10))
            print()
            print(print("---" + feature + ' First 10 data types' + '---'))
            for entry in dataframe[feature][:10]:
                print(type(entry))
            print('\n')
    else:
        print("---" + list_of_features + ' First 10 entries' + '---')
        print(dataframe[list_of_features].head(10))
        print()
        print(print("---" + list_of_features + ' First 10 data types' + '---'))
        for entry in dataframe[list_of_features][:10]:
            print(type(entry))
        print('\n')

In [10]:
examine_features(df, half_nan_cols)

---Latitude First 10 entries---
0        NaN
1        NaN
2    36.9222
3        NaN
4        NaN
5    42.4453
6        NaN
7        NaN
8        NaN
9        NaN
Name: Latitude, dtype: object

---Latitude First 10 data types---
None
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>


---Longitude First 10 entries---
0        NaN
1        NaN
2   -81.8781
3        NaN
4        NaN
5   -70.7583
6        NaN
7        NaN
8        NaN
9        NaN
Name: Longitude, dtype: object

---Longitude First 10 data types---
None
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>


---Aircraft.Category First 10 entries---
0         NaN
1         NaN
2         NaN
3         NaN
4         NaN
5    Airplane
6         NaN
7    Airplane
8    Airplane
9         NaN
Name: Aircraft.Category, d

In [11]:
#Removes columns Longitude, Latitude, FAR.Description, Schedule, and Air.Carrier
df_clean = df.drop(['Longitude', 'Latitude', 'FAR.Description', 'Schedule', 'Air.carrier'], axis = 1)
df_clean.head()

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Airport.Code,Airport.Name,Injury.Severity,Aircraft.damage,...,Engine.Type,Purpose.of.flight,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,Fatal(2),Destroyed,...,Reciprocating,Personal,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,Fatal(4),Destroyed,...,Reciprocating,Personal,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,,,Fatal(3),Destroyed,...,Reciprocating,Personal,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,Fatal(2),Destroyed,...,Reciprocating,Personal,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,Fatal(1),Destroyed,...,,Personal,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


In [12]:
display_df_information(df_clean)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88889 entries, 0 to 88888
Data columns (total 26 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      88889 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Airport.Code            50249 non-null  object 
 7   Airport.Name            52790 non-null  object 
 8   Injury.Severity         87889 non-null  object 
 9   Aircraft.damage         85695 non-null  object 
 10  Aircraft.Category       32287 non-null  object 
 11  Registration.Number     87572 non-null  object 
 12  Make                    88826 non-null  object 
 13  Model                   88797 non-null  object 
 14  Amateur.Built           88787 non-null

In [13]:
#Creates a list of non-airplane aircraft categories
non_airplanes = list(df['Aircraft.Category'].value_counts()[1:].index)
non_airplanes

['Helicopter',
 'Glider',
 'Balloon',
 'Gyrocraft',
 'Weight-Shift',
 'Powered Parachute',
 'Ultralight',
 'Unknown',
 'WSFT',
 'Powered-Lift',
 'Blimp',
 'UNK',
 'ULTR',
 'Rocket']

In [14]:
#uses lambda function to filter out any non-airplane labeled aircraft using the non-airplanes list
df_clean = df_clean.apply(lambda row: row[~df_clean['Aircraft.Category'].isin(non_airplanes)])
df_clean.head()

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Airport.Code,Airport.Name,Injury.Severity,Aircraft.damage,...,Engine.Type,Purpose.of.flight,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,Fatal(2),Destroyed,...,Reciprocating,Personal,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,Fatal(4),Destroyed,...,Reciprocating,Personal,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,,,Fatal(3),Destroyed,...,Reciprocating,Personal,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,Fatal(2),Destroyed,...,Reciprocating,Personal,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,Fatal(1),Destroyed,...,,Personal,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


In [15]:
display_df_information(df_clean)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 84219 entries, 0 to 88888
Data columns (total 26 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                84219 non-null  object 
 1   Investigation.Type      84219 non-null  object 
 2   Accident.Number         84219 non-null  object 
 3   Event.Date              84219 non-null  object 
 4   Location                84169 non-null  object 
 5   Country                 83998 non-null  object 
 6   Airport.Code            48404 non-null  object 
 7   Airport.Name            50867 non-null  object 
 8   Injury.Severity         83289 non-null  object 
 9   Aircraft.damage         81201 non-null  object 
 10  Aircraft.Category       27617 non-null  object 
 11  Registration.Number     82949 non-null  object 
 12  Make                    84159 non-null  object 
 13  Model                   84129 non-null  object 
 14  Amateur.Built           84119 non-null

In [16]:
#Drops the Aircraft.Category column since it only contains airplanes
df_clean = df_clean.drop(['Aircraft.Category'], axis = 1)
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 84219 entries, 0 to 88888
Data columns (total 25 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                84219 non-null  object 
 1   Investigation.Type      84219 non-null  object 
 2   Accident.Number         84219 non-null  object 
 3   Event.Date              84219 non-null  object 
 4   Location                84169 non-null  object 
 5   Country                 83998 non-null  object 
 6   Airport.Code            48404 non-null  object 
 7   Airport.Name            50867 non-null  object 
 8   Injury.Severity         83289 non-null  object 
 9   Aircraft.damage         81201 non-null  object 
 10  Registration.Number     82949 non-null  object 
 11  Make                    84159 non-null  object 
 12  Model                   84129 non-null  object 
 13  Amateur.Built           84119 non-null  object 
 14  Number.of.Engines       78838 non-null

In [17]:
df_clean['Amateur.Built'].value_counts()

No     76008
Yes     8111
Name: Amateur.Built, dtype: int64

In [18]:
#Creates a new dataframe consisting only of professionally built airplanes
df_clean = df_clean[(df_clean['Amateur.Built'] == 'No')]
df_clean.head()

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Airport.Code,Airport.Name,Injury.Severity,Aircraft.damage,...,Engine.Type,Purpose.of.flight,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,Fatal(2),Destroyed,...,Reciprocating,Personal,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,Fatal(4),Destroyed,...,Reciprocating,Personal,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,,,Fatal(3),Destroyed,...,Reciprocating,Personal,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,Fatal(2),Destroyed,...,Reciprocating,Personal,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,Fatal(1),Destroyed,...,,Personal,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


In [19]:
display_df_information(df_clean)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 76008 entries, 0 to 88888
Data columns (total 25 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                76008 non-null  object 
 1   Investigation.Type      76008 non-null  object 
 2   Accident.Number         76008 non-null  object 
 3   Event.Date              76008 non-null  object 
 4   Location                75962 non-null  object 
 5   Country                 75793 non-null  object 
 6   Airport.Code            43433 non-null  object 
 7   Airport.Name            45695 non-null  object 
 8   Injury.Severity         75079 non-null  object 
 9   Aircraft.damage         73037 non-null  object 
 10  Registration.Number     74839 non-null  object 
 11  Make                    75963 non-null  object 
 12  Model                   75941 non-null  object 
 13  Amateur.Built           76008 non-null  object 
 14  Number.of.Engines       70984 non-null

In [20]:
df_clean['Amateur.Built'].value_counts()

No    76008
Name: Amateur.Built, dtype: int64

In [21]:
#Drops the Amateur.Built column since now all data in this column is only 'NO'
df_clean = df_clean.drop(['Amateur.Built'], axis = 1)
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 76008 entries, 0 to 88888
Data columns (total 24 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                76008 non-null  object 
 1   Investigation.Type      76008 non-null  object 
 2   Accident.Number         76008 non-null  object 
 3   Event.Date              76008 non-null  object 
 4   Location                75962 non-null  object 
 5   Country                 75793 non-null  object 
 6   Airport.Code            43433 non-null  object 
 7   Airport.Name            45695 non-null  object 
 8   Injury.Severity         75079 non-null  object 
 9   Aircraft.damage         73037 non-null  object 
 10  Registration.Number     74839 non-null  object 
 11  Make                    75963 non-null  object 
 12  Model                   75941 non-null  object 
 13  Number.of.Engines       70984 non-null  float64
 14  Engine.Type             70611 non-null

In [22]:
df_clean.isna().sum()

Event.Id                      0
Investigation.Type            0
Accident.Number               0
Event.Date                    0
Location                     46
Country                     215
Airport.Code              32575
Airport.Name              30313
Injury.Severity             929
Aircraft.damage            2971
Registration.Number        1169
Make                         45
Model                        67
Number.of.Engines          5024
Engine.Type                5397
Purpose.of.flight          5371
Total.Fatal.Injuries       9658
Total.Serious.Injuries    10632
Total.Minor.Injuries      10141
Total.Uninjured            4722
Weather.Condition          3755
Broad.phase.of.flight     20743
Report.Status              5083
Publication.Date          12662
dtype: int64

In [23]:
cols_half_nans_clean_df = list(df_clean.columns[df_clean.isna().sum() > 30000])
cols_half_nans_clean_df

['Airport.Code', 'Airport.Name']

In [24]:
#Gets the value counts for Airport.Code and Airport.Name
multiple_value_counts(df_clean, cols_half_nans_clean_df)

----Airport.Code----

NONE    1285
PVT      365
ORD      149
APA      146
MRI      132
        ... 
VG24       1
MO2        1
MRIA       1
39Z        1
5Z9        1
Name: Airport.Code, Length: 9545, dtype: int64

There are 9545 unique entries in Airport.Code

----Airport.Name----

PRIVATE                           199
Private                           184
NONE                              119
Private Airstrip                  118
PRIVATE STRIP                     101
                                 ... 
LYNDEN MUNICIPAL AIRPORT JANSE      1
MARTIN FIELD AIRPORT                1
MT. STERLING- MONTGOMERY            1
CANTON-PLYMOTH-METTETAL             1
OXFORD WATERBURY                    1
Name: Airport.Name, Length: 22392, dtype: int64

There are 22392 unique entries in Airport.Name



In [25]:
examine_features(df_clean, cols_half_nans_clean_df)

---Airport.Code First 10 entries---
0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
5    NaN
6    NaN
7    NaN
8    N58
9    JAX
Name: Airport.Code, dtype: object

---Airport.Code First 10 data types---
None
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'str'>
<class 'str'>


---Airport.Name First 10 entries---
0                   NaN
1                   NaN
2                   NaN
3                   NaN
4                   NaN
5                   NaN
6                   NaN
7    BLACKBURN AG STRIP
8               HANOVER
9     JACKSONVILLE INTL
Name: Airport.Name, dtype: object

---Airport.Name First 10 data types---
None
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'str'>
<class 'str'>
<class 'str'>




In [26]:
df_clean = df_clean.drop(cols_half_nans_clean_df, axis = 1)
df_clean.head()

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Injury.Severity,Aircraft.damage,Registration.Number,Make,...,Engine.Type,Purpose.of.flight,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,Fatal(2),Destroyed,NC6404,Stinson,...,Reciprocating,Personal,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,Fatal(4),Destroyed,N5069P,Piper,...,Reciprocating,Personal,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,Fatal(3),Destroyed,N5142R,Cessna,...,Reciprocating,Personal,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,Fatal(2),Destroyed,N1168J,Rockwell,...,Reciprocating,Personal,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,Fatal(1),Destroyed,N15NY,Cessna,...,,Personal,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


In [27]:
display_df_information(df_clean)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 76008 entries, 0 to 88888
Data columns (total 22 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                76008 non-null  object 
 1   Investigation.Type      76008 non-null  object 
 2   Accident.Number         76008 non-null  object 
 3   Event.Date              76008 non-null  object 
 4   Location                75962 non-null  object 
 5   Country                 75793 non-null  object 
 6   Injury.Severity         75079 non-null  object 
 7   Aircraft.damage         73037 non-null  object 
 8   Registration.Number     74839 non-null  object 
 9   Make                    75963 non-null  object 
 10  Model                   75941 non-null  object 
 11  Number.of.Engines       70984 non-null  float64
 12  Engine.Type             70611 non-null  object 
 13  Purpose.of.flight       70637 non-null  object 
 14  Total.Fatal.Injuries    66350 non-null

In [28]:
injuries_and_fatalities = ['Injury.Severity', 'Total.Fatal.Injuries', 'Total.Serious.Injuries', 'Total.Minor.Injuries']

In [29]:
multiple_value_counts(df_clean, injuries_and_fatalities)

----Injury.Severity----

Non-Fatal     58094
Fatal(1)       4895
Fatal          3728
Fatal(2)       3289
Incident       2140
              ...  
Fatal(43)         1
Fatal(72)         1
Fatal(121)        1
Fatal(144)        1
Fatal(169)        1
Name: Injury.Severity, Length: 106, dtype: int64

There are 106 unique entries in Injury.Severity

----Total.Fatal.Injuries----

0.0      51857
1.0       6690
2.0       4361
3.0       1437
4.0       1026
         ...  
162.0        1
169.0        1
150.0        1
31.0         1
156.0        1
Name: Total.Fatal.Injuries, Length: 122, dtype: int64

There are 122 unique entries in Total.Fatal.Injuries

----Total.Serious.Injuries----

0.0      54823
1.0       7169
2.0       2363
3.0        545
4.0        243
5.0         63
6.0         34
7.0         25
9.0         15
10.0        12
8.0         12
13.0         8
11.0         6
26.0         5
12.0         5
14.0         4
25.0         3
20.0         3
28.0         3
59.0         2
17.0         2
50.0 

In [30]:
examine_features(df_clean, injuries_and_fatalities)

---Injury.Severity First 10 entries---
0     Fatal(2)
1     Fatal(4)
2     Fatal(3)
3     Fatal(2)
4     Fatal(1)
5    Non-Fatal
6     Fatal(4)
7    Non-Fatal
8    Non-Fatal
9    Non-Fatal
Name: Injury.Severity, dtype: object

---Injury.Severity First 10 data types---
None
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>


---Total.Fatal.Injuries First 10 entries---
0    2.0
1    4.0
2    3.0
3    2.0
4    1.0
5    NaN
6    4.0
7    0.0
8    0.0
9    0.0
Name: Total.Fatal.Injuries, dtype: float64

---Total.Fatal.Injuries First 10 data types---
None
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>


---Total.Serious.Injuries First 10 entries---
0    0.0
1    0.0
2    NaN
3    0.0
4    2.0
5    NaN
6    0.0
7    0.0
8    0.0
9    0.0
Name: Total.Serious.Injuries, dtype: float64

---Total.S

In [31]:
df_clean['Injury.Severity'].value_counts()

Non-Fatal     58094
Fatal(1)       4895
Fatal          3728
Fatal(2)       3289
Incident       2140
              ...  
Fatal(43)         1
Fatal(72)         1
Fatal(121)        1
Fatal(144)        1
Fatal(169)        1
Name: Injury.Severity, Length: 106, dtype: int64

In [32]:
df_clean['Total.Fatal.Injuries'].value_counts()

0.0      51857
1.0       6690
2.0       4361
3.0       1437
4.0       1026
         ...  
162.0        1
169.0        1
150.0        1
31.0         1
156.0        1
Name: Total.Fatal.Injuries, Length: 122, dtype: int64

In [33]:
#For Injury.Severity, group all fatal injuries into the same category named "Fatal" without
#specifying the number of fatalities since the Total.Fatal.Injuries column includes this information

df_clean['Injury.Severity'] = df_clean['Injury.Severity'].map(lambda entry: str(entry).split('(')[0])
df_clean['Injury.Severity'].value_counts()

Non-Fatal      58094
Fatal          14495
Incident        2140
nan              929
Minor            157
Serious          116
Unavailable       77
Name: Injury.Severity, dtype: int64

In [34]:
#Removes rows where 
#df_clean = df_clean[df_clean['Injury.Severity'] != 'Incident']
df_clean['Injury.Severity'].value_counts()

Non-Fatal      58094
Fatal          14495
Incident        2140
nan              929
Minor            157
Serious          116
Unavailable       77
Name: Injury.Severity, dtype: int64

In [35]:
#Grouping minor injuries into the Non-Fatal category
df_clean['Injury.Severity'].replace({'Minor': 'Non-Fatal'}, inplace = True)
df_clean['Injury.Severity'].value_counts()  

Non-Fatal      58251
Fatal          14495
Incident        2140
nan              929
Serious          116
Unavailable       77
Name: Injury.Severity, dtype: int64

In [36]:
#Grouping nan values into the Unavailable category
df_clean['Injury.Severity'].replace({'nan': 'Unavailable'}, inplace = True)
df_clean['Injury.Severity'].value_counts()

Non-Fatal      58251
Fatal          14495
Incident        2140
Unavailable     1006
Serious          116
Name: Injury.Severity, dtype: int64

In [37]:
df_clean.isna().sum()

Event.Id                      0
Investigation.Type            0
Accident.Number               0
Event.Date                    0
Location                     46
Country                     215
Injury.Severity               0
Aircraft.damage            2971
Registration.Number        1169
Make                         45
Model                        67
Number.of.Engines          5024
Engine.Type                5397
Purpose.of.flight          5371
Total.Fatal.Injuries       9658
Total.Serious.Injuries    10632
Total.Minor.Injuries      10141
Total.Uninjured            4722
Weather.Condition          3755
Broad.phase.of.flight     20743
Report.Status              5083
Publication.Date          12662
dtype: int64

In [38]:
df_clean['Publication.Date'].value_counts()

25-09-2020    12557
26-09-2020     1370
03-11-2020      938
31-03-1993      418
25-11-2003      341
              ...  
04-01-2006        1
05-10-1998        1
09-04-1996        1
29-03-2010        1
24-08-1993        1
Name: Publication.Date, Length: 2800, dtype: int64

In [39]:
df_clean = df_clean.drop(['Publication.Date'], axis = 1)
df_clean.head()

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Injury.Severity,Aircraft.damage,Registration.Number,Make,...,Number.of.Engines,Engine.Type,Purpose.of.flight,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,Fatal,Destroyed,NC6404,Stinson,...,1.0,Reciprocating,Personal,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,Fatal,Destroyed,N5069P,Piper,...,1.0,Reciprocating,Personal,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,Fatal,Destroyed,N5142R,Cessna,...,1.0,Reciprocating,Personal,3.0,,,,IMC,Cruise,Probable Cause
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,Fatal,Destroyed,N1168J,Rockwell,...,1.0,Reciprocating,Personal,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,Fatal,Destroyed,N15NY,Cessna,...,,,Personal,1.0,2.0,,0.0,VMC,Approach,Probable Cause


In [40]:
display_df_information(df_clean)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 76008 entries, 0 to 88888
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                76008 non-null  object 
 1   Investigation.Type      76008 non-null  object 
 2   Accident.Number         76008 non-null  object 
 3   Event.Date              76008 non-null  object 
 4   Location                75962 non-null  object 
 5   Country                 75793 non-null  object 
 6   Injury.Severity         76008 non-null  object 
 7   Aircraft.damage         73037 non-null  object 
 8   Registration.Number     74839 non-null  object 
 9   Make                    75963 non-null  object 
 10  Model                   75941 non-null  object 
 11  Number.of.Engines       70984 non-null  float64
 12  Engine.Type             70611 non-null  object 
 13  Purpose.of.flight       70637 non-null  object 
 14  Total.Fatal.Injuries    66350 non-null

In [41]:
df_clean['Report.Status'].value_counts()

Probable Cause                                                                                                                                         55291
Foreign                                                                                                                                                 1802
<br /><br />                                                                                                                                             136
Factual                                                                                                                                                  131
The pilot's failure to maintain directional control during the landing roll.                                                                              50
                                                                                                                                                       ...  
The pilots improper landing flare in gusting wind conditi

In [42]:
df_clean['Registration.Number'].value_counts()

NONE      80
UNREG     17
None      15
USAF       9
N20752     8
          ..
N143E      1
N6116U     1
N4784N     1
N3BP       1
N66322     1
Name: Registration.Number, Length: 68226, dtype: int64

In [43]:
df_clean = df_clean.drop(['Registration.Number', 'Report.Status'], axis = 1)
df_clean.head()

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Injury.Severity,Aircraft.damage,Make,Model,Number.of.Engines,Engine.Type,Purpose.of.flight,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,Fatal,Destroyed,Stinson,108-3,1.0,Reciprocating,Personal,2.0,0.0,0.0,0.0,UNK,Cruise
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,Fatal,Destroyed,Piper,PA24-180,1.0,Reciprocating,Personal,4.0,0.0,0.0,0.0,UNK,Unknown
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,Fatal,Destroyed,Cessna,172M,1.0,Reciprocating,Personal,3.0,,,,IMC,Cruise
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,Fatal,Destroyed,Rockwell,112,1.0,Reciprocating,Personal,2.0,0.0,0.0,0.0,IMC,Cruise
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,Fatal,Destroyed,Cessna,501,,,Personal,1.0,2.0,,0.0,VMC,Approach


In [44]:
display_df_information(df_clean)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 76008 entries, 0 to 88888
Data columns (total 19 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                76008 non-null  object 
 1   Investigation.Type      76008 non-null  object 
 2   Accident.Number         76008 non-null  object 
 3   Event.Date              76008 non-null  object 
 4   Location                75962 non-null  object 
 5   Country                 75793 non-null  object 
 6   Injury.Severity         76008 non-null  object 
 7   Aircraft.damage         73037 non-null  object 
 8   Make                    75963 non-null  object 
 9   Model                   75941 non-null  object 
 10  Number.of.Engines       70984 non-null  float64
 11  Engine.Type             70611 non-null  object 
 12  Purpose.of.flight       70637 non-null  object 
 13  Total.Fatal.Injuries    66350 non-null  float64
 14  Total.Serious.Injuries  65376 non-null

In [45]:
multiple_value_counts(df_clean, ['Number.of.Engines', 'Engine.Type'])

----Number.of.Engines----

1.0    58635
2.0    10758
0.0      684
3.0      481
4.0      424
6.0        1
8.0        1
Name: Number.of.Engines, dtype: int64

There are 7 unique entries in Number.of.Engines

----Engine.Type----

Reciprocating      60048
Turbo Prop          3311
Turbo Fan           2467
Turbo Shaft         2275
Unknown             1810
Turbo Jet            682
Geared Turbofan       12
Electric               5
UNK                    1
Name: Engine.Type, dtype: int64

There are 9 unique entries in Engine.Type



In [46]:
examine_features(df_clean, ['Number.of.Engines', 'Engine.Type'])

---Number.of.Engines First 10 entries---
0    1.0
1    1.0
2    1.0
3    1.0
4    NaN
5    2.0
6    1.0
7    1.0
8    2.0
9    1.0
Name: Number.of.Engines, dtype: float64

---Number.of.Engines First 10 data types---
None
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>


---Engine.Type First 10 entries---
0    Reciprocating
1    Reciprocating
2    Reciprocating
3    Reciprocating
4              NaN
5        Turbo Fan
6    Reciprocating
7    Reciprocating
8    Reciprocating
9    Reciprocating
Name: Engine.Type, dtype: object

---Engine.Type First 10 data types---
None
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'float'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>




In [47]:
#Grouping UNK values into the Unknown category
df_clean['Engine.Type'].replace({'UNK': 'Unknown'}, inplace = True)
df_clean['Engine.Type'].value_counts()

Reciprocating      60048
Turbo Prop          3311
Turbo Fan           2467
Turbo Shaft         2275
Unknown             1811
Turbo Jet            682
Geared Turbofan       12
Electric               5
Name: Engine.Type, dtype: int64

In [48]:
#Converting all entries in Engine.Type to string data type
df_clean['Engine.Type'] = df_clean['Engine.Type'].astype(str)

In [49]:
examine_features(df_clean, ['Engine.Type'])

---Engine.Type First 10 entries---
0    Reciprocating
1    Reciprocating
2    Reciprocating
3    Reciprocating
4              nan
5        Turbo Fan
6    Reciprocating
7    Reciprocating
8    Reciprocating
9    Reciprocating
Name: Engine.Type, dtype: object

---Engine.Type First 10 data types---
None
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>




In [50]:
display_df_information(df_clean)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 76008 entries, 0 to 88888
Data columns (total 19 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                76008 non-null  object 
 1   Investigation.Type      76008 non-null  object 
 2   Accident.Number         76008 non-null  object 
 3   Event.Date              76008 non-null  object 
 4   Location                75962 non-null  object 
 5   Country                 75793 non-null  object 
 6   Injury.Severity         76008 non-null  object 
 7   Aircraft.damage         73037 non-null  object 
 8   Make                    75963 non-null  object 
 9   Model                   75941 non-null  object 
 10  Number.of.Engines       70984 non-null  float64
 11  Engine.Type             76008 non-null  object 
 12  Purpose.of.flight       70637 non-null  object 
 13  Total.Fatal.Injuries    66350 non-null  float64
 14  Total.Serious.Injuries  65376 non-null

In [51]:
cols_to_examine = ['Aircraft.damage', 'Purpose.of.flight', 'Broad.phase.of.flight', 'Weather.Condition', 'Investigation.Type', 'Accident.Number']

In [52]:
multiple_value_counts(df_clean, cols_to_examine)

----Aircraft.damage----

Substantial    54555
Destroyed      15802
Minor           2583
Unknown           97
Name: Aircraft.damage, dtype: int64

There are 4 unique entries in Aircraft.damage

----Purpose.of.flight----

Personal                     40711
Instructional                 9706
Unknown                       6235
Aerial Application            4380
Business                      3789
Positioning                   1431
Other Work Use                1016
Ferry                          755
Public Aircraft                679
Aerial Observation             623
Executive/corporate            523
Flight Test                    199
Skydiving                      181
Banner Tow                     101
Public Aircraft - Federal       74
Air Race show                   49
Glider Tow                      42
Air Race/show                   35
Public Aircraft - State         33
Firefighting                    23
Public Aircraft - Local         20
External Load                   18
Air Drop  

In [53]:
df_clean = df_clean.drop(['Investigation.Type', 'Purpose.of.flight'], axis = 1)
df_clean.head()

Unnamed: 0,Event.Id,Accident.Number,Event.Date,Location,Country,Injury.Severity,Aircraft.damage,Make,Model,Number.of.Engines,Engine.Type,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight
0,20001218X45444,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,Fatal,Destroyed,Stinson,108-3,1.0,Reciprocating,2.0,0.0,0.0,0.0,UNK,Cruise
1,20001218X45447,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,Fatal,Destroyed,Piper,PA24-180,1.0,Reciprocating,4.0,0.0,0.0,0.0,UNK,Unknown
2,20061025X01555,NYC07LA005,1974-08-30,"Saltville, VA",United States,Fatal,Destroyed,Cessna,172M,1.0,Reciprocating,3.0,,,,IMC,Cruise
3,20001218X45448,LAX96LA321,1977-06-19,"EUREKA, CA",United States,Fatal,Destroyed,Rockwell,112,1.0,Reciprocating,2.0,0.0,0.0,0.0,IMC,Cruise
4,20041105X01764,CHI79FA064,1979-08-02,"Canton, OH",United States,Fatal,Destroyed,Cessna,501,,,1.0,2.0,,0.0,VMC,Approach


In [54]:
display_df_information(df_clean)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 76008 entries, 0 to 88888
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                76008 non-null  object 
 1   Accident.Number         76008 non-null  object 
 2   Event.Date              76008 non-null  object 
 3   Location                75962 non-null  object 
 4   Country                 75793 non-null  object 
 5   Injury.Severity         76008 non-null  object 
 6   Aircraft.damage         73037 non-null  object 
 7   Make                    75963 non-null  object 
 8   Model                   75941 non-null  object 
 9   Number.of.Engines       70984 non-null  float64
 10  Engine.Type             76008 non-null  object 
 11  Total.Fatal.Injuries    66350 non-null  float64
 12  Total.Serious.Injuries  65376 non-null  float64
 13  Total.Minor.Injuries    65867 non-null  float64
 14  Total.Uninjured         71286 non-null

In [55]:
examine_features(df_clean, ['Make', 'Model'])

---Make First 10 entries---
0              Stinson
1                Piper
2               Cessna
3             Rockwell
4               Cessna
5    Mcdonnell Douglas
6               Cessna
7               Cessna
8               Cessna
9       North American
Name: Make, dtype: object

---Make First 10 data types---
None
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>


---Model First 10 entries---
0           108-3
1        PA24-180
2            172M
3             112
4             501
5             DC9
6             180
7             140
8            401B
9    NAVION L-17B
Name: Model, dtype: object

---Model First 10 data types---
None
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>




In [56]:
multiple_value_counts(df_clean, ['Make', 'Model'])

----Make----

Cessna                22125
Piper                 11978
CESSNA                 4922
Beech                  4310
PIPER                  2839
                      ...  
NORTHROP                  1
D-Fly                     1
GROB AIRCRAFT AG          1
Aero Design Eleven        1
BEECH AIRCRAFT CO.        1
Name: Make, Length: 1908, dtype: int64

There are 1908 unique entries in Make

----Model----

152           2363
172           1750
172N          1160
PA-28-140      923
150            828
              ... 
H500C            1
S2R T660         1
ME 209           1
DUO DISCUS       1
M-20L            1
Name: Model, Length: 7996, dtype: int64

There are 7996 unique entries in Model



In [57]:
df_clean['Make'][4]

'Cessna'

In [58]:
df_clean['Make'] = df_clean['Make'].astype(str)

In [59]:
df_clean['Make'] = df_clean['Make'].map(lambda x: x.title())

In [60]:
df_clean['Make'].value_counts()

Cessna                27047
Piper                 14817
Beech                  5352
Boeing                 2695
Bell                   1797
                      ...  
Scan                      1
Bearhawk                  1
Thunder & Colt Ltd        1
Spitfire                  1
Eagle Aircraft Co         1
Name: Make, Length: 1582, dtype: int64

In [61]:
df_clean.isna().sum()

Event.Id                      0
Accident.Number               0
Event.Date                    0
Location                     46
Country                     215
Injury.Severity               0
Aircraft.damage            2971
Make                          0
Model                        67
Number.of.Engines          5024
Engine.Type                   0
Total.Fatal.Injuries       9658
Total.Serious.Injuries    10632
Total.Minor.Injuries      10141
Total.Uninjured            4722
Weather.Condition          3755
Broad.phase.of.flight     20743
dtype: int64

In [62]:
df_clean = df_clean.fillna('Unknown')
df_clean.isna().sum()

Event.Id                  0
Accident.Number           0
Event.Date                0
Location                  0
Country                   0
Injury.Severity           0
Aircraft.damage           0
Make                      0
Model                     0
Number.of.Engines         0
Engine.Type               0
Total.Fatal.Injuries      0
Total.Serious.Injuries    0
Total.Minor.Injuries      0
Total.Uninjured           0
Weather.Condition         0
Broad.phase.of.flight     0
dtype: int64

In [63]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 76008 entries, 0 to 88888
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Event.Id                76008 non-null  object
 1   Accident.Number         76008 non-null  object
 2   Event.Date              76008 non-null  object
 3   Location                76008 non-null  object
 4   Country                 76008 non-null  object
 5   Injury.Severity         76008 non-null  object
 6   Aircraft.damage         76008 non-null  object
 7   Make                    76008 non-null  object
 8   Model                   76008 non-null  object
 9   Number.of.Engines       76008 non-null  object
 10  Engine.Type             76008 non-null  object
 11  Total.Fatal.Injuries    76008 non-null  object
 12  Total.Serious.Injuries  76008 non-null  object
 13  Total.Minor.Injuries    76008 non-null  object
 14  Total.Uninjured         76008 non-null  object
 15  We

In [64]:
examine_features(df_clean, list(df_clean.columns))

---Event.Id First 10 entries---
0    20001218X45444
1    20001218X45447
2    20061025X01555
3    20001218X45448
4    20041105X01764
5    20170710X52551
6    20001218X45446
7    20020909X01562
8    20020909X01561
9    20020909X01560
Name: Event.Id, dtype: object

---Event.Id First 10 data types---
None
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>


---Accident.Number First 10 entries---
0    SEA87LA080
1    LAX94LA336
2    NYC07LA005
3    LAX96LA321
4    CHI79FA064
5    NYC79AA106
6    CHI81LA106
7    SEA82DA022
8    NYC82DA015
9    MIA82DA029
Name: Accident.Number, dtype: object

---Accident.Number First 10 data types---
None
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>


---Event.Date First 10 entries---
0    1948-10-24
1    1962-07-19
2    1974-08-30
3    1977-06-19
4    1979-08-02
5    1979-09-17

In [65]:
multiple_value_counts(df_clean, list(df_clean.columns))

----Event.Id----

20001214X45071    3
20001212X19172    3
20001211X11993    2
20001208X06211    2
20001213X29069    2
                 ..
20001213X26255    1
20001214X35976    1
20001212X20660    1
20001211X15325    1
20001212X18249    1
Name: Event.Id, Length: 75192, dtype: int64

There are 75192 unique entries in Event.Id

----Accident.Number----

ERA22LA379    2
ERA22LA103    2
CEN23MA034    2
ERA22FA318    2
DCA22WA089    2
             ..
NYC86LA226    1
LAX00FA160    1
NYC84FA074    1
CHI82FA110    1
MKC85LA208    1
Name: Accident.Number, Length: 75987, dtype: int64

There are 75987 unique entries in Accident.Number

----Event.Date----

2000-07-08    25
1986-05-17    24
1982-05-16    22
1983-05-28    22
1983-10-01    22
              ..
2009-11-13     1
2015-09-01     1
2003-03-25     1
2020-02-14     1
1996-03-06     1
Name: Event.Date, Length: 14583, dtype: int64

There are 14583 unique entries in Event.Date

----Location----

ANCHORAGE, AK                         421
MIAMI, FL

In [66]:
#Function that fills multiple columns' NaN values with 'Unknown'
def multiple_fill_nan(dataframe, list_of_cols):
    for col in list_of_cols:
        dataframe[col].replace({'Unk': 'Unknown'}, inplace = True)
        dataframe[col].replace({'UNK': 'Unknown'}, inplace = True)

In [67]:
multiple_fill_nan(df_clean, df_clean.columns)
df_clean['Weather.Condition'].value_counts()

VMC        65546
IMC         5711
Unknown     4751
Name: Weather.Condition, dtype: int64

In [68]:
df_clean['Engine.Type'].value_counts()

Reciprocating      60048
nan                 5397
Turbo Prop          3311
Turbo Fan           2467
Turbo Shaft         2275
Unknown             1811
Turbo Jet            682
Geared Turbofan       12
Electric               5
Name: Engine.Type, dtype: int64

In [69]:
#Removes all rows where Engine.Type is 'nan'
df_clean = df_clean[df_clean['Engine.Type'] != 'nan']
df_clean['Engine.Type'].value_counts()

Reciprocating      60048
Turbo Prop          3311
Turbo Fan           2467
Turbo Shaft         2275
Unknown             1811
Turbo Jet            682
Geared Turbofan       12
Electric               5
Name: Engine.Type, dtype: int64

In [70]:
type(df_clean['Event.Date'][2])

str

In [71]:
#Convert event dates from string to pandas datetime datatype
df_clean['Event.Date'] = pd.to_datetime(df_clean['Event.Date'])

In [72]:
type(df_clean['Event.Date'][2])

pandas._libs.tslibs.timestamps.Timestamp

In [73]:
#Creating a column for the day of the week for each event
df_clean['Event.Day'] = df_clean['Event.Date'].dt.day_name()

In [74]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 70611 entries, 0 to 88767
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Event.Id                70611 non-null  object        
 1   Accident.Number         70611 non-null  object        
 2   Event.Date              70611 non-null  datetime64[ns]
 3   Location                70611 non-null  object        
 4   Country                 70611 non-null  object        
 5   Injury.Severity         70611 non-null  object        
 6   Aircraft.damage         70611 non-null  object        
 7   Make                    70611 non-null  object        
 8   Model                   70611 non-null  object        
 9   Number.of.Engines       70611 non-null  object        
 10  Engine.Type             70611 non-null  object        
 11  Total.Fatal.Injuries    70611 non-null  object        
 12  Total.Serious.Injuries  70611 non-null  object

In [75]:
df_clean.isna().sum()

Event.Id                  0
Accident.Number           0
Event.Date                0
Location                  0
Country                   0
Injury.Severity           0
Aircraft.damage           0
Make                      0
Model                     0
Number.of.Engines         0
Engine.Type               0
Total.Fatal.Injuries      0
Total.Serious.Injuries    0
Total.Minor.Injuries      0
Total.Uninjured           0
Weather.Condition         0
Broad.phase.of.flight     0
Event.Day                 0
dtype: int64

# Exploratory Data Analysis

In [76]:
#clean_df.to_csv('aircraft_safety_cleaned.csv')

NameError: name 'clean_df' is not defined

In [77]:
df_clean.describe()

  df_clean.describe()


Unnamed: 0,Event.Id,Accident.Number,Event.Date,Location,Country,Injury.Severity,Aircraft.damage,Make,Model,Number.of.Engines,Engine.Type,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Event.Day
count,70611,70611,70611,70611,70611,70611,70611,70611,70611.0,70611.0,70611,70611.0,70611.0,70611.0,70611.0,70611,70611,70611
unique,69849,70607,14044,21570,170,5,4,1356,7328.0,7.0,8,85.0,40.0,54.0,358.0,3,12,7
top,20001212X19172,WPR22LA143,2000-07-08 00:00:00,"ANCHORAGE, AK",United States,Non-Fatal,Substantial,Cessna,152.0,1.0,Reciprocating,0.0,0.0,0.0,0.0,VMC,Unknown,Saturday
freq,3,2,25,421,68351,55631,52023,25797,2322.0,57238.0,60048,48854.0,51069.0,49231.0,21539.0,63447,16067,12939
first,,,1948-10-24 00:00:00,,,,,,,,,,,,,,,
last,,,2022-11-09 00:00:00,,,,,,,,,,,,,,,


AttributeError: 'DataFrame' object has no attribute 'summary'

# Conclusions

## Limitations

## Recommendations

## Next Steps