# Aviation Venture Risk EDA

## Introduction

ACME Co. is interested in purchasing and operating airplanes for commercial and private enterprises. This Exploratory Data Analysis (EDA) utilizes data from the National Transportation Safety Board to determine which aircraft have the lowest risk. The analysis contains actionable insights for the head of the new Aviation Division, Scott Fly.

## Import Python Libraries

In [2]:
#imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Load the Data

In [3]:
!ls data

AviationData.csv  USState_Codes.csv


In [36]:
#load the CSV files for the rest of the project

#pandas says columns 6, 7, and 28 have mixed data types.  For now we will set them to strings to avoid errors later
'''
Latitude and Longitude have two formats in the file.  One is degrees, minutes, seconds format with a suffix for hemisphere like N
The other format is called decimal degrees and it is a float.
'''

#latin1 is required as utf-8 will not load
#load 5 rows just for column names, a full load shows mixed data type warnings on columns 6, 7, 28
#so we will tell pandas to load them as strings
aviation_data = pd.read_csv("data/AviationData.csv",encoding="latin1", nrows=1)
col_list = list(aviation_data.columns)
dtype_spec = {
    col_list[6]: 'str', #Latitude
    col_list[7]: 'str', #Longitude
    col_list[28]: 'str' #Broad.phase.of.flight
}

#now load it in full without warnings
aviation_data = pd.read_csv("data/AviationData.csv",encoding="latin1", dtype=dtype_spec)
uscode_data = pd.read_csv("data/USState_Codes.csv")

### Mixed Data Type Issue with Longitude, Latitude

For latitude and longitude they mix formats.  Some are in degrees, minutes and seconds format with a suffix like 'N' for direction.  Some are in decimal degrees, which are easier to work with mathematically.

In [45]:
#To know to pass str types for columns 6, 7 and 28 we had to know what is up with those columns
#We do value_counts to see if one issue comes up a lot and inspect some initial values
aviation_data['Latitude'].value_counts()

332739N      19
335219N      18
334118N      17
32.815556    17
324934N      16
             ..
039613N       1
342034N       1
433113N       1
343255N       1
373829N       1
Name: Latitude, Length: 25589, dtype: int64

In [46]:
aviation_data['Longitude'].value_counts()

0112457W       24
1114342W       18
1151140W       17
-104.673056    17
-112.0825      16
               ..
0843135W        1
0101957W        1
1064131W        1
1114414W        1
0121410W        1
Name: Longitude, Length: 27154, dtype: int64

### Mixed Data Type for Broad.phase.of.flight 
This is likely due to NaN, but needs more investigation

In [40]:
aviation_data['Broad.phase.of.flight'].head()

0      Cruise
1     Unknown
2      Cruise
3      Cruise
4    Approach
Name: Broad.phase.of.flight, dtype: object

In [41]:
aviation_data['Broad.phase.of.flight'].value_counts()

Landing        15428
Takeoff        12493
Cruise         10269
Maneuvering     8144
Approach        6546
Climb           2034
Taxi            1958
Descent         1887
Go-around       1353
Standing         945
Unknown          548
Other            119
Name: Broad.phase.of.flight, dtype: int64

In [44]:
aviation_data['Broad.phase.of.flight'].isna().sum()

27165

In [43]:
aviation_data[aviation_data['Broad.phase.of.flight'].isna()]['Broad.phase.of.flight'].head()

3030    NaN
3550    NaN
3637    NaN
4032    NaN
5505    NaN
Name: Broad.phase.of.flight, dtype: object

## Checking for Missing Data
First we will run some checks on the aviation_data to see what we're dealing with for missing data

In [59]:
aviation_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88889 entries, 0 to 88888
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      88889 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Latitude                34382 non-null  object 
 7   Longitude               34373 non-null  object 
 8   Airport.Code            50249 non-null  object 
 9   Airport.Name            52790 non-null  object 
 10  Injury.Severity         87889 non-null  object 
 11  Aircraft.damage         85695 non-null  object 
 12  Aircraft.Category       32287 non-null  object 
 13  Registration.Number     87572 non-null  object 
 14  Make                    88826 non-null

It seems we are missing a lot of data!  We will need to formulate plans on all of this.
This next code block will let us see how much data is missing percent wise

In [64]:
# Calculate the percentage of missing values for each column
missing_perc = aviation_data.isna().mean() * 100
missing_perc

Event.Id                   0.000000
Investigation.Type         0.000000
Accident.Number            0.000000
Event.Date                 0.000000
Location                   0.058500
Country                    0.254250
Latitude                  61.320298
Longitude                 61.330423
Airport.Code              43.469946
Airport.Name              40.611324
Injury.Severity            1.124999
Aircraft.damage            3.593246
Aircraft.Category         63.677170
Registration.Number        1.481623
Make                       0.070875
Model                      0.103500
Amateur.Built              0.114750
Number.of.Engines          6.844491
Engine.Type                7.961615
FAR.Description           63.974170
Schedule                  85.845268
Purpose.of.flight          6.965991
Air.carrier               81.271023
Total.Fatal.Injuries      12.826109
Total.Serious.Injuries    14.073732
Total.Minor.Injuries      13.424608
Total.Uninjured            6.650992
Weather.Condition          5

Convention suggests dropping columns where over 50% of the data is missing, unless it is very important to your analysis.
These are candidates to consider dropping:

In [69]:
missing_perc[missing_perc > 50]

Latitude             61.320298
Longitude            61.330423
Aircraft.Category    63.677170
FAR.Description      63.974170
Schedule             85.845268
Air.carrier          81.271023
dtype: float64

Lets see what Location has, maybe it will help us decide on what to do with Lat and Long

In [71]:
aviation_data['Location'].value_counts()

ANCHORAGE, AK          434
MIAMI, FL              200
ALBUQUERQUE, NM        196
HOUSTON, TX            193
CHICAGO, IL            184
                      ... 
Corona De Tucso, AZ      1
Lithonia, GA             1
BONANZA, OR              1
NEWPORT, PA              1
Brasnorte,               1
Name: Location, Length: 27758, dtype: int64

We have options now, we can drop Lat and Long or we can use some sort of API to get the Lat and Long filled based on the Locatoin when it is missing!  We don't have to decide now, lets let further exploration guide our choices.