# Aviation Venture Risk EDA

## Introduction

ACME Co. is interested in purchasing and operating airplanes for commercial and private enterprises. This Exploratory Data Analysis (EDA) utilizes data from the National Transportation Safety Board to determine which aircraft have the lowest risk. The analysis contains actionable insights for the head of the new Aviation Division, Scott Fly.

## Import Python Libraries

In [96]:
#imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Load the Data

In [97]:
!ls data

AviationData.csv  USState_Codes.csv


In [98]:
#load the CSV files for the rest of the project

#pandas says columns 6, 7, and 28 have mixed data types.  For now we will set them to strings to avoid errors later
'''
Latitude and Longitude have two formats in the file.  One is degrees, minutes, seconds format 
with a suffix for hemisphere like N
The other format is called decimal degrees and it is a float.
'''

#latin1 is required as utf-8 will not load
#load 5 rows just for column names, a full load shows mixed data type warnings on columns 6, 7, 28
#so we will tell pandas to load them as strings
aviation_data = pd.read_csv("data/AviationData.csv",encoding="latin1", nrows=1)
col_list = list(aviation_data.columns)
dtype_spec = {
    col_list[6]: 'str', #Latitude
    col_list[7]: 'str', #Longitude
    col_list[28]: 'str' #Broad.phase.of.flight
}

#now load it in full without warnings
aviation_data = pd.read_csv("data/AviationData.csv",encoding="latin1", dtype=dtype_spec)
uscode_data = pd.read_csv("data/USState_Codes.csv")

# Standardize Column Names

In [99]:
aviation_data.columns = aviation_data.columns.str.lower().str.replace(' ', '_').str.replace('.','_')

  aviation_data.columns = aviation_data.columns.str.lower().str.replace(' ', '_').str.replace('.','_')


In [100]:
#initial missing data check before cleaning/standardization
aviation_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88889 entries, 0 to 88888
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   event_id                88889 non-null  object 
 1   investigation_type      88889 non-null  object 
 2   accident_number         88889 non-null  object 
 3   event_date              88889 non-null  object 
 4   location                88837 non-null  object 
 5   country                 88663 non-null  object 
 6   latitude                34382 non-null  object 
 7   longitude               34373 non-null  object 
 8   airport_code            50249 non-null  object 
 9   airport_name            52790 non-null  object 
 10  injury_severity         87889 non-null  object 
 11  aircraft_damage         85695 non-null  object 
 12  aircraft_category       32287 non-null  object 
 13  registration_number     87572 non-null  object 
 14  make                    88826 non-null

### Mixed Data Type Issue with Longitude, Latitude

For latitude and longitude they mix formats.  Some are in degrees, minutes and seconds format with a suffix like 'N' for direction.  Some are in decimal degrees, which are easier to work with mathematically.

In [101]:
#To know to pass str types for columns 6, 7 and 28 we had to know what is up with those columns
#We do value_counts to see if one issue comes up a lot and inspect some initial values
aviation_data['latitude'].value_counts()

332739N      19
335219N      18
334118N      17
32.815556    17
324934N      16
             ..
039613N       1
342034N       1
433113N       1
343255N       1
373829N       1
Name: latitude, Length: 25589, dtype: int64

In [102]:
aviation_data['longitude'].value_counts()

0112457W       24
1114342W       18
1151140W       17
-104.673056    17
-112.0825      16
               ..
0843135W        1
0101957W        1
1064131W        1
1114414W        1
0121410W        1
Name: longitude, Length: 27154, dtype: int64

### Mixed Data Type for Broad.phase.of.flight 
This is likely due to NaN, but needs more investigation

In [103]:
aviation_data['broad_phase_of_flight'].head()

0      Cruise
1     Unknown
2      Cruise
3      Cruise
4    Approach
Name: broad_phase_of_flight, dtype: object

In [104]:
aviation_data['broad_phase_of_flight'].value_counts()

Landing        15428
Takeoff        12493
Cruise         10269
Maneuvering     8144
Approach        6546
Climb           2034
Taxi            1958
Descent         1887
Go-around       1353
Standing         945
Unknown          548
Other            119
Name: broad_phase_of_flight, dtype: int64

In [105]:
aviation_data['broad_phase_of_flight'].isna().sum()

27165

In [106]:
aviation_data[aviation_data['broad_phase_of_flight'].isna()]['broad_phase_of_flight'].head()

3030    NaN
3550    NaN
3637    NaN
4032    NaN
5505    NaN
Name: broad_phase_of_flight, dtype: object

# Standardize Data

Lets attack getting rid of duplicate labels.  We checked ahead and there's a lot of labels duplicated for a lot of reasons.  This attempts to get rid of all of that.

In [107]:
#lowercase and stripe white space, this removes a lot of duplicate labels
def standardize_string(s):
    if pd.isna(s):
        return s
    if isinstance(s,str):
        s = s.strip().lower()#.replace(' ','_')
    return s

#for some columns treating none as nan helps us do analysis
def none_to_nan(s):
    if s == 'none':
        return np.nan
    return s

#for some columns treating unknown as nan helps us do analysis
def unknown_to_nan(s):
    if s == 'unknown':
        return np.nan
    return s

#for some columns treating unknown as nan helps us do analysis
def unk_to_nan(s):
    if s == 'unk':
        return np.nan
    return s

def replace_in_cols(cols,map):
    aviation_data[cols] = aviation_data[cols].applymap(map)


display(aviation_data.head())
aviation_data = aviation_data.applymap(standardize_string)

replace_in_cols(['airport_code','airport_name','registration_number'], none_to_nan)
replace_in_cols(['aircraft_category','registration_number','engine_type','far_description'], unk_to_nan)
replace_in_cols(['engine_type','aircraft_damage','registration_number'], unknown_to_nan)

display(aviation_data.head())
aviation_data['airport_code'].value_counts()

Unnamed: 0,event_id,investigation_type,accident_number,event_date,location,country,latitude,longitude,airport_code,airport_name,...,purpose_of_flight,air_carrier,total_fatal_injuries,total_serious_injuries,total_minor_injuries,total_uninjured,weather_condition,broad_phase_of_flight,report_status,publication_date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,...,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.922223,-81.878056,,,...,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,...,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


Unnamed: 0,event_id,investigation_type,accident_number,event_date,location,country,latitude,longitude,airport_code,airport_name,...,purpose_of_flight,air_carrier,total_fatal_injuries,total_serious_injuries,total_minor_injuries,total_uninjured,weather_condition,broad_phase_of_flight,report_status,publication_date
0,20001218x45444,accident,sea87la080,1948-10-24,"moose creek, id",united states,,,,,...,personal,,2.0,0.0,0.0,0.0,unk,cruise,probable cause,
1,20001218x45447,accident,lax94la336,1962-07-19,"bridgeport, ca",united states,,,,,...,personal,,4.0,0.0,0.0,0.0,unk,unknown,probable cause,19-09-1996
2,20061025x01555,accident,nyc07la005,1974-08-30,"saltville, va",united states,36.922223,-81.878056,,,...,personal,,3.0,,,,imc,cruise,probable cause,26-02-2007
3,20001218x45448,accident,lax96la321,1977-06-19,"eureka, ca",united states,,,,,...,personal,,2.0,0.0,0.0,0.0,imc,cruise,probable cause,12-09-2000
4,20041105x01764,accident,chi79fa064,1979-08-02,"canton, oh",united states,,,,,...,personal,,1.0,2.0,,0.0,vmc,approach,probable cause,16-04-1980


pvt     497
apa     160
ord     149
mri     137
den     115
       ... 
78oh      1
0v6       1
8c3       1
56m       1
eikh      1
Name: airport_code, Length: 10345, dtype: int64

In [108]:
#initial missing data check before cleaning/standardization
aviation_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88889 entries, 0 to 88888
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   event_id                88889 non-null  object 
 1   investigation_type      88889 non-null  object 
 2   accident_number         88889 non-null  object 
 3   event_date              88889 non-null  object 
 4   location                88837 non-null  object 
 5   country                 88663 non-null  object 
 6   latitude                34382 non-null  object 
 7   longitude               34373 non-null  object 
 8   airport_code            48637 non-null  object 
 9   airport_name            52558 non-null  object 
 10  injury_severity         87889 non-null  object 
 11  aircraft_damage         85576 non-null  object 
 12  aircraft_category       32285 non-null  object 
 13  registration_number     87137 non-null  object 
 14  make                    88826 non-null

# Checking for Missing Data
First we will run some checks on the aviation_data to see what we're dealing with for missing data

In [109]:
aviation_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88889 entries, 0 to 88888
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   event_id                88889 non-null  object 
 1   investigation_type      88889 non-null  object 
 2   accident_number         88889 non-null  object 
 3   event_date              88889 non-null  object 
 4   location                88837 non-null  object 
 5   country                 88663 non-null  object 
 6   latitude                34382 non-null  object 
 7   longitude               34373 non-null  object 
 8   airport_code            48637 non-null  object 
 9   airport_name            52558 non-null  object 
 10  injury_severity         87889 non-null  object 
 11  aircraft_damage         85576 non-null  object 
 12  aircraft_category       32285 non-null  object 
 13  registration_number     87137 non-null  object 
 14  make                    88826 non-null

It seems we are missing a lot of data!  We will need to formulate plans on all of this.
This next code block will let us see how much data is missing percent wise

## Missing Percent

In [110]:
# Calculate the percentage of missing values for each column
missing_perc = aviation_data.isna().mean() * 100
missing_perc

event_id                   0.000000
investigation_type         0.000000
accident_number            0.000000
event_date                 0.000000
location                   0.058500
country                    0.254250
latitude                  61.320298
longitude                 61.330423
airport_code              45.283443
airport_name              40.872324
injury_severity            1.124999
aircraft_damage            3.727120
aircraft_category         63.679420
registration_number        1.970998
make                       0.070875
model                      0.103500
amateur_built              0.114750
number_of_engines          6.844491
engine_type               10.270112
far_description           64.391545
schedule                  85.845268
purpose_of_flight          6.965991
air_carrier               81.271023
total_fatal_injuries      12.826109
total_serious_injuries    14.073732
total_minor_injuries      13.424608
total_uninjured            6.650992
weather_condition          5

Convention suggests dropping columns where over 50% of the data is missing, unless it is very important to your analysis.
These are candidates to consider dropping:

In [111]:
missing_perc[missing_perc > 50]

latitude             61.320298
longitude            61.330423
aircraft_category    63.679420
far_description      64.391545
schedule             85.845268
air_carrier          81.271023
dtype: float64

## location

Lets see what Location has, maybe it will help us decide on what to do with Lat and Long

In [112]:
aviation_data['location'].value_counts()

anchorage, ak       548
miami, fl           275
houston, tx         271
albuquerque, nm     265
chicago, il         256
                   ... 
medina, mn            1
circle pines, mn      1
pine island, fl       1
churchtown, oh        1
brasnorte,            1
Name: location, Length: 21977, dtype: int64

We have options now, we can drop Lat and Long or we can use some sort of API to get the Lat and Long filled based on the Locatoin when it is missing!  We don't have to decide now, lets let further exploration guide our choices.

# Exploring More Columns

## investigation_type

In [113]:
aviation_data['investigation_type'].value_counts()

accident    85015
incident     3874
Name: investigation_type, dtype: int64

Accident - caused personal injury, damage or damage
Incident - could have potentially caused harm but didn't necessarily 

Weighting these differently is a good idea.

## country

In [114]:
aviation_data['country'].value_counts()

united states     82248
brazil              374
canada              359
mexico              358
united kingdom      344
                  ...  
chad                  1
ivory coast           1
cambodia              1
yemen                 1
benin                 1
Name: country, Length: 215, dtype: int64

There is a strong bias to the United States so our data is only going to be the most reliable in the US due to uncontrolled factors like regulation differences between countries.

## airport_code and airport_name

In [115]:
aviation_data['airport_code'].value_counts()

pvt     497
apa     160
ord     149
mri     137
den     115
       ... 
78oh      1
0v6       1
8c3       1
56m       1
eikh      1
Name: airport_code, Length: 10345, dtype: int64

In [116]:
aviation_data['airport_name'].value_counts()

private                           471
private airstrip                  266
private strip                     161
merrill field                     109
centennial                        102
                                 ... 
lambert-st. louis int'l             1
williamson mingo cty                1
sanona creek airstrip               1
penns cave                          1
wichita dwight d eisenhower nt      1
Name: airport_name, Length: 21565, dtype: int64

We will have to clean the names quite a bit especially around private airports.
The codes have a lot of missing data could that correlate with private airports not having codes?

In [117]:
aviation_data[(aviation_data['airport_name']=='private') & (aviation_data['airport_code'].isna())][['airport_name','airport_code']]

Unnamed: 0,airport_name,airport_code
974,private,
1057,private,
1076,private,
1186,private,
1658,private,
...,...,...
85962,private,
86046,private,
86494,private,
88425,private,


In [118]:
aviation_data[(aviation_data['airport_name']=='private') & (aviation_data['airport_code'].notna())][['airport_code']].value_counts()

airport_code
pvt             120
priv              5
rla               2
0co6              1
172               1
4nc5              1
mm20              1
my99              1
unk               1
xxx               1
dtype: int64

In [119]:
aviation_data[aviation_data['airport_name']=='private']['airport_name'].count()

471

So 337/471 of them are NaN and private.  of the remaining ones morst are pvt but priv is the same thing so we can fix that.

## injury_severity

In [120]:
aviation_data['injury_severity'].value_counts()

non-fatal     67357
fatal(1)       6167
fatal          5262
fatal(2)       3711
incident       2219
              ...  
fatal(270)        1
fatal(60)         1
fatal(43)         1
fatal(143)        1
fatal(230)        1
Name: injury_severity, Length: 109, dtype: int64

We will need to seperate fatal from the number to let it make more sense for our charts and analysis.

## aircraft_damage

In [121]:
aviation_data['aircraft_damage'].value_counts()

substantial    64148
destroyed      18623
minor           2805
Name: aircraft_damage, dtype: int64

Finally a field that has little issues, just a few missing values.

## Aircraft Specs
Lets examine the group of fields that might let us identify features of the airplanes, some of the most important data

### Aircraft.Category

In [122]:
aviation_data['aircraft_category'].value_counts()

airplane             27617
helicopter            3440
glider                 508
balloon                231
gyrocraft              173
weight-shift           161
powered parachute       91
ultralight              30
unknown                 14
wsft                     9
powered-lift             5
blimp                    4
rocket                   1
ultr                     1
Name: aircraft_category, dtype: int64

We can get rid of all that aren't airplane, probably

### registration_number

In [123]:
aviation_data['registration_number'].value_counts()

unreg     131
usaf        9
n20752      8
n53893      6
n11vh       6
         ... 
n62951      1
n1013e      1
n8266r      1
n65737      1
n9026p      1
Name: registration_number, Length: 79091, dtype: int64

Lots of NONE might not be as useful a field.  However if some planes were just train wrecks and had lots of issues then they could be 
something worth keeping in mind when interpretting the data
### make

In [124]:
aviation_data['make'].value_counts()

cessna           27149
piper            14870
beech             5372
boeing            2745
bell              2722
                 ...  
cohen                1
kitchens             1
lutes                1
izatt                1
royse ralph l        1
Name: make, Length: 7587, dtype: int64

We will need to clean this data for sure the Cessna for example is listed twice with diff capitilization
### model

In [125]:
aviation_data['model'].value_counts()

152                 2367
172                 1756
172n                1164
pa-28-140            932
150                  829
                    ... 
e75nl                  1
747-273c               1
watcha-mccall-it       1
md-520n                1
m-8 eagle              1
Name: model, Length: 11646, dtype: int64

Make and Model together can form unique strings to identify types of craft so this is nice.  (It is possible for a model to have the same name and diff makes, even if rare).  This pair will be used in a lot of finding the correlations in the data.
## amateur_built

In [126]:
aviation_data['amateur_built'].value_counts()

no     80312
yes     8475
Name: amateur_built, dtype: int64

Since we are a business we might just remove rows about amateur built aircraft.  However maybe some of the non business made ones have properties we'd be interested in?

In [79]:
aviation_data['Number.of.Engines'].value_counts()

1.0    69582
2.0    11079
0.0     1226
3.0      483
4.0      431
8.0        3
6.0        1
Name: Number.of.Engines, dtype: int64

I'm wondering what has zero engines, gliders perhaps?  Lets find out below

In [127]:
aviation_data[aviation_data['number_of_engines']==0][['engine_type']].value_counts()

engine_type  
none             10
reciprocating     2
dtype: int64

It seems it's mostly missing data.  Also our missing data is going to need standardized to one value.

In [128]:
aviation_data['engine_type'].value_counts()

reciprocating      69530
turbo shaft         3609
turbo prop          3391
turbo fan           2481
turbo jet            703
none                  21
geared turbofan       12
electric              10
lr                     2
hybrid rocket          1
Name: engine_type, dtype: int64

## far_description
FAR stands for Federal Aviation Regulations. These regulations are a comprehensive set of rules and guidelines established by the Federal Aviation Administration (FAA) to ensure the safety and efficiency of civil aviation within the United States.

Some examples:
- Part 23: Airworthiness Standards for Normal, Utility, Acrobatic, and Commuter Category Airplanes
- Part 61: Certification: Pilots, Flight Instructors, and Ground Instructors
- Part 91: General Operating and Flight Rules
- Part 121: Operating Requirements: Domestic, Flag, and Supplemental Operations

In [129]:
aviation_data['far_description'].value_counts()

091                               18221
part 91: general aviation          6486
nusn                               1584
nusc                               1013
137                                1010
135                                 746
121                                 679
part 137: agricultural              437
part 135: air taxi & commuter       298
pubu                                253
129                                 246
part 121: air carrier               165
133                                 107
part 129: foreign                   100
non-u.s., non-commercial             97
non-u.s., commercial                 93
part 133: rotorcraft ext. load       32
unknown                              22
public use                           19
091k                                 14
armf                                  8
part 125: 20+ pax,6000+ lbs           5
125                                   5
107                                   4
public aircraft                       2


The Data for this will need standardized as seen by the Part 91 and etc.  
on part 91 and others
https://pilotinstitute.com/part-91-vs-121-vs-135/

NUSN: Non-U.S. National (incident or accident)

"NUSN" is a code used to identify incidents or accidents involving aircraft that are not registered in the United States. This categorization helps in distinguishing between domestic and international incidents for regulatory and statistical purposes.
NUSC: Non-U.S. Commercial (incident or accident)

"NUSC" refers to incidents or accidents involving non-U.S. commercial aircraft. This code is specifically used for commercial operations, such as airlines and charter services, that are not registered in the United States.

... Understanding all the codes is going to be important if we use this field

## schedule

In [130]:
aviation_data['schedule'].value_counts()

nsch    4474
unk     4099
schd    4009
Name: schedule, dtype: int64

NSCH (Non-Scheduled): Refers to flights that do not operate on a regular schedule. These can include charter flights, air taxi operations, private flights, and other ad-hoc operations.

UNK (Unknown): Indicates that the schedule type of the flight operation is unknown. This can occur when the information is not available or not recorded in the safety database.

SCHD (Scheduled): Refers to flights that operate on a regular, published schedule. These are typically commercial airline flights that follow a fixed timetable.

## purpose_of_flight

In [131]:
aviation_data['purpose_of_flight'].value_counts()

personal                     49448
instructional                10601
unknown                       6802
aerial application            4712
business                      4018
positioning                   1646
other work use                1264
ferry                          812
aerial observation             794
public aircraft                720
executive/corporate            553
flight test                    405
skydiving                      182
external load                  123
public aircraft - federal      105
banner tow                     101
air race show                   99
public aircraft - local         74
public aircraft - state         64
air race/show                   59
glider tow                      53
firefighting                    40
air drop                        11
asho                             6
pubs                             4
publ                             1
Name: purpose_of_flight, dtype: int64

This is very informative as a field and it seems to have minimal issues with missing data.  Here is the Skydiving Joseph mentioned.

## air_carrier

In [132]:
aviation_data['air_carrier'].value_counts()

pilot                         258
american airlines              90
united airlines                89
delta air lines                53
delta air lines inc            44
                             ... 
frank w. scooley                1
richard l. mcglashan            1
inflight pilot traning llc      1
mills & daughters inc           1
mc cessna 210n llc              1
Name: air_carrier, Length: 13208, dtype: int64

It has a lot of missing rows but it could have a big impact on safety.

## total_fatal_injuries

In [133]:
aviation_data['total_fatal_injuries'].value_counts()

0.0      59675
1.0       8883
2.0       5173
3.0       1589
4.0       1103
         ...  
156.0        1
68.0         1
31.0         1
115.0        1
176.0        1
Name: total_fatal_injuries, Length: 125, dtype: int64

it is in floats for some reason so we might make sure there's no "partial fatal injuries"

## total_serious_injuries

In [134]:
aviation_data['total_serious_injuries'].value_counts()

0.0      63289
1.0       9125
2.0       2815
3.0        629
4.0        258
5.0         78
6.0         41
7.0         27
9.0         16
10.0        13
8.0         13
13.0         9
11.0         6
26.0         5
14.0         5
12.0         5
20.0         3
25.0         3
28.0         3
21.0         2
59.0         2
50.0         2
17.0         2
47.0         2
137.0        1
19.0         1
161.0        1
27.0         1
35.0         1
67.0         1
33.0         1
88.0         1
125.0        1
53.0         1
34.0         1
41.0         1
18.0         1
63.0         1
55.0         1
23.0         1
43.0         1
39.0         1
45.0         1
44.0         1
16.0         1
60.0         1
106.0        1
81.0         1
15.0         1
22.0         1
Name: total_serious_injuries, dtype: int64

It is in floats again
## total_minor_injuries

In [135]:
aviation_data['total_minor_injuries'].value_counts()

0.0      61454
1.0      10320
2.0       3576
3.0        784
4.0        372
5.0        129
6.0         67
7.0         59
9.0         22
8.0         20
13.0        14
10.0        11
12.0        11
14.0        10
11.0         9
17.0         8
19.0         6
18.0         6
24.0         5
22.0         5
25.0         4
16.0         4
15.0         4
33.0         4
20.0         3
21.0         3
26.0         3
23.0         3
32.0         3
27.0         3
50.0         2
30.0         2
36.0         2
31.0         2
28.0         2
42.0         2
38.0         2
57.0         1
65.0         1
84.0         1
43.0         1
35.0         1
380.0        1
47.0         1
68.0         1
200.0        1
71.0         1
58.0         1
171.0        1
39.0         1
96.0         1
29.0         1
69.0         1
62.0         1
45.0         1
125.0        1
40.0         1
Name: total_minor_injuries, dtype: int64

It is in floats again
## total_uninjured

In [136]:
aviation_data['total_uninjured'].value_counts()

0.0      29879
1.0      25101
2.0      15988
3.0       4313
4.0       2662
         ...  
558.0        1
412.0        1
338.0        1
401.0        1
455.0        1
Name: total_uninjured, Length: 379, dtype: int64

also in floats
## weather_condition

In [137]:
aviation_data['weather_condition'].value_counts()

vmc    77303
imc     5976
unk     1118
Name: weather_condition, dtype: int64

UNK and Unk are not standardized

Conditions are codes:

**VMC (Visual Meteorological Conditions):**

VMC: Indicates weather conditions that allow for visual flight rules (VFR) operations. Pilots can navigate and control the aircraft using visual references outside the cockpit.

**IMC (Instrument Meteorological Conditions):**

IMC: Refers to weather conditions that require instrument flight rules (IFR) operations. Visibility and cloud cover are such that pilots must rely on cockpit instruments for navigation and control.

**UNK (Unknown):**

UNK: Indicates that the meteorological conditions at the time of the incident or data point are unknown or not recorded.


## broad_phase_of_flight

In [138]:
aviation_data['broad_phase_of_flight'].value_counts()

landing        15428
takeoff        12493
cruise         10269
maneuvering     8144
approach        6546
climb           2034
taxi            1958
descent         1887
go-around       1353
standing         945
unknown          548
other            119
Name: broad_phase_of_flight, dtype: int64

Landing is the most prone to incident or accident it seems like, especially since it is also relatively short in terms of phases.
This data can help us find planes weak points and strong points as well as high risk areas.

## report_status

In [139]:
aviation_data['report_status'].value_counts()

probable cause                                                                                                                                                                                                                                      61754
foreign                                                                                                                                                                                                                                              1999
<br /><br />                                                                                                                                                                                                                                          167
factual                                                                                                                                                                                                                                               145


So it seems a lot of the values are probable cause

**Probable Cause**: This term is used to describe the findings of an investigation that identify the factors or events that most likely led to the incident or accident. When a report reaches the "probable cause" status, the investigating authority, such as the National Transportation Safety Board (NTSB) in the United States, has completed its analysis and has determined the primary reasons behind the occurrence.


## publication_date

In [140]:
aviation_data['publication_date'].value_counts()

25-09-2020    17019
26-09-2020     1769
03-11-2020     1155
31-03-1993      452
25-11-2003      396
              ...  
29-11-2004        1
29-08-2001        1
18-11-2004        1
17-12-1996        1
29-12-2022        1
Name: publication_date, Length: 2924, dtype: int64

I'm not sure why most the publication dates are 25-09-2020, maybe they batch the reports or something