# Goal: Determine which aircraft are lowest risk (both commercial AND private) to guide stakeholders into purchasing the safest aircrafts

In [13]:
import pandas as pd
import numpy as np

pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 31)

In [18]:
df = pd.read_csv('data/Aviation_Data.csv')
df.head()

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,Injury.Severity,Aircraft.damage,Aircraft.Category,Registration.Number,Make,Model,Amateur.Built,Number.of.Engines,Engine.Type,FAR.Description,Schedule,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,Fatal(2),Destroyed,,NC6404,Stinson,108-3,No,1.0,Reciprocating,,,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,Fatal(4),Destroyed,,N5069P,Piper,PA24-180,No,1.0,Reciprocating,,,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.9222,-81.8781,,,Fatal(3),Destroyed,,N5142R,Cessna,172M,No,1.0,Reciprocating,,,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,Fatal(2),Destroyed,,N1168J,Rockwell,112,No,1.0,Reciprocating,,,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,Fatal(1),Destroyed,,N15NY,Cessna,501,No,,,,,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


# most important cols:
These columns are essentially the reccomendation that we give (specifically make/model). Engine features are simply that; features of the plane

- make
- model

Features of make/model:
- number_of_engines
- engine_type

## second most important:
Variables that should be taken into account regarding the safety of a specific model.

Features that effect the chance of accident:
- weather_condition

# duplicate event_id values indicate a collision between 2 aircraft
- indicated by identical 'accident_number' values with a A/B on the end

# investigation_type
- Accident -> An event that results in significant harm (death or serious injury).
- Incident -> An event that usually does not lead to significant harm, but reveals potential risks (near miss with other aircraft, equipment malfunction)

# weather_conditions

- IMC -> "Instrument Meteorological Conditions". Reduced visiibility (cloud, fog). Pilot must use instruments outside of just visual reference. These crashes should be weighted less (more understandable).
- VMC -> "Visual Meteorological Conditions". Weather is generally clear (good visibility). These crashes should be weighted higher (crash was likely due to something other than weather conditions).

In [41]:
# for numerical columns -> replace 'Unknown' with 0 and transform strings -> floats -> ints
for col in ['total_fatal_injuries', 'total_serious_injuries', 'total_minor_injuries', 'total_uninjured', 'number_of_engines']:
    df[col] = df[col].replace(['Unknown', np.nan], 0)
    df[col] = df[col].astype(float).astype(int)

In [53]:
print(f'96% of investiagtion_type is classified as \'Accident\'.')

df['investigation_type'].value_counts()

96% of investiagtion_type is classified as 'Accident'.


Accident    84919
Incident     3874
Name: investigation_type, dtype: int64