# Project To Determine Low-Risk Aircraft

## Project Goal

My project aims at determining the lowest risk aircraft to guide my company on which aircraft to purchase and operate.

## Data Source and Data Exploration

This data is downloaded from [github repository](https://github.com/learn-co-curriculum/dsc-phase-1-project-v3/blob/1c3ea4c2ac868f4467e6f55304b5713d40314c35/data/Aviation_Data.csv) and includes aviation accident data from 1962 to 2023 about civil aviation accidents and selected incidents in the United States and international waters.

I used 14 columns for my analysis including a calculated column which included variables about:
- Aircraft category, make and model
- Injuries due to aircraft accident and injury severity
- Aircraft damage after incident
- Condition of the weather at the time of incident
I dropped missing values and 463 duplicate items from the selected columns.

In [28]:
#importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [29]:
#Loading the aviation dataset from the csv file into a pandas DataFrame
df = pd.read_csv("Aviation_Data.csv")
df.head() #displaying the first 5 rows of the DataFrame

  df = pd.read_csv("Aviation_Data.csv")


Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,...,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.922223,-81.878056,,,...,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,...,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


In [30]:
# Displaying the column names of the DataFrame
df.columns 

Index(['Event.Id', 'Investigation.Type', 'Accident.Number', 'Event.Date',
       'Location', 'Country', 'Latitude', 'Longitude', 'Airport.Code',
       'Airport.Name', 'Injury.Severity', 'Aircraft.damage',
       'Aircraft.Category', 'Registration.Number', 'Make', 'Model',
       'Amateur.Built', 'Number.of.Engines', 'Engine.Type', 'FAR.Description',
       'Schedule', 'Purpose.of.flight', 'Air.carrier', 'Total.Fatal.Injuries',
       'Total.Serious.Injuries', 'Total.Minor.Injuries', 'Total.Uninjured',
       'Weather.Condition', 'Broad.phase.of.flight', 'Report.Status',
       'Publication.Date'],
      dtype='object')

In [31]:
# Choosing relevant columns for analysis
relevant_columns =  ["Aircraft.Category", "Make", "Model", "Injury.Severity", "Total.Fatal.Injuries", 
                     "Total.Serious.Injuries", "Total.Minor.Injuries", "Total.Uninjured", "Aircraft.damage",
                     "Number.of.Engines","Engine.Type", "Weather.Condition", "Broad.phase.of.flight",]
df_relevant = df[relevant_columns]
df_relevant.head()

Unnamed: 0,Aircraft.Category,Make,Model,Injury.Severity,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Aircraft.damage,Number.of.Engines,Engine.Type,Weather.Condition,Broad.phase.of.flight
0,,Stinson,108-3,Fatal(2),2.0,0.0,0.0,0.0,Destroyed,1.0,Reciprocating,UNK,Cruise
1,,Piper,PA24-180,Fatal(4),4.0,0.0,0.0,0.0,Destroyed,1.0,Reciprocating,UNK,Unknown
2,,Cessna,172M,Fatal(3),3.0,,,,Destroyed,1.0,Reciprocating,IMC,Cruise
3,,Rockwell,112,Fatal(2),2.0,0.0,0.0,0.0,Destroyed,1.0,Reciprocating,IMC,Cruise
4,,Cessna,501,Fatal(1),1.0,2.0,,0.0,Destroyed,,,VMC,Approach


In [38]:
# Handling missing values
df_relevant = df_relevant.dropna(how='any', axis=0)

In [39]:
# Dealing with Duplicate Entries
df_relevant.duplicated().sum()

np.int64(463)

In [41]:
df_relevant = df_relevant.drop_duplicates()
df_relevant.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3122 entries, 7 to 63908
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Aircraft.Category       3122 non-null   object 
 1   Make                    3122 non-null   object 
 2   Model                   3122 non-null   object 
 3   Injury.Severity         3122 non-null   object 
 4   Total.Fatal.Injuries    3122 non-null   float64
 5   Total.Serious.Injuries  3122 non-null   float64
 6   Total.Minor.Injuries    3122 non-null   float64
 7   Total.Uninjured         3122 non-null   float64
 8   Aircraft.damage         3122 non-null   object 
 9   Number.of.Engines       3122 non-null   float64
 10  Engine.Type             3122 non-null   object 
 11  Weather.Condition       3122 non-null   object 
 12  Broad.phase.of.flight   3122 non-null   object 
dtypes: float64(5), object(8)
memory usage: 341.5+ KB


In [42]:
# Statistical summary of the relevant DataFrame
df_relevant.describe(include='all')

Unnamed: 0,Aircraft.Category,Make,Model,Injury.Severity,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Aircraft.damage,Number.of.Engines,Engine.Type,Weather.Condition,Broad.phase.of.flight
count,3122,3122,3122.0,3122,3122.0,3122.0,3122.0,3122.0,3122,3122.0,3122,3122,3122
unique,7,302,1091.0,12,,,,,3,,6,3,12
top,Airplane,Cessna,172.0,Non-Fatal,,,,,Substantial,,Reciprocating,VMC,Landing
freq,2733,1042,55.0,2360,,,,,1987,,2776,2748,698
mean,,,,,0.442024,0.227418,0.310058,2.262332,,1.146381,,,
std,,,,,1.154794,0.689216,1.163895,13.92839,,0.439227,,,
min,,,,,0.0,0.0,0.0,0.0,,0.0,,,
25%,,,,,0.0,0.0,0.0,0.0,,1.0,,,
50%,,,,,0.0,0.0,0.0,1.0,,1.0,,,
75%,,,,,0.0,0.0,0.0,2.0,,1.0,,,
