<a id='About'></a>
## About ##
​The National Transport Safety Board (NTSB) aviation accident database contains civil aviation accidents and selected incidents that occurred from 1962 to present within the United States, its territories and possessions, and in international waters. Foreign investigations in which the NTSB participated as an accredited representative will also be listed.  
Data description:
- Event.Id: ID of the aviation accident
- Investigation.Type: 
- Accident.Number: Number of the aviation accident
- Event.Date: Date when the accident occured
- Location: Place and State where the accident occured
- Country: Country where the accident occured
- Latitude: Latitude where the accident occured
- Longitude: Longitude where the accident occured
- Airport.Code: 
- Airport.Name: 
- Injury.Severity: 
- Aircraft.damage: 
- Aircraft.Category: 
- Registration.Number: Registration number of the aircraft
- Make: Make of the aircraft
- Model: Model of the aircraft
- Amateur.Built: 
- Number.of.Engines: Number of engines of the aircraft
- Engine.Type: Type of engine of the aircraft
- FAR.Description: Federal Aviation Regulations description
- Schedule:
- Purpose.of.flight: 
- Air.carrier: 
- Total.Fatal.Injuries: Total number of fatal injuries
- Total.Serious.Injuries: Total number of serious injuries
- Total.Minor.Injuries: Total number of minor injuries
- Total.Uninjured: Total number of uninjured
- Weather.Condition: Condition of the weather when the accident occured
- Broad.phase.of.flight: 
- Report.Status: 
- Publication.Date: Date when the information about the accident was published


<a id='Business Problem'></a>
## Business Problem
The client is expanding in to new industries to diversify its portfolio. Specifically, they are interested in purchasing and operating airplanes for commercial and private enterprises, but do not know anything about the potential risks of aircraft.  
Our  task is to determine which aircraft are the lowest risk for the company to start this new business endeavor.  
Client needs three concrete recommendations with with findings into actionable insights that the head of the new aviation division can use to help decide which aircraft to purchase. 

## Data Understanding
TBD

**Table of contents**  
1. [About](#about)
2. [Business problem & Data understanding](#business-problem)
3. [Data preprocessing](#data-preprocessing)
- 3.1. [Exploring data](#exploring-data)
- 3.2. [Data preparation](#data-preparation)
4. [Exploratory data analysis](#exploratory-data-analysis)
- 4.1. [General data analysis](#general-data-analysis)
- 4.2. [Data analysis for reccommendations](#data-analysis-for-recommendations)
5. [Summary](#summary)

## Data Preprocessing

### Exploring Data

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter("ignore")


In [2]:
df = pd.read_csv('data/AviationData.csv', encoding='latin-1')
df.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,...,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.9222,-81.8781,,,...,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,...,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88889 entries, 0 to 88888
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                88889 non-null  object 
 1   Investigation.Type      88889 non-null  object 
 2   Accident.Number         88889 non-null  object 
 3   Event.Date              88889 non-null  object 
 4   Location                88837 non-null  object 
 5   Country                 88663 non-null  object 
 6   Latitude                34382 non-null  object 
 7   Longitude               34373 non-null  object 
 8   Airport.Code            50248 non-null  object 
 9   Airport.Name            52790 non-null  object 
 10  Injury.Severity         87889 non-null  object 
 11  Aircraft.damage         85695 non-null  object 
 12  Aircraft.Category       32287 non-null  object 
 13  Registration.Number     87572 non-null  object 
 14  Make                    88826 non-null

In [15]:
df.isna().sum().sort_values(ascending=False)

Schedule                  76307
Air.carrier               72241
FAR.Description           56866
Aircraft.Category         56602
Longitude                 54516
Latitude                  54507
Airport.Code              38641
Airport.Name              36099
Broad.phase.of.flight     27165
Publication.Date          13771
Total.Serious.Injuries    12510
Total.Minor.Injuries      11933
Total.Fatal.Injuries      11401
Engine.Type                7077
Report.Status              6381
Purpose.of.flight          6192
Number.of.Engines          6084
Total.Uninjured            5912
Weather.Condition          4492
Aircraft.damage            3194
Registration.Number        1317
Injury.Severity            1000
Country                     226
Amateur.Built               102
Model                        92
Make                         63
Location                     52
Event.Date                    0
Accident.Number               0
Investigation.Type            0
Event.Id                      0
dtype: i

In [10]:
round((df.isna().sum().sort_values(ascending = False) / df.shape[0] * 100), 2).to_frame("%")

Unnamed: 0,%
Schedule,85.85
Air.carrier,81.27
FAR.Description,63.97
Aircraft.Category,63.68
Longitude,61.33
Latitude,61.32
Airport.Code,43.47
Airport.Name,40.61
Broad.phase.of.flight,30.56
Publication.Date,15.49


In [9]:
df.duplicated().sum()

0

1. We have a dataset of 88889 rows and 31 columns
2. 5 columns have float64 data type, 26 columns - object data type
3. No duplicates found
4. Data in dataframe to be converted into lowercase
5. Name of columns to be converted into snake style
6. Columns: Event.Date, Publication.Date to be converted to datetime type
7. Columns: Longitude, Latitude, to be converted to float data type
8. Columnes: Number.of.Engines,Total.Fatal.Injuries, Total.Serious.Injuries, Total.Uninjured  to be converted to integer data type
9. Column: Amateur.Built to be converted into boolean data type
10. Column "Location" to be splitted into 2 columns with Place and State values
11. Almost all the columns, except Event.Date, Accident.Number, Investigation.Type ,Event.Id, have missing values


### Data Preparation

## Exploratory Data Analysis

### General Data Analysis

### Data Analysis for recommendations

## Summary