## Problem Statement
Over the relatively short 118 year history of powered human flight, the human race has achieved the unachievable time and again. From the first slow short hops of the Wright brothers in 1903 to supersonic jet flight only 44 years later, and to the moon only 22 years after that, untold numbers of people have perished in the advancement of aviation technology. These days, we think almost nothing of boarding a machine that can fly us at altitudes approaching 40,000 feet at almost the speed of sound to get us from one side of the country to the other in mere hours. While the technology is proven and safe enough for us to not worry about the miniscule chance of a serious problem resulting in a devastating outcome, those problems do still arise. There are many causes for aviation mishaps and accidents, and whenever one occurs, investigators gather every possible data point that could lead to a better understanding of what happened and how to prevent it in the future.


## Data Sets
For this exercise, I will be analyzing aviation mishap data sets provided by the Federal Aviation Administration (FAA) Aviation Safety Information Analysis and Sharing (ASIAS) system. This system provides links to multiple source databases provided from the FAA, the National Transportation Safety Board (NTSB), the Bureau of Transportation Statistics (BTS), and the National Aeronautics and Space Administration (NASA), among others, that track various safety issues and incident and accident reports.

## Sample Questions I Seek to Answer
I will begin with exploring the data and producing summary statistics .

Are there significant clusters of primary and secondary contributing factors in aviation mishaps when looking at different types of mishaps? For example, do stall or spin mishap types occur more frequently when there is a specific combination of contributing factors (e.g., meteorological, physiological, or mechanical, etc.)?

Are there specific contributing factors (or groups of factors) that, if focused on via training or other resources, would have a more significant impact on aviation mishap rates for the most common mishap types?

Overall analysis:
Most common accident? Incident? both?
Most deadly accident type? Incident type? both? (fatalities

Of these most common, what are the combinations of contributing factors most prevalent for these A/Is?

Questions for follow-up:
Pilot analysis:
Does the total flight time (flight hours) of the pilot make a difference? 
Does the total number of flight hours in make/model make a difference?
Does the number of flight hours in the preceeding 90 days make a difference?

Aircraft Analysis:
Aircraft type make a difference? 
Engine?


## Challenges
1. There are several data sets provided from multiple organizations that are each very large. This required a thorough study of each data set thoroughly to determine the best combination of data and questions to ask.

2. The analysis is further complicated by the fact that the main data set refers to multiple other tables that decode the various codes recorded in the main table. I 


## Project Execution
This project required me to look at many different data sets from the NTSB, FAA, BTS, and NASA. I selected a dataset from the FAA's Accident/Incident Database (AIDS). This database is very thorough and spans many years (pre-1975, then each year through 2015). I began by looking at the data download site at http://av-info.faa.gov/dd_sublevel.asp?Folder=%5CAID.

The data is broken down by year groups, with one text file (tab delimited) per 5 year time period. The resulting data set is massive. I began by choosing the most recent year group (2015 - 2019) which includes incidents up to July 29th, 2017. This one file contains 5,775 rows and 180 columns.

There are several ways to approach this. I considered combining all data files into one DataFrame for analysis, but for the scope of this project, I chose to keep my analysis to the most recent year group. In the future, it would be interesting to conduct the same analysis on the combined data set to look at the trends of accidents and incidents over time.

In [20]:
# Import sample data file from FAA Accident/Incident Database (AIDS):
import pandas as pd
url = './data/A2015_19.txt'
faa = pd.read_table(url, sep='\t', error_bad_lines=False) # I needed to add the error_bad_lines in order to get the file to read

Skipping line 1095: expected 180 fields, saw 181
Skipping line 1798: expected 180 fields, saw 181
Skipping line 2414: expected 180 fields, saw 181
Skipping line 2938: expected 180 fields, saw 181
Skipping line 3990: expected 180 fields, saw 181



## Data Munging Phase
In this phase I examined basic information about the data set. After I pulled the data into my notebook, I found that I did not need many of the 180 columns. I imported the data dictionary from a text file into a spreadsheet (./data/Data Dictionary.xlsx) and looked at the columns to determine those I could drop for the main data set. I annotated the spreadsheet with my keep/drop decision. As some of the questions I wanted to examine would need some features while others would not, I kept the 'faa' dataset as whole as possible, and peeled off subsets of the data as needed. I also renamed the columns so they would make more sense during the analysis phase.

In [72]:
#Data Munging Phase

#Get basic info on dataset:
#print "Info: ", faa.info(), '\n'
#print "Shape: ", faa.shape, '\n'
#print "Columns: ", faa.columns, '\n'

#Based on the Data Dictionary (./data/Data Dictionary.xlsx) many of the 180 features can be dropped from the main data set.

#There should only be two event types: A or I (accident or incident)


In [56]:
faa.loc[(faa['c1'] == 'I') | (faa['c1'] == 'A')]

Unnamed: 0,c5,c1,c2,c3,c4,c6,c7,c8,c9,c10,...,c154,c156,c158,c161,c163,c183,c191,c229,c230,end_of_record
0,20150101000039A,A,091,,,2015,01,01,20150101,2130,...,,,Wheeled-Tricycle,,,,,19571231.0,,
1,20150101000239A,A,091,,,2015,01,01,20150101,1628,...,,,Wheeled-Conventional,,,,,19840924.0,,
2,20150102000479A,A,091,,,2015,01,02,20150102,1730,...,,,Wheeled-Tricycle,,,,,19580106.0,,
3,20150102000709A,A,091,,,2015,01,02,20150102,1755,...,,,Wheeled-Tricycle,,,,,19661113.0,,
4,20150102001609A,A,091,,,2015,01,02,20150102,1640,...,,,Skids,,,,,19640718.0,,
5,20150102002209I,I,091,,,2015,01,02,20150102,1547,...,,,Wheeled-Conventional,,,,,19781026.0,,
6,20150103000029I,I,091,,,2015,01,03,20150103,1630,...,,,Wheeled-Tricycle,,,,,19510404.0,,
7,20150103000109A,A,091,,,2015,01,03,20150103,1020,...,,,Wheeled-Tricycle,,,,,19671124.0,,
8,20150103000299A,A,091,,,2015,01,03,20150103,1700,...,,,Wheeled-Tricycle,,,,,19740228.0,,
9,20150103005989A,A,091,,,2015,01,03,20150103,1345,...,,,Wheeled-Conventional,,,,,19320706.0,,
