# Exploratory Data Analysis of Aviation Accident Data


## Description

This Jupyter Notebook performs an exploratory data analysis (EDA) on a dataset containing information about aviation accidents. The dataset includes details about different aviation accidents, such as accident type, location, date, and factors contributing to the accidents.

The purpose of this analysis is to gain insights into the dataset and answer several key questions:

1. What are the most common causes of accidents among Part 121 air carriers?
2. Are there any seasonal or time-related patterns in accidents?
3. How has the safety record of Part 121 air carriers evolved over the years?
4. Are there any correlations between specific factors (e.g., weather conditions, aircraft types) and accident severity?

By exploring and visualizing the data, the aim is to provide a better understanding of aviation accidents and identify trends and patterns that may be useful for improving aviation safety.

---

**Note**: This notebook is for educational and analytical purposes only. It does not cover any specific incident or accident investigation.


# Imports
## Libraries

In [15]:
# Standard Library Imports
import os
import sys
from datetime import datetime

# Third-Party Library Imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns




# Print a timestamp to indicate when the imports were executed
print(f"Imports executed at {datetime.now()}")

Imports executed at 2023-12-29 11:58:50.729940


## Data

Establish Static URLs for the individual tables

In [16]:
# Load data from JSON
df = pd.read_csv('https://raw.githubusercontent.com/flyguy221/NTSB-Aviation-Investigation-Analysis/aea43e7658cff1ec3ce3364d9e619191f0bff7a5/data/raw/cases2023-12-29_06-03/cases2023-12-29_06-03.json')

#Print a timestamp to indicate when the import was executed
print(f"Data Import Completed at {datetime.now()}")

Error: ('01000', "[01000] [unixODBC][Driver Manager]Can't open lib 'Microsoft Access Driver (*.mdb, *.accdb)' : file not found (0) (SQLDriverConnect)")

# Data Overview

In [12]:
# Display the first few rows of the dataset
print(df.head())

# Get basic info about the dataset
print(df.info())

# Generate summary statistics
print(df.describe())

# Check the dimensions of the DataFrame
print(df.shape)

       NtsbNo EventType    Mkey             EventDate            City  \
0  DCA23LA445       ACC  193031  2023-09-07T02:32:00Z          Hebron   
1  DCA23LA382       ACC  192742  2023-07-28T08:16:00Z    Myrtle Beach   
2  DCA23LA383       ACC  192743  2023-07-24T15:30:00Z  East Palestine   
3  DCA23WA394       ACC  192790  2023-07-24T11:00:00Z       Fuimicino   
4  DCA23WA363       ACC  192633  2023-07-17T01:30:00Z        Tel Aviv   

            State        Country ReportNo              N# SerialNumber  ...  \
0        Kentucky  United States      NaN          N37531        66119  ...   
1  Atlantic Ocean  United States      NaN          N77431        32833  ...   
2            Ohio  United States      NaN          N279WN        32532  ...   
3             NaN          Italy      NaN          N189DN        25990  ...   
4   Other Foreign         Israel      NaN  OE-LRG, N918FD        24290  ...   

         FAR      AirCraftDamage WeatherCondition  \
0        121                 NaN 

**The dataset has 41 columns and 1286 rows, with a mix of data types, including integers, floats, booleans, and objects (strings).**


# Data Cleaning:

Before diving into the analysis, The following data cleaning steps will be completed:

* Handle missing values: Identify and handle columns with missing values.
* Data type conversion: Convert columns to appropriate data types.
* Remove unnecessary columns: Drop columns that are not relevant to the analysis.
* Exploratory Data Analysis (EDA): Begin by exploring the data to better understand its characteristics and to identify potential patterns or relationships. This will entail creating visualizations and summary statistics for different columns.

## Handle Missing Values

Check quantity of missing data

In [13]:
# Create temp DF
missing_data = df.isna().sum()


print(missing_data)

NtsbNo                            0
EventType                         0
Mkey                              0
EventDate                         0
City                              3
State                           137
Country                          35
ReportNo                       1205
N#                                0
SerialNumber                     94
HasSafetyRec                      0
Mode                              0
ReportType                      104
OriginalPublishedDate           161
DocketOriginalPublishedDate     835
HighestInjuryLevel              478
FatalInjuryCount                  3
SeriousInjuryCount                3
MinorInjuryCount                  3
ProbableCause                   244
Findings                        892
EventID                        1284
Latitude                        304
Longitude                       304
Make                              0
Model                             0
AirCraftCategory                 22
AirportID                   

In [14]:
percentage_missing_data = (missing_data / len(df)) * 100
print(percentage_missing_data)

NtsbNo                          0.000000
EventType                       0.000000
Mkey                            0.000000
EventDate                       0.000000
City                            0.233281
State                          10.653188
Country                         2.721617
ReportNo                       93.701400
N#                              0.000000
SerialNumber                    7.309487
HasSafetyRec                    0.000000
Mode                            0.000000
ReportType                      8.087092
OriginalPublishedDate          12.519440
DocketOriginalPublishedDate    64.930016
HighestInjuryLevel             37.169518
FatalInjuryCount                0.233281
SeriousInjuryCount              0.233281
MinorInjuryCount                0.233281
ProbableCause                  18.973561
Findings                       69.362364
EventID                        99.844479
Latitude                       23.639191
Longitude                      23.639191
Make            

# Questions

## What are the most common causes of accidents among Part 121 air carriers?

Answering Specific Questions:

For the question about the most common causes of accidents, you can analyze the 'cm_probableCause' column.
For seasonal or time-related patterns, explore the 'cm_eventDate' column and consider creating time-based visualizations.
To understand the evolution of safety records, analyze the 'cm_eventDate' and 'cm_completionStatus' columns over time.
For correlations, consider using correlation matrices and visualizations to identify relationships between factors and accident severity.
Data Visualization: Visualize your findings using libraries like Matplotlib and Seaborn. This will help you communicate your results effectively.

Hypothesis Testing: If your analysis involves hypothesis testing, you can perform statistical tests using libraries such as SciPy.

Documentation: Make sure to document your code and analysis steps clearly in your Jupyter Notebook. This will help you and others understand the analysis process.

Further Analysis: Depending on your findings, you may need to perform more in-depth analyses, such as time series analysis or regression modeling.