# Exploratory Data Analysis of Aviation Accident Data


## Description

This Jupyter Notebook performs an exploratory data analysis (EDA) on a dataset containing information about aviation accidents. The dataset includes details about different aviation accidents, such as accident type, location, date, and factors contributing to the accidents.

The purpose of this analysis is to gain insights into the dataset and answer several key questions:

1. What are the most common causes of accidents among Part 121 air carriers?
2. Are there any seasonal or time-related patterns in accidents?
3. How has the safety record of Part 121 air carriers evolved over the years?
4. Are there any correlations between specific factors (e.g., weather conditions, aircraft types) and accident severity?

By exploring and visualizing the data, the aim is to provide a better understanding of aviation accidents and identify trends and patterns that may be useful for improving aviation safety.

---

**Note**: This notebook is for educational and analytical purposes only. It does not cover any specific incident or accident investigation.


# Imports
## Libraries

In [4]:
# Standard Library Imports
import os
import sys
from datetime import datetime

# Third-Party Library Imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Print a timestamp to indicate when the imports were executed
print(f"Imports executed at {datetime.now()}")

Imports executed at 2023-12-29 10:49:16.815928


## Data

In [5]:
# Load data from JSON
df = pd.read_json('https://raw.githubusercontent.com/flyguy221/NTSB-Aviation-Investigation-Analysis/aea43e7658cff1ec3ce3364d9e619191f0bff7a5/data/raw/cases2023-12-29_06-03/cases2023-12-29_06-03.json')

#Print a timestamp to indicate when the import was executed
print(f"Data Import Completed at {datetime.now()}")

Data Import Completed at 2023-12-29 10:49:18.453913


# Data Overview

In [6]:
# Display the first few rows of the dataset
print(df.head())

# Get basic info about the dataset
print(df.info())

# Generate summary statistics
print(df.describe())

# Check the dimensions of the DataFrame
print(df.shape)

   cm_mkey airportId airportName  cm_closed cm_completionStatus  \
0   193031      None        None       True           Completed   
1   192742      None        None       True           Completed   
2   192743      None        None       True           Completed   
3   192790      None        None       True                 N/A   
4   192633      None        None       True                 N/A   

   cm_hasSafetyRec cm_highestInjury  cm_isStudy   cm_mode  cm_ntsbNum  ...  \
0            False          Serious         0.0  Aviation  DCA23LA445  ...   
1            False          Serious         0.0  Aviation  DCA23LA382  ...   
2            False          Serious         0.0  Aviation  DCA23LA383  ...   
3            False             None         0.0  Aviation  DCA23WA394  ...   
4            False             None         0.0  Aviation  DCA23WA363  ...   

  factualNarrative prelimNarrative cm_fatalInjuryCount cm_minorInjuryCount  \
0             None            None                

**The dataset has 41 columns and 1286 rows, with a mix of data types, including integers, floats, booleans, and objects (strings).**


# Data Cleaning:

Before diving into the analysis, The following data cleaning steps will be completed:

* Handle missing values: Identify and handle columns with missing values.
* Data type conversion: Convert columns to appropriate data types.
* Remove unnecessary columns: Drop columns that are not relevant to the analysis.
* Exploratory Data Analysis (EDA): Begin by exploring the data to better understand its characteristics and to identify potential patterns or relationships. This will entail creating visualizations and summary statistics for different columns.

## Handle Missing Values

# Questions

Answering Specific Questions:

For the question about the most common causes of accidents, you can analyze the 'cm_probableCause' column.
For seasonal or time-related patterns, explore the 'cm_eventDate' column and consider creating time-based visualizations.
To understand the evolution of safety records, analyze the 'cm_eventDate' and 'cm_completionStatus' columns over time.
For correlations, consider using correlation matrices and visualizations to identify relationships between factors and accident severity.
Data Visualization: Visualize your findings using libraries like Matplotlib and Seaborn. This will help you communicate your results effectively.

Hypothesis Testing: If your analysis involves hypothesis testing, you can perform statistical tests using libraries such as SciPy.

Documentation: Make sure to document your code and analysis steps clearly in your Jupyter Notebook. This will help you and others understand the analysis process.

Further Analysis: Depending on your findings, you may need to perform more in-depth analyses, such as time series analysis or regression modeling.