# Exploratory Data Analysis of Aviation Accident Data


## Description

This Jupyter Notebook performs an exploratory data analysis (EDA) on a dataset containing information about aviation accidents. The dataset includes details about different aviation accidents, such as accident type, location, date, and factors contributing to the accidents.

The purpose of this analysis is to gain insights into the dataset and answer several key questions:

1. What are the most common causes of accidents among Part 121 air carriers?
2. Are there any seasonal or time-related patterns in accidents?
3. How has the safety record of Part 121 air carriers evolved over the years?
4. Are there any correlations between specific factors (e.g., weather conditions, aircraft types) and accident severity?

By exploring and visualizing the data, the aim is to provide a better understanding of aviation accidents and identify trends and patterns that may be useful for improving aviation safety.

---

**Note**: This notebook is for educational and analytical purposes only. It does not cover any specific incident or accident investigation.


# Imports
## Libraries

In [1]:
# Standard Library Imports
import os
import sys
from datetime import datetime

# Third-Party Library Imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Custom function import
from scripts.load_data import load_data

# Print a timestamp to indicate when the imports were executed
print(f"Imports executed at {datetime.now()}")

Imports executed at 2023-12-30 12:36:26.890801


## Data

Please see [Data Loading](http://localhost:8888/notebooks/MyGitRepo/NTSB-Aviation-Investigation-Analysis/code/scripts/load_data.ipynb) for more information on the data loading process


In [2]:
# Load all tables
loaded_data = load_data(load_all=True)

#Print a timestamp to indicate when the import was executed
print(f"Data Import Completed at {datetime.now()}")

Data Import Completed at 2023-12-30 12:36:38.558547


In [3]:
loaded_data

{'narratives':                 ev_id  Aircraft_Key  \
 0      20080211X00175             1   
 1      20080107X00026             1   
 2      20080107X00026             2   
 3      20080109X00036             1   
 4      20080107X00027             1   
 ...               ...           ...   
 25243  20231114193384             1   
 25244  20231114193385             1   
 25245  20231114193387             1   
 25246  20231115193389             1   
 25247  20231120193406             1   
 
                                            narr_complete  
 0                                              import     
 1      On January 1, 2008, about 1430 Pacific standar...  
 2      On January 1, 2008, about 1430 Pacific standar...  
 3      The private pilot was conducting a touch-and-g...  
 4      On January 3, 2008, approximately 0225 central...  
 ...                                                  ...  
 25243  On November 13, 2023, about 0800 Pacific stand...  
 25244  On November 14, 

Error Handling: Implement try-except blocks to handle potential errors during data loading. This could include handling file not found errors, incorrect format issues, or network-related problems when fetching from URLs.

Data Type Specification: During data loading, you may specify data types for each column if you have this information. This can help in managing memory usage and ensuring data consistency.

Memory Optimization: If the datasets are large, consider using parameters like low_memory=False or dtype in pd.read_csv to optimize memory usage.

Data Dictionary Utilization: Use the data dictionary to understand each column and its data type. This will help in cleaning and transforming the data accurately. Also, it's good to cross-reference the dictionary while loading the data to ensure all columns are correctly interpreted.

Dataframe Inspection: After loading each dataframe, it's good practice to inspect the first few rows using df.head() and check the dataframe's shape and size using df.shape and df.info() to confirm that the data is loaded correctly.

Duplication and Consistency Checks: Perform initial checks for duplicate records or inconsistent data after loading.

Linking Tables: Once the data is loaded and initial inspections are done, you should start linking the tables using primary and foreign keys. This will enable more complex analyses like joining different aspects of the data, which is essential for in-depth analysis.

Dataframe Naming: Ensure dataframe names are descriptive to maintain readability and ease of understanding in your code.

Timestamps and Logging: Your use of a timestamp to log when the import was executed is good practice. You might also want to log other steps in your data processing for better tracking.

Documentation: Document each step of your data loading process. Explain why each step is necessary and how it contributes to your overall analysis.



# Data Overview

In [None]:
# Display the first few rows of the dataset
print(df.head())

# Get basic info about the dataset
print(df.info())

# Generate summary statistics
print(df.describe())

# Check the dimensions of the DataFrame
print(df.shape)

**The dataset has 41 columns and 1286 rows, with a mix of data types, including integers, floats, booleans, and objects (strings).**


# Data Cleaning:

Before diving into the analysis, The following data cleaning steps will be completed:

* Handle missing values: Identify and handle columns with missing values.
* Data type conversion: Convert columns to appropriate data types.
* Remove unnecessary columns: Drop columns that are not relevant to the analysis.
* Exploratory Data Analysis (EDA): Begin by exploring the data to better understand its characteristics and to identify potential patterns or relationships. This will entail creating visualizations and summary statistics for different columns.

## Handle Missing Values

Check quantity of missing data

In [None]:
# Create temp DF
missing_data = df.isna().sum()


print(missing_data)

In [None]:
percentage_missing_data = (missing_data / len(df)) * 100
print(percentage_missing_data)

# Questions

## What are the most common causes of accidents among Part 121 air carriers?

Answering Specific Questions:

For the question about the most common causes of accidents, you can analyze the 'cm_probableCause' column.
For seasonal or time-related patterns, explore the 'cm_eventDate' column and consider creating time-based visualizations.
To understand the evolution of safety records, analyze the 'cm_eventDate' and 'cm_completionStatus' columns over time.
For correlations, consider using correlation matrices and visualizations to identify relationships between factors and accident severity.
Data Visualization: Visualize your findings using libraries like Matplotlib and Seaborn. This will help you communicate your results effectively.

Hypothesis Testing: If your analysis involves hypothesis testing, you can perform statistical tests using libraries such as SciPy.

Documentation: Make sure to document your code and analysis steps clearly in your Jupyter Notebook. This will help you and others understand the analysis process.

Further Analysis: Depending on your findings, you may need to perform more in-depth analyses, such as time series analysis or regression modeling.