## 1. Business Understanding
# 1.1 Overview
To expand its market footprint and diversify its investment portfolio, the company is preparing to enter the aviation sector. This initiative involves acquiring aircraft to operate within both commercial and private aviation markets. Given the significant risks associated with air travel—ranging from mechanical failure and environmental hazards to pilot error and maintenance practices—it is imperative that the company make data-informed decisions regarding aircraft selection.

# 1.2 Business Problem
The organization currently lacks knowledge about the risk profiles of different aircraft models. Without a clear understanding of historical accident patterns, the company could unknowingly invest in aircraft with poor safety records, resulting in potential financial loss, reputational damage, and regulatory complications. Therefore, leadership has tasked the data team with identifying the **lowest-risk aircraft models** to guide strategic procurement.

# 1.3 Project Objective
The objective of this project is to analyze historical aviation accident data to evaluate and compare the safety performance of different aircraft models. The goal is to translate this analysis into **three clear business recommendations** that will inform the Aviation Division’s purchasing decisions.

This involves:
- Identifying aircraft models with consistently low accident frequencies or severities.
- Understanding trends across aircraft manufacturers, types, and use cases (commercial vs. private).
- Assessing the impact of contributing factors such as pilot error, equipment failure, weather conditions, or operational mismanagement.

# 1.4 Business Goals
- **Minimize Risk**: Recommend aircraft with the lowest historical accident rates to reduce the risk exposure for the business.
- **Support Procurement**: Provide a ranked list or categorical insights on safe aircraft for commercial and private deployment.
- **Enable Strategic Planning**: Use historical data trends to anticipate long-term implications of choosing particular aircraft.

# 1.5 Success Criteria
- Delivery of **three actionable and evidence-based business recommendations** supported by visual insights.
- Development of an **interactive dashboard** that allows business stakeholders to explore aircraft risk profiles.
- A **non-technical presentation** and a **well-documented Jupyter Notebook** that together communicate the methodology, findings, and value of the analysis.

In [1]:
# Import relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the data
aviation_data = pd.read_csv(r'Data\AviationAccidentDataset\AviationData.csv', encoding='ISO-8859-1')

aviation_data.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Location,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20001218X45444,Accident,SEA87LA080,1948-10-24,"MOOSE CREEK, ID",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,UNK,Cruise,Probable Cause,
1,20001218X45447,Accident,LAX94LA336,1962-07-19,"BRIDGEPORT, CA",United States,,,,,...,Personal,,4.0,0.0,0.0,0.0,UNK,Unknown,Probable Cause,19-09-1996
2,20061025X01555,Accident,NYC07LA005,1974-08-30,"Saltville, VA",United States,36.9222,-81.8781,,,...,Personal,,3.0,,,,IMC,Cruise,Probable Cause,26-02-2007
3,20001218X45448,Accident,LAX96LA321,1977-06-19,"EUREKA, CA",United States,,,,,...,Personal,,2.0,0.0,0.0,0.0,IMC,Cruise,Probable Cause,12-09-2000
4,20041105X01764,Accident,CHI79FA064,1979-08-02,"Canton, OH",United States,,,,,...,Personal,,1.0,2.0,,0.0,VMC,Approach,Probable Cause,16-04-1980


In [2]:
'''IDE- Initial Data Exploration'''

# Finding out how many rows and columns are available
print(f"The dataset contains {aviation_data.shape[0]} rows and {aviation_data.shape[1]} columns\n")

# Find out column names to know if they need standardisation and renaming
print("Column Names:\n", aviation_data.columns, "\n")

The dataset contains 88889 rows and 31 columns

Column Names:
 Index(['Event.Id', 'Investigation.Type', 'Accident.Number', 'Event.Date',
       'Location', 'Country', 'Latitude', 'Longitude', 'Airport.Code',
       'Airport.Name', 'Injury.Severity', 'Aircraft.damage',
       'Aircraft.Category', 'Registration.Number', 'Make', 'Model',
       'Amateur.Built', 'Number.of.Engines', 'Engine.Type', 'FAR.Description',
       'Schedule', 'Purpose.of.flight', 'Air.carrier', 'Total.Fatal.Injuries',
       'Total.Serious.Injuries', 'Total.Minor.Injuries', 'Total.Uninjured',
       'Weather.Condition', 'Broad.phase.of.flight', 'Report.Status',
       'Publication.Date'],
      dtype='object') 



In [3]:
# Standardize columns for easier readability
aviation_data.columns = (aviation_data.columns.str.strip().str.lower().str.replace(".", "_"))

aviation_data.sample(2)

Unnamed: 0,event_id,investigation_type,accident_number,event_date,location,country,latitude,longitude,airport_code,airport_name,...,purpose_of_flight,air_carrier,total_fatal_injuries,total_serious_injuries,total_minor_injuries,total_uninjured,weather_condition,broad_phase_of_flight,report_status,publication_date
10345,20001214X41558,Accident,DEN85LA030,1984-11-22,"OAKS, ND",United States,,,,PRIVATE AIRSTRIP,...,Personal,,0.0,0.0,0.0,1.0,VMC,Landing,Probable Cause,
29640,20001212X18582,Accident,DEN92FA020,1991-12-25,"MONTE VISTA, CO",United States,,,,,...,Personal,,3.0,0.0,0.0,0.0,VMC,Takeoff,Probable Cause,23-04-1993


In [4]:
# Get metadata
aviation_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88889 entries, 0 to 88888
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   event_id                88889 non-null  object 
 1   investigation_type      88889 non-null  object 
 2   accident_number         88889 non-null  object 
 3   event_date              88889 non-null  object 
 4   location                88837 non-null  object 
 5   country                 88663 non-null  object 
 6   latitude                34382 non-null  object 
 7   longitude               34373 non-null  object 
 8   airport_code            50249 non-null  object 
 9   airport_name            52790 non-null  object 
 10  injury_severity         87889 non-null  object 
 11  aircraft_damage         85695 non-null  object 
 12  aircraft_category       32287 non-null  object 
 13  registration_number     87572 non-null  object 
 14  make                    88826 non-null

In [7]:
'''Descriptive Analysis'''
# Get Statistical summary
aviation_data.describe(include = "O").T

Unnamed: 0,count,unique,top,freq
event_id,88889,87951,20001212X19172,3
investigation_type,88889,2,Accident,85015
accident_number,88889,88863,CEN22FA424,2
event_date,88889,14782,1982-05-16,25
location,88837,27758,"ANCHORAGE, AK",434
country,88663,219,United States,82248
latitude,34382,25592,332739N,19
longitude,34373,27156,0112457W,24
airport_code,50249,10375,NONE,1488
airport_name,52790,24871,Private,240


In [5]:
# Check for duplicated values
print("Duplicates:", aviation_data.duplicated().sum())

# Check for nulls and get their percentage to advice on best imputing or dropping criteria
null_counts = aviation_data.isna().sum()
null_percent = (null_counts / len(aviation_data)) * 100
null_summary = pd.DataFrame({'Null Count': null_counts, 'Null %': null_percent.round(2)})

print("\nNull Values Summary:\n", null_summary)

Duplicates: 0

Null Values Summary:
                         Null Count  Null %
event_id                         0    0.00
investigation_type               0    0.00
accident_number                  0    0.00
event_date                       0    0.00
location                        52    0.06
country                        226    0.25
latitude                     54507   61.32
longitude                    54516   61.33
airport_code                 38640   43.47
airport_name                 36099   40.61
injury_severity               1000    1.12
aircraft_damage               3194    3.59
aircraft_category            56602   63.68
registration_number           1317    1.48
make                            63    0.07
model                           92    0.10
amateur_built                  102    0.11
number_of_engines             6084    6.84
engine_type                   7077    7.96
far_description              56866   63.97
schedule                     76307   85.85
purpose_of_flight

# 2. Data Understanding

## 2.1 Data Source

The dataset used in this analysis is sourced from the **National Transportation Safety Board (NTSB)** and contains records of civil aviation accidents and selected incidents from **1962 to 2023**. The data covers both U.S.-based incidents and those that occurred in international waters involving U.S.-registered aircraft.

- **File Type**: CSV
- **Size**: _[number of rows and columns after loading]_
- **Period Covered**: 1962–2023
- **Scope**: Includes variables on aircraft make/model, accident severity, causes, weather conditions, flight purpose, location, and fatalities/injuries.

## 2.2 Data Structure

Key columns in the dataset include:

| Column Name         | Description                                                                 |
|---------------------|-----------------------------------------------------------------------------|
| `Event Date`        | Date of the accident                                                        |
| `Location`          | City and state (or international location) of the accident                  |
| `Make`              | Manufacturer of the aircraft                                                |
| `Model`             | Model of the aircraft                                                       |
| `Aircraft Damage`   | Severity of aircraft damage (e.g., Destroyed, Substantial, Minor)           |
| `Injury Severity`   | Level of injury/fatalities (e.g., Fatal, Non-Fatal, None)                   |
| `Purpose of Flight` | Reason for flight (e.g., Personal, Business, Commercial, Instructional)     |
| `Broad Phase of Flight` | Flight phase during which the incident occurred (e.g., Takeoff, Landing)  |
| `Weather Condition` | Weather during the incident (e.g., VMC - Visual Meteorological Conditions)  |
| `Total Aboard`      | Number of persons on board                                                  |
| `Total Fatal Injuries` | Number of fatalities resulting from the incident                        |

## 2.3 Initial Observations

- The dataset spans **over six decades**, making time-based trend analysis highly feasible.
- Certain fields, such as `Injury Severity` and `Aircraft Damage`, are categorical and may require normalization or encoding.
- There are missing values across several columns, especially in fields such as `Weather Condition`, `Broad Phase of Flight`, and `Total Fatal Injuries`.
- The column `Make` and `Model` are critical for our analysis since they relate directly to the business question on **aircraft risk assessment**.

## 2.4 Data Quality Issues

- **Missing Values**: Several fields contain nulls or blank strings. These must be investigated for relevance and either imputed, ignored, or used as-is depending on the column.
- **Inconsistent Labeling**: Categorical values such as `Aircraft Damage` or `Purpose of Flight` may be inconsistent (e.g., "business" vs "Business") and will require standardization.
- **Outliers**: Outliers may exist in fields like `Total Aboard` and `Total Fatal Injuries` and must be handled with care—especially in ratio-based analyses.

## 2.5 Next Steps

- Perform data cleaning and preprocessing.
- Explore distributions of key fields (e.g., damage, injury, make/model frequency).
- Create visual summaries to better understand relationships between aircraft types and safety metrics.
- Begin framing metrics for risk analysis (e.g., accident rate per model, severity index).

