# Aviation Safety Analysis: Identifying Low-Risk Aircraft for Business Expansion
## Overview

This project analyzes aviation accident data from the National Transportation Safety Board (NTSB) to provide actionable insights for a company expanding into the aviation industry. By examining historical accident data from 1962 to 2023, we identify the lowest-risk aircraft for commercial and private operations to inform strategic purchasing decisions.

## Goals and Objectives:


### Goal 1: Data Understanding, Preparation& Cleaning
* Import the aviation dataset and thoroughly examine its structure
* Handle missing values using appropriate imputation techniques
* Create derived features that help identify aircraft risk factors (e.g., accident rate per flight hours)
* Normalize data by aircraft make/model to enable fair comparisons

### Goal 2: Analysis & Visualization
* Analyze trends in accidents by aircraft type over time
* Compare safety records across different manufacturers and models
* Examine how factors like aircraft age, maintenance history, and operating conditions affect safety
* Create visualizations that clearly show which aircraft categories have the lowest accident rates


# Goal 3: Key Insights from Visualizations
### Outlier Detection:

* The boxplots clearly show extreme values in injury counts that need investigation

* Rare categories in flight phases and weather conditions are identified

### Time Trends:

* The time series with anomaly detection highlights unusual years that deviate from the 5-year moving average

* Fatalities trend shows correlation with accident frequency but with some notable exceptions

### Phase/Category Heatmap:

* Reveals which combinations of aircraft type and flight phase are most dangerous

* Shows normalized percentages for fair comparison between different aircraft categories

### Weather Impact:

* The stacked bar chart quantifies how different weather conditions affect accident severity

* Danger score provides a simple metric for comparing risk across conditions

### Geographic Patterns:

* The map visualization shows clusters of accidents and their relative severity

* Highlights regions requiring special operational considerations

### Goal 4: Business Recommendations
Identify specific aircraft models with the best safety records
Recommend optimal aircraft age ranges for purchase
Suggest operational guidelines to minimize risk
Enhance predictive maintenance protocols
Develop comprehensive weather avoidance strategies
Invest in advanced materials research
Implement robust communication systems
Continuous pilot training for extreme conditions



In [14]:
import pandas as pd
df= pd.read_csv("Aviation_Data.csv")
df.head(50)

FileNotFoundError: [Errno 2] No such file or directory: 'Aviation_Data.csv'

### Goal 1:   Data Understanding, Preparation& Cleaning

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the data
df = pd.read_csv('Aviation_Data.csv')
df


FileNotFoundError: [Errno 2] No such file or directory: 'Aviation_Data.csv'

In [None]:
df.dtype()

NameError: name 'df' is not defined

## Getting summary of the dataframe

In [16]:

df.info()

NameError: name 'df' is not defined

## Getting a summary of the statistics of the dataframe

In [None]:
df.describe()

## Handling missing values in the Dataset

In [None]:
df.isnull().sum()

NameError: name 'df' is not defined

In [None]:
df.isnull().sum().sum()

# Filling Null values

In [None]:
df2 = df.fillna(value = 0)
df2

NameError: name 'df' is not defined

In [None]:
df2.isnull().sum().sum()

## Handling Duplicates in this Dataset

In [15]:
df_clean = df.copy()
df_clean = df_clean.drop_duplicates(subset = [
        'Event.Id',          # Should be unique per event
        'Accident.Number',   # Should be unique per accident
        'Event.Date',        # Combined with other fields
        'Location',          # Helps identify unique events
        'Aircraft.Category'  # Part of unique identification
    ] , keep='first')        


NameError: name 'df' is not defined

# Filtering the dataset

## Analysing accidents: Discarding incidents

In [None]:
gby_ev_type = events.groubby('ev_type')
gby_ev_type.groups.keys()

* *Accident: An occurrence associated with the operation of an aircraft which takes place between the time any person boards an aircraft with the intention of flight until such time as all such persons have disembarked,in which:

A person is fatally or seriously injured.
The Aircraft sustains damage or structural failure
The Aircraft is missing or is completely inaccessible

* Incident: An occurrence other than an accident,associated with the operation of an aircraft which affects or could affect the safety of operation.

In [None]:
events = gby_ev_type.get_group('ACC')
events['ev_id'].count()

# Focus on Commercial Flights

In [None]:
# are accidents
cond = aircraft['ev_id'].isin(events['ev_id'])
aircraft = aircraft[cond]

In [None]:
gby_far_apart = aircraft.groupby('far_apart')
gby_far_apart.groups.keys()

In [17]:
desired_far_parts = ['NUSC', # Non-U.S.Commercial
                       '121'] # Air Carrier

#### Checking for duplicates

In [None]:
ev_id_for_desired_criteria = aircraft['ev_id'].drop_duplicates()
# Determining how many different events is required after this?
print('Number of events:', ev_ids_for_desired_criteria.count())

mask = events['ev_id'].isin(ev_ids_for_desired_criteria.values)
events = events[mask]

## Handling Missing Values in this Data Frame

In [None]:
df.isnull().sum()

In [None]:
df.isnull().sum().sum()

## Filling Null values

In [None]:
df2 = df.fillna(value = 0)
df2

In [None]:
df2.isnull().sum().sum()

# Goal 2: Analysis & Visualization

* Events and aircraft only contain accident data from flights corresponding to Far parts in desired_far_parts.

## Where do accidents occur?
* What information about location is given?

In [None]:
events[['ev_Airportname', 'ev_Airportcode', 'ev_country', 'latitude', 'longitude']].head(10)

In [None]:
# parsing latitude
events['latitude'] = events['latitude'].replace(' ',np.NaN)
lat = events['latitude']
lat.dropna(inplace=True)

mask = lat.str.contains(r'^[0-9]{6}[NnSs]$')
events['latitude_num'] =lat[mask].apply(convert_lat)

# parsing longitude
events['longitude'] = events['longitude'].replace(' ',np.NaN)
lon = events['longitude']
lon.dropna(inplace=True)

mask = lon.str.contains('r^[0-9]{7}[EeWw]$')
events['longitude_num'] = lon[mask].apply(convert_lon)

events[['longitude_num', 'latitude_num']].head()


### Visualization using basemaps which also involves plotting

In [None]:
lon_ = events['longitude_num'].values
lat_ = events['latitude_num'].values

In [None]:
from mpl_toolkits.basemap import Basemap

fig=plt.figure(figsize=(15,15))

m1 = Basemap('milli',lon_O=0, lat_O=0)
m1.bluemarble()
m1.drawcoastlines()
m1.scatter(lon_,lat_,latlon=True,marker='.',color='r')

### Determining where the airports are 

In [None]:
header= ['name','country','lat','lon']
airports = pd.read_csv ('Aviation_data.csv',usecols=(1,2,3,6,7), names=header)
airports.head()

In [None]:
fig.ax = plt.subplots(2,1,sharex=True,sharey=True)

fig.set_size_inches(15,15)

ax[0].set_title('airports')
m1= Basemap('mill',lon_O)=0,lat_O=0,ax=ax[0])
m1.drawcoastlines()
m1.bluemarble()
x,y = m1(airports['lom'],airports['lat'])

m1.scatter(x,y,latlon=False, marker='.',color='g')

ax[1].set_title('accidents')
m2 = Basemap('mill',lon_O=0,lat_O=0,ax=ax[1])
m2.drawcoastlines()
m2.bluemarble()
m2.scatter(lon_,lat_,latlon=False, marker='.',color='r')
# conclusion from the above is that the dataset is American as well indicate the airport and accident variables

## When do accidents occur and also which broad flight phase is safer

In [None]:
detailed_phase=[501,502,503,504,505,511,512,513,514,521,522,523,531,541,551,552,553,562,563,564,565,566,567,567,568,569,571,572,573,574,575,576,581,582,583,591,592,542]

general_phase =[(ii/10)*10 for ii in detailed_phase[:-1]]
general_phase.append(580)

aircraft['phase_gen'] = aircraft['phase_fit_spec'].replace(to_replace=detailed_phase,value=general_phase)
occurrence_series = aircraft['phase_gen'].value_counts()

# only the ten most common ones
occurrences_series.iloc[o:10]

In [None]:
phases_dict = {500:'STANDING',510:'TAXI',520:'TAKEOFF',530:'CLIMB',540:'CRUISE',550:'DESCENT',560:'APPROACH',570:'LANDING',580:'MANEUVERING',600:'OTHER',610:'UNKNOWN'}

occurences_series = occurences_series.rename_axis(phases_dict)

occurences_series.plot.barh(stacked=True)
plt.xlabel('number of accidents')
# There are many accidents relating to landing.

### Looking at the number of people injured as well as damage to the aircraft an approach can be:

In [None]:
group= aircraft_groupby(['phase_gen', 'damege'])

phases_list = phases_dict.keys()
damage_list = ['NONE','SUBS','DEST','MINR','UNK']
injured = ['injured','fatalities']
phases = pd.DataFrame(columns=damage_list+injured)

phases

In [None]:
for phase in phases_list:
    sum_inj = 0
    sum_fat = 0
    for dam in damage_list:
        gg = group.get_group((phase, dam))
        phases.loc[phase,[dam]] = gg['ev_id'].count()
        mask = events['ev_id'].isin(gg['ev_id'])
        # print(mask.sum())
        inj_m = events[mask]['inj_tot_m'].sum()
        inj_s = events[mask]['inj_tot_s'].sum()
        fat = events[mask]['inj_tot_f'].sum()
        if not np.isNaN(inj_m):
            sum_inj += inj_m
        if not np.isNaN(inj_s)
           sum_inj += inj_s
        if not np.isNaN(fat)
           sum_fat += fat
      except keyError:
        pass

    phases.loc[phase, ['injured']] = sum_inj
    phases.loc[phase, ['fatalities']] = sum_fat

phases.rename_axis(phases_dict, inplace = True)
phases

In [None]:
phases['TOTAL'] = PHASES[['NONE','MINR','SUBS','DEST']].SUM(axis=1)
phases = phases.sort_values(by='TOTAL',ascending=False)

phases[['NONE','MINR','SUBS','DEST',]].plot.barh(stacked=True)
# There is a large number of events in landing and the aircraft is substantially damaged during landing.

In [None]:
phases[['injured','fatalities']].plot.barh()

In [None]:
phases = phases.sort_values(by='fatalities', ascending=False)
phases[['injured','fatalities']].plot.barh()

## Which are the main accident occurrences?

In [None]:
occurrences = pd.read_csv('Aviation_Data.csv',sep='.')
occurrences = occurrences[occurrences]['ev_id'].isin(events['ev_id'])