In [1]:
import pandas as pd
import numpy as np

## Non-Major Incidents Data

The safety dataset is composed of individual records of minor safety incidents from 2008 to 2024. The URL describing the dataset is given by https://data.transportation.gov/Public-Transit/Non-Major-Safety-and-Security-Events/urir-txqm/about_data.

In [None]:
raw_dat = pd.read_csv("../../raw_data/NonMajor_Safety_Raw.csv", sep=',')
raw_dat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 97032 entries, 0 to 97031
Data columns (total 20 columns):
 #   Column                                                                 Non-Null Count  Dtype 
---  ------                                                                 --------------  ----- 
 0   5 Digit NTD ID                                                         97032 non-null  int64 
 1   Agency                                                                 97032 non-null  object
 2   Mode                                                                   97032 non-null  object
 3   Type of Service                                                        97032 non-null  object
 4   Month                                                                  97032 non-null  object
 5   Year                                                                   97032 non-null  int64 
 6   Safety/Security                                                        97032 non-null  object


  raw_dat = pd.read_csv("../data dump/NonMajor_Safety_Raw.csv", sep=',')


`Additional Assault Information` is free-form input, which is not useful. We can get rid of that one. We also only have information of the capital expenditures for 2016 to 2022, so we only need to isolate those years.

In [3]:
raw_dat = raw_dat.drop(columns=['Additional Assault Information'])
raw_dat = raw_dat[(2016 <= raw_dat['Year']) & (raw_dat['Year'] <= 2022)]

In [4]:
raw_dat.info()

<class 'pandas.core.frame.DataFrame'>
Index: 31838 entries, 11824 to 43661
Data columns (total 19 columns):
 #   Column                                                                 Non-Null Count  Dtype 
---  ------                                                                 --------------  ----- 
 0   5 Digit NTD ID                                                         31838 non-null  int64 
 1   Agency                                                                 31838 non-null  object
 2   Mode                                                                   31838 non-null  object
 3   Type of Service                                                        31838 non-null  object
 4   Month                                                                  31838 non-null  object
 5   Year                                                                   31838 non-null  int64 
 6   Safety/Security                                                        31838 non-null  object
 

We are aggregating over the full years, so we can ignore the `Month` field, and we can also ignore `Type of Service`, since we are looking at transit agencies themselves, not the detailed specifics of the services they offer. We can also get rid of the `Location` category, since the `Location Group` includes broad details of the categories of the incident locations anyway.

In [5]:
raw_dat = raw_dat.drop(columns=['Month', 'Type of Service', 'Location', 'Location'])
raw_dat.info()

<class 'pandas.core.frame.DataFrame'>
Index: 31838 entries, 11824 to 43661
Data columns (total 16 columns):
 #   Column                                                                 Non-Null Count  Dtype 
---  ------                                                                 --------------  ----- 
 0   5 Digit NTD ID                                                         31838 non-null  int64 
 1   Agency                                                                 31838 non-null  object
 2   Mode                                                                   31838 non-null  object
 3   Year                                                                   31838 non-null  int64 
 4   Safety/Security                                                        31838 non-null  object
 5   Event Type                                                             31838 non-null  object
 6   Location Group                                                         31838 non-null  object
 

Based on the kind of data we want to use with the NTD data, keeping the `Event Type`, `Location Group`, and `Safety/Security` just gets in the way, since we want to aggregate the bulk numbers across the agencies. We also drop the `Agency` column, since the ID gets us there already.

In [6]:
raw_dat = raw_dat.drop(columns=['Safety/Security', 'Event Type', 'Location Group', 'Agency'])

These entries are by month, so we need to group them by year and transit agency. The agencies have names, but even better, they have unique numeric identifiers. In addition, we need to group these by year.

In [7]:
dat_grouped = raw_dat.groupby(by = ['5 Digit NTD ID', 'Mode', 'Year'])

In [8]:
agg_safety = dat_grouped.sum().reset_index()
agg_safety.sample(7)

Unnamed: 0,5 Digit NTD ID,Mode,Year,Physical Assaults on Operators (Security Events Only),Non-Physical Assaults on Operators (Security Events Only),Physical Assaults on Other Transit Workers (Security Events Only),Non-Physical Assaults on Other Transit Workers (Security Events Only),Total Events,Customer Injuries (Safety Events Only),Worker Injuries (Safety Events Only),Other Injuries (Safety Events Only),Total Injuries (Safety Events Only)
2430,50518,DR,2019,0,0,0,0,1,1,0,0,1
991,30030,MB,2016,0,0,0,0,121,109,10,1,120
1626,40053,MB,2019,0,0,0,0,1,1,0,0,1
1705,40087,MB,2019,0,0,0,0,31,31,0,0,31
1487,40032,MB,2021,0,0,0,0,15,14,0,1,15
1320,40015,DR,2016,0,0,0,0,1,0,0,0,0
1988,50016,MB,2016,0,0,0,0,34,34,0,0,34


In [9]:
agg_safety.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3677 entries, 0 to 3676
Data columns (total 12 columns):
 #   Column                                                                 Non-Null Count  Dtype 
---  ------                                                                 --------------  ----- 
 0   5 Digit NTD ID                                                         3677 non-null   int64 
 1   Mode                                                                   3677 non-null   object
 2   Year                                                                   3677 non-null   int64 
 3   Physical Assaults on Operators (Security Events Only)                  3677 non-null   int64 
 4   Non-Physical Assaults on Operators (Security Events Only)              3677 non-null   int64 
 5   Physical Assaults on Other Transit Workers (Security Events Only)      3677 non-null   int64 
 6   Non-Physical Assaults on Other Transit Workers (Security Events Only)  3677 non-null   int64 
 7

One can verify that the "Security Events" columns are all zero, and so we will remove them from the data to be analyzed, since they provide no information.

In [10]:
agg_safety.drop(columns=agg_safety.columns[3:7], inplace=True)
agg_safety.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3677 entries, 0 to 3676
Data columns (total 8 columns):
 #   Column                                  Non-Null Count  Dtype 
---  ------                                  --------------  ----- 
 0   5 Digit NTD ID                          3677 non-null   int64 
 1   Mode                                    3677 non-null   object
 2   Year                                    3677 non-null   int64 
 3   Total Events                            3677 non-null   object
 4   Customer Injuries (Safety Events Only)  3677 non-null   int64 
 5   Worker Injuries (Safety Events Only)    3677 non-null   int64 
 6   Other Injuries (Safety Events Only)     3677 non-null   int64 
 7   Total Injuries (Safety Events Only)     3677 non-null   int64 
dtypes: int64(6), object(2)
memory usage: 229.9+ KB


In [11]:
# Rename the columns for easier access.
agg_safety.columns = ['NTD ID', 'Mode', 'Year', 'Total Minor Incidents', 'Customer Injuries',
                      'Worker Injuries', 'Other Injuries', 'Total Injuries']
agg_safety['Total Minor Incidents'] = agg_safety['Total Minor Incidents'].astype('int64')
agg_safety['NTD ID'] = [f"{num:05d}" for num in agg_safety['NTD ID']]
agg_safety.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3677 entries, 0 to 3676
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   NTD ID                 3677 non-null   object
 1   Mode                   3677 non-null   object
 2   Year                   3677 non-null   int64 
 3   Total Minor Incidents  3677 non-null   int64 
 4   Customer Injuries      3677 non-null   int64 
 5   Worker Injuries        3677 non-null   int64 
 6   Other Injuries         3677 non-null   int64 
 7   Total Injuries         3677 non-null   int64 
dtypes: int64(6), object(2)
memory usage: 229.9+ KB


In [12]:
agg_safety.to_csv('../data/NTD_NonMajor_Safety_Incidents.csv', index=False)

## Major Incidents Data

We next process the major incidents data for the given systems. The information is given in https://data.transportation.gov/Public-Transit/Major-Safety-Events/9ivb-8ae9/about_data.

In [None]:
major_raw = pd.read_csv("../../raw_data/Major_Safety_Raw.csv")

  major_raw = pd.read_csv("../data dump/Major_Safety_Raw.csv")


In [14]:
# Too many columns to print with info, so we put them in a text file for reference.
cols = np.array(major_raw.columns)
cols.tofile('major_cols.txt', sep='\n')

The dataset catalogs individual incidents across transit agencies, but we want aggregated numbers for each year by agency and transit mode. This means that we are going to remove information that cannot be summed, which translates to removing non-numerical fields from the datasets. Admittedly, it would take away information if we were interested in the details of the accidents, but for the questions we want to address, we are primarily interested in the general safety records of the agencies. There are a lot of columns, so let's get to pruning.

In [15]:
idx_to_rm = [np.arange(start=9, stop=23), [24], [27], np.arange(start=32, stop=68), 
             np.arange(start=69, stop=72), [112]]
rm_flat = []
for sublist in idx_to_rm:
    rm_flat.extend(sublist)

In [16]:
major_raw.drop(columns=major_raw.columns[rm_flat], inplace=True)

In [17]:
major_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94961 entries, 0 to 94960
Data columns (total 57 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   NTD ID                                            94961 non-null  int64  
 1   Agency                                            94961 non-null  object 
 2   Primary UZA UACE Code                             94961 non-null  int64  
 3   Rail/Bus/Ferry                                    94961 non-null  object 
 4   Mode Name                                         94961 non-null  object 
 5   Mode                                              94961 non-null  object 
 6   Type of Service                                   94961 non-null  object 
 7   Fixed Route Flag                                  94961 non-null  bool   
 8   Year                                              94961 non-null  int64  
 9   Property Damage  

In [18]:
# Remove last few columns we need to.
major_raw.drop(columns=['Agency', 'Primary UZA UACE Code', 'Rail/Bus/Ferry', 'Mode Name',
                        'Type of Service', 'Fixed Route Flag'], inplace=True)
major_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94961 entries, 0 to 94960
Data columns (total 51 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   NTD ID                                            94961 non-null  int64  
 1   Mode                                              94961 non-null  object 
 2   Year                                              94961 non-null  int64  
 3   Property Damage                                   74925 non-null  float64
 4   Total Injuries                                    94961 non-null  int64  
 5   Total Fatalities                                  94961 non-null  int64  
 6   Number of Transit Vehicles Involved               94961 non-null  int64  
 7   Number of Vehicles Involved                       94961 non-null  int64  
 8   Number of Cars on Involved Transit Vehicles       94961 non-null  int64  
 9   Number of Deraile

In [19]:
# Select only incidents occuring between 2016 and 2022.
major_raw = major_raw[(major_raw['Year'] >= 2016) & (major_raw['Year'] <= 2022)]
major_raw.shape

(62790, 51)

In [20]:
#Create the grouped object on which we will perform the operations.
major_grouped = major_raw.groupby(by=['NTD ID', 'Mode', 'Year'])

In [21]:
# The information specifying the number of vehicles involved in an accident is more amenable to something like the median, since 
# it tells us the usual severity of the major incidents. It might not give much information, in which case we can
# drop the columns. Vehicle speed could be a useful proxy, however, though since it is vehicle speed at time of accident,
# it might not tell us much about how fast the vehicles move.
med_cols = major_grouped[['Number of Transit Vehicles Involved', 'Number of Vehicles Involved',
              'Number of Cars on Involved Transit Vehicles', 'Number of Derailed Vehicles', 'Vehicle Speed']].median()
med_cols = med_cols.reset_index()
med_cols.head()

Unnamed: 0,NTD ID,Mode,Year,Number of Transit Vehicles Involved,Number of Vehicles Involved,Number of Cars on Involved Transit Vehicles,Number of Derailed Vehicles,Vehicle Speed
0,1,MB,2016,1.0,0.0,0.0,,15.0
1,1,MB,2017,1.0,0.0,0.0,,15.0
2,1,MB,2018,1.0,1.0,0.0,,5.0
3,1,MB,2019,1.0,0.0,0.0,,15.0
4,1,MB,2020,1.0,1.0,0.0,,10.0


In [22]:
sum_cols = major_raw.drop(columns=['Number of Transit Vehicles Involved', 'Number of Vehicles Involved',
                                   'Number of Cars on Involved Transit Vehicles', 'Number of Derailed Vehicles',
                                   'Vehicle Speed'])
sum_cols = sum_cols.groupby(by=['NTD ID', 'Mode', 'Year'])
sum_cols = sum_cols.sum().reset_index()

We've completed the aggregation on both groups. Before merging, I want to check the median columns to see if they're actually telling us anything.

In [23]:
print(np.count_nonzero(med_cols.fillna(0), axis=0))

[3725 3725 3725 3452 3096  298   71 3276]


It looks like the `Number of Cars on Involved Transit Vehicles` and `Number of Derailed Vehicles` columns have the vast majority of their values being either NA or 0. That's not ideal, but maybe it's a bit better for the rail specific modes?

In [24]:
import re
np.count_nonzero(med_cols[[re.match(r"[A-Z]R", s) is not None for s in med_cols['Mode']]].fillna(0),
                 axis=0)

array([1376, 1376, 1376, 1248, 1084,  286,   69, 1226])

It's really not. We can get rid of those two columns.

In [25]:
med_cols.drop(columns=['Number of Derailed Vehicles',
                       'Number of Cars on Involved Transit Vehicles'], inplace=True)

In [26]:
med_cols.columns = ['NTD ID', 'Mode', 'Year', 'Median Transit Vehicles Involved', 'Median Vehicles Involved', 'Median Vehicle Speed']
med_cols.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3725 entries, 0 to 3724
Data columns (total 6 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   NTD ID                            3725 non-null   int64  
 1   Mode                              3725 non-null   object 
 2   Year                              3725 non-null   int64  
 3   Median Transit Vehicles Involved  3725 non-null   float64
 4   Median Vehicles Involved          3725 non-null   float64
 5   Median Vehicle Speed              3591 non-null   float64
dtypes: float64(3), int64(2), object(1)
memory usage: 174.7+ KB


Now we are ready to merge the two data frames into one dataset, which we can then export. Here we go.

In [27]:
print(med_cols.shape)
print(sum_cols.shape)

(3725, 6)
(3725, 46)


In [28]:
major_data = sum_cols.merge(med_cols, on=['NTD ID', 'Mode', 'Year'])

In [29]:
major_data.shape

(3725, 49)

In [30]:
major_data['NTD ID'] = [f"{num:05d}" for num in major_data['NTD ID']]
major_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3725 entries, 0 to 3724
Data columns (total 49 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   NTD ID                                            3725 non-null   object 
 1   Mode                                              3725 non-null   object 
 2   Year                                              3725 non-null   int64  
 3   Property Damage                                   3725 non-null   float64
 4   Total Injuries                                    3725 non-null   int64  
 5   Total Fatalities                                  3725 non-null   int64  
 6   Transit Vehicle Rider Fatalities                  3725 non-null   int64  
 7   People Waiting or Leaving Fatalities              3725 non-null   int64  
 8   Transit Vehicle Operator Fatalities               3725 non-null   int64  
 9   Transit Employee Fa

In [None]:
major_data.to_csv('../../data/NTD_Major_Safety_Incidents.csv', sep=',', index=False)