In [41]:
import pandas as pd
import numpy as np

## Non-Major Incidents Data

The safety dataset is composed of individual records of minor safety incidents from 2008 to 2024. The URL describing the dataset is given by https://data.transportation.gov/Public-Transit/Non-Major-Safety-and-Security-Events/urir-txqm/about_data.

In [None]:
raw_dat = pd.read_csv("../data dump/NonMajor_Safety_Raw.csv", sep=',')
raw_dat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 97032 entries, 0 to 97031
Data columns (total 20 columns):
 #   Column                                                                 Non-Null Count  Dtype 
---  ------                                                                 --------------  ----- 
 0   5 Digit NTD ID                                                         97032 non-null  int64 
 1   Agency                                                                 97032 non-null  object
 2   Mode                                                                   97032 non-null  object
 3   Type of Service                                                        97032 non-null  object
 4   Month                                                                  97032 non-null  object
 5   Year                                                                   97032 non-null  int64 
 6   Safety/Security                                                        97032 non-null  object


  raw_dat = pd.read_csv("../data dump/Safety_and_Security_Raw.csv", sep=',')


`Additional Assault Information` is free-form input, which is not useful. We can get rid of that one. We also only have information of the capital expenditures for 2016 to 2022, so we only need to isolate those years.

In [43]:
raw_dat = raw_dat.drop(columns=['Additional Assault Information'])
raw_dat = raw_dat[(2016 <= raw_dat['Year']) & (raw_dat['Year'] <= 2022)]

In [44]:
raw_dat.info()

<class 'pandas.core.frame.DataFrame'>
Index: 31838 entries, 11824 to 43661
Data columns (total 19 columns):
 #   Column                                                                 Non-Null Count  Dtype 
---  ------                                                                 --------------  ----- 
 0   5 Digit NTD ID                                                         31838 non-null  int64 
 1   Agency                                                                 31838 non-null  object
 2   Mode                                                                   31838 non-null  object
 3   Type of Service                                                        31838 non-null  object
 4   Month                                                                  31838 non-null  object
 5   Year                                                                   31838 non-null  int64 
 6   Safety/Security                                                        31838 non-null  object
 

We are aggregating over the full years, so we can ignore the `Month` field, and we can also ignore `Type of Service`, since we are looking at transit agencies themselves, not the detailed specifics of the services they offer. We can also get rid of the `Location` category, since the `Location Group` includes broad details of the categories of the incident locations anyway.

In [45]:
raw_dat = raw_dat.drop(columns=['Month', 'Type of Service', 'Location', 'Location'])
raw_dat.info()

<class 'pandas.core.frame.DataFrame'>
Index: 31838 entries, 11824 to 43661
Data columns (total 16 columns):
 #   Column                                                                 Non-Null Count  Dtype 
---  ------                                                                 --------------  ----- 
 0   5 Digit NTD ID                                                         31838 non-null  int64 
 1   Agency                                                                 31838 non-null  object
 2   Mode                                                                   31838 non-null  object
 3   Year                                                                   31838 non-null  int64 
 4   Safety/Security                                                        31838 non-null  object
 5   Event Type                                                             31838 non-null  object
 6   Location Group                                                         31838 non-null  object
 

Based on the kind of data we want to use with the NTD data, keeping the `Event Type`, `Location Group`, and `Safety/Security` just gets in the way, since we want to aggregate the bulk numbers across the agencies. We also drop the `Agency` column, since the ID gets us there already.

In [46]:
raw_dat = raw_dat.drop(columns=['Safety/Security', 'Event Type', 'Location Group', 'Agency'])

These entries are by month, so we need to group them by year and transit agency. The agencies have names, but even better, they have unique numeric identifiers. In addition, we need to group these by year.

In [47]:
dat_grouped = raw_dat.groupby(by = ['5 Digit NTD ID', 'Mode', 'Year'])

In [48]:
agg_safety = dat_grouped.sum().reset_index()
agg_safety.sample(7)

Unnamed: 0,5 Digit NTD ID,Mode,Year,Physical Assaults on Operators (Security Events Only),Non-Physical Assaults on Operators (Security Events Only),Physical Assaults on Other Transit Workers (Security Events Only),Non-Physical Assaults on Other Transit Workers (Security Events Only),Total Events,Customer Injuries (Safety Events Only),Worker Injuries (Safety Events Only),Other Injuries (Safety Events Only),Total Injuries (Safety Events Only)
3550,90175,MB,2017,0,0,0,0,1,1,0,0,1
1224,40003,MB,2020,0,0,0,0,12,11,0,0,11
3625,90233,MB,2022,0,0,0,0,8,3,5,0,8
2864,70014,DR,2017,0,0,0,0,1,1,0,0,1
339,10004,MB,2018,0,0,0,0,4,3,1,0,4
3413,90090,MB,2018,0,0,0,0,5,5,0,0,5
531,10128,MB,2020,0,0,0,0,3,3,0,0,3


In [49]:
agg_safety.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3677 entries, 0 to 3676
Data columns (total 12 columns):
 #   Column                                                                 Non-Null Count  Dtype 
---  ------                                                                 --------------  ----- 
 0   5 Digit NTD ID                                                         3677 non-null   int64 
 1   Mode                                                                   3677 non-null   object
 2   Year                                                                   3677 non-null   int64 
 3   Physical Assaults on Operators (Security Events Only)                  3677 non-null   int64 
 4   Non-Physical Assaults on Operators (Security Events Only)              3677 non-null   int64 
 5   Physical Assaults on Other Transit Workers (Security Events Only)      3677 non-null   int64 
 6   Non-Physical Assaults on Other Transit Workers (Security Events Only)  3677 non-null   int64 
 7

One can verify that the "Security Events" columns are all zero, and so we will remove them from the data to be analyzed, since they provide no information.

In [51]:
agg_safety.drop(columns=agg_safety.columns[3:7], inplace=True)
agg_safety.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3677 entries, 0 to 3676
Data columns (total 8 columns):
 #   Column                                  Non-Null Count  Dtype 
---  ------                                  --------------  ----- 
 0   5 Digit NTD ID                          3677 non-null   int64 
 1   Mode                                    3677 non-null   object
 2   Year                                    3677 non-null   int64 
 3   Total Events                            3677 non-null   object
 4   Customer Injuries (Safety Events Only)  3677 non-null   int64 
 5   Worker Injuries (Safety Events Only)    3677 non-null   int64 
 6   Other Injuries (Safety Events Only)     3677 non-null   int64 
 7   Total Injuries (Safety Events Only)     3677 non-null   int64 
dtypes: int64(6), object(2)
memory usage: 229.9+ KB


In [53]:
# Rename the columns for easier access.
agg_safety.columns = ['NTD ID', 'Mode', 'Year', 'Total Minor Incidents', 'Customer Injuries',
                      'Worker Injuries', 'Other Injuries', 'Total Injuries']
agg_safety['Total Minor Incidents'] = agg_safety['Total Minor Incidents'].astype('int64')
agg_safety.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3677 entries, 0 to 3676
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   NTD ID                 3677 non-null   int64 
 1   Mode                   3677 non-null   object
 2   Year                   3677 non-null   int64 
 3   Total Minor Incidents  3677 non-null   int64 
 4   Customer Injuries      3677 non-null   int64 
 5   Worker Injuries        3677 non-null   int64 
 6   Other Injuries         3677 non-null   int64 
 7   Total Injuries         3677 non-null   int64 
dtypes: int64(7), object(1)
memory usage: 229.9+ KB


In [54]:
agg_safety.to_csv('../data/NTD_NonMajor_Safety_Incidents.csv', index=False)