Thomas Dougherty

Probability and Statistics for Computer Science


### Analysis of the On-Time Performance (OTP) of New Jersey Transit Commuter Rail<br>
On-time performance refers to the level of success of a service adhering to its schedule. In this example I'll be exploring NJT commuter rail from March 2018 to March 2020. On-time performance data can be used for schedule planning, passenger information systems, and comparison to weather data. 
    

#### Data cleanup

In [None]:

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from datetime import datetime
from helper_functions import *

#all_services = helper.combine_csvs("data\\services\\")
print("Reading CSV....")
all_services = pd.read_csv('data\\all_services.csv')
all_services = helper.format_services(all_services)
all_services.head(10)

## Visualization of On Time Performance
New Jersey Transit defines "On Time" as a variation of service within 6 minutes of the published schedule. A function iterates through the data frame and classifies each arrival into one of four categories by keeping a count in a size 4 array. 

#### All Services by Year

In [None]:
#To Do: automate this
# Get the data for each subplot
data_2018 = helper.categorize_lateness(all_services[(all_services['date'].dt.year == 2018)])
data_2019 = helper.categorize_lateness(all_services[(all_services['date'].dt.year == 2019)])
data_2020 = helper.categorize_lateness(all_services[(all_services['date'].dt.year == 2020)])
late_labels = ['On Time', '6-10 Minutes Late', '10-15 Minutes Late', 'More Than 15 Minutes Late']
late_colors = ['green', 'yellow', 'orange', 'red']

fig, (ax1, ax2, ax3) = plt.subplots(1,3)

ax1.pie(data_2018, colors=late_colors, radius = 1.2, autopct = "%0.2f%%", startangle=270)
ax1.title.set_text('On Time Performance in 2018')
ax2.pie(data_2019, colors=late_colors, radius = 1.2, autopct = "%0.2f%%", startangle=270)
ax2.title.set_text('On Time Performance in 2019')
ax3.pie(data_2020, colors=late_colors, radius = 1.2, autopct = "%0.2f%%", startangle=270)
ax3.title.set_text('On Time Performance in 2020*')
plt.text(1.0, -1.3, "*Data collection ended in May 2020", fontsize=10)
fig.set_figwidth(12)
plt.legend(bbox_to_anchor = (1.0, 1.0), labels = late_labels)
plt.show()

#### All Services by Season

Using the date column, arrivals will be divided into four primary categories. The arrivals will be sub-categorized by how many minutes late then plotted onto a pie chart to show on time performance in the seasons of the year

In [None]:
# TO DO: Automate this
data_fall = helper.categorize_lateness(all_services[(all_services['date'].dt.month >= 9) & (all_services['date'].dt.month <= 12)])
# could not get a 'true' boolean value for the winter months and had to broken up into two separate dfs
df1 = all_services[(all_services['date'].dt.month <= 3)]
df2 = all_services[(all_services['date'].dt.month >= 12)]
result = pd.concat([df1,df2])
data_winter = helper.categorize_lateness(result) 
data_spring = helper.categorize_lateness(all_services[(all_services['date'].dt.month >= 3) & (all_services['date'].dt.month <=6)])
data_summer = helper.categorize_lateness(all_services[(all_services['date'].dt.month >= 6) & (all_services['date'].dt.month <=9)])

fig, (ax1, ax2, ax3, ax4) = plt.subplots(1,4)

# and this
ax1.pie(data_fall, colors=late_colors, radius = 1.2, autopct = "%0.2f%%", startangle=270)
ax1.title.set_text('September - December')
ax2.pie(data_winter, colors=late_colors, radius = 1.2, autopct = "%0.2f%%", startangle=270)
ax2.title.set_text('December - March')
ax3.pie(data_spring, colors=late_colors, radius = 1.2, autopct = "%0.2f%%", startangle=270)
ax3.title.set_text('March - June')
ax4.pie(data_summer, colors=late_colors, radius = 1.2, autopct = "%0.2f%%", startangle=270)
ax4.title.set_text('June - September')

fig.set_figwidth(15)
fig.set_figheight(5)

plt.legend(bbox_to_anchor = (1.0, 1.0), labels = late_labels)
plt.show()

In Autumn, tree leaves tend to fall on the rails. As they're crushed by the weight of the trains, the wheels of the train and the rails they ride on become coated in a low-friction residue. he buildup is incremental with each passing train. So unlike snow, the buildup cannot be prevented by the regular running of trains. These condition makes it difficult for trains to accelerate, decelerate, and maintain safe speeds leading to delays and other service disruptions so we can expect a lower On Time Performance during the autumn months. 

##### OTP by Line <br>


In [None]:
njt_lines = all_services['line'].unique()

line_labels = ['Northeast Corridor', 'North Jersey Coast', 'Main Line', 'Morristown Line', 'Gladstone Branch',
               'Raritan Valley', 'Bergen County Line', 'Atlantic City Line', 'Montclair-Boonton',
               'Princeton Shuttle', 'Pascack Valley']

#otp_by_line_2018 = helper.get_otp_data(all_services,'line', njt_lines, 2018)
#otp_by_line_2019 = helper.get_otp_data(all_services,'line', njt_lines, 2019)
#helper.chart_subplots(otp_by_line_2018,line_labels, late_labels, late_colors, "On Time Performance In 2018")
#helper.chart_subplots(otp_by_line_2019,line_labels, late_labels, late_colors, "On Time Performance In 2019")


## Reporting to Final Destination On Time
To Do: Systemwide, By Line, New York Penn Station <br>
bar charts?

In [35]:
trains_to_nyp = all_services[(all_services['to'] == 'New York Penn Station')]
trains_from_nyp = all_services[(all_services['from'] == 'New York Penn Station')]
trains_to_nyp.dtypes
am_start = pd.to_datetime("06:00:00")
am_end = pd.to_datetime("09:30:00")
pm_start = pd.to_datetime("16:00:00")
pm_end = pd.to_datetime("19:00:00")
# AM-PM Weekday Rush Hours
am_peak_nyp = trains_to_nyp[(trains_to_nyp['scheduled_time'].dt.time >= am_start.time()) & (trains_to_nyp['scheduled_time'].dt.time <= am_end.time())]
am_peak_nyp.drop(am_peak_nyp[am_peak_nyp['date'].dt.weekday > 4].index, inplace=True)
pm_peak_nyp = trains_from_nyp[(trains_from_nyp['scheduled_time'].dt.time >= pm_start.time()) & (trains_from_nyp['scheduled_time'].dt.time <= pm_end.time())]
pm_peak_nyp.drop(pm_peak_nyp[pm_peak_nyp['date'].dt.weekday > 4].index, inplace=True)

#Off Peak
off_peak1 = trains_to_nyp[(trains_to_nyp['scheduled_time'].dt.time < am_start.time())]
off_peak2 = trains_to_nyp[(trains_to_nyp['scheduled_time'].dt.time > am_end.time()) & (trains_to_nyp['scheduled_time'].dt.time < pm_start.time())]
off_peak3 = trains_to_nyp[(trains_to_nyp['scheduled_time'].dt.time > pm_end.time())]
off_peak_nyp = pd.concat([off_peak1, off_peak2])
off_peak_nyp = pd.concat([off_peak3, off_peak_nyp])
off_peak_nyp.drop(off_peak_nyp[off_peak_nyp['date'].dt.weekday > 4].index, inplace=True)

lines_to_nyp = trains_to_nyp['line'].unique()
categories = ['AM Peak', 'PM Peak', 'Weekday', 'Weekend']


print(helper.calculate_otp(am_peak_nyp[(am_peak_nyp['date'].dt.month == 10) & (am_peak_nyp['date'].dt.year == 2019)]))
print(helper.calculate_otp(pm_peak_nyp[(pm_peak_nyp['date'].dt.month == 10) & (pm_peak_nyp['date'].dt.year == 2019)]))
print(helper.calculate_otp(off_peak_nyp[(off_peak_nyp['date'].dt.month == 10) & (off_peak_nyp['date'].dt.year == 2019)]))


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  am_peak_nyp.drop(am_peak_nyp[am_peak_nyp['date'].dt.weekday > 4].index, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pm_peak_nyp.drop(pm_peak_nyp[pm_peak_nyp['date'].dt.weekday > 4].index, inplace=True)


81.16
73.56
86.49
