In [1]:
import pandas as pd
import seaborn as sns
from pathlib import Path
import matplotlib.pyplot as plt
import holoviews as hv
import hvplot.pandas  # noqa

pd.options.plotting.backend = "holoviews"

import warnings

warnings.filterwarnings("ignore")

data_folder = Path().absolute() / "../data/processed/"
file_name = "processed_df.csv"

# check file size
file_size_bytes = (data_folder / file_name).stat().st_size
print(f"The file size in bytes is: {file_size_bytes: 1G}")

The file size in bytes is:  2.89454E+06


In [2]:
df = pd.read_csv(
    data_folder / file_name,
    parse_dates=["DateTime"],
)
df.head(10)

Unnamed: 0,Day,Station,Code,Min Delay,Min Gap,Bound,Line,Vehicle,DateTime,Year,Month,Hour,meridian
0,Saturday,LAWRENCE EAST STATION,SRDP,0,0,N,SRT,3023,2022-01-01 15:59:00,2022,1,15,PM
1,Saturday,SPADINA BD STATION,MUIS,0,0,,BD,0,2022-01-01 02:23:00,2022,1,2,AM
2,Saturday,KENNEDY SRT STATION TO,MRO,0,0,,SRT,0,2022-01-01 22:00:00,2022,1,22,PM
3,Saturday,VAUGHAN MC STATION,MUIS,0,0,,YU,0,2022-01-01 02:28:00,2022,1,2,AM
4,Saturday,EGLINTON STATION,MUATC,0,0,S,YU,5981,2022-01-01 02:34:00,2022,1,2,AM
5,Saturday,QUEEN STATION,MUNCA,0,0,,YU,0,2022-01-01 05:40:00,2022,1,5,AM
6,Saturday,DAVISVILLE STATION,MUNCA,0,0,,YU,0,2022-01-01 06:56:00,2022,1,6,AM
7,Saturday,ST PATRICK STATION,MUNCA,0,0,,YU,0,2022-01-01 06:58:00,2022,1,6,AM
8,Saturday,PAPE STATION,MUNCA,0,0,,BD,0,2022-01-01 07:01:00,2022,1,7,AM
9,Saturday,WILSON STATION,TUATC,10,0,S,YU,5896,2022-01-01 07:43:00,2022,1,7,AM


In [3]:
def calculate_metrics(sub_category: pd.DataFrame) -> pd.Series:
    """Function used to calculate subway delay metrics.

    Args:
        sub_category (pd.DataFrame): Sub category from a given column.

    Returns:
        pd.Series: Series of metrics used for data analysis.
    """
    metrics = {}
    no_delay_count = sub_category[sub_category["Min Delay"] == 0].shape[0]
    total_service_count = sub_category.shape[0]

    # return 0 if there are no on-time services
    if no_delay_count == 0:
        sub_category["on_time_perc_performance"] = 0.0

    metrics["on_time_perc_performance"] = no_delay_count * 100 / total_service_count
    metrics["total_service_count"] = total_service_count
    metrics["total_delay"] = sub_category["Min Delay"].sum()

    return pd.Series(metrics)


### Calculate each category and flag categories of concern

In [4]:
categories = ["Station", "Code", "Bound", "Line", "Vehicle", "Month", "Year", "Hour", "meridian"]
all_delayed_categories = {}
plots = []
for category in categories:
    temp = df.groupby(category).apply(calculate_metrics)
    temp["mean_delay"] = temp["total_delay"] / temp["total_service_count"]
    temp = temp.reset_index()

    all_delayed_categories[category] = temp
    plot = temp.hvplot.scatter(
        y="total_delay",
        x="total_service_count",
        color="mean_delay",
        label=category,
        hover_cols=["total_service_count", "total_delay", category],
        xlim=(0, None),
        ylim=(0, None),
    )
    plots.append(plot)

In [5]:
hv.Layout(plots).cols(1).opts(shared_axes=False)


Any points that are located in the top left corner - i.e. points which have a low service count but high total delays should be investigated.

For example, for each category the worst performing sub-category appears to be:
- Vehicle number 5796,
- Line SRT,
- West bound trains,
- Code MUPR1,
- McCowan station

all of which (amongst many others) do not follow the general trend.

The `mean_delay` hue provides an important metric which is `average minutes delay per service`. The scatter points highlighted in a darker shade are alarming and are categories that would need to be investigated further.


In regards, to the date time of the services. The meridian information showed that PM services often had more delays, typically, peak times are when there more delays. The yearly information also aligned with logic where 2020 and 2021 showed a reduced service and not as much total delays, however 2021 was the worst performing from an `average minutes delay per service` point of view.


# Conclusion

From these plots we can see that as the total number of services increases as the total delay increases - which is what we would expect! 

Apart from this general trend, we use the categorical information along with the datestamp to be able to predict the `total_delay`. This would be mean this project would involve a regression model to determine the delay in minutes. 

**However**, a more useful model would be able to predict a **delay before it occurs** - the exact total delay in minutes is not relevant. We would bucket the delays in minutes so that there 3 categories [`<2`, `<10`, `>11`]. A delay of `>11` would be considered a major problem. *n.b. these are arbitrary values, exact buckets should be coordinated with Toronto Subway*.

Thus, this task would be beneficial if we produced a classification model.

During production, we would be able to classify which journey is likely to face issues. Having a model that has a *high recall* for the `>11` class because if we classify this category incorrectly in the lower classes it would result in unexpected delays - we would rather have predicted a greater delay but in reality a reduced delay occurred.

Having this model would not allow us to take measures to prevent them from occurring in the first place as we do not have the delay `code`. 


If we wanted to want to prevent trains from being delayed we would change this into a classification problem (and remove the delay in minutes field - due to target leakage). As the number of codes are much larger this model may not accurately be able to predict the specific codes as there is a class in-balance. I would suggest reducing the codes or increasing the data used for training.