# System Design
This section proposes a real-time system to predict bus arrival times. We describe this in laymen's terms for general audiences but also provide detailed sub-section [[⚙️]](#) for technical audiences that want to learn more.


----
## Problem Framing

We treat this as a regression task where the system is to predict the next bus arrival deviation time (in minutes) for a given stop in the transit network [[⚙️12]](#). Learning is implicitly supervised from historical route trips. Given routes are sequences of stops who's arrival time is influenced by that of previous stops, we model the input, as shown in the figure below, as a sequence problem [[⚙️3]](#).

<img src="../images/problem_framing.png" width="100%">

To predict the arrival time for any given stop at any moment (orange highlighted target in the figure), the system uses 2 context windows as input. The spatial window tracks the status times of previous stops in the current routes' trip with respect to bus GPS location & is meant to represent local &/or resent state of the environment. Meanwhile, the temporal window tracks the status times of the target stop in previous route trips & is meant to represent lingering or accumulated global congestion.

To forecast future stop (right of target in figure) the system is to recursively predict forward the next stop arrival time assuming the previous predictions were accurate.

----
## Dataset Preparation

To build & evaluate our system, we need to create representative open `train`, `validate` & blind `test` datasets. Each of these must not only preserve the time-series chronology of the data but also respect each route sequence of stops [[⚙️7]](#). We achieve this in 2 steps.

### Triage Step
Employ a on-overlapping moving window split strategy as shown next.
That is, sampling will partition the time-series in chronological sized chunks (say `3 day` period with `1 day` gap [[⚙️8]](#)) taking all available route data within each period. [[⚙️9]](#)
<img src="../images/dataset_sampling.png" width="100%">

### Contextualize Step
For each dataset, we need to aggregate the individual stops of a directional route into a time stamped sequence. We then pass our spatial & temporal sliding windows (say of `S` & `T` respective sizes) to get 2 sequences of contextual data to learn from as well as the observed deviation time label to predict.

$$
\vec{spatial\_window} = \left [ 
stop\_id^{\hspace{0.1cm}s=S}_{\hspace{0.1cm}t=0},
\cdots ,
stop\_id^{\hspace{0.1cm}s=-2}_{\hspace{0.1cm}t=0},
stop\_id^{\hspace{0.1cm}s=-2}_{\hspace{0.1cm}t=0}
\right ]
$$
$$
\vec{temporal\_window} = \left [ 
stop\_id^{\hspace{0.1cm}s=0}_{\hspace{0.1cm}t=-T},
\cdots ,
stop\_id^{\hspace{0.1cm}s=0}_{\hspace{0.1cm}t=-2},
stop\_id^{\hspace{0.1cm}s=0}_{\hspace{0.1cm}t=-1}
\right ]
$$
$$
predict\_label = stop\_id^{\hspace{0.1cm}s=0}_{\hspace{0.1cm}t=0}
$$
[[⚙️10]](#) [[⚙️11]](#)


----
## Feature Engineering

### Temporal Signals
Anythings that describes moment in time the prediction needs to be made.
  - current & scheduled second in minute [[⚙️1]](#)
  - current & scheduled minute in hour [[⚙️1]](#)
  - current & scheduled time of day [[⚙️1]](#)
  - current & scheduled day of week [[⚙️1]](#)
  - current & scheduled month in year [[⚙️1]](#)
  - current holiday flag (ie: x-mas, Canada Day, Louis Riel Day, ..)

### Spatial Signals
Anything that describes the static geographic conditions of the stops.
  - stop location (ie: parent & neighboring areas [[⚙️2]](#))
  - stop intersection type (ie: lights, stop-sign, mid-road, ..)
  - speed limits between stops
  - distance between stops
  - % road types between stops (ie: street, ave, freeway, ..)
  - road attributes between stops (ie: tunnels, bridges)
  - number of routes served by this stop
  - zone (ie: commercial, residential, industrial)
  - population density around stop(ie: within various 50-500m radii)
  - distance to special POIs (ie: fire station, hospital, school, train stations, ..)

### Contextual Signals
Anything that describes the dynamic state of the transit network.
  - next bus GPS coordinates
  - next bus passenger volume
  - passengers waiting at previous `n` stops along route
  - current route bus status at previous `n` stops (ie: early, delayed, on-time, pending)
  - previous route bus status at target stop. (ie: early, delayed, on-time, pending)
  - weather conditions (ie: rain, snow, wind, fog, ..)
  - road closures (ie: construction, incidents, festivities, ..)
  - current traffic levels between stops (ie: low, medium, high)


----
## System Architecture

The proposed high level architecture is shown below. Whenever a bus arrival time prediction is requested for known stops in the transit network, the system must be given the conditions of the route window with contextual data as described in previous sections.

<img src="../images/system_arch.png" width="100%">

The first part of the architecture *transforms* inputs into values for the final *estimator* part to interpret & produce a time deviation result.

Since the data is heavily skewed, the transformer is also responsible for balancing (at learning time) the proportions of ``delayed``, ``early`` & ``on-time`` examples the estimator sees [[⚙️4]](#).

The estimator itself, consist in training a couple sub-models. The spatial sub-model learns to predict the arrival time given the status of the previous stops of the current route. While the temporal sub model learns to predict the arrival time given the status of the previous routes for the given stop [[⚙️3]](#) [[⚙️5]](#). Since the data has several outlier spikes, we further propose an anomaly detection sub-model to help learn how these propagate through the transit network [[⚙️6]](#).

----
## Tuning & Evaluation

* hyper parameters (ie: spatial & temporal window sizes, featue encodings)
* probability calibration
* temporal kfold validation

----
## System Integration

### Training Time
* Retrain the system whenever:
  - route or stop locations are added, removed or changed.
  - route schedules are updated
  - annually to account for behavior changes (due to demographics, roads, pois, price, ..)

### Inference Time
* use schedule times at the start of the route.
* trigger system whenever bus arrives & departs from a stop
* also trigger at periodic intervals should bus be spontaneously stuck in traffic.
* triggering the system to predict arrival time for next stop & use that estimate to simulate predictions for all subsequent stops in route.

----
## Technical Details


\[⚙️1]: To avoid the curse of high dimensionality from one-hot-encoding categorical features that are cyclical in nature, we can instead trigonometrically encode these with only 2 features per signal.

\[⚙️2]: ``H3`` or ``Geohash`` are geo-indexing schemes that let you navigate to nested & neighbor areas quickly. They group various *lat,lng* points to the same hash area helping transfer learn areas.

\[⚙️3]: Use of ``LSTM`` or ``Transformer`` architectures is recommended. Unlike classical machine learned algorithms like ``RandomForestRegressor`` or ``XGBRegressor``, these are deep learned recurrent neural networks that leverage the sequential structure of  data. Care must be taken to only model forward uni-directionality of data to avoid *data leakage* from learning with future unavailable data at production time.

\[⚙️4]: Since data is plentiful, we recommend ``NearMiss`` down sampling the majority classes which ignores easy to predict data. However, we can also employ ``SMOTE`` techniques to synthesize artificial datapoints for the minority classes.

\[⚙️5]: Scaling the collection of signals before learning ensures all features are weighted equally despite their units or magnitude ranges.

\[⚙️6]: Suitable techniques for anomaly detection include `OneVsRestClassifier`, ``IsolationForestRegressor`` or ``DBSCAN`` dimensionality reduction which involve learning what is common & flagging the outliers.

\[⚙️7]: This is important to avoid *data leakage*. We don't want the model to train on end of route data when asked to predict start of route arrival times. Likewise we don't want model to train on end of day data when asked to predict start of day. Both cases lead model to "memorize" answer in the lab that it will not have in production. So must ensure spatial or temporal splits of route data do not occur.

\[⚙️8]: Gap sampling is used as a technique to further guard against *data leakage*. By ignoring this in-between data, we ensure the evaluation sets are less temporally correlated to the learning sets.

\[⚙️9]: Care must be taken when choosing sampling period & gap sizes as some combinations might entirely exclude certain days of the week out of the evaluation set creating a learning bias.

\[⚙️10]: To properly learn arrival times for stops at the start & end of routes, we need to pad the input lists with an out-of-vocabulary id.

\[⚙️11]: Notice that we encode our dataset with stop-ids rather than the arrival time deviation. This is a `hashing-trick` to save memory & decouple the information vector from the reference of stops.

\[⚙️12]: All stops are assumed to be uniquely identifiable by `Route Number`, `Route Direction` & `Stop Number`. 

----