# System Design
This section proposes a real-time system to predict bus arrival times. We describe this in laymen's terms for general audiences but also provide detailed sub-section [[⚙️]](#) for technical audiences that want to learn more.


----
## Problem Framing

* regression problem
* implicitly supervised learning from historical data
* leverage sequential structure of routes [[⚙️3]](#)

<img src="../images/problem_framing.png" width="100%">

----
## Dataset Preparation


- Since this project deals with time series forecasting, we need to prepare our learning & 

* Windows train/validation/test split
* Route Status Padding
* Train Data Balancer

<img src="../images/dataset_sampling.png" width="100%">

----
## Feature Engineering

### Temporal Signals
Anythings that describes moment in time the prediction needs to be made.
  - current & scheduled second in minute [[⚙️1]](#)
  - current & scheduled minute in hour [[⚙️1]](#)
  - current & scheduled time of day [[⚙️1]](#)
  - current & scheduled day of week [[⚙️1]](#)
  - current & scheduled month in year [[⚙️1]](#)
  - current holiday flag (ie: x-mas, Canada Day, Louis Riel Day, ..)

### Spatial Signals
Anything that describes the static geographic conditions of the stops.
  - stop location (ie: parent & neighbouring areas [[⚙️2]](#))
  - stop intersection type (ie: lights, stop-sign, mid-road, ..)
  - speed limits between stops
  - distance between stops
  - % road types between stops (ie: street, ave, freeway, ..)
  - road attributes between stops (ie: tunnels, bridges)
  - number of routes served by this stop
  - zone (ie: commercial, residential, industrial)
  - population density around stop(ie: within various 50-500m radii)
  - distance to special POIs (ie: fire station, hospital, school, train stations, ..)

### Contextual Signals
Anything that describes the dynamic state of the transit network.
  - next bus GPS coordinates
  - next bus passenger volume
  - passengers waiting at previous `n` stops along route
  - current route bus status at previous `n` stops (ie: early, delayed, on-time, pending)
  - previous route bus status at target stop. (ie: early, delayed, on-time, pending)
  - weather conditions (ie: rain, snow, wind, fog, ..)
  - road closures (ie: construction, incidents, festivities, ..)
  - current traffic levels between stops (ie: low, medium, high)


----
## System Architecture

The proposed high level architecture is shown below. Whenever a bus arrival time prediction is requested for known stops in the transit network, the system must be given the conditions of the route window with contextual data as described in previous sections.

<img src="../images/system_arch.png" width="100%">

The first part of the architecture *transforms* inputs into values for the final *estimator* part to interpret & proudce time deviation result.

Since the data is hevily skewed, the transformer is also responsible for balancing (at learning time) the proportions of ``delayed``, ``early`` & ``on-time`` examples the estimator sees [[⚙️4]](#).

The estimator itself, consits in traning a couple sub-models. The spatial sub-model learns to predict the arrival time given the status of the previous stops of the current route. While the temporal sub model learns to predict the arrival time given the status of the previous routes for the given stop [[⚙️3]](#) [[⚙️5]](#). Since the data has several outlier spikes, we further propose an anomaly detection sub-model to help learn how these propagate through the transit network [[⚙️6]](#).

----
## Tuning & Evaluation

* hyper parameters (ie: spatial & temporal window sizes, featue encodings)
* probability calibration
* temporal kfold validation

----
## System Integration

### Training Time
* Retrain the system whenever:
  - route or stop locations are added, removed or changed.
  - route schedules are updated
  - annually to account for behaviour changes (due to demographics, roads, pois, price, ..)

### Inference Time
* use schedule times at the start of the route.
* trigger system whenever bus arrives & departs from a stop
* also trigger at periodic intervals should bus be spontaneously stuck in traffic.
* triggering the system to predict arival time for next stop & use that estimate to simulate predictions for all subequent stops in route.

----
## Technical Details

\[⚙️1]: To avoid the curse of high dimensionality from one-hot-encoding categorical features that are cyclical in nature, we can instead trigonometrically encode these with only 2 features per signal.

\[⚙️2]: ``H3`` or ``Geohash`` are geo-indexing schemes that let you navigate to nested & neighbouring areas quickly. They group various *lat,lng* points to the same hash area helping transfer learn areas.

\[⚙️3]: Use of ``LSTM`` or ``Transformer`` architectures is recommended. Unlike classical machine learned algorithms like ``RandomForestRegressor`` or ``XGBRegressor``, these are deep learned recurrent neural networks that leverage the sequential structure of  data. Care must be taken to only model forward uni-directionality of data to avoid *data leackage* from learning with future unavailable data at production time.

\[⚙️4]: Since data is plentiful, we recommend ``NearMiss`` down sampling the majority classes which ignores easy to predict data. However, we can also employ ``SMOTE`` techniques to synthesize artifical datapoints for the minority classes.

\[⚙️5]: Scaling the collection of signals before learning ensures all features are weighted equaly despite their units or magnitude ranges.

\[⚙️6]: Suitable techniques for anomaly detection include `OneVsRestClassifier`, ``IsolationForestRegressor`` or ``DBSCAN`` which in

----