# System Design
This section proposes a real-time system to predict bus arrival times. We describe this in laymen's terms for general audiences but also provide detailed sub-section [[⚙️]](#) for technical audiences that want to learn more.


----
## Problem Framing

* regression problem
* implicitly supervised learning from historical data
* leverage sequential structure of routes [[⚙️3]](#)

<img src="../images/problem_framing.png" width="100%">

----
## Dataset Preparation

* Windows train/validation/test split
* Route Status Padding
* Train Data Balancer

<img src="../images/dataset_sampling.png" width="100%">

----
## Feature Engineering

### Temporal Signals
Anythings that describes moment in time the prediction needs to be made.
  - current & scheduled minute in hour [[⚙️1]](#)
  - current & scheduled time of day [[⚙️1]](#)
  - current & scheduled day of week [[⚙️1]](#)
  - current & scheduled month in year [[⚙️1]](#)
  - holiday flag (ie: x-mas, Louis Riel Day, ..)

### Spatial Signals
Anything that describes the static geographic conditions of the stops.
  - stop location (ie: parent & neighbouring areas [[⚙️2]](#))
  - stop intersection type (ie: lights, stop-sign, mid-road, ..)
  - speed limits between stops
  - distance between stops
  - % road types between stops (ie: street, ave, freeway, ..)
  - road attributes between stops (ie: tunnels, bridges)
  - number of routes served by this stop
  - zone (ie: commercial, residential, industrial)
  - population density around stop(ie: within various 50-500m radii)
  - distance to special POIs (ie: fire station, hospital, school, train stations, ..)

### Contextual Signals
Anything that describes the dynamic state of the transit network.
  - next bus GPS coordinates
  - next bus passenger volume
  - passengers waiting at previous `n` stops along route
  - bus status at previous `n` stops (ie: early, delayed, on-time, pending)
  - weather conditions (ie: rain, snow, wind, fog, ..)
  - road closures (ie: construction, incidents, festivities, ..)
  - traffic levels (ie: low, medium, high)


----
## Model Architecture

----
## Tuning & Evaluation


----
## System Integration


----
## Technical Details

\[⚙️1]: To avoid the curse of high dimensionality from one-hot-encoding categorical features that are cyclical in nature, we can instead trigonometrically encode these with only 2 features per signal.

\[⚙️2]: ``H3`` or ``Geohash`` are geo-indexing schemes that let you navigate to nested & neighbouring areas quickly. They group various *lat,lng* points to the same hash area helping transfer learn areas.

\[⚙️3]: Use of ``LSTM`` or ``Transformer`` architectures is recommended. Unlike classical machine learned algorithms like ``RandomForestRegressor`` or ``XGBRegressor``, these are deep learned recurrent neural networks that leverage the sequential structure of  data. Care must be taken to only model forward uni-directionality of data to avoid *data leackage* from learning with future unavailable data at production time.

----