# Hourly traffic volume prediction on Interstate 94

### Multivariate time series prediction with getML

In this tutorial, we demonstrate a time series application of getML. We predict the hourly traffic volume on I-94 westbound from Minneapolis-St Paul. 
We benchmark our results against [Facebook's Prophet](https://facebook.github.io/prophet/). getML's relational learning algorithms outperform Prophet's classical time series approach by ~15%.

Summary:

- Prediction type: __Regression model__
- Domain: __Transportation__
- Prediction target: __Hourly traffic volume__ 
- Source data: __Multivariate time series, 5 components__
- Population size: __24096__

_Author: Sören Nikolaus_

# Background

....

# Analysis

Let's get started with the analysis and set-up your session:

In [1]:
import datetime
import os
import time

import numpy as np
import pandas as pd

from IPython.display import Image
import matplotlib.pyplot as plt
plt.style.use('seaborn')
%matplotlib inline  

import getml

from utils import FTTimeSeriesBuilder, TSFreshBuilder

print(f"getML API version: {getml.__version__}\n")

getml.engine.set_project('interstate94')

getML API version: 0.16.0


Connected to project 'interstate94'


For various technical reasons, we want to keep our MyBinder notebook short. That is why we pre-store the features for prophet and tsfresh. However, you are very welcome to try this at home and fully reproduce our results. You can just set the two constants to "True".

## 1. Loading data

### 1.1 Download from source

We begin by downloading the data from the UC Irvine Machine Learning repository:

In [2]:
data = getml.datasets.load_interstate94(roles=True, units=True)
traffic_test, traffic_train, _ = data.values()
traffic = getml.data.concat("traffic", [traffic_train, traffic_test])

In [3]:
for data_frame in [traffic_test, traffic_train, traffic]:
    data_frame.set_role(data_frame.categorical_names, getml.data.roles.unused_string)

In [4]:
traffic

Name,ds,join_key,Traffic_Volume,lower_window,upper_window,ds_day,holiday,hour,weekday,day,month,year
Role,time_stamp,join_key,target,unused_float,unused_float,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
Units,"time stamp, comparison only",Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,hour,weekday,day,month,year
0.0,2016-01-01,1,1513,-1,0,2016-01-01 00:00:00,New Years Day,0,4,1,1,2016
1.0,2016-01-01 01:00:00,1,1550,-1,0,2016-01-01 00:00:00,New Years Day,1,4,1,1,2016
2.0,2016-01-01 02:00:00,1,993,-1,0,2016-01-01 00:00:00,New Years Day,2,4,1,1,2016
3.0,2016-01-01 03:00:00,1,719,-1,0,2016-01-01 00:00:00,New Years Day,3,4,1,1,2016
4.0,2016-01-01 04:00:00,1,533,-1,0,2016-01-01 00:00:00,New Years Day,4,4,1,1,2016
,...,...,...,...,...,...,...,...,...,...,...,...
24091.0,2018-09-30 19:00:00,1,3543,,,2018-09-30 00:00:00,No holiday,19,6,30,9,2018
24092.0,2018-09-30 20:00:00,1,2781,,,2018-09-30 00:00:00,No holiday,20,6,30,9,2018
24093.0,2018-09-30 21:00:00,1,2159,,,2018-09-30 00:00:00,No holiday,21,6,30,9,2018
24094.0,2018-09-30 22:00:00,1,1450,,,2018-09-30 00:00:00,No holiday,22,6,30,9,2018


### 1.2 Define relational model

To start with relational learning, we need to specify the data model. We manually replicate the appropriate time series structure by setting time series related join conditions (`horizon`, `memory` and `allow_lagged_targets`). This is done abstractly using [Placeholders](https://docs.getml.com/latest/user_guide/data_model/data_model.html#placeholders)

The data model consists of two tables:
* __Population table__ `traffic_{test/train}`: holds target and the contemporarily available time-based components
* __Peripheral table__ `traffic`: same table as the population table
* Join between both placeholders specifies (`horizon`) to prevent leaks and (`memory`) that keeps the computations feasible

In [5]:
population = getml.data.Placeholder('population')

peripheral = getml.data.Placeholder('peripheral')

# 1. The horizon is 1 hour (we predict the next hour). 
# 2. The memory is 7 days, so we allow the algorithm to 
#    use information from up to 7 days. 
# 3. We allow lagged targets. Thus the algorithm can 
#    identify autoregressive processes.

population.join(
    peripheral,
    join_key='join_key',
    time_stamp='ds',
    horizon=getml.data.time.hours(1),
    memory=getml.data.time.hours(24),
    allow_lagged_targets=True
)

population

## 2. Predictive modeling

We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.

### 2.1 Propositionalization with getML's FastProp

In [6]:
seasonal = getml.preprocessors.Seasonal()

fast_prop = getml.feature_learning.FastPropModel(
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    num_threads=1,
)

__Build the pipeline__

In [7]:
pipe_fp_fl = getml.pipeline.Pipeline(
    preprocessors=[seasonal],
    feature_learners=[fast_prop],
    population=population,
    tags=["feature learning", "fastprop"],
)

pipe_fp_fl

In [8]:
pipe_fp_fl.check(traffic_train, [traffic])

Checking data model...
OK.


In [9]:
begin = time.time()

pipe_fp_fl.fit(traffic_train, [traffic])

fastprop_train = pipe_fp_fl.transform(
    traffic_train, [traffic], df_name="fastprop_train")

end = time.time()

fastprop_runtime = datetime.timedelta(seconds=end - begin)

Checking data model...
OK.

Preprocessing...

FastProp: Trying 337 features...

Trained pipeline.
Time taken: 0h:0m:36.652234


Preprocessing...

FastProp: Building features...



In [10]:
fastprop_test = pipe_fp_fl.transform(
    traffic_test, [traffic_test], df_name="fastprop_test")


Preprocessing...

FastProp: Building features...



In [11]:
predictor = getml.predictors.XGBoostRegressor()

pipe_fp_pr = getml.pipeline.Pipeline(
    tags=["prediction", "fastprop"], predictors=[predictor]
)

In [12]:
pipe_fp_pr.fit(fastprop_train)

Checking data model...
OK.

XGBoost: Training as predictor...

Trained pipeline.
Time taken: 0h:0m:7.177519



In [13]:
pipe_fp_pr.score(fastprop_test)




Unnamed: 0,date time,set used,target,mae,rmse,rsquared
0,2021-05-18 12:55:06,fastprop_train,Traffic_Volume,203.3306,295.1301,0.9775
1,2021-05-18 12:55:06,fastprop_test,Traffic_Volume,190.5532,282.126,0.9799


### 2.2 Propositionalization with featuretools

In [14]:
dfs_pandas = {}

for df in [traffic_train, traffic_test, traffic]:
    dfs_pandas[df.name] = df.to_pandas()
    del dfs_pandas[df.name]["holiday"]
    del dfs_pandas[df.name]["hour"]
    del dfs_pandas[df.name]["day"]
    del dfs_pandas[df.name]["weekday"]
    del dfs_pandas[df.name]["month"]
    del dfs_pandas[df.name]["year"]
    del dfs_pandas[df.name]["ds_day"]
    del dfs_pandas[df.name]["lower_window"]
    del dfs_pandas[df.name]["upper_window"]
    dfs_pandas[df.name]["join_key"] = 1

In [15]:
ft_builder = FTTimeSeriesBuilder(
    num_features=200,
    horizon=pd.Timedelta(hours=1),
    memory=pd.Timedelta(hours=24),
    column_id="join_key",
    time_stamp="ds",
    target="Traffic_Volume",
    allow_lagged_targets=True,
)

In [16]:
featuretools_train = ft_builder.fit(dfs_pandas["traffic_train"])
featuretools_test = ft_builder.transform(dfs_pandas["traffic_test"])

featuretools: Trying features...


  agg_primitives: ['all', 'any', 'entropy', 'num_true', 'percent_true']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible variable types for the primitive were found in the data.


Selecting the best out of 77 features...
Time taken: 0h:4m:6.73582



  agg_primitives: ['all', 'any', 'entropy', 'num_true', 'percent_true']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible variable types for the primitive were found in the data.


In [17]:
dfs_pandas["traffic_train"]

Unnamed: 0,join_key,Traffic_Volume,ds
0,1,1513.0,2016-01-01 00:00:00
1,1,1550.0,2016-01-01 01:00:00
2,1,993.0,2016-01-01 02:00:00
3,1,719.0,2016-01-01 03:00:00
4,1,533.0,2016-01-01 04:00:00
...,...,...,...
19271,1,1287.0,2018-03-13 23:00:00
19272,1,665.0,2018-03-14 00:00:00
19273,1,340.0,2018-03-14 01:00:00
19274,1,285.0,2018-03-14 02:00:00


In [18]:
roles={
    getml.data.roles.join_key: ["join_key"],
    getml.data.roles.target: ["Traffic_Volume"],
    getml.data.roles.time_stamp: ["ds"],
}

df_featuretools_train = getml.data.DataFrame.from_pandas(
    featuretools_train, name="featuretools_train", roles=roles
)

df_featuretools_test = getml.data.DataFrame.from_pandas(
    featuretools_test, name="featuretools_test", roles=roles
)

In [19]:
df_featuretools_train.set_role(
    df_featuretools_train.unused_names, getml.data.roles.numerical
)

df_featuretools_test.set_role(
    df_featuretools_test.unused_names, getml.data.roles.numerical
)

In [20]:
predictor = getml.predictors.XGBoostRegressor()

pipe_ft_pr = getml.pipeline.Pipeline(
    tags=["prediction", "featuretools"], predictors=[predictor]
)

pipe_ft_pr

In [21]:
pipe_ft_pr.check(df_featuretools_train)

Checking data model...
OK.


In [22]:
pipe_ft_pr.fit(df_featuretools_train)

Checking data model...
OK.

XGBoost: Training as predictor...

Trained pipeline.
Time taken: 0h:0m:3.141464



In [23]:
pipe_ft_pr.score(df_featuretools_test)




Unnamed: 0,date time,set used,target,mae,rmse,rsquared
0,2021-05-18 13:00:36,featuretools_train,Traffic_Volume,250.4943,363.6149,0.9658
1,2021-05-18 13:00:36,featuretools_test,Traffic_Volume,241.0873,349.6071,0.9692


### 2.3 Propositionalization with tsfresh

tsfresh failed to run through due to an apparent bug in the tsfresh library and is therefore excluded from this analysis.

### 2.4 Comparison

In [24]:
num_features = dict(
    fastprop=607,
    featuretools=77,
)

runtime_per_feature = [
    fastprop_runtime / num_features['fastprop'],
    ft_builder.runtime / num_features['featuretools'],
]

features_per_second = [1.0/r.total_seconds() for r in runtime_per_feature]

speedup_per_feature = [r/runtime_per_feature[0] for r in runtime_per_feature]

comparison = pd.DataFrame(
    dict(
        runtime=[fastprop_runtime, ft_builder.runtime],
        num_features=num_features.values(),
        features_per_second=features_per_second,
        speedup=[1, ft_builder.runtime/fastprop_runtime],
        speedup_per_feature=speedup_per_feature,
        rsquared=[pipe_fp_pr.rsquared, pipe_ft_pr.rsquared],
        rmse=[pipe_fp_pr.rmse, pipe_ft_pr.rmse],
        mae=[pipe_fp_pr.mae, pipe_ft_pr.mae],
    )
)

comparison.index = ["getML: FastProp", "featuretools"]

In [25]:
comparison

Unnamed: 0,runtime,num_features,features_per_second,speedup,speedup_per_feature,rsquared,rmse,mae
getML: FastProp,0 days 00:00:55.543907,607,10.928245,1.0,1.0,0.979916,282.126003,190.553229
featuretools,0 days 00:04:06.735820,77,0.312075,4.442176,35.018043,0.969201,349.607098,241.087284


In [26]:
comparison.to_csv("comparisons/interstate94.csv")

## 3. Conclusion

You are encouraged to reproduce these results. You will need getML (https://getml.com/product) to do so. You can download it for free.

# Next Steps

This tutorial went showcased another time series application of getML and benchmarked getML against popular time series libraries.

If you are interested in further real-world applications of getML, head back to the [notebook overview](welcome.md) and choose one of the remaining examples.

Here is some additional material from our [documentation](https://docs.getml.com/latest/) if you want to learn more about getML:
* [Feature learning with Multirel](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#multirel)
* [Feature learning with Relboost](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#relboost)

# Get in contact

If you have any question schedule a [call with Alex](https://go.getml.com/meetings/alexander-uhlig/getml-demo), the co-founder of getML, or write us an [email](team@getml.com). Prefer a private demo of getML? Just contact us to make an appointment.