# Propositionalization: Interstate 94

In this notebbok, we compare getML's FastProp against well-known feature engineering libraries featuretools and tsfresh.

Summary:

- Prediction type: __Regression model__
- Domain: __Transportation__
- Prediction target: __Hourly traffic volume__ 
- Source data: __Multivariate time series, 5 components__
- Population size: __24096__

_Author: Sören Nikolaus_

# Background

A common approach to feature engineering is to generate attribute-value representations from relational data by applying a fixed set of aggregations to columns of interest and perform a feature selection on the (possibly large) set of generated features afterwards. In academia, this approach is called _propositionalization._

getML's [FastProp](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#fastprop) is an implementation of this propositionalization approach that has been optimized for speed and memory efficiency. In this notebook, we want to demonstrate how – well – fast FastProp is. To this end, we will benchmark FastProp against the popular feature engineering libraries [featuretools](https://www.featuretools.com/) and [tsfresh](https://tsfresh.readthedocs.io/en/latest/). Both of these libraries use propositionalization approaches for feature engineering.

In this notebook, we predict the hourly traffic volume on I-94 westbound from Minneapolis-St Paul. The analysis is built on top of a dataset provided by the [MN Department of Transportation](https://www.dot.state.mn.us), with some data preparation done by [John Hogue](https://github.com/dreyco676/Anomaly_Detection_A_to_Z/). For further details about the data set refer to [the full notebook](../interstate94.ipynb).

# Analysis

Let's get started with the analysis and set-up your session:

In [1]:
import datetime
import os
import time

import numpy as np
import pandas as pd

from IPython.display import Image
import matplotlib.pyplot as plt
plt.style.use('seaborn')
%matplotlib inline  

import getml

from utils import FTTimeSeriesBuilder, TSFreshBuilder

print(f"getML API version: {getml.__version__}\n")

getml.engine.set_project('interstate94')

getML API version: 1.0.0




Connected to project 'interstate94'


## 1. Loading data

### 1.1 Download from source

We begin by downloading the data from the UC Irvine Machine Learning repository:

In [2]:
traffic = getml.datasets.load_interstate94(roles=True, units=True)


Loading traffic...


In [3]:
traffic.set_role(traffic.roles.categorical, getml.data.roles.unused_string)

In [4]:
traffic

name,ds,traffic_volume,holiday,day,month,weekday,hour,year
role,time_stamp,target,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
unit,time stamp,Unnamed: 2_level_2,Unnamed: 3_level_2,day,month,weekday,hour,year
0.0,2016-01-01,1513,New Years Day,1,1,4,0,2016
1.0,2016-01-01 01:00:00,1550,New Years Day,1,1,4,1,2016
2.0,2016-01-01 02:00:00,993,New Years Day,1,1,4,2,2016
3.0,2016-01-01 03:00:00,719,New Years Day,1,1,4,3,2016
4.0,2016-01-01 04:00:00,533,New Years Day,1,1,4,4,2016
,...,...,...,...,...,...,...,...
24091.0,2018-09-30 19:00:00,3543,No holiday,30,9,6,19,2018
24092.0,2018-09-30 20:00:00,2781,No holiday,30,9,6,20,2018
24093.0,2018-09-30 21:00:00,2159,No holiday,30,9,6,21,2018
24094.0,2018-09-30 22:00:00,1450,No holiday,30,9,6,22,2018


### 1.2 Define relational model


In [5]:
split = getml.data.split.time(traffic, "ds", test=getml.data.time.datetime(2018, 3, 15))

In [6]:
time_series = getml.data.TimeSeries(
    population=traffic,
    split=split,
    alias="traffic",
    time_stamps='ds',
    horizon=getml.data.time.hours(1),
    memory=getml.data.time.hours(24),
    lagged_targets=True,
)

time_series

Unnamed: 0,data frames,staging table
0,traffic,TRAFFIC__STAGING_TABLE_1
1,traffic,TRAFFIC__STAGING_TABLE_2

Unnamed: 0,subset,name,rows,type
0,test,traffic,4800,View
1,train,traffic,19296,View

Unnamed: 0,name,rows,type
0,traffic,24096,DataFrame


## 2. Predictive modeling

We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.

### 2.1 Propositionalization with getML's FastProp

In [7]:
seasonal = getml.preprocessors.Seasonal()

fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    num_threads=1,
)

__Build the pipeline__

In [8]:
pipe_fp_fl = getml.pipeline.Pipeline(
    preprocessors=[seasonal],
    feature_learners=[fast_prop],
    data_model=time_series.data_model,
    tags=["feature learning", "fastprop"],
)

pipe_fp_fl

In [9]:
pipe_fp_fl.check(time_series.train)

Checking data model...


Staging...

Preprocessing...

Checking...


OK.


In [10]:
begin = time.time()

pipe_fp_fl.fit(time_series.train)

fastprop_train = pipe_fp_fl.transform(time_series.train, df_name="fastprop_train")

end = time.time()

fastprop_runtime = datetime.timedelta(seconds=end - begin)

Checking data model...


Staging...


OK.


Staging...

Preprocessing...

FastProp: Trying 461 features...


Trained pipeline.
Time taken: 0h:0m:18.427982



Staging...

Preprocessing...

FastProp: Building features...




In [11]:
fastprop_test = pipe_fp_fl.transform(time_series.test, df_name="fastprop_test")



Staging...

Preprocessing...

FastProp: Building features...




In [12]:
predictor = getml.predictors.XGBoostRegressor()

pipe_fp_pr = getml.pipeline.Pipeline(
    tags=["prediction", "fastprop"], predictors=[predictor]
)

In [13]:
pipe_fp_pr.fit(fastprop_train)

Checking data model...


Staging...

Checking...


OK.


Staging...

XGBoost: Training as predictor...


Trained pipeline.
Time taken: 0h:0m:11.804915



In [14]:
pipe_fp_pr.score(fastprop_test)



Staging...




Unnamed: 0,date time,set used,target,mae,rmse,rsquared
0,2021-08-24 16:32:39,fastprop_train,traffic_volume,200.374,292.5448,0.9779
1,2021-08-24 16:32:39,fastprop_test,traffic_volume,185.2662,266.3327,0.9823


### 2.2 Propositionalization with featuretools

In [15]:
traffic_train = time_series.train.population
traffic_test = time_series.test.population

In [16]:
dfs_pandas = {}

for df in [traffic_train, traffic_test, traffic]:
    dfs_pandas[df.name] = df.drop(df.roles.unused).to_pandas()
    dfs_pandas[df.name]["join_key"] = 1

In [17]:
ft_builder = FTTimeSeriesBuilder(
    num_features=200,
    horizon=pd.Timedelta(hours=1),
    memory=pd.Timedelta(hours=24),
    column_id="join_key",
    time_stamp="ds",
    target="traffic_volume",
    allow_lagged_targets=True,
)

In [18]:
featuretools_train = ft_builder.fit(dfs_pandas["train"])
featuretools_test = ft_builder.transform(dfs_pandas["test"])

featuretools: Trying features...


  agg_primitives: ['all', 'any', 'entropy', 'num_true', 'percent_true']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible variable types for the primitive were found in the data.


Selecting the best out of 77 features...
Time taken: 0h:4m:38.852938



  agg_primitives: ['all', 'any', 'entropy', 'num_true', 'percent_true']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible variable types for the primitive were found in the data.


In [19]:
dfs_pandas["train"]

Unnamed: 0,ds,traffic_volume,join_key
0,2016-01-01 00:00:00,1513.0,1
1,2016-01-01 01:00:00,1550.0,1
2,2016-01-01 02:00:00,993.0,1
3,2016-01-01 03:00:00,719.0,1
4,2016-01-01 04:00:00,533.0,1
...,...,...,...
19291,2018-03-14 19:00:00,3426.0,1
19292,2018-03-14 20:00:00,3115.0,1
19293,2018-03-14 21:00:00,2507.0,1
19294,2018-03-14 22:00:00,1810.0,1


In [20]:
roles={
    getml.data.roles.join_key: ["join_key"],
    getml.data.roles.target: ["traffic_volume"],
    getml.data.roles.time_stamp: ["ds"],
}

df_featuretools_train = getml.data.DataFrame.from_pandas(
    featuretools_train, name="featuretools_train", roles=roles
)

df_featuretools_test = getml.data.DataFrame.from_pandas(
    featuretools_test, name="featuretools_test", roles=roles
)

In [21]:
df_featuretools_train.set_role(
    df_featuretools_train.roles.unused, getml.data.roles.numerical
)

df_featuretools_test.set_role(
    df_featuretools_test.roles.unused, getml.data.roles.numerical
)

In [22]:
predictor = getml.predictors.XGBoostRegressor()

pipe_ft_pr = getml.pipeline.Pipeline(
    tags=["prediction", "featuretools"], predictors=[predictor]
)

pipe_ft_pr

In [23]:
pipe_ft_pr.check(df_featuretools_train)

Checking data model...


Staging...

Checking...


OK.


In [24]:
pipe_ft_pr.fit(df_featuretools_train)

Checking data model...


Staging...


OK.


Staging...

XGBoost: Training as predictor...


Trained pipeline.
Time taken: 0h:0m:4.583803



In [25]:
pipe_ft_pr.score(df_featuretools_test)



Staging...




Unnamed: 0,date time,set used,target,mae,rmse,rsquared
0,2021-08-24 16:38:36,featuretools_train,traffic_volume,248.089,361.5149,0.9662
1,2021-08-24 16:38:36,featuretools_test,traffic_volume,238.8621,353.285,0.9687


### 2.3 Propositionalization with tsfresh

tsfresh failed to run through due to an apparent bug in the tsfresh library and is therefore excluded from this analysis.

### 2.4 Comparison

In [26]:
num_features = dict(
    fastprop=607,
    featuretools=77,
)

runtime_per_feature = [
    fastprop_runtime / num_features['fastprop'],
    ft_builder.runtime / num_features['featuretools'],
]

features_per_second = [1.0/r.total_seconds() for r in runtime_per_feature]

speedup_per_feature = [r/runtime_per_feature[0] for r in runtime_per_feature]

comparison = pd.DataFrame(
    dict(
        runtime=[fastprop_runtime, ft_builder.runtime],
        num_features=num_features.values(),
        features_per_second=features_per_second,
        speedup=[1, ft_builder.runtime/fastprop_runtime],
        speedup_per_feature=speedup_per_feature,
        rsquared=[pipe_fp_pr.rsquared, pipe_ft_pr.rsquared],
        rmse=[pipe_fp_pr.rmse, pipe_ft_pr.rmse],
        mae=[pipe_fp_pr.mae, pipe_ft_pr.mae],
    )
)

comparison.index = ["getML: FastProp", "featuretools"]

In [27]:
comparison

Unnamed: 0,runtime,num_features,features_per_second,speedup,speedup_per_feature,rsquared,rmse,mae
getML: FastProp,0 days 00:00:26.578271,607,22.83835,1.0,1.0,0.982251,266.332662,185.266245
featuretools,0 days 00:04:38.852938,77,0.276131,10.491764,82.708331,0.968665,353.284989,238.862052


In [28]:
comparison.to_csv("comparisons/interstate94.csv")

### Why is FastProp so fast?

First, FastProp hugely benefits from getML's custom-built C++-native in-memory database engine. The engine is highly optimized for working with relational data structures and makes use of information about the relational structure of the data to efficiently store and carry out computations on such data. This matters in particular for time series where we [relate the current observation to a certain number of observations from the past](https://docs.getml.com/latest/user_guide/data_model/data_model.html#time-series): Other libraries have to deal explicitly with this inherent structure of (multivariate) time series; and such explicit transformations are costly, in terms of consumption of both, memory and computational resources. All operations on data stored in getML's engine benefit from implementations in modern C++. Further, we are taking advantage of functional design patterns where all column-based operations are evaluated lazily. So, for example, aggregations are carried out only on rows that matter (taking into account even complex conditions that might span multiple tables in the relational model). Duplicate operations are reduced to a bare minimum by keeping track of the relational data model. In addition to the mere advantage in performance, FastProp, by building on an abstract data model, also has an edge in memory consumption based on the abstract database design, the reliance on efficient storage patterns (utilizing pointers and indices) for concrete data, and by taking advantage of functional design patterns and lazy computations. This allows working with data sets of substantial size even without falling back to distributed computing models.

## 3. Conclusion

You are encouraged to reproduce these results. You will need getML (https://getml.com/product) to do so. You can download it for free.

# Next Steps

This tutorial went showcased another time series application of getML and benchmarked getML against popular time series libraries.

If you are interested in further real-world applications of getML, head back to the [notebook overview](welcome.md) and choose one of the remaining examples.

Here is some additional material from our [documentation](https://docs.getml.com/latest/) if you want to learn more about getML:
* [Feature learning with Multirel](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#multirel)
* [Feature learning with Relboost](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#relboost)

# Get in contact

If you have any question schedule a [call with Alex](https://go.getml.com/meetings/alexander-uhlig/getml-demo), the co-founder of getML, or write us an [email](team@getml.com). Prefer a private demo of getML? Just contact us to make an appointment.