# Propositionalization: Interstate 94

In this notebbok, we compare getML's FastProp against well-known feature engineering libraries featuretools and tsfresh.

Summary:

- Prediction type: __Regression model__
- Domain: __Transportation__
- Prediction target: __Hourly traffic volume__ 
- Source data: __Multivariate time series, 5 components__
- Population size: __24096__

<a target="_blank" href="https://colab.research.google.com/github/getml/getml-demo/blob/master/fastprop_benchmark/interstate94_prop.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Background

A common approach to feature engineering is to generate attribute-value representations from relational data by applying a fixed set of aggregations to columns of interest and perform a feature selection on the (possibly large) set of generated features afterwards. In academia, this approach is called _propositionalization._

getML's [FastProp](https://getml.com/latest/user_guide/concepts/feature_engineering/#feature-engineering-algorithms-fastprop) is an implementation of this propositionalization approach that has been optimized for speed and memory efficiency. In this notebook, we want to demonstrate how – well – fast FastProp is. To this end, we will benchmark FastProp against the popular feature engineering libraries [featuretools](https://www.featuretools.com/) and [tsfresh](https://tsfresh.readthedocs.io/en/latest/). Both of these libraries use propositionalization approaches for feature engineering.

In this notebook, we predict the hourly traffic volume on I-94 westbound from Minneapolis-St Paul. The analysis is built on top of a dataset provided by the [MN Department of Transportation](https://www.dot.state.mn.us), with some data preparation done by [John Hogue](https://github.com/dreyco676/Anomaly_Detection_A_to_Z/). For further details about the data set refer to [the full notebook](https://getml.com/latest/examples/enterprise-notebooks/interstate94).

## Analysis

1. [Loading data](#1.-Loading-data)
2. [Predictive modeling](#2.-Predictive-modeling)
3. [Comparison](#3.-Comparison)

Let's get started with the analysis and set-up your session:

In [1]:
%pip install -q "getml==1.5.0" "featuretools==1.31.0"

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import sys

os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
from pathlib import Path

import pandas as pd
import getml

print(f"getML API version: {getml.__version__}\n")

getML API version: 1.5.0



In [3]:
getml.engine.launch(allow_remote_ips=True, token="token")
getml.engine.set_project("interstate94")

getML Engine is already running.


In [4]:
# If we are in Colab, we need to fetch the utils folder from the repository
if os.getenv("COLAB_RELEASE_TAG"):
    !curl -L https://api.github.com/repos/getml/getml-demo/tarball/master | tar --wildcards --strip-components=1 -xz '*utils*'

In [5]:
parent = Path(os.getcwd()).parent.as_posix()

if parent not in sys.path:
    sys.path.append(parent)

from utils import Benchmark, FTTimeSeriesBuilder

### 1. Loading data

#### 1.1 Download from source

We begin by downloading the data from the UC Irvine Machine Learning repository:

In [6]:
traffic = getml.datasets.load_interstate94(roles=True, units=True)

[2K  Downloading traffic... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 1.2/1.2 MB • 00:00
[?25h

In [7]:
traffic.set_role(traffic.roles.categorical, getml.data.roles.unused_string)

In [8]:
traffic

name,ds,traffic_volume,holiday,day,month,weekday,hour,year
role,time_stamp,target,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
unit,"time stamp, comparison only",Unnamed: 2_level_2,Unnamed: 3_level_2,day,month,weekday,hour,year
0.0,2016-01-01,1513,New Years Day,1,1,4,0,2016
1.0,2016-01-01 01:00:00,1550,New Years Day,1,1,4,1,2016
2.0,2016-01-01 02:00:00,993,New Years Day,1,1,4,2,2016
3.0,2016-01-01 03:00:00,719,New Years Day,1,1,4,3,2016
4.0,2016-01-01 04:00:00,533,New Years Day,1,1,4,4,2016
,...,...,...,...,...,...,...,...
24091.0,2018-09-30 19:00:00,3543,No holiday,30,9,6,19,2018
24092.0,2018-09-30 20:00:00,2781,No holiday,30,9,6,20,2018
24093.0,2018-09-30 21:00:00,2159,No holiday,30,9,6,21,2018
24094.0,2018-09-30 22:00:00,1450,No holiday,30,9,6,22,2018


#### 1.2 Define relational model


In [9]:
split = getml.data.split.time(traffic, "ds", test=getml.data.time.datetime(2018, 3, 15))

In [10]:
time_series = getml.data.TimeSeries(
    population=traffic,
    split=split,
    alias="traffic",
    time_stamps="ds",
    horizon=getml.data.time.hours(1),
    memory=getml.data.time.hours(24),
    lagged_targets=True,
)

time_series

Unnamed: 0,data frames,staging table
0,traffic,TRAFFIC__STAGING_TABLE_1
1,traffic,TRAFFIC__STAGING_TABLE_2

Unnamed: 0,subset,name,rows,type
0,test,traffic,unknown,View
1,train,traffic,unknown,View

Unnamed: 0,name,rows,type
0,traffic,24096,DataFrame


### 2. Predictive modeling

We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.

#### 2.1 Propositionalization with getML's FastProp

In [11]:
seasonal = getml.preprocessors.Seasonal()

fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    num_threads=1,
)

__Build the pipeline__

In [12]:
pipe_fp_fl = getml.pipeline.Pipeline(
    preprocessors=[seasonal],
    feature_learners=[fast_prop],
    data_model=time_series.data_model,
    tags=["feature learning", "fastprop"],
)

pipe_fp_fl

In [13]:
pipe_fp_fl.check(time_series.train)

[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[?25h

In [14]:
benchmark = Benchmark()

In [15]:
with benchmark("fastprop"):
    pipe_fp_fl.fit(time_series.train)
    fastprop_train = pipe_fp_fl.transform(time_series.train, df_name="fastprop_train")

[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[?25h

[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  FastProp: Trying 365 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:03
[?25h

Time taken: 0:00:03.058378.

[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
[?25h

In [16]:
fastprop_test = pipe_fp_fl.transform(time_series.test, df_name="fastprop_test")

[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[?25h

In [17]:
predictor = getml.predictors.XGBoostRegressor()

pipe_fp_pr = getml.pipeline.Pipeline(
    tags=["prediction", "fastprop"], predictors=[predictor]
)

In [18]:
pipe_fp_pr.fit(fastprop_train)

[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[?25h

[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:05
[?25h

Time taken: 0:00:05.192145.



In [19]:
pipe_fp_pr.score(fastprop_test)

[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[?25h

Unnamed: 0,date time,set used,target,mae,rmse,rsquared
0,2024-09-13 13:17:10,fastprop_train,traffic_volume,198.9482,292.2493,0.9779
1,2024-09-13 13:17:10,fastprop_test,traffic_volume,180.4867,261.9389,0.9827


#### 2.2 Propositionalization with featuretools

In [20]:
traffic_train = time_series.train.population
traffic_test = time_series.test.population

In [21]:
dfs_pandas = {}

for df in [traffic_train, traffic_test, traffic]:
    dfs_pandas[df.name] = df.drop(df.roles.unused).to_pandas()
    dfs_pandas[df.name]["join_key"] = 1

In [22]:
ft_builder = FTTimeSeriesBuilder(
    num_features=200,
    horizon=pd.Timedelta(hours=1),
    memory=pd.Timedelta(hours=24),
    column_id="join_key",
    time_stamp="ds",
    target="traffic_volume",
    allow_lagged_targets=True,
)

In [23]:
with benchmark("featuretools"):
    featuretools_train = ft_builder.fit(dfs_pandas["train"])

featuretools_test = ft_builder.transform(dfs_pandas["test"])

featuretools: Trying features...
Selecting the best out of 118 features...
Time taken: 0h:4m:27.008254



In [24]:
roles = {
    getml.data.roles.join_key: ["join_key"],
    getml.data.roles.target: ["traffic_volume"],
    getml.data.roles.time_stamp: ["ds"],
}

df_featuretools_train = getml.data.DataFrame.from_pandas(
    featuretools_train, name="featuretools_train", roles=roles
)

df_featuretools_test = getml.data.DataFrame.from_pandas(
    featuretools_test, name="featuretools_test", roles=roles
)

In [25]:
df_featuretools_train.set_role(
    df_featuretools_train.roles.unused, getml.data.roles.numerical
)

df_featuretools_test.set_role(
    df_featuretools_test.roles.unused, getml.data.roles.numerical
)

In [26]:
predictor = getml.predictors.XGBoostRegressor()

pipe_ft_pr = getml.pipeline.Pipeline(
    tags=["prediction", "featuretools"], predictors=[predictor]
)

pipe_ft_pr

In [27]:
pipe_ft_pr.check(df_featuretools_train)

[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[?25h

In [28]:
pipe_ft_pr.fit(df_featuretools_train)

[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[?25h

[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
[?25h

Time taken: 0:00:01.955919.



In [29]:
pipe_ft_pr.score(df_featuretools_test)

[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[?25h

Unnamed: 0,date time,set used,target,mae,rmse,rsquared
0,2024-09-13 13:22:48,featuretools_train,traffic_volume,220.4023,321.1657,0.9734
1,2024-09-13 13:22:48,featuretools_test,traffic_volume,210.1988,317.52,0.9746


#### 2.3 Propositionalization with tsfresh

tsfresh failed to run through due to an apparent bug in the tsfresh library and is therefore excluded from this analysis.

### 3. Comparison

In [30]:
num_features = dict(
    fastprop=461,
    featuretools=59,
)

runtime_per_feature = [
    benchmark.runtimes["fastprop"] / num_features["fastprop"],
    benchmark.runtimes["featuretools"] / num_features["featuretools"],
]

features_per_second = [1.0 / r.total_seconds() for r in runtime_per_feature]

normalized_runtime_per_feature = [
    r / runtime_per_feature[0] for r in runtime_per_feature
]

comparison = pd.DataFrame(
    dict(
        runtime=[benchmark.runtimes["fastprop"], benchmark.runtimes["featuretools"]],
        num_features=num_features.values(),
        features_per_second=features_per_second,
        normalized_runtime=[
            1,
            benchmark.runtimes["featuretools"] / benchmark.runtimes["fastprop"],
        ],
        normalized_runtime_per_feature=normalized_runtime_per_feature,
        rsquared=[pipe_fp_pr.rsquared, pipe_ft_pr.rsquared],
        rmse=[pipe_fp_pr.rmse, pipe_ft_pr.rmse],
        mae=[pipe_fp_pr.mae, pipe_ft_pr.mae],
    )
)

comparison.index = ["getML: FastProp", "featuretools"]

In [31]:
comparison

Unnamed: 0,runtime,num_features,features_per_second,normalized_runtime,normalized_runtime_per_feature,rsquared,rmse,mae
getML: FastProp,0 days 00:00:04.806504,461,95.914061,1.0,1.0,0.982678,261.938873,180.486734
featuretools,0 days 00:04:27.009351,59,0.220966,55.551676,434.066948,0.974582,317.519976,210.198793


In [32]:
comparison.to_csv("comparisons/interstate94.csv")

In [None]:
getml.engine.shutdown()