# Propositionalization: Traffic near Dodgers' stadium

In this notebook, we compare getML's FastProp against well-known feature engineering libraries featuretools and tsfresh.

Summary:

- Prediction type: __Regression model__
- Domain: __Transportation__
- Prediction target: __traffic volume__ 
- Source data: __Univariate time series__
- Population size: __47497__

# Background

A common approach to feature engineering is to generate attribute-value representations from relational data by applying a fixed set of aggregations to columns of interest and perform a feature selection on the (possibly large) set of generated features afterwards. In academia, this approach is called _propositionalization._

getML's [FastProp](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#fastprop) is an implementation of this propositionalization approach that has been optimized for speed and memory efficiency. In this notebook, we want to demonstrate how – well – fast FastProp is. To this end, we will benchmark FastProp against the popular feature engineering libraries [featuretools](https://www.featuretools.com/) and [tsfresh](https://tsfresh.readthedocs.io/en/latest/). Both of these libraries use propositionalization approaches for feature engineering.

In this notebook, we use traffic data that was collected for the Glendale on ramp for the 101 North freeway in Los Angeles. For further details about the data set refer to [the full notebook](../dodgers.ipynb).

### A web frontend for getML

The getML monitor is a frontend built to support your work with getML. The getML monitor displays information such as the imported data frames, trained pipelines and allows easy data and feature exploration. You can launch the getML monitor [here](http://localhost:1709).

# Analysis

## Table of contents

1. [Loading data](#1.-Loading-data)
2. [Predictive modeling](#2.-Predictive-modeling)
3. [Comparison](#3.-Comparison)

Let's get started with the analysis and set-up your session:

In [1]:
import datetime
import gc
import os
import sys
import time
from urllib import request

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy
from IPython.display import Image
from scipy.stats import pearsonr

plt.style.use("seaborn")
%matplotlib inline

In [2]:
sys.path.append(os.path.join(sys.path[0], ".."))

from utils import Benchmark, FTTimeSeriesBuilder, TSFreshBuilder

In [3]:
import getml

getml.engine.launch()
getml.engine.set_project("dodgers")

getML engine is already running.



Connected to project 'dodgers'
http://localhost:1709/#/listprojects/dodgers/


## 1. Loading data

### 1.1 Download from source

We begin by downloading the data from the UC Irvine Machine Learning repository:

In [4]:
fname = "Dodgers.data"

if not os.path.exists(fname):
    fname, res = request.urlretrieve(
        "https://archive.ics.uci.edu/ml/machine-learning-databases/event-detection/"
        + fname,
        fname,
    )

data_full_pandas = pd.read_csv(fname, header=None)
data_full_pandas.columns = ["ds", "y"]

In [5]:
data_full_pandas["ds"] = [
    datetime.datetime.strptime(dt, "%m/%d/%Y %H:%M") for dt in data_full_pandas["ds"]
]

In [6]:
data_full_pandas

Unnamed: 0,ds,y
0,2005-04-10 00:00:00,-1
1,2005-04-10 00:05:00,-1
2,2005-04-10 00:10:00,-1
3,2005-04-10 00:15:00,-1
4,2005-04-10 00:20:00,-1
...,...,...
50395,2005-10-01 23:35:00,-1
50396,2005-10-01 23:40:00,-1
50397,2005-10-01 23:45:00,-1
50398,2005-10-01 23:50:00,-1


### 1.2 Prepare data for getML

In [7]:
data_full = getml.data.DataFrame.from_pandas(data_full_pandas, "data_full")

In [8]:
data_full.set_role("y", getml.data.roles.target)
data_full.set_role("ds", getml.data.roles.time_stamp)

In [9]:
data_full

name,ds,y
role,time_stamp,target
unit,"time stamp, comparison only",Unnamed: 2_level_2
0.0,2005-04-10,-1
1.0,2005-04-10 00:05:00,-1
2.0,2005-04-10 00:10:00,-1
3.0,2005-04-10 00:15:00,-1
4.0,2005-04-10 00:20:00,-1
,...,...
50395.0,2005-10-01 23:35:00,-1
50396.0,2005-10-01 23:40:00,-1
50397.0,2005-10-01 23:45:00,-1
50398.0,2005-10-01 23:50:00,-1


In [10]:
split = getml.data.split.time(
    population=data_full, time_stamp="ds", test=getml.data.time.datetime(2005, 8, 20)
)
split

Unnamed: 0,Unnamed: 1
0.0,train
1.0,train
2.0,train
3.0,train
4.0,train
,...


### 1.3 Define relational model

To start with relational learning, we need to specify the data model. We manually replicate the appropriate time series structure by setting time series related join conditions (`horizon`, `memory` and `allow_lagged_targets`). This is done abstractly using [Placeholders](https://docs.getml.com/latest/user_guide/data_model/data_model.html#placeholders)

The data model consists of two tables:
* __Population table__ `traffic_{test/train}`: holds target and the contemporarily available time-based components
* __Peripheral table__ `traffic`: same table as the population table
* Join between both placeholders specifies (`horizon`) to prevent leaks and (`memory`) that keeps the computations feasible

In [11]:
# 1. The horizon is 1 hour (we predict the traffic volume in one hour).
# 2. The memory is 2 hours, so we allow the algorithm to
#    use information from up to 2 hours ago.
# 3. We allow lagged targets. Thus, the algorithm can
#    identify autoregressive processes.

time_series = getml.data.TimeSeries(
    population=data_full,
    alias="population",
    split=split,
    time_stamps="ds",
    horizon=getml.data.time.hours(1),
    memory=getml.data.time.hours(2),
    lagged_targets=True,
)

time_series

Unnamed: 0,data frames,staging table
0,population,POPULATION__STAGING_TABLE_1
1,data_full,DATA_FULL__STAGING_TABLE_2

Unnamed: 0,subset,name,rows,type
0,test,data_full,12384,View
1,train,data_full,38016,View

Unnamed: 0,name,rows,type
0,data_full,50400,DataFrame


## 2. Predictive modeling

We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.

### 2.1 Propositionalization with getML's FastProp

In [12]:
seasonal = getml.preprocessors.Seasonal()

fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    num_threads=1,
)

__Build the pipeline__

In [13]:
pipe_fp_fl = getml.pipeline.Pipeline(
    preprocessors=[seasonal],
    feature_learners=[fast_prop],
    data_model=time_series.data_model,
    tags=["feature learning", "fastprop"],
)

pipe_fp_fl

In [14]:
pipe_fp_fl.check(time_series.train)

Checking data model...


Staging...

Preprocessing...

Checking...


OK.


In [15]:
benchmark = Benchmark()

In [16]:
with benchmark("fastprop"):
    pipe_fp_fl.fit(time_series.train)
    fastprop_train = pipe_fp_fl.transform(time_series.train, df_name="fastprop_train")

Checking data model...


Staging...


OK.


Staging...

Preprocessing...

FastProp: Trying 526 features...


Trained pipeline.
Time taken: 0h:0m:34.680276



Staging...

Preprocessing...

FastProp: Building features...




In [17]:
fastprop_test = pipe_fp_fl.transform(time_series.test, df_name="fastprop_test")



Staging...

Preprocessing...

FastProp: Building features...




In [18]:
predictor = getml.predictors.XGBoostRegressor()

pipe_fp_pr = getml.pipeline.Pipeline(
    tags=["prediction", "fastprop"], predictors=[predictor]
)

In [19]:
pipe_fp_pr.fit(fastprop_train)

Checking data model...


Staging...

Checking...


OK.


Staging...

XGBoost: Training as predictor...


Trained pipeline.
Time taken: 0h:0m:17.37808



In [20]:
pipe_fp_pr.score(fastprop_test)



Staging...




Unnamed: 0,date time,set used,target,mae,rmse,rsquared
0,2022-03-30 00:28:52,fastprop_train,y,5.4188,7.5347,0.699
1,2022-03-30 00:28:52,fastprop_test,y,5.6151,7.8243,0.6747


### 2.2 Propositionalization with featuretools

In [21]:
data_train = time_series.train.population.to_df("data_train")
data_test = time_series.test.population.to_df("data_test")

In [22]:
dfs_pandas = {}

for df in getml.project.data_frames:
    dfs_pandas[df.name] = df.to_pandas()
    dfs_pandas[df.name]["id"] = 1

In [23]:
ft_builder = FTTimeSeriesBuilder(
    num_features=200,
    horizon=pd.Timedelta(hours=1),
    memory=pd.Timedelta(hours=2),
    column_id="id",
    time_stamp="ds",
    target="y",
    allow_lagged_targets=True,
)

In [24]:
with benchmark("featuretools"):
    featuretools_train = ft_builder.fit(dfs_pandas["data_train"])

featuretools_test = ft_builder.transform(dfs_pandas["data_test"])

featuretools: Trying features...


  agg_primitives: ['all', 'any', 'count', 'num_true', 'percent_true']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible columns for the primitive were found in the data.


Selecting the best out of 59 features...
Time taken: 0h:5m:58.182095



  agg_primitives: ['all', 'any', 'count', 'num_true', 'percent_true']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible columns for the primitive were found in the data.


In [25]:
df_featuretools_train = getml.data.DataFrame.from_pandas(
    featuretools_train, name="featuretools_train", roles=data_train.roles
)

df_featuretools_test = getml.data.DataFrame.from_pandas(
    featuretools_test, name="featuretools_test", roles=data_train.roles
)

In [26]:
df_featuretools_train.set_role(
    df_featuretools_train.roles.unused, getml.data.roles.numerical
)

df_featuretools_test.set_role(
    df_featuretools_test.roles.unused, getml.data.roles.numerical
)

In [27]:
predictor = getml.predictors.XGBoostRegressor()

pipe_ft_pr = getml.pipeline.Pipeline(
    tags=["prediction", "featuretools"], predictors=[predictor]
)

pipe_ft_pr

In [28]:
pipe_ft_pr.check(df_featuretools_train)

Checking data model...


Staging...

Checking...




In [29]:
pipe_ft_pr.fit(df_featuretools_train)

Checking data model...


Staging...




Staging...

XGBoost: Training as predictor...


Trained pipeline.
Time taken: 0h:0m:3.437791



In [30]:
pipe_ft_pr.score(df_featuretools_test)



Staging...




Unnamed: 0,date time,set used,target,mae,rmse,rsquared
0,2022-03-30 00:36:53,featuretools_train,y,5.47,7.603,0.6934
1,2022-03-30 00:36:53,featuretools_test,y,6.2846,8.6711,0.65


### 2.3 Propositionalization with tsfresh

In [31]:
tsfresh_builder = TSFreshBuilder(
    num_features=200,
    horizon=20,
    memory=60,
    column_id="id",
    time_stamp="ds",
    target="y",
    allow_lagged_targets=True,
)

In [32]:
with benchmark("tsfresh"):
    tsfresh_train = tsfresh_builder.fit(dfs_pandas["data_train"])

tsfresh_test = tsfresh_builder.transform(dfs_pandas["data_test"])

Rolling: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:24<00:00,  2.42it/s]
Feature Extraction: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:22<00:00,  2.64it/s]
Feature Extraction: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:23<00:00,  2.51it/s]


Selecting the best out of 13 features...
Time taken: 0h:1m:26.650208



Rolling: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:07<00:00,  7.84it/s]
Feature Extraction: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:07<00:00,  7.84it/s]
Feature Extraction: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:08<00:00,  7.30it/s]


In [33]:
df_tsfresh_train = getml.data.DataFrame.from_pandas(
    tsfresh_train, name="tsfresh_train", roles=data_train.roles
)

df_tsfresh_test = getml.data.DataFrame.from_pandas(
    tsfresh_test, name="tsfresh_test", roles=data_train.roles
)

In [34]:
df_tsfresh_train.set_role(df_tsfresh_train.roles.unused, getml.data.roles.numerical)

df_tsfresh_test.set_role(df_tsfresh_test.roles.unused, getml.data.roles.numerical)

In [35]:
pipe_tsf_pr = getml.pipeline.Pipeline(
    tags=["predicition", "tsfresh"], predictors=[predictor]
)

pipe_tsf_pr

In [36]:
pipe_tsf_pr.fit(df_tsfresh_train)

Checking data model...


Staging...

Checking...




Staging...

XGBoost: Training as predictor...


Trained pipeline.
Time taken: 0h:0m:3.127005



In [37]:
pipe_tsf_pr.score(df_tsfresh_test)



Staging...




Unnamed: 0,date time,set used,target,mae,rmse,rsquared
0,2022-03-30 00:38:51,tsfresh_train,y,6.3146,8.2348,0.6418
1,2022-03-30 00:38:52,tsfresh_test,y,6.7886,8.9134,0.5778


## 3. Comparison

In [38]:
num_features = dict(
    fastprop=526,
    featuretools=59,
    tsfresh=12,
)

runtime_per_feature = [
    benchmark.runtimes["fastprop"] / num_features["fastprop"],
    benchmark.runtimes["featuretools"] / num_features["featuretools"],
    benchmark.runtimes["tsfresh"] / num_features["tsfresh"],
]

features_per_second = [1.0 / r.total_seconds() for r in runtime_per_feature]

normalized_runtime_per_feature = [
    r / runtime_per_feature[0] for r in runtime_per_feature
]

comparison = pd.DataFrame(
    dict(
        runtime=[
            benchmark.runtimes["fastprop"],
            benchmark.runtimes["featuretools"],
            benchmark.runtimes["tsfresh"],
        ],
        num_features=num_features.values(),
        features_per_second=features_per_second,
        normalized_runtime=[
            1,
            benchmark.runtimes["featuretools"] / benchmark.runtimes["fastprop"],
            benchmark.runtimes["tsfresh"] / benchmark.runtimes["fastprop"],
        ],
        normalized_runtime_per_feature=normalized_runtime_per_feature,
        rsquared=[pipe_fp_pr.rsquared, pipe_ft_pr.rsquared, pipe_tsf_pr.rsquared],
        rmse=[pipe_fp_pr.rmse, pipe_ft_pr.rmse, pipe_tsf_pr.rmse],
        mae=[pipe_fp_pr.mae, pipe_ft_pr.mae, pipe_tsf_pr.mae],
    )
)

comparison.index = ["getML: FastProp", "featuretools", "tsfresh"]

In [39]:
comparison

Unnamed: 0,runtime,num_features,features_per_second,normalized_runtime,normalized_runtime_per_feature,rsquared,rmse,mae
getML: FastProp,0 days 00:00:49.212908,526,10.688214,1.0,1.0,0.67474,7.824273,5.615138
featuretools,0 days 00:05:58.182900,59,0.16472,7.278231,64.887047,0.650041,8.671105,6.284556
tsfresh,0 days 00:01:26.650363,12,0.138488,1.760724,77.17814,0.577811,8.913408,6.78861


In [40]:
# export for further use
comparison.to_csv("comparisons/dodgers.csv")

## Why is FastProp so fast?

First, FastProp hugely benefits from getML's custom-built C++-native in-memory database engine. The engine is highly optimized for working with relational data structures and makes use of information about the relational structure of the data to efficiently store and carry out computations on such data. This matters in particular for time series where we [relate the current observation to a certain number of observations from the past](https://docs.getml.com/latest/user_guide/data_model/data_model.html#time-series): Other libraries have to deal explicitly with this inherent structure of (multivariate) time series; and such explicit transformations are costly, in terms of consumption of both, memory and computational resources. All operations on data stored in getML's engine benefit from implementations in modern C++. Further, we are taking advantage of functional design patterns where all column-based operations are evaluated lazily. So, for example, aggregations are carried out only on rows that matter (taking into account even complex conditions that might span multiple tables in the relational model). Duplicate operations are reduced to a bare minimum by keeping track of the relational data model. In addition to the mere advantage in performance, FastProp, by building on an abstract data model, also has an edge in memory consumption based on the abstract database design, the reliance on efficient storage patterns (utilizing pointers and indices) for concrete data, and by taking advantage of functional design patterns and lazy computations. This allows working with data sets of substantial size even without falling back to distributed computing models.

# Next Steps

If you are interested in further real-world applications of getML, visit the [notebook section on getml.com](https://notebooks.getml.com/). If you want to gain a deeper understanding about our notebooks' contents or download the code behind the notebooks, have a look at the [getml-demo repository](https://github.com/getml/getml-demo/). Here, you can also find [futher benchmarks of getML](https://github.com/getml/getml-demo/#benchmarks).

Want to try out without much hassle? Just head to [try.getml.com](https://try.getml.com) to launch an instance of getML directly in your browser.

Further, here is some additional material from our [documentation](https://docs.getml.com/latest/) if you want to learn more about getML:
* [Annotating data within getML's data frames](https://docs.getml.com/latest/user_guide/annotating_data/annotating_data.html),
* [Defining your relational structure through getML's abstract data model](https://docs.getml.com/latest/user_guide/data_model/data_model.html), or
* [An introduction to feature learning](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html).

# Get in contact

If you have any questions, just write us an [email](https://getml.com/contact/lets-talk/). Prefer a private demo of getML for your team? Just [contact us](https://getml.com/contact/lets-talk/) to arrange an introduction to getML.