# Why feature learning is better than simple propositionalization

**NOTE: Due to featuretools's and tsfresh's memory requirements, this notebook will not run on MyBinder when RUN_FEATURETOOLS=True RUN_TSFRESH=True.**

In this notebook we will compare getML to featuretools and tsfresh, both of which open-source libraries for feature engineering. We find that advanced algorithms featured in getML yield significantly better predictions on this dataset. We then discuss why that is.

Summary:

- Prediction type: __Regression model__
- Domain: __Air pollution__
- Prediction target: __pm 2.5 concentration__ 
- Source data: __Multivariate time series__
- Population size: __41757__

*Author: Dr. Patrick Urbanke*

## Background

Many data scientists and AutoML tools use propositionalization methods for feature engineering. These propositionalization methods usually work as follows:

- Generate a large number of hard-coded features
- Use feature selection to pick a percentage of these features

By contrast, getML (https://getml.com/product) contains approaches for feature learning: Feature learning adapts machine learning approaches such as decision trees or gradient boosting to the problem of extracting features from relational data and time series.

In this notebook, we will benchmark getML (https://getml.com/product) against featuretools (https://www.featuretools.com/) and tsfresh (https://tsfresh.readthedocs.io/en/latest/). Both of these libaries use propositionalization approaches for feature engineering.

As our example dataset, we use a publicly available dataset on air pollution in Beijing, China (https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data). The data set has been originally used in the following study:

Liang, X., Zou, T., Guo, B., Li, S., Zhang, H., Zhang, S., Huang, H. and Chen, S. X. (2015). Assessing Beijing's PM2.5 pollution: severity, weather impact, APEC and winter heating. Proceedings of the Royal Society A, 471, 20150257.

We find that getML significantly outperforms featuretools and tsfresh in terms of predictive accuracy (**R-squared of 62.3%** vs **R-squared of 50.4%**).

Our findings indicate that getML's feature learning algorithms are better at adapting to data sets and are also more scalable due to their lower memory requirement.

## A web frontend for getML

The getML monitor is a frontend built to support your work with getML. The getML monitor displays information such as the imported data frames, trained pipelines and allows easy data and feature exploration. You can launch the getML monitor [here](http://localhost:1709).

In [1]:
import os
from urllib import request

import getml
import numpy as np
import pandas as pd
from scipy.stats import pearsonr

import matplotlib.pyplot as plt
%matplotlib inline  

In [2]:
RUN_FEATURETOOLS = True
RUN_TSFRESH = True

if RUN_FEATURETOOLS:
    from utils import FTTimeSeriesBuilder

if RUN_TSFRESH:
    from utils import TSFreshBuilder

## 1. Loading data

### 1.1 Download from source

We begin by downloading the data from the UCI Machine Learning repository.

In [3]:
FEATURETOOLS_FILES = [
    "featuretools_training.csv",
    "featuretools_test.csv"
]

if not RUN_FEATURETOOLS:
    for fname in FEATURETOOLS_FILES:
        if not os.path.exists(fname):
            fname, res = request.urlretrieve(
                "https://static.getml.com/datasets/air_pollution/featuretools/" + fname, 
                fname
            )

In [4]:
TSFRESH_FILES = [
    "tsfresh_training.csv",
    "tsfresh_test.csv"
]

if not RUN_TSFRESH:
    for fname in TSFRESH_FILES:
        if not os.path.exists(fname):
            fname, res = request.urlretrieve(
                "https://static.getml.com/datasets/air_pollution/tsfresh/" + fname, 
                fname
            )

In [5]:
fname = "PRSA_data_2010.1.1-2014.12.31.csv"

if not os.path.exists(fname):
    fname, res = request.urlretrieve(
        "https://archive.ics.uci.edu/ml/machine-learning-databases/00381/" + fname, 
        fname
    )


### 1.2 Prepare data for tsfresh and getML

Our our goal is to predict the pm2.5 concentration from factors such as weather or time of day. However, there are some **missing entries** for pm2.5, so we get rid of them.

In [6]:
data_full_pandas = pd.read_csv(fname)

data_full_pandas = data_full_pandas[
    data_full_pandas["pm2.5"] == data_full_pandas["pm2.5"]
]

tsfresh requires a date column, so we build one.

In [7]:
def add_leading_zero(val):
    if len(str(val)) == 1:
        return "0" + str(val)
    return str(val)

data_full_pandas["month"] = [
    add_leading_zero(val) for val in data_full_pandas["month"]
]

data_full_pandas["day"] = [
    add_leading_zero(val) for val in data_full_pandas["day"]
]

data_full_pandas["hour"] = [
    add_leading_zero(val) for val in data_full_pandas["hour"]
]

def make_date(year, month, day, hour):
    return year + "-" + month + "-" + day + " " + hour + ":00:00"

data_full_pandas["date"] = [
    make_date(str(year), month, day, hour) \
    for year, month, day, hour in zip(
        data_full_pandas["year"],
        data_full_pandas["month"],
        data_full_pandas["day"],
        data_full_pandas["hour"],
    )
]


tsfresh also requires the time series to have ids. Since there is only a single time series, that series has the same id.

In [8]:
data_full_pandas["id"] = 1

The dataset now contains many columns that we do not need or that tsfresh cannot process. For instance, *cbwd* might actually contain useful information, but it is a categorical variable, which is difficult to handle for tsfresh, so we remove it.

We also want to split our data into a training and testing set.

In [9]:
data_train_pandas = data_full_pandas[data_full_pandas["year"] < 2014]
data_test_pandas = data_full_pandas[data_full_pandas["year"] == 2014]
data_full_pandas = data_full_pandas

In [10]:
def remove_unwanted_columns(df):
    del df["cbwd"]
    del df["year"]
    del df["month"]
    del df["day"]
    del df["hour"]
    del df["No"]
    return df

data_full_pandas = remove_unwanted_columns(data_full_pandas)
data_train_pandas = remove_unwanted_columns(data_train_pandas)
data_test_pandas = remove_unwanted_columns(data_test_pandas)

In [11]:
data_full_pandas

Unnamed: 0,pm2.5,DEWP,TEMP,PRES,Iws,Is,Ir,date,id
24,129.0,-16,-4.0,1020.0,1.79,0,0,2010-01-02 00:00:00,1
25,148.0,-15,-4.0,1020.0,2.68,0,0,2010-01-02 01:00:00,1
26,159.0,-11,-5.0,1021.0,3.57,0,0,2010-01-02 02:00:00,1
27,181.0,-7,-5.0,1022.0,5.36,1,0,2010-01-02 03:00:00,1
28,138.0,-7,-5.0,1022.0,6.25,2,0,2010-01-02 04:00:00,1
...,...,...,...,...,...,...,...,...,...
43819,8.0,-23,-2.0,1034.0,231.97,0,0,2014-12-31 19:00:00,1
43820,10.0,-22,-3.0,1034.0,237.78,0,0,2014-12-31 20:00:00,1
43821,10.0,-22,-3.0,1034.0,242.70,0,0,2014-12-31 21:00:00,1
43822,8.0,-22,-4.0,1034.0,246.72,0,0,2014-12-31 22:00:00,1


We then **load the data into the getML engine**. We begin by setting a project.

In [12]:
getml.engine.set_project('air_pollution')


Connected to project 'air_pollution'


In [13]:
df_full = getml.data.DataFrame.from_pandas(data_full_pandas, name='full')
df_train = getml.data.DataFrame.from_pandas(data_train_pandas, name='train')
df_test = getml.data.DataFrame.from_pandas(data_test_pandas, name='test')

df_full["date"] = df_full["date"].as_ts()

We need to **assign roles** to the columns, such as defining the target column.

In [14]:
def set_roles(df):
    df.set_role(["date"], getml.data.roles.time_stamp)
    df.set_role(["pm2.5"], getml.data.roles.target)
    df.set_role([
        "DEWP", 
        "TEMP",
        "PRES",
        "Iws",
        "Is",
        "Ir"], getml.data.roles.numerical)

set_roles(df_full)
set_roles(df_train)
set_roles(df_test)

## 3. Predictive modelling


### 3.1 Pipeline 1: Complex features, 7 days

For our first experiment, we will learn complex features and allow a memory of up to seven days. That means at every given point in time, the algorithm is allowed to back seven days into the past.

getML uses relational learning to build construct the pipelines. Even though there is a simpler time series API, the relational API is more flexible which is why decide to use it.

In [15]:
population = getml.data.Placeholder('population')

peripheral = getml.data.Placeholder('peripheral')

population.join(
    peripheral,
    time_stamp='date',
    memory=getml.data.time.days(7)
)

population

In [16]:
relmt = getml.feature_learning.RelMTModel(
    num_features=10,
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    seed=4367,
    num_threads=1
)

predictor = getml.predictors.XGBoostRegressor(n_jobs=1)

pipe = getml.pipeline.Pipeline(
    tags=['memory: 7d', 'complex features'],
    population=population,
    peripheral=[peripheral],
    feature_learners=[relmt],
    predictors=[predictor]
)

pipe

It is good practice to always check your data model first, even though `check(...)` is also called by `fit(...)`. That enables us to make last-minute changes.

In [17]:
pipe.check(df_train, [df_full])

Checking data model...
OK.


We now fit our data on the training set and evaluate our findings, both in-sample and out-of-sample.

In [18]:
pipe.fit(df_train, [df_full])

Checking data model...
OK.

RelMT: Training features...

RelMT: Building features...

XGBoost: Training as predictor...

Trained pipeline.
Time taken: 0h:4m:48.03673



In [19]:
pipe.score(df_test, [df_full])


RelMT: Building features...



Unnamed: 0,date time,set used,target,mae,rmse,rsquared
0,2021-02-24 10:50:01,train,pm2.5,35.27219,51.11251,0.69007
1,2021-02-24 10:50:12,test,pm2.5,40.0586,57.79037,0.62263


### 3.2 Pipeline 2: Complex features, 1 day

In [20]:
population = getml.data.Placeholder('population')

peripheral = getml.data.Placeholder('peripheral')

population.join(
    peripheral,
    time_stamp='date',
    memory=getml.data.time.days(1)
)

population

In [21]:
relmt = getml.feature_learning.RelMTModel(
    num_features=10,
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    seed=4367,
    num_threads=1
)

predictor = getml.predictors.XGBoostRegressor(n_jobs=1)

pipe = getml.pipeline.Pipeline(
    tags=['memory: 1d', 'complex features'],
    population=population,
    peripheral=[peripheral],
    feature_learners=[relmt],
    predictors=[predictor]
)

pipe

In [22]:
pipe.check(df_train, [df_full])

Checking data model...
OK.


In [23]:
pipe.fit(df_train, [df_full])

Checking data model...
OK.

RelMT: Training features...

RelMT: Building features...

XGBoost: Training as predictor...

Trained pipeline.
Time taken: 0h:2m:7.974977



In [24]:
pipe.score(df_test, [df_full])


RelMT: Building features...



Unnamed: 0,date time,set used,target,mae,rmse,rsquared
0,2021-02-24 10:52:29,train,pm2.5,38.75185,56.17597,0.62551
1,2021-02-24 10:52:37,test,pm2.5,44.40302,66.1336,0.51528


### 3.3 Pipeline 3: Simple features, 7 days

For our third experiment, we will learn simple features and allow a memory of up to seven days.

In [25]:
population = getml.data.Placeholder('population')

peripheral = getml.data.Placeholder('peripheral')

population.join(
    peripheral,
    time_stamp='date',
    memory=getml.data.time.days(7)
)

population

In [26]:
fast_prop = getml.feature_learning.FastPropModel(
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    num_features=40,
    num_threads=1
)

predictor = getml.predictors.XGBoostRegressor(n_jobs=1)

pipe = getml.pipeline.Pipeline(
    tags=['memory: 7d', 'simple features'],
    population=population,
    peripheral=[peripheral],
    feature_learners=[fast_prop],
    predictors=[predictor]
)

pipe

In [27]:
pipe.check(df_train, [df_full])

Checking data model...
OK.


In [28]:
pipe.fit(df_train, [df_full])

Checking data model...
OK.

FastProp: Trying 72 features...

FastProp: Building features...

XGBoost: Training as predictor...

Trained pipeline.
Time taken: 0h:0m:27.943497



In [29]:
pipe.score(df_test, [df_full])


FastProp: Building features...



Unnamed: 0,date time,set used,target,mae,rmse,rsquared
0,2021-02-24 10:53:13,train,pm2.5,39.14191,55.00419,0.66066
1,2021-02-24 10:53:16,test,pm2.5,48.50128,68.62869,0.48425


### 3.4 Pipeline 4: Simple features, 1 day

For our fourth experiment, we will learn simple features and allow a memory of up to one day.

In [30]:
population = getml.data.Placeholder('population')

peripheral = getml.data.Placeholder('peripheral')

population.join(
    peripheral,
    time_stamp='date',
    memory=getml.data.time.days(1)
)

population

In [31]:
fast_prop = getml.feature_learning.FastPropModel(
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    num_features=40,
    num_threads=1
)

predictor = getml.predictors.XGBoostRegressor(n_jobs=1)

pipe = getml.pipeline.Pipeline(
    tags=['memory: 1d', 'simple features'],
    population=population,
    peripheral=[peripheral],
    feature_learners=[fast_prop],
    predictors=[predictor]
)

pipe

In [32]:
pipe.check(df_train, [df_full])

Checking data model...
OK.


In [33]:
pipe.fit(df_train, [df_full])

Checking data model...
OK.

FastProp: Trying 72 features...

FastProp: Building features...

XGBoost: Training as predictor...

Trained pipeline.
Time taken: 0h:0m:15.773669



In [34]:
pipe.score(df_test, [df_full])


FastProp: Building features...



Unnamed: 0,date time,set used,target,mae,rmse,rsquared
0,2021-02-24 10:53:40,train,pm2.5,40.79,58.38303,0.60263
1,2021-02-24 10:53:41,test,pm2.5,46.548,65.88426,0.51142


### 3.5 Using featuretools

To make things a bit easier, we have written a high-level wrapper around featuretools which we placed in a separate module (`utils`).

In [35]:
if RUN_FEATURETOOLS:
    ft_builder = FTTimeSeriesBuilder(
        num_features=40,
        horizon=pd.Timedelta(days=0),
        memory=pd.Timedelta(days=1),
        column_id="id",
        time_stamp="date",
        target="pm2.5")
    #
    featuretools_training = ft_builder.fit(data_train_pandas)
    featuretools_test = ft_builder.transform(data_test_pandas)
    #
    featuretools_training.to_csv("featuretools_training.csv", index=False)
    featuretools_test.to_csv("featuretools_test.csv", index=False)

featuretools: Trying features...
Time taken: 0h:1m:37.306428



In [36]:
if not RUN_FEATURETOOLS:
    featuretools_training = pd.read_csv("featuretools_training.csv")
    featuretools_test = pd.read_csv("featuretools_test.csv")

In [37]:
df_featuretools_training = getml.data.DataFrame.from_pandas(featuretools_training, name='featuretools_training')
df_featuretools_test = getml.data.DataFrame.from_pandas(featuretools_test, name='featuretools_test')

In [38]:
def set_roles_featuretools(df):
    df["date"] = df["date"].as_ts()
    df.set_role(["pm2.5"], getml.data.roles.target)
    df.set_role(["date"], getml.data.roles.time_stamp)
    df.set_role(df.unused_names, getml.data.roles.numerical)
    df.set_role(["id"], getml.data.roles.unused_float)
    return df

df_featuretools_training = set_roles_featuretools(df_featuretools_training)
df_featuretools_test = set_roles_featuretools(df_featuretools_test)

In [39]:
predictor = getml.predictors.XGBoostRegressor()

pipe = getml.pipeline.Pipeline(
    tags=['featuretools', 'memory: 1d'],
    predictors=[predictor]
)

pipe

In [40]:
pipe.check(df_featuretools_training)

Checking data model...
OK.


In [41]:
pipe.fit(df_featuretools_training)

Checking data model...
OK.

XGBoost: Training as predictor...

Trained pipeline.
Time taken: 0h:0m:4.410858



In [42]:
pipe.score(df_featuretools_test)




Unnamed: 0,date time,set used,target,mae,rmse,rsquared
0,2021-02-24 10:55:50,featuretools_training,pm2.5,40.73829,57.75336,0.6123
1,2021-02-24 10:55:50,featuretools_test,pm2.5,46.85966,66.42585,0.5041


### 3.6 Using tsfresh

tsfresh is a rather low-level library. To make things a bit easier, we have written a high-level wrapper which we placed in a separate module (`utils`).

To limit the memory consumption, we undertake the following steps:

- We limit ourselves to a memory of 1 day from any point in time. This is necessary, because tsfresh duplicates records for every time stamp. That means that looking back 7 days instead of one day, the memory consumption would be  seven times as high.
- We extract only tsfresh's **MinimalFCParameters** and **IndexBasedFCParameters** (the latter is a superset of **TimeBasedFCParameters**).

In order to make sure that tsfresh's features can be compared to getML's features, we also do the following:

- We apply tsfresh's built-in feature selection algorithm.
- Of the remaining features, we only keep the 40 features most correlated with the target (in terms of the absolute value of the correlation).
- We add the original columns as additional features.


In [43]:
data_train_pandas

Unnamed: 0,pm2.5,DEWP,TEMP,PRES,Iws,Is,Ir,date,id
24,129.0,-16,-4.0,1020.0,1.79,0,0,2010-01-02 00:00:00,1
25,148.0,-15,-4.0,1020.0,2.68,0,0,2010-01-02 01:00:00,1
26,159.0,-11,-5.0,1021.0,3.57,0,0,2010-01-02 02:00:00,1
27,181.0,-7,-5.0,1022.0,5.36,1,0,2010-01-02 03:00:00,1
28,138.0,-7,-5.0,1022.0,6.25,2,0,2010-01-02 04:00:00,1
...,...,...,...,...,...,...,...,...,...
35059,22.0,-19,7.0,1013.0,114.87,0,0,2013-12-31 19:00:00,1
35060,18.0,-21,7.0,1014.0,119.79,0,0,2013-12-31 20:00:00,1
35061,23.0,-21,7.0,1014.0,125.60,0,0,2013-12-31 21:00:00,1
35062,20.0,-21,6.0,1014.0,130.52,0,0,2013-12-31 22:00:00,1


One of the issues about tsfresh is that is actually requires more memory than allowed by MyBinder. We therefore have to remove the parts that relate to this.

In [44]:
if RUN_TSFRESH:
    tsfresh_builder = TSFreshBuilder(
        num_features=40,
        memory=24,
        column_id="id",
        time_stamp="date",
        target="pm2.5")
    #
    tsfresh_training = tsfresh_builder.fit(data_train_pandas)
    tsfresh_test = tsfresh_builder.transform(data_test_pandas)
    #
    tsfresh_training.to_csv("tsfresh_training.csv", index=False)
    tsfresh_test.to_csv("tsfresh_test.csv", index=False)

Rolling: 100%|██████████| 20/20 [01:22<00:00,  4.14s/it]
Feature Extraction: 100%|██████████| 20/20 [00:47<00:00,  2.38s/it]
Feature Extraction: 100%|██████████| 20/20 [01:27<00:00,  4.39s/it]


Time taken: 0h:4m:1.47543



Rolling: 100%|██████████| 20/20 [00:15<00:00,  1.25it/s]
Feature Extraction: 100%|██████████| 20/20 [00:13<00:00,  1.49it/s]
Feature Extraction: 100%|██████████| 20/20 [00:27<00:00,  1.38s/it]


tsfresh does not contain built-in machine learning algorithms. In order to ensure a fair comparison, we use the exact same machine learning algorithm we have also used for getML: An XGBoost regressor with all hyperparameters set to their default value.

In order to do so, we load the tsfresh features into the getML engine.

In [45]:
if not RUN_TSFRESH:
    tsfresh_training = pd.read_csv("tsfresh_training.csv")
    tsfresh_test = pd.read_csv("tsfresh_test.csv")

In [46]:
df_tsfresh_training = getml.data.DataFrame.from_pandas(tsfresh_training, name='tsfresh_training')
df_tsfresh_test = getml.data.DataFrame.from_pandas(tsfresh_test, name='tsfresh_test')

As usual, we need to set roles:

In [47]:
def set_roles_tsfresh(df):
    df["date"] = df["date"].as_ts()
    df.set_role(["pm2.5"], getml.data.roles.target)
    df.set_role(["date"], getml.data.roles.time_stamp)
    df.set_role(df.unused_names, getml.data.roles.numerical)
    df.set_role(["id"], getml.data.roles.unused_float)
    return df

df_tsfresh_training = set_roles_tsfresh(df_tsfresh_training)
df_tsfresh_test = set_roles_tsfresh(df_tsfresh_test)

In this case, our pipeline is very simple. It only consists of a single XGBoostRegressor.

In [48]:
predictor = getml.predictors.XGBoostRegressor()

pipe = getml.pipeline.Pipeline(
    tags=['tsfresh', 'memory: 1d'],
    predictors=[predictor]
)

pipe

In [49]:
pipe.check(df_tsfresh_training)

Checking data model...
OK.


In [50]:
pipe.fit(df_tsfresh_training)

Checking data model...
OK.

XGBoost: Training as predictor...

Trained pipeline.
Time taken: 0h:0m:5.822567



In [51]:
pipe.score(df_tsfresh_test)




Unnamed: 0,date time,set used,target,mae,rmse,rsquared
0,2021-02-24 11:01:05,tsfresh_training,pm2.5,41.12497,58.4532,0.6013
1,2021-02-24 11:01:05,tsfresh_test,pm2.5,46.66552,66.26759,0.50394


## 4. Discussion

We have seen that getML outperforms tsfresh by more than 10 percentage points in terms of R-squared. We now want to analyze why that is.

There are two possible hypotheses:

- getML outperforms featuretools and tsfresh, because it using feature learning and is able to produce more complex features
- getML outperforms featuretools and tsfresh, because it makes better use of memory and is able to look back further.

Let's summarize our findings:


Name         | Memory  | Feature complexity | R-squared | RMSE | Time taken
------------ | ------- | ------------------ | --------- | ---- | -----------------------
Pipeline 1   |  7 days |            complex |     62.3% | 57.8 | ~4 minutes 48 seconds
Pipeline 2   |   1 day |            complex |     51.5% | 66.1 | ~2 minutes 7 seconds
Pipeline 3   |  7 days |             simple |     48.4% | 68.6 | ~27 seconds
Pipeline 4   |   1 day |             simple |     51.1% | 65.8 | ~15 seconds
featuretools |   1 day |             simple |     50.4% | 66.4 | ~1 minute 40 seconds
tsfresh      |   1 day |             simple |     50.4% | 66.3 | ~4 minutes


We have built simple features and complex features and we also differentiate between am memory of 1 day and a memory of 7 days. When we have a memory of one day and allow only simple features, getML produces features that are very similar to featuertools and tsfresh. It is therefore unsurprising that their performance is roughly on par with the performance of featuretools and tsfresh, even though getML is several orders of magnitude faster. It is about seven times faster than featuretools (15 seconds vs 1 minute 40 seconds) and about 20 times faster than tsfresh (15 seconds vs 4 minutes).

The summary table shows that combination of both of our hypotheses explains why getML outperforms featuretools and tsfresh. Complex features do better than simple features with a memory of one day. With a memory of seven days, simple features actually get worse. But when you look back seven days and allow more complex features, you get good results.

This suggests that getML outperforms featuretools and tsfresh, because it can make more efficient use of memory and thus look back further. Because getML uses feature learning and can build more complex features it can make better use of the greater look-back window.

## 5. Conclusion

We have compared getML's feature learning algorithms to tsfresh's brute-force feature engineering approaches on a data set related to air pollution in China. We found that getML significantly outperforms featuretools and tsfresh. These results are consistent with the view that feature learning can yield significant improvements over simple propositionalization approaches.

However, there are other datasets on which simple propositionalization performs well. Our suggestion is therefore to think of algorithms like `FastProp` and `RelMT` as tools in a toolbox. If a simple tool like `FastProp` gets the job done, then use that. But when you need more advanced approaches, like `RelMT`, you should have them at your disposal as well.

You are encouraged to reproduce these results. You will need getML (https://getml.com/product) and tsfresh (https://tsfresh.readthedocs.io/en/latest/). You can download both for free. 