# Predicting air pollution

### Why feature learning is better than brute-force feature engineering

In this notebook, we will train ML models to predict the PM2.5 concentration based on several external factors. PM2.5 are fine inhalable particles that pose an enormous health risk. We will compare getML to tsfresh, an open-source library that generates features for time series. tsfresh uses a brute-force approach to feature engineering, whereas getML uses feature learning. We find that getML generates significantly better predictions and consumes roughly 2% of the memory that tsfresh requires.

Summary:

- Prediction type: __Regression model__
- Domain: __Environmental science__
- Prediction target: __PM 2.5 concentration__ 
- Source data: __Multivariate time series__
- Population size: __41757__

_Author: Patrick Urbanke_

# Background

Many data scientists and AutoML tools use brute-force methods for feature engineering. These brute-force methods usually work as follows:

- Generate a large number of hard-coded features
- Use feature selection to pick a small percentage of these features

By contrast, [getML](https://getml.com/product) uses feature learning: Feature learning adapts machine learning approaches such as decision trees or gradient boosting to the problem of extracting features from relational data and time series.

In this notebook, we will benchmark getML against [tsfresh](https://tsfresh.readthedocs.io/en/latest/). tsfresh is a popular Python library that uses brute-force methods as described above to generate features for time series.

As our example dataset, we use a publicly available dataset on [air pollution in Beijing](https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data), China. The data set has been originally used in the following study:

Liang, X., Zou, T., Guo, B., Li, S., Zhang, H., Zhang, S., Huang, H. and Chen, S. X. (2015). Assessing Beijing's PM2.5 pollution: severity, weather impact, APEC and winter heating. Proceedings of the Royal Society A, 471, 20150257.

We find that:

- getML significantly outperforms tsfresh in terms of predictive accuracy (**R-squared of 60.9%** vs **R-squared of 48.7%**).
- getML consumes **considerably less memory** than tsfresh (**0.08 GB** vs **3.63 GB**).

Our findings indicate that feature learning algorithms are better at adapting to data sets and are also more scalable due to their lower memory requirement.

### A web frontend for getML

The getML monitor is a frontend built to support your work with getML. The getML monitor displays information such as the imported data frames, trained pipelines and allows easy data and feature exploration. You can launch the getML monitor [here](http://localhost:1709).

### Where is this running?

Your getML live session is running inside a docker container on [mybinder.org](https://mybinder.org/), a service built by the Jupyter community and funded by Google Cloud, OVH, GESIS Notebooks and the Turing Institute. As it is a free service, this session will shut down after 10 minutes of inactivity.

# Analysis

Let's get started with the analysis and set-up your session:

In [1]:
import gc
import os
import psutil
from urllib import request
import threading
import time
import numpy as np
import pandas as pd
from scipy.stats import pearsonr
from IPython.display import Image
import matplotlib.pyplot as plt
plt.style.use('seaborn')
%matplotlib inline  

import getml

print(f"getML API version: {getml.__version__}\n")

getml.engine.set_project('air_pollution')

getML API version: 0.12.0

Loading existing project 'air_pollution'


## 1. Loading data

### 1.1 Download from source

We begin by downloading the data from the UCI Machine Learning repository.

In [2]:
fname = "PRSA_data_2010.1.1-2014.12.31.csv"

if not os.path.exists(fname):
    fname, res = request.urlretrieve(
        "https://archive.ics.uci.edu/ml/machine-learning-databases/00381/" + fname, 
        fname
    )


### 1.2 Prepare data for tsfresh and getML

Our our goal is to predict the pm2.5 concentration from factors such as weather or time of day. However, there are some missing entries for pm2.5, so we get rid of them.

In [3]:
data_full_pandas = pd.read_csv(fname)

data_full_pandas = data_full_pandas[
    data_full_pandas["pm2.5"] == data_full_pandas["pm2.5"]
]

tsfresh requires a date column, so we build one.

In [4]:
def add_leading_zero(val):
    if len(str(val)) == 1:
        return "0" + str(val)
    return str(val)

data_full_pandas["month"] = [
    add_leading_zero(val) for val in data_full_pandas["month"]
]

data_full_pandas["day"] = [
    add_leading_zero(val) for val in data_full_pandas["day"]
]

data_full_pandas["hour"] = [
    add_leading_zero(val) for val in data_full_pandas["hour"]
]

def make_date(year, month, day, hour):
    return year + "-" + month + "-" + day + " " + hour + ":00:00"

data_full_pandas["date"] = [
    make_date(str(year), month, day, hour) \
    for year, month, day, hour in zip(
        data_full_pandas["year"],
        data_full_pandas["month"],
        data_full_pandas["day"],
        data_full_pandas["hour"],
    )
]


tsfresh also requires the time series to have ids. Since there is only a single time series, that series has the same id.

In [5]:
data_full_pandas["id"] = 1

The dataset now contains many columns that we do not need or that tsfresh cannot process. For instance, *cbwd* might actually contain useful information, but it is a categorical variable, which is difficult to handle for tsfresh, so we remove it.

We also want to split our data into a training and testing set.

In [6]:
data_train_pandas = data_full_pandas[data_full_pandas["year"] < 2014]
data_test_pandas = data_full_pandas[data_full_pandas["year"] == 2014]

In [7]:
def remove_unwanted_columns(df):
    del df["cbwd"]
    del df["year"]
    del df["month"]
    del df["day"]
    del df["hour"]
    del df["No"]
    return df

data_full_pandas = remove_unwanted_columns(data_full_pandas)
data_train_pandas = remove_unwanted_columns(data_train_pandas)
data_test_pandas = remove_unwanted_columns(data_test_pandas)

In [8]:
data_full_pandas

Unnamed: 0,pm2.5,DEWP,TEMP,PRES,Iws,Is,Ir,date,id
24,129.0,-16,-4.0,1020.0,1.79,0,0,2010-01-02 00:00:00,1
25,148.0,-15,-4.0,1020.0,2.68,0,0,2010-01-02 01:00:00,1
26,159.0,-11,-5.0,1021.0,3.57,0,0,2010-01-02 02:00:00,1
27,181.0,-7,-5.0,1022.0,5.36,1,0,2010-01-02 03:00:00,1
28,138.0,-7,-5.0,1022.0,6.25,2,0,2010-01-02 04:00:00,1
...,...,...,...,...,...,...,...,...,...
43819,8.0,-23,-2.0,1034.0,231.97,0,0,2014-12-31 19:00:00,1
43820,10.0,-22,-3.0,1034.0,237.78,0,0,2014-12-31 20:00:00,1
43821,10.0,-22,-3.0,1034.0,242.70,0,0,2014-12-31 21:00:00,1
43822,8.0,-22,-4.0,1034.0,246.72,0,0,2014-12-31 22:00:00,1


We then load the data into the getML engine. We begin by setting a project.

In [9]:
getml.engine.set_project('air_pollution')

Loading existing project 'air_pollution'


In [10]:
df_full = getml.data.DataFrame.from_pandas(data_full_pandas, name='full')
df_train = getml.data.DataFrame.from_pandas(data_train_pandas, name='train')
df_test = getml.data.DataFrame.from_pandas(data_test_pandas, name='test')

df_full["date"] = df_full["date"].as_ts()

We need to assign roles to the columns, such as defining the target column.

In [11]:
def set_roles(df):
    df.set_role(["date"], getml.data.roles.time_stamp)
    df.set_role(["pm2.5"], getml.data.roles.target)
    df.set_role([
        "DEWP", 
        "TEMP",
        "PRES",
        "Iws",
        "Is",
        "Ir"], getml.data.roles.numerical)

set_roles(df_full)
set_roles(df_train)
set_roles(df_test)

## 2. Tracking memory consumption

A major issue about brute-force is their memory consumption. We would therefore like to be able to measure the memory consumption of different algorithms.

We will do so by tracking the overall system memory usage and then substracting the peak system memory usage from the initial memory usage. This gives a good approximation as long as we do not start any other memory-heavy processes while training.

In [12]:
class MemoryTracker():
    """
    The MemoryTracker measures the system's memory consumption
    once every second. It can be used to get an approximation of 
    the overall memory consumption of certain algorithms.
    """
    
    def __init__(self):
        self._initial_usage = 0
        self._max_usage = 0
        
        self._stop = False
        
        self.lock = threading.Lock()
        
        self.th = threading.Thread(
            target=self._measure_memory_usage,
        )
        
    def __del__(self):
        self.stop()
        
    def _get_memory_usage(self):
        return psutil.virtual_memory().used

    def _measure_memory_usage(self):
        while True:
            time.sleep(1)
            
            self.lock.acquire()
                                    
            if self._stop:
                self.lock.release()
                break
            
            current_usage = self._get_memory_usage()
            
            if current_usage > self._max_usage:
                self._max_usage = current_usage
            
            self.lock.release()

    @property
    def peak_consumption(self):
        """
        The peak system memory consumption, in GB
        """
        self.lock.acquire()
        
        p_con = self._max_usage - self._initial_usage
        
        self.lock.release()
        
        p_con /= 1e9
        
        return p_con
    
    def start(self):
        """
        Starts measuring the memory consumption.
        """
        self.lock.acquire()
        
        self._initial_usage = self._get_memory_usage()
        
        self._max_usage = self._initial_usage
        
        self._stop = False
        
        self.th = threading.Thread(
            target=self._measure_memory_usage,
        )
        
        self.th.start()
        
        self.lock.release()
        
    def stop(self):
        """
        Stops measuring the memory consumption.
        """
        self.lock.acquire()
        self._stop = True
        self.lock.release()

In [13]:
memory_tracker = MemoryTracker()

## 3. Predictive modelling


### 3.1 Pipeline 1: Complex features, 7 days

For our first experiment, we will learn complex features and allow a memory of up to seven days. That means at every given point in time, the algorithm is allowed to look back seven days into the past.

getML uses relational learning to build construct the pipelines. Even though there is a simpler time series API, the relational API is more flexible which is why decide to use it.

In [14]:
population = getml.data.Placeholder('population')

peripheral = getml.data.Placeholder('peripheral')

population.join(
    peripheral,
    time_stamp='date',
    memory=getml.data.time.days(7)
)

population

placeholder,other placeholder,allow lagged targets,horizon,join keys used,memory,other join keys used,other time stamps used,relationship,time stamps used,upper time stamps used
population,peripheral,False,0.0,,604800.0,,date,many-to-many,date,


We then set up the features. We will use two different feature learning algorithms, namely MultirelModel and RelboostModel. Because we want complex features, we set *max_length* and *max_depth* to 7.

In [15]:
aggregations = [
    getml.feature_learning.aggregations.Avg,
    getml.feature_learning.aggregations.Sum,
    getml.feature_learning.aggregations.Min,
    getml.feature_learning.aggregations.Max,
    getml.feature_learning.aggregations.Median,
    getml.feature_learning.aggregations.Stddev
]

multirel = getml.feature_learning.MultirelModel(
    aggregation=aggregations,
    num_features=10,
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    seed=4367,
    max_length=7
)

relboost = getml.feature_learning.RelboostModel(
    num_features=10,
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    seed=4367,
    max_depth=7
)

predictor = getml.predictors.XGBoostRegressor()

pipe = getml.pipeline.Pipeline(
    tags=['memory: 7d', 'complex features'],
    population=population,
    peripheral=[peripheral],
    feature_learners=[multirel, relboost],
    predictors=[predictor]
)

pipe

It is good practice to always check your data model first, even though `check(...)` is also called by `fit(...)`. That enables us to make last-minute changes.

In [16]:
pipe.check(df_train, [df_full])

Checking data model...
OK.


We now fit our data on the training set and evaluate our findings, both in-sample and out of sample.

In [17]:
memory_tracker.start()
pipe.fit(df_train, [df_full])
memory_tracker.stop()

print("Memory consumption: ")
print(memory_tracker.peak_consumption)

Checking data model...
OK.

MultirelModel: Training features...

RelboostModel: Training features...

MultirelModel: Building features...

RelboostModel: Building features...

XGBoost: Training as predictor...

Trained pipeline.
Time taken: 0h:11m:1.75538

Memory consumption: 
1.021763584


In [18]:
in_sample = pipe.score(df_train, [df_full])
print('In sample:', in_sample)

out_of_sample = pipe.score(df_test, [df_full])
print('Out of sample:', out_of_sample)


MultirelModel: Building features...

RelboostModel: Building features...

In sample: {'mae': [30.921433475676267], 'rmse': [43.9544175981943], 'rsquared': [0.7706751157318166]}

MultirelModel: Building features...

RelboostModel: Building features...

Out of sample: {'mae': [40.097575759601405], 'rmse': [58.56721852677215], 'rsquared': [0.6085457300725953]}


### 3.2 Pipeline 2: Complex features, 1 day

In [19]:
population = getml.data.Placeholder('population')

peripheral = getml.data.Placeholder('peripheral')

population.join(
    peripheral,
    time_stamp='date',
    memory=getml.data.time.days(1)
)

population

placeholder,other placeholder,allow lagged targets,horizon,join keys used,memory,other join keys used,other time stamps used,relationship,time stamps used,upper time stamps used
population,peripheral,False,0.0,,86400.0,,date,many-to-many,date,


In [20]:
aggregations = [
    getml.feature_learning.aggregations.Avg,
    getml.feature_learning.aggregations.Sum,
    getml.feature_learning.aggregations.Min,
    getml.feature_learning.aggregations.Max,
    getml.feature_learning.aggregations.Median,
    getml.feature_learning.aggregations.Stddev
]

multirel = getml.feature_learning.MultirelModel(
    aggregation=aggregations,
    num_features=10,
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    seed=4367,
    max_length=0
)

relboost = getml.feature_learning.RelboostModel(
    num_features=10,
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    seed=4367,
    max_depth=5
)

predictor = getml.predictors.XGBoostRegressor()

pipe = getml.pipeline.Pipeline(
    tags=['memory: 1d', 'complex features'],
    population=population,
    peripheral=[peripheral],
    feature_learners=[multirel, relboost],
    predictors=[predictor]
)

pipe

In [21]:
pipe.check(df_train, [df_full])

Checking data model...
OK.


In [22]:
memory_tracker.start()
pipe.fit(df_train, [df_full])
memory_tracker.stop()

print("Memory consumption: ")
print(memory_tracker.peak_consumption)

Checking data model...
OK.

MultirelModel: Training features...

RelboostModel: Training features...

MultirelModel: Building features...

RelboostModel: Building features...

XGBoost: Training as predictor...

Trained pipeline.
Time taken: 0h:2m:35.041272

Memory consumption: 
0.01992704


In [23]:
in_sample = pipe.score(df_train, [df_full])
print('In sample:', in_sample)

out_of_sample = pipe.score(df_test, [df_full])
print('Out of sample:', out_of_sample)


MultirelModel: Building features...

RelboostModel: Building features...

In sample: {'mae': [38.17490033531624], 'rmse': [54.7799460058], 'rsquared': [0.6442914910608807]}

MultirelModel: Building features...

RelboostModel: Building features...

Out of sample: {'mae': [44.95119658580155], 'rmse': [66.09828658228155], 'rsquared': [0.5018899991896832]}


### 3.3 Pipeline 3: Simple features, 7 days

For our third experiment, we will learn simple features and allow a memory of up to seven days.

This simplicity is accomplished by learning 20 features using MultirelModel with a *max_length* of 0. That means that MultirelModel is not allowed to learn any conditions. The resulting features can expected to be very similar to features produced by tsfresh.

In [24]:
population = getml.data.Placeholder('population')

peripheral = getml.data.Placeholder('peripheral')

population.join(
    peripheral,
    time_stamp='date',
    memory=getml.data.time.days(7)
)

population

placeholder,other placeholder,allow lagged targets,horizon,join keys used,memory,other join keys used,other time stamps used,relationship,time stamps used,upper time stamps used
population,peripheral,False,0.0,,604800.0,,date,many-to-many,date,


In [25]:
aggregations = [
    getml.feature_learning.aggregations.Avg,
    getml.feature_learning.aggregations.Sum,
    getml.feature_learning.aggregations.Min,
    getml.feature_learning.aggregations.Max,
    getml.feature_learning.aggregations.Median,
    getml.feature_learning.aggregations.Stddev
]

multirel = getml.feature_learning.MultirelModel(
    aggregation=aggregations,
    num_features=20,
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    seed=4367,
    max_length=0
)

predictor = getml.predictors.XGBoostRegressor()

pipe = getml.pipeline.Pipeline(
    tags=['memory: 7d', 'simple features'],
    population=population,
    peripheral=[peripheral],
    feature_learners=[multirel],
    predictors=[predictor]
)

pipe

In [26]:
pipe.check(df_train, [df_full])

Checking data model...
OK.


In [27]:
memory_tracker.start()
pipe.fit(df_train, [df_full])
memory_tracker.stop()

print("Memory consumption: ")
print(memory_tracker.peak_consumption)

Checking data model...
OK.

MultirelModel: Training features...

MultirelModel: Building features...

XGBoost: Training as predictor...

Trained pipeline.
Time taken: 0h:1m:44.331903

Memory consumption: 
0.25921536


### 3.4 Pipeline 4: Simple features, 1 day

For our fourth experiment, we will learn simple features and allow a memory of up to one day.

As we will see, tsfresh consumes a lot of memory. Looking further into the past increases the memory requirement to the point that looking back to up to 7 days is not feasible on a normal desktop computer. For reasons we will discuss later, we want to replicate the tsfresh features using getML's greedy approach.

In [28]:
in_sample = pipe.score(df_train, [df_full])
print('In sample:', in_sample)

out_of_sample = pipe.score(df_test, [df_full])
print('Out of sample:', out_of_sample)


MultirelModel: Building features...

In sample: {'mae': [44.39174692957318], 'rmse': [62.05820701761394], 'rsquared': [0.5578533318605543]}

MultirelModel: Building features...

Out of sample: {'mae': [51.86212120127656], 'rmse': [71.42298768943652], 'rsquared': [0.4305880166321852]}


In [29]:
population = getml.data.Placeholder('population')

peripheral = getml.data.Placeholder('peripheral')

population.join(
    peripheral,
    time_stamp='date',
    memory=getml.data.time.days(1)
)

population

placeholder,other placeholder,allow lagged targets,horizon,join keys used,memory,other join keys used,other time stamps used,relationship,time stamps used,upper time stamps used
population,peripheral,False,0.0,,86400.0,,date,many-to-many,date,


In [30]:
aggregations = [
    getml.feature_learning.aggregations.Avg,
    getml.feature_learning.aggregations.Sum,
    getml.feature_learning.aggregations.Min,
    getml.feature_learning.aggregations.Max,
    getml.feature_learning.aggregations.Median,
    getml.feature_learning.aggregations.Stddev
]

multirel = getml.feature_learning.MultirelModel(
    aggregation=aggregations,
    num_features=20,
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    seed=4367,
    max_length=0
)

predictor = getml.predictors.XGBoostRegressor()

pipe = getml.pipeline.Pipeline(
    tags=['memory: 1d', 'simple features'],
    population=population,
    peripheral=[peripheral],
    feature_learners=[multirel],
    predictors=[predictor]
)

pipe

In [31]:
pipe.check(df_train, [df_full])

Checking data model...
OK.


In [32]:
memory_tracker.start()
pipe.fit(df_train, [df_full])
memory_tracker.stop()

print("Memory consumption: ")
print(memory_tracker.peak_consumption)

Checking data model...
OK.

MultirelModel: Training features...

MultirelModel: Building features...

XGBoost: Training as predictor...

Trained pipeline.
Time taken: 0h:1m:8.451705

Memory consumption: 
0.17676288


In [33]:
in_sample = pipe.score(df_train, [df_full])
print('In sample:', in_sample)

out_of_sample = pipe.score(df_test, [df_full])
print('Out of sample:', out_of_sample)


MultirelModel: Building features...

In sample: {'mae': [44.076551880232735], 'rmse': [62.61924427999326], 'rsquared': [0.5377020375473408]}

MultirelModel: Building features...

Out of sample: {'mae': [48.43578703042721], 'rmse': [68.2086724841597], 'rsquared': [0.4713164136964429]}


### 3.5 Using tsfresh

tsfresh is a rather low-level library. To make things a bit easier, we write a high-level wrapper.

To limit the memory consumption, we undertake the following steps:

- We limit ourselves to a memory of 1 day from any point in time. This is necessary, because tsfresh duplicates records for every time stamp. That means that looking back 7 days instead of one day, the memory consumption would be  seven times as high.
- We extract only tsfresh's MinimalFCParameters and IndexBasedFCParameters (the latter is a superset of TimeBasedFCParameters).

In order to make sure that tsfresh's features can be compared to getML's features, we also do the following:

- We apply tsfresh's built-in feature selection algorithm.
- Of the remaining features, we only keep the 20 features most correlated with the target (in terms of the absolute value of the correlation).
- We add the original columns as additional features.

We do this, because we used getML to train 20 features, but have also kept the original columns.


In [34]:
# import tsfresh
# from tsfresh.utilities.dataframe_functions import roll_time_series
# from tsfresh.feature_selection.relevance import calculate_relevance_table


In [35]:
class TSFreshBuilder():
    
    def __init__(self, num_features, memory, column_id, time_stamp, target):
        """
        Scikit-learn style feature builder based on TSFresh.
        
        Args:
            
            num_features: The (maximum) number of features to build.
            
            memory: How much back in time you want to go until the
                    feature builder starts "forgetting" data.
                    
            column_id: The name of the column containing the ids.
            
            time_stamp: The name of the column containing the time stamps.
            
            target: The name of the target column.
        """
        self.num_features = num_features
        self.memory = memory
        self.column_id = column_id
        self.time_stamp = time_stamp
        self.target = target
        
        self.selected_features = []
        
    def _add_original_columns(self, original_df, df_selected):
        for colname in original_df.columns:
            df_selected[colname] = np.asarray(
                original_df[colname])
                    
        return df_selected

    def _extract_features(self, df):
        df_rolled = roll_time_series(
            df, 
            column_id=self.column_id, 
            column_sort=self.time_stamp,
            max_timeshift=self.memory
        )

        extracted_minimal = tsfresh.extract_features(
            df_rolled,
            column_id=self.column_id, 
            column_sort=self.time_stamp,
            default_fc_parameters=tsfresh.feature_extraction.MinimalFCParameters()
        )
        
        extracted_index_based = tsfresh.extract_features(
            df_rolled,
            column_id=self.column_id, 
            column_sort=self.time_stamp,
            default_fc_parameters=tsfresh.feature_extraction.settings.IndexBasedFCParameters()
        )
        
        extracted_features = pd.concat(
            [extracted_minimal, extracted_index_based], axis=1
        )
        del extracted_minimal
        del extracted_index_based
        
        gc.collect()
        
        extracted_features[
            extracted_features != extracted_features] = 0.0  
        
        extracted_features[
            np.isinf(extracted_features)] = 0.0 
        
        return extracted_features
        
    def _print_time_taken(self, begin, end):

        seconds = end - begin

        hours = int(seconds / 3600)
        seconds -= float(hours * 3600)

        minutes = int(seconds / 60)
        seconds -= float(minutes * 60)

        seconds = round(seconds, 6)

        print(
            "Time taken: " + str(hours) + "h:" +
            str(minutes) + "m:" + str(seconds)
        )

        print("")
        
    def _remove_target_column(self, df):
        colnames = np.asarray(df.columns)
        
        if self.target not in colnames:
            return df
        
        colnames = colnames[colnames != self.target]
        
        return df[colnames]
        
    def _select_features(self, df, target):
        df_selected = tsfresh.select_features(
            df, 
            target
        )
        
        colnames = np.asarray(df_selected.columns)

        correlations = np.asarray([
            np.abs(pearsonr(target, df_selected[col]))[0] for col in colnames
        ])
        
        # [::-1] is somewhat unintuitive syntax,
        # but it reverses the entire column.
        self.selected_features = colnames[
            np.argsort(correlations)
        ][::-1][:self.num_features]

        return df_selected[self.selected_features]
        
    def fit(self, df):
        """
        Fits the features.
        """
        begin = time.time()

        target = np.asarray(df[self.target])
        
        df_without_target = self._remove_target_column(df)
        
        df_extracted = self._extract_features(
            df_without_target)
        
        df_selected = self._select_features(
            df_extracted, target)
                
        del df_extracted
        gc.collect()
        
        df_selected = self._add_original_columns(df, df_selected)

        end = time.time()
        
        self._print_time_taken(begin, end)
        
        return df_selected
    
    def transform(self, df):
        """
        Transforms the raw data into a set of features.
        """
        df_extracted = self._extract_features(df)
        
        df_selected = df_extracted[self.selected_features]
        
        del df_extracted
        gc.collect()
        
        df_selected = self._add_original_columns(df, df_selected)
                                         
        return df_selected

In [36]:
data_train_pandas

Unnamed: 0,pm2.5,DEWP,TEMP,PRES,Iws,Is,Ir,date,id
24,129.0,-16,-4.0,1020.0,1.79,0,0,2010-01-02 00:00:00,1
25,148.0,-15,-4.0,1020.0,2.68,0,0,2010-01-02 01:00:00,1
26,159.0,-11,-5.0,1021.0,3.57,0,0,2010-01-02 02:00:00,1
27,181.0,-7,-5.0,1022.0,5.36,1,0,2010-01-02 03:00:00,1
28,138.0,-7,-5.0,1022.0,6.25,2,0,2010-01-02 04:00:00,1
...,...,...,...,...,...,...,...,...,...
35059,22.0,-19,7.0,1013.0,114.87,0,0,2013-12-31 19:00:00,1
35060,18.0,-21,7.0,1014.0,119.79,0,0,2013-12-31 20:00:00,1
35061,23.0,-21,7.0,1014.0,125.60,0,0,2013-12-31 21:00:00,1
35062,20.0,-21,6.0,1014.0,130.52,0,0,2013-12-31 22:00:00,1


In [37]:
builder = TSFreshBuilder(
    num_features=20,
    memory=24,
    column_id="id",
    time_stamp="date",
    target="pm2.5")

One of the issues about tsfresh is that is actually requires more memory than allowed by mybinder. We therefore have to uncomment the parts that relate to this.

In [38]:
# memory_tracker.start()
# tsfresh_training = builder.fit(data_train_pandas)
# memory_tracker.stop()

print("Memory consumption: ")
print(memory_tracker.peak_consumption)

Memory consumption: 
0.17676288


In [39]:
# tsfresh_test = builder.transform(data_test_pandas)

tsfresh does not contain built-in machine learning algorithms. In order to ensure a fair comparison, we use the exact same machine learning algorithm we have also used for getML: An XGBoost regressor with all hyperparameters set to their default value.

In order to do so, we load the tsfresh features into the getML engine.

In [40]:
# df_tsfresh_training = getml.data.DataFrame.from_pandas(tsfresh_training, name='tsfresh_training')
# df_tsfresh_test = getml.data.DataFrame.from_pandas(tsfresh_test, name='tsfresh_test')

As usual, we need to set roles:

In [41]:
def set_roles_tsfresh(df):
    df["date"] = df["date"].as_ts()
    df.set_role(["pm2.5"], getml.data.roles.target)
    df.set_role(["date"], getml.data.roles.time_stamp)
    df.set_role(df.unused_names, getml.data.roles.numerical)
    df.set_role(["id"], getml.data.roles.unused_float)
    return df

# df_tsfresh_training = set_roles_tsfresh(df_tsfresh_training)
# df_tsfresh_test = set_roles_tsfresh(df_tsfresh_test)

In this case, our pipeline is very simple. It only consists of a single XGBoostRegressor.

In [42]:
predictor = getml.predictors.XGBoostRegressor()

pipe = getml.pipeline.Pipeline(
    tags=['tsfresh', 'memory: 1d'],
    predictors=[predictor]
)

pipe

In [43]:
# pipe.check(df_tsfresh_training)

In [44]:
# pipe.fit(df_tsfresh_training)

In [45]:
# in_sample = pipe.score(df_tsfresh_training)
# print('In sample:', in_sample)

# out_of_sample = pipe.score(df_tsfresh_test)
# print('Out of sample:', out_of_sample)

## 4. Discussion

We have seen that getML outperforms tsfresh by more than 10 percentage points in terms of R-squared. We now want to analyze why that is.

There are two possible hypotheses:

- getML outperforms tsfresh, because it using feature learning and is able to produce more complex features.
- getML outperforms tsfresh, because it makes better use of memory and is able to look back further.

Let's summarize our findings:


Name       | Look-back | Feature complexity | R-squared | RMSE | Memory usage
---------- | --------- | ------------------ | --------- | ---- | ------------ 
Pipeline 1 |    7 days |            complex |     60.9% | 58.6 | 0.08 GB
Pipeline 2 |     1 day |            complex |     50.2% | 66.1 | 0.02 GB
Pipeline 3 |    7 days |             simple |     43.0% | 71.4 | 0.07 GB
Pipeline 4 |     1 day |             simple |     47.1% | 68.2 | 0.07 GB
tsfresh    |     1 day |             simple |     48.7% | 67.4 | 3.63 GB


We have built simple features and complex features and we also differentiate between looking back 1 day and looking back 7 days. When we look back one day and allow only simple features, getML produces features that are very similar to tsfresh. It is therefore unsurprising that their performance is about on par with the performance of tsfresh. It is actually a bit worse, because getML uses a greedy algorithm and there is a price we pay for that. But since the greedy algorithms also allows us to build more complex feature the benefits of the greedy algorithm outweigh its costs.

Let's also compare the memory consumption: Even Pipeline 1, which has the highest memory consumption of all of getML's pipelines, only consumes about 2.2% percent of the memory that tsfresh needs. This is in part due to the way tsfresh is implemented, but it is also an inherent problem: If your approach is to generate many features to then select a small share, you will need a lot of memory. In theory, it is possible to write a more memory-efficient implementation of tsfresh and then use a look-back of 7 days, but when we compare Pipeline 3 and Pipeline 4, we can conclude that it is unlikely that this would improve tsfresh's predictive performance.

The summary table shows that a combination of both of our hypotheses explains why getML outperforms tsfresh. Complex features do better than simple features when looking back one day. When looking back seven days, simple features actually get worse. But when you look back seven days and allow more complex features, you really get good results.

This suggests that getML outperforms tsfresh, because it can make more efficient use of memory and thus look back further. Because getML uses feature learning and can build more complex features it can make better use of the greater look-back window.

## 5. Conclusion

We have compared getML's feature learning algorithms to tsfresh's brute-force feature engineering approaches on a data set related to air pollution in China. We found that getML significantly outperforms tsfresh. These results are consistent with the view that feature learning is better than brute-force feature engineering.

You are encouraged to reproduce these results. You will need getML (https://getml.com/product) and tsfresh (https://tsfresh.readthedocs.io/en/latest/). You can download both for free.

# Next Steps

If you want to learn more about getML, here are some additional tutorials and articles that will help you:

__Tutorials:__
* [Loan default prediction: Introduction to relational learning](loans_demo.ipynb)
* [Occupancy detection: A multivariate time series example](occupancy_demo.ipynb)  
* [Expenditure categorization: Why relational learning matters](consumer_expenditures_demo.ipynb)
* [Disease lethality prediction: Feature engineering and the curse of dimensionality](atherosclerosis_demo.ipynb)
* [Traffic volume prediction: Feature engineering on multivariate time series](interstate94_demo.ipynb)
* [Air pollution prediction: Why feature learning outperforms brute-force approaches](air_pollution_demo.ipynb) 


__User Guides__ (from our [documentation](https://docs.getml.com/latest/)):
* [Feature learning with Multirel](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#multirel)
* [Feature learning with Relboost](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#relboost)



# Get into contact

If you have any question schedule a [call with Alex](https://go.getml.com/meetings/alexander-uhlig/getml-demo), the co-founder of getML, or write us an [email](team@getml.com). Prefer a private demo of getML? Just contact us to make an appointment.