# Feature engineering on sensor data - how to overcome feature explosion

The purpose of this notebook is to illustrate how we can overcome the feature explosion problem based on an example dataset involving sensor data.


Summary:

- Prediction type: __Regression__
- Domain: __Robotics__
- Prediction target: __The force vector on the robot's arm__ 
- Population size: __15001__

_Author: Dr. Patrick Urbanke_

## The data set

The data set has been generously provided by Erik Berger who originally collected it for his dissertation:

> Berger, E. (2018). *Behavior-Specific Proprioception Models for Robotic Force Estimation: A Machine Learning Approach.* Freiberg, Germany: Technische Universitaet Bergakademie Freiberg.

## A web frontend for getML

The getML monitor is a frontend built to support your work with getML. The getML monitor displays information such as the imported data frames, trained pipelines and allows easy data and feature exploration. You can launch the getML monitor [here](http://localhost:1709).

## 1. Loading data

We begin by importing the libraries and setting the project.

In [11]:
import datetime
import os
from urllib import request
import time

import pandas as pd
import numpy as np
import getml
import getml.data as data
import getml.database as database
import getml.engine as engine
import getml.feature_learning.aggregations as agg
import getml.data.roles as roles

from utils import FTTimeSeriesBuilder, TSFreshBuilder

from IPython.display import Image
import matplotlib.pyplot as plt
plt.style.use('seaborn')
%matplotlib inline  
 
getml.engine.set_project('robot')


Connected to project 'robot'


### 1.1 Download from source


In [12]:
fname = "robot-demo.csv"

if not os.path.exists(fname):
    fname, res = request.urlretrieve(
        "https://static.getml.com/datasets/robotarm/" + fname, 
        fname
    )

In [13]:
data_all = getml.data.DataFrame.from_csv("robot-demo.csv", "data_all")

In [14]:
data_all

Name,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,98,99,100,101,102,103,104,105,106,f_x,f_y,f_z
Role,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float
0.0,3.4098,-0.3274,0.9604,-3.7436,-1.0191,-6.0205,0,0,0,0,0,0,0,0,0,0,0,0,8.38e-17,-4.8116,-1.4033,-0.1369,0.002472,0,9.803e-16,-55.642,-16.312,-1.2042,0.02167,0,3.4098,-0.3274,0.9605,-3.7437,-1.0191,-6.0205,0,0,0,0,0,0,0.1233,-6.5483,-2.8045,-0.8296,0.07625,-0.1906,0.1211,-6.5483,-2.8157,-0.8281,0.07015,-0.1983,0.7699,0.41,0.08279,-1.4094,0.786,-0.3682,0,0,0,0,0,0,-22.654,-11.503,-18.673,-3.5155,5.8354,-2.05,0.7699,0.41,0.08278,-1.4094,0.786,-0.3681,0,0,0,0,0,0,48.069,48.009,0.9668,47.834,47.925,47.818,47.834,47.955,47.971,-11.03,6.9,-7.33
1.0,3.4098,-0.3274,0.9604,-3.7436,-1.0191,-6.0205,0,0,0,0,0,0,0,0,0,0,0,0,8.38e-17,-4.8116,-1.4033,-0.1369,0.002472,0,9.803e-16,-55.642,-16.312,-1.2042,0.02167,0,3.4098,-0.3274,0.9604,-3.7437,-1.0191,-6.0205,0,0,0,0,0,0,0.1188,-6.5506,-2.8404,-0.8281,0.06405,-0.1998,0.1211,-6.5483,-2.8157,-0.8281,0.07015,-0.1983,0.7699,0.41,0.0828,-1.4094,0.7859,-0.3682,0,0,0,0,0,0,-21.627,-11.046,-18.66,-3.5395,5.7577,-1.9805,0.7699,0.41,0.08278,-1.4094,0.786,-0.3681,0,0,0,0,0,0,48.009,48.009,0.8594,47.834,47.925,47.818,47.834,47.955,47.971,-10.848,6.7218,-7.4427
2.0,3.4098,-0.3274,0.9604,-3.7436,-1.0191,-6.0205,0,0,0,0,0,0,0,0,0,0,0,0,8.38e-17,-4.8116,-1.4033,-0.1369,0.002472,0,9.803e-16,-55.642,-16.312,-1.2042,0.02167,0,3.4098,-0.3274,0.9605,-3.7437,-1.0191,-6.0205,0,0,0,0,0,0,0.1099,-6.5438,-2.8,-0.8205,0.07473,-0.183,0.1211,-6.5483,-2.8157,-0.8281,0.07015,-0.1922,0.7699,0.41,0.08279,-1.4094,0.7859,-0.3682,0,0,0,0,0,0,-23.843,-12.127,-18.393,-3.6453,5.978,-1.9978,0.7699,0.41,0.08278,-1.4094,0.786,-0.3681,0,0,0,0,0,0,48.009,48.069,0.931,47.879,47.925,47.818,47.834,47.955,47.971,-10.666,6.5436,-7.5555
3.0,3.4098,-0.3274,0.9604,-3.7436,-1.0191,-6.0205,0,0,0,0,0,0,0,0,0,0,0,0,8.38e-17,-4.8116,-1.4033,-0.1369,0.002472,0,9.803e-16,-55.642,-16.312,-1.2042,0.02167,0,3.4098,-0.3273,0.9604,-3.7437,-1.0191,-6.0205,0,0,0,0,0,0,0.1233,-6.5483,-2.8224,-0.8266,0.07168,-0.1998,0.1211,-6.5483,-2.8157,-0.8281,0.07015,-0.1967,0.7699,0.41,0.08275,-1.4094,0.786,-0.3681,0,0,0,0,0,0,-21.772,-10.872,-18.691,-3.5512,5.6648,-1.9976,0.7699,0.41,0.08278,-1.4094,0.786,-0.3681,0,0,0,0,0,0,48.069,48.069,0.931,47.879,47.925,47.818,47.834,47.955,47.971,-10.507,6.4533,-7.65
4.0,3.4098,-0.3274,0.9604,-3.7436,-1.0191,-6.0205,0,0,0,0,0,0,0,0,0,0,0,0,8.38e-17,-4.8116,-1.4033,-0.1369,0.002472,0,9.803e-16,-55.642,-16.312,-1.2042,0.02167,0,3.4098,-0.3274,0.9604,-3.7437,-1.0191,-6.0205,0,0,0,0,0,0,0.1255,-6.5394,-2.8,-0.8327,0.07473,-0.1952,0.1211,-6.5483,-2.8157,-0.8327,0.07015,-0.1922,0.7699,0.41,0.08278,-1.4094,0.786,-0.3681,0,0,0,0,0,0,-22.823,-11.645,-18.524,-3.5305,5.8712,-2.0096,0.7699,0.41,0.08278,-1.4094,0.786,-0.3681,0,0,0,0,0,0,48.069,48.069,0.8952,47.879,47.925,47.818,47.834,47.955,47.971,-10.413,6.6267,-7.69
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14996.0,3.0837,-0.8836,1.4501,-2.2102,-1.559,-5.3265,-0.03151,-0.05375,0.04732,0.1482,-0.05218,0.06706,0.2969,0.5065,-0.4459,-1.3963,0.4916,-0.6319,-0.3694,-4.1879,-1.1847,-0.09441,-0.1568,0.1898,1.1605,-42.951,-19.023,-2.6343,0.1551,-0.1338,3.0836,-0.8836,1.4503,-2.2101,-1.5591,-5.3263,-0.03347,-0.05585,0.04805,0.151,-0.05513,0.07114,-0.3564,-6.0394,-2.3001,-0.2181,-0.1159,0.09608,-0.3632,-6.0394,-2.3023,-0.212,-0.125,0.1113,0.7116,0.06957,0.06036,-0.8506,2.9515,-0.03352,-0.03558,-0.03029,0.002444,-0.04208,0.1458,-0.1098,-0.8784,-0.07291,-37.584,0.0001132,-2.1031,0.03318,0.7117,0.0697,0.06044,-0.8511,2.951,-0.03356,-0.03508,-0.02849,0.001571,-0.03951,0.1442,-0.1036,48.069,48.009,0.8952,47.818,47.834,47.818,47.803,47.94,47.94,10.84,-1.41,16.14
14997.0,3.0835,-0.884,1.4505,-2.2091,-1.5594,-5.326,-0.02913,-0.0497,0.04376,0.137,-0.04825,0.062,0.2969,0.5065,-0.4459,-1.3963,0.4916,-0.6319,-0.3677,-4.1837,-1.1874,-0.09682,-0.1562,0.189,1.1592,-42.937,-19.023,-2.6331,0.1545,-0.1338,3.0833,-0.8841,1.4507,-2.209,-1.5596,-5.3258,-0.02909,-0.04989,0.04198,0.1481,-0.05465,0.06249,-0.3161,-6.1179,-2.253,-0.3752,-0.03965,0.08693,-0.3273,-6.1022,-2.2597,-0.366,-0.05033,0.0915,0.7114,0.06932,0.06039,-0.8497,2.953,-0.03359,-0.0335,-0.02723,0.001208,-0.04242,0.1428,-0.0967,-2.7137,0.8552,-38.514,-0.6088,-3.2383,-0.9666,0.7114,0.06948,0.06045,-0.8503,2.9525,-0.03359,-0.03246,-0.02633,0.001469,-0.03657,0.1333,-0.09571,48.009,48.009,0.8594,47.818,47.834,47.818,47.803,47.94,47.94,10.857,-1.52,15.943
14998.0,3.0833,-0.8844,1.4508,-2.208,-1.5598,-5.3256,-0.02676,-0.04565,0.04019,0.1258,-0.04431,0.05695,0.2969,0.5065,-0.4459,-1.3963,0.4916,-0.6319,-0.3659,-4.1797,-1.1901,-0.09922,-0.1555,0.1881,1.1579,-42.924,-19.023,-2.6321,0.154,-0.1338,3.0831,-0.8844,1.451,-2.2078,-1.56,-5.3253,-0.02776,-0.04382,0.03652,0.1295,-0.05064,0.04818,-0.343,-6.2569,-2.1566,-0.3035,0.00305,0.1434,-0.3385,-6.2322,-2.1589,-0.302,-0.00915,0.1571,0.7111,0.06912,0.06039,-0.849,2.9544,-0.0337,-0.02911,-0.02589,0.001292,-0.04046,0.1246,-0.08058,4.2749,1.0128,-36.412,-1.2811,-0.4296,-1.1013,0.7112,0.06928,0.06046,-0.8495,2.9538,-0.03362,-0.02984,-0.02417,0.001364,-0.03362,0.1224,-0.08786,48.009,48.009,0.931,47.818,47.834,47.818,47.803,47.94,47.94,10.89,-1.74,15.55
14999.0,3.0831,-0.8847,1.4511,-2.2071,-1.5602,-5.3251,-0.02438,-0.0416,0.03662,0.1147,-0.04038,0.0519,0.2969,0.5065,-0.4459,-1.3963,0.4916,-0.6319,-0.3642,-4.1758,-1.1928,-0.1016,-0.1548,0.1873,1.1568,-42.912,-19.023,-2.6311,0.1535,-0.1338,3.0829,-0.8848,1.4513,-2.2068,-1.5604,-5.3249,-0.02149,-0.04059,0.03417,0.1202,-0.0395,0.04178,-0.4237,-6.2703,-2.0939,-0.302,-0.01372,0.1739,-0.4125,-6.2569,-2.0916,-0.2943,-0.02898,0.1891,0.7109,0.06894,0.06039,-0.8484,2.9557,-0.03384,-0.02738,-0.01982,0.001031,-0.03028,0.1157,-0.06702,11.518,1.5002,-39.314,-1.8671,-0.3734,-0.5733,0.7109,0.06909,0.06047,-0.8488,2.955,-0.03364,-0.02721,-0.02201,0.001255,-0.03067,0.1115,-0.08003,48.009,48.009,0.931,47.818,47.834,47.818,47.803,47.94,47.94,11.29,-1.4601,15.743


### 1.2 Prepare data

The force vector consists of three component (*f_x*, *f_y* and *f_z*), meaning that we have three targets. For this comparison, we only predict the first component (*f_x*). 

Also, we want to speed things up a little, so we only use 10 columns. A previous analysis has revealed that the predictive power is mainly extracted from these 10 columns.

In [15]:
only_use = ['30', '34', '37', '38', '4', '59', '61', '7', '77', '78']

data_all.set_role(["f_x"], getml.data.roles.target)
data_all.set_role(only_use, getml.data.roles.numerical)

This is what the data set looks like:

In [16]:
data_all

Name,f_x,30,34,37,38,4,59,61,7,77,78,3,5,6,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,31,32,33,35,36,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,60,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,79,80,81,82,83,84,85,86,98,99,100,101,102,103,104,105,106,f_y,f_z
Role,target,numerical,numerical,numerical,numerical,numerical,numerical,numerical,numerical,numerical,numerical,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float
0.0,-11.03,-1.2042,-0.3274,-1.0191,-6.0205,-0.3274,0.08279,0.786,-1.0191,0.08278,-1.4094,3.4098,0.9604,-3.7436,-6.0205,0,0,0,0,0,0,0,0,0,0,0,0,8.38e-17,-4.8116,-1.4033,-0.1369,0.002472,0,9.803e-16,-55.642,-16.312,0.02167,0,3.4098,0.9605,-3.7437,0,0,0,0,0,0,0.1233,-6.5483,-2.8045,-0.8296,0.07625,-0.1906,0.1211,-6.5483,-2.8157,-0.8281,0.07015,-0.1983,0.7699,0.41,-1.4094,-0.3682,0,0,0,0,0,0,-22.654,-11.503,-18.673,-3.5155,5.8354,-2.05,0.7699,0.41,0.786,-0.3681,0,0,0,0,0,0,48.069,48.009,0.9668,47.834,47.925,47.818,47.834,47.955,47.971,6.9,-7.33
1.0,-10.848,-1.2042,-0.3274,-1.0191,-6.0205,-0.3274,0.0828,0.7859,-1.0191,0.08278,-1.4094,3.4098,0.9604,-3.7436,-6.0205,0,0,0,0,0,0,0,0,0,0,0,0,8.38e-17,-4.8116,-1.4033,-0.1369,0.002472,0,9.803e-16,-55.642,-16.312,0.02167,0,3.4098,0.9604,-3.7437,0,0,0,0,0,0,0.1188,-6.5506,-2.8404,-0.8281,0.06405,-0.1998,0.1211,-6.5483,-2.8157,-0.8281,0.07015,-0.1983,0.7699,0.41,-1.4094,-0.3682,0,0,0,0,0,0,-21.627,-11.046,-18.66,-3.5395,5.7577,-1.9805,0.7699,0.41,0.786,-0.3681,0,0,0,0,0,0,48.009,48.009,0.8594,47.834,47.925,47.818,47.834,47.955,47.971,6.7218,-7.4427
2.0,-10.666,-1.2042,-0.3274,-1.0191,-6.0205,-0.3274,0.08279,0.7859,-1.0191,0.08278,-1.4094,3.4098,0.9604,-3.7436,-6.0205,0,0,0,0,0,0,0,0,0,0,0,0,8.38e-17,-4.8116,-1.4033,-0.1369,0.002472,0,9.803e-16,-55.642,-16.312,0.02167,0,3.4098,0.9605,-3.7437,0,0,0,0,0,0,0.1099,-6.5438,-2.8,-0.8205,0.07473,-0.183,0.1211,-6.5483,-2.8157,-0.8281,0.07015,-0.1922,0.7699,0.41,-1.4094,-0.3682,0,0,0,0,0,0,-23.843,-12.127,-18.393,-3.6453,5.978,-1.9978,0.7699,0.41,0.786,-0.3681,0,0,0,0,0,0,48.009,48.069,0.931,47.879,47.925,47.818,47.834,47.955,47.971,6.5436,-7.5555
3.0,-10.507,-1.2042,-0.3273,-1.0191,-6.0205,-0.3274,0.08275,0.786,-1.0191,0.08278,-1.4094,3.4098,0.9604,-3.7436,-6.0205,0,0,0,0,0,0,0,0,0,0,0,0,8.38e-17,-4.8116,-1.4033,-0.1369,0.002472,0,9.803e-16,-55.642,-16.312,0.02167,0,3.4098,0.9604,-3.7437,0,0,0,0,0,0,0.1233,-6.5483,-2.8224,-0.8266,0.07168,-0.1998,0.1211,-6.5483,-2.8157,-0.8281,0.07015,-0.1967,0.7699,0.41,-1.4094,-0.3681,0,0,0,0,0,0,-21.772,-10.872,-18.691,-3.5512,5.6648,-1.9976,0.7699,0.41,0.786,-0.3681,0,0,0,0,0,0,48.069,48.069,0.931,47.879,47.925,47.818,47.834,47.955,47.971,6.4533,-7.65
4.0,-10.413,-1.2042,-0.3274,-1.0191,-6.0205,-0.3274,0.08278,0.786,-1.0191,0.08278,-1.4094,3.4098,0.9604,-3.7436,-6.0205,0,0,0,0,0,0,0,0,0,0,0,0,8.38e-17,-4.8116,-1.4033,-0.1369,0.002472,0,9.803e-16,-55.642,-16.312,0.02167,0,3.4098,0.9604,-3.7437,0,0,0,0,0,0,0.1255,-6.5394,-2.8,-0.8327,0.07473,-0.1952,0.1211,-6.5483,-2.8157,-0.8327,0.07015,-0.1922,0.7699,0.41,-1.4094,-0.3681,0,0,0,0,0,0,-22.823,-11.645,-18.524,-3.5305,5.8712,-2.0096,0.7699,0.41,0.786,-0.3681,0,0,0,0,0,0,48.069,48.069,0.8952,47.879,47.925,47.818,47.834,47.955,47.971,6.6267,-7.69
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14996.0,10.84,-2.6343,-0.8836,-1.5591,-5.3263,-0.8836,0.06036,2.9515,-1.559,0.06044,-0.8511,3.0837,1.4501,-2.2102,-5.3265,-0.03151,-0.05375,0.04732,0.1482,-0.05218,0.06706,0.2969,0.5065,-0.4459,-1.3963,0.4916,-0.6319,-0.3694,-4.1879,-1.1847,-0.09441,-0.1568,0.1898,1.1605,-42.951,-19.023,0.1551,-0.1338,3.0836,1.4503,-2.2101,-0.03347,-0.05585,0.04805,0.151,-0.05513,0.07114,-0.3564,-6.0394,-2.3001,-0.2181,-0.1159,0.09608,-0.3632,-6.0394,-2.3023,-0.212,-0.125,0.1113,0.7116,0.06957,-0.8506,-0.03352,-0.03558,-0.03029,0.002444,-0.04208,0.1458,-0.1098,-0.8784,-0.07291,-37.584,0.0001132,-2.1031,0.03318,0.7117,0.0697,2.951,-0.03356,-0.03508,-0.02849,0.001571,-0.03951,0.1442,-0.1036,48.069,48.009,0.8952,47.818,47.834,47.818,47.803,47.94,47.94,-1.41,16.14
14997.0,10.857,-2.6331,-0.8841,-1.5596,-5.3258,-0.884,0.06039,2.953,-1.5594,0.06045,-0.8503,3.0835,1.4505,-2.2091,-5.326,-0.02913,-0.0497,0.04376,0.137,-0.04825,0.062,0.2969,0.5065,-0.4459,-1.3963,0.4916,-0.6319,-0.3677,-4.1837,-1.1874,-0.09682,-0.1562,0.189,1.1592,-42.937,-19.023,0.1545,-0.1338,3.0833,1.4507,-2.209,-0.02909,-0.04989,0.04198,0.1481,-0.05465,0.06249,-0.3161,-6.1179,-2.253,-0.3752,-0.03965,0.08693,-0.3273,-6.1022,-2.2597,-0.366,-0.05033,0.0915,0.7114,0.06932,-0.8497,-0.03359,-0.0335,-0.02723,0.001208,-0.04242,0.1428,-0.0967,-2.7137,0.8552,-38.514,-0.6088,-3.2383,-0.9666,0.7114,0.06948,2.9525,-0.03359,-0.03246,-0.02633,0.001469,-0.03657,0.1333,-0.09571,48.009,48.009,0.8594,47.818,47.834,47.818,47.803,47.94,47.94,-1.52,15.943
14998.0,10.89,-2.6321,-0.8844,-1.56,-5.3253,-0.8844,0.06039,2.9544,-1.5598,0.06046,-0.8495,3.0833,1.4508,-2.208,-5.3256,-0.02676,-0.04565,0.04019,0.1258,-0.04431,0.05695,0.2969,0.5065,-0.4459,-1.3963,0.4916,-0.6319,-0.3659,-4.1797,-1.1901,-0.09922,-0.1555,0.1881,1.1579,-42.924,-19.023,0.154,-0.1338,3.0831,1.451,-2.2078,-0.02776,-0.04382,0.03652,0.1295,-0.05064,0.04818,-0.343,-6.2569,-2.1566,-0.3035,0.00305,0.1434,-0.3385,-6.2322,-2.1589,-0.302,-0.00915,0.1571,0.7111,0.06912,-0.849,-0.0337,-0.02911,-0.02589,0.001292,-0.04046,0.1246,-0.08058,4.2749,1.0128,-36.412,-1.2811,-0.4296,-1.1013,0.7112,0.06928,2.9538,-0.03362,-0.02984,-0.02417,0.001364,-0.03362,0.1224,-0.08786,48.009,48.009,0.931,47.818,47.834,47.818,47.803,47.94,47.94,-1.74,15.55
14999.0,11.29,-2.6311,-0.8848,-1.5604,-5.3249,-0.8847,0.06039,2.9557,-1.5602,0.06047,-0.8488,3.0831,1.4511,-2.2071,-5.3251,-0.02438,-0.0416,0.03662,0.1147,-0.04038,0.0519,0.2969,0.5065,-0.4459,-1.3963,0.4916,-0.6319,-0.3642,-4.1758,-1.1928,-0.1016,-0.1548,0.1873,1.1568,-42.912,-19.023,0.1535,-0.1338,3.0829,1.4513,-2.2068,-0.02149,-0.04059,0.03417,0.1202,-0.0395,0.04178,-0.4237,-6.2703,-2.0939,-0.302,-0.01372,0.1739,-0.4125,-6.2569,-2.0916,-0.2943,-0.02898,0.1891,0.7109,0.06894,-0.8484,-0.03384,-0.02738,-0.01982,0.001031,-0.03028,0.1157,-0.06702,11.518,1.5002,-39.314,-1.8671,-0.3734,-0.5733,0.7109,0.06909,2.955,-0.03364,-0.02721,-0.02201,0.001255,-0.03067,0.1115,-0.08003,48.009,48.009,0.931,47.818,47.834,47.818,47.803,47.94,47.94,-1.4601,15.743


### 1.3 Separate data into a training and testing set

We also want to separate the data set into a training and testing set. We do so by using the first 10,500 measurements for training and then using the remainder for testing.

In [17]:
separator = 10500

data_train = data_all.where("data_train", data_all.rowid() < separator)
data_test = data_all.where("data_test", data_all.rowid() >= separator)

## 2. Predictive modeling

### 2.1 Propositionalization with getML's FastProp

In [18]:
fast_prop = getml.feature_learning.FastPropTimeSeries(
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    allow_lagged_targets=False,
    memory=15,
    num_threads=1,
)


In [19]:
pipe_fp_fl = getml.pipeline.Pipeline(
    feature_learners=[fast_prop],
    tags=["feature learning", "fastprop"],
)

In [20]:
pipe_fp_fl.check(data_train)

Checking data model...
OK.


In [21]:
begin = time.time()

pipe_fp_fl.fit(data_train)

fastprop_train = pipe_fp_fl.transform(data_train, df_name="fastprop_train")

end = time.time()

fastprop_runtime = datetime.timedelta(seconds=end - begin)

Checking data model...
OK.

FastProp: Trying 123 features...

Trained pipeline.
Time taken: 0h:0m:0.064701


FastProp: Building features...



In [22]:
fastprop_test = pipe_fp_fl.transform(data_test, df_name="fastprop_test")


FastProp: Building features...



In [23]:
predictor = getml.predictors.XGBoostRegressor()

pipe_fp_pr = getml.pipeline.Pipeline(
    tags=["prediction", "fastprop"], predictors=[predictor]
)

In [24]:
pipe_fp_pr.check(fastprop_train)



Checking data model...


In [25]:
pipe_fp_pr.fit(fastprop_train)



Checking data model...

XGBoost: Training as predictor...

Trained pipeline.
Time taken: 0h:0m:4.419197



In [26]:
pipe_fp_pr.score(fastprop_test)




Unnamed: 0,date time,set used,target,mae,rmse,rsquared
0,2021-05-18 13:17:28,fastprop_train,f_x,0.4479,0.5908,0.9961
1,2021-05-18 13:17:29,fastprop_test,f_x,0.5593,0.7332,0.9949


### 2.2 Propositionalization with featuretools

In [27]:
dfs_pandas = {}

for df in [data_train, data_test, data_all]:
    dfs_pandas[df.name] = df.to_pandas()
    delete_columns = [col for col in dfs_pandas[df.name].columns if col not in only_use + ["f_x"]]
    for col in delete_columns:
        del dfs_pandas[df.name][col] 
    dfs_pandas[df.name]["id"] = 1
    dfs_pandas[df.name]["ds"] = pd.to_datetime(np.arange(0, dfs_pandas[df.name].shape[0]), unit="s")

In [28]:
dfs_pandas["data_train"]

Unnamed: 0,30,34,37,38,4,59,61,7,77,78,f_x,id,ds
0,-1.2042,-0.32739,-1.0191,-6.0205,-0.32737,0.082791,0.78597,-1.0191,0.082782,-1.4094,-11.0300,1,1970-01-01 00:00:00
1,-1.2042,-0.32739,-1.0191,-6.0205,-0.32737,0.082800,0.78592,-1.0191,0.082782,-1.4094,-10.8480,1,1970-01-01 00:00:01
2,-1.2042,-0.32737,-1.0191,-6.0205,-0.32737,0.082786,0.78594,-1.0191,0.082782,-1.4094,-10.6660,1,1970-01-01 00:00:02
3,-1.2042,-0.32734,-1.0191,-6.0205,-0.32737,0.082755,0.78599,-1.0191,0.082782,-1.4094,-10.5070,1,1970-01-01 00:00:03
4,-1.2042,-0.32736,-1.0191,-6.0205,-0.32737,0.082782,0.78597,-1.0191,0.082782,-1.4094,-10.4130,1,1970-01-01 00:00:04
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10495,-1.1446,-0.37311,-1.0486,-5.9532,-0.37326,0.087343,0.90793,-1.0488,0.087468,-1.4162,-9.7673,1,1970-01-01 02:54:55
10496,-1.1349,-0.37103,-1.0472,-5.9564,-0.37108,0.087241,0.90199,-1.0474,0.087274,-1.4160,-9.9200,1,1970-01-01 02:54:56
10497,-1.1255,-0.36889,-1.0458,-5.9596,-0.36896,0.087055,0.89618,-1.0460,0.087082,-1.4158,-9.7743,1,1970-01-01 02:54:57
10498,-1.1163,-0.36680,-1.0444,-5.9627,-0.36689,0.086907,0.89034,-1.0447,0.086893,-1.4155,-8.6109,1,1970-01-01 02:54:58


In [29]:
ft_builder = FTTimeSeriesBuilder(
    num_features=200,
    horizon=pd.Timedelta(seconds=0),
    memory=pd.Timedelta(seconds=15),
    column_id="id",
    time_stamp="ds",
    target="f_x",
)

In [30]:
featuretools_train = ft_builder.fit(dfs_pandas["data_train"])
featuretools_test = ft_builder.transform(dfs_pandas["data_test"])

featuretools: Trying features...


  agg_primitives: ['all', 'any', 'entropy', 'num_true', 'percent_true']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible variable types for the primitive were found in the data.


Selecting the best out of 177 features...
Time taken: 0h:6m:8.647821



  agg_primitives: ['all', 'any', 'entropy', 'num_true', 'percent_true']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible variable types for the primitive were found in the data.


In [31]:
featuretools_train

Unnamed: 0_level_0,MIN(peripheral.37),MIN(peripheral.7),FIRST(peripheral.37),FIRST(peripheral.7),MEDIAN(peripheral.37),MEDIAN(peripheral.7),MEAN(peripheral.37),MEAN(peripheral.7),SUM(peripheral.37),SUM(peripheral.7),...,"TREND(peripheral.38, ds)","TREND(peripheral.59, ds)","TREND(peripheral.30, ds)","TREND(peripheral.4, ds)","TREND(peripheral.78, ds)","TREND(peripheral.34, ds)","TREND(peripheral.77, ds)",f_x,id,ds
_featuretools_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,-1.0191,-1.0191,-1.0191,-1.0191,-1.0191,-1.0191,-1.019100,-1.019100,-1.0191,-1.0191,...,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,-11.0300,1,1970-01-01 00:00:00
1,-1.0191,-1.0191,-1.0191,-1.0191,-1.0191,-1.0191,-1.019100,-1.019100,-2.0382,-2.0382,...,0.000000e+00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,-10.8480,1,1970-01-01 00:00:01
2,-1.0191,-1.0191,-1.0191,-1.0191,-1.0191,-1.0191,-1.019100,-1.019100,-3.0573,-3.0573,...,-1.204867e-26,-0.216000,0.000000,0.000000,0.000000,0.864000,0.000000,-10.6660,1,1970-01-01 00:00:02
3,-1.0191,-1.0191,-1.0191,-1.0191,-1.0191,-1.0191,-1.019100,-1.019100,-4.0764,-4.0764,...,0.000000e+00,-1.054080,0.000000,0.000000,0.000000,1.468800,0.000000,-10.5070,1,1970-01-01 00:00:03
4,-1.0191,-1.0191,-1.0191,-1.0191,-1.0191,-1.0191,-1.019100,-1.019100,-5.0955,-5.0955,...,0.000000e+00,-0.544320,0.000000,0.000000,0.000000,0.950400,0.000000,-10.4130,1,1970-01-01 00:00:04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10495,-1.0716,-1.0721,-1.0716,-1.0721,-1.0596,-1.0596,-1.059793,-1.059927,-15.8969,-15.8989,...,-3.789643e-03,-0.000183,0.011060,0.002575,0.000191,0.002561,-0.000201,-9.7673,1,1970-01-01 02:54:55
10496,-1.0698,-1.0702,-1.0698,-1.0702,-1.0580,-1.0580,-1.058167,-1.058280,-15.8725,-15.8742,...,-3.708571e-03,-0.000179,0.010880,0.002522,0.000198,0.002506,-0.000201,-9.9200,1,1970-01-01 02:54:56
10497,-1.0681,-1.0684,-1.0681,-1.0684,-1.0565,-1.0563,-1.056567,-1.056667,-15.8485,-15.8500,...,-3.629286e-03,-0.000177,0.010697,0.002469,0.000206,0.002450,-0.000200,-9.7743,1,1970-01-01 02:54:57
10498,-1.0662,-1.0665,-1.0662,-1.0665,-1.0549,-1.0548,-1.054987,-1.055087,-15.8248,-15.8263,...,-3.550357e-03,-0.000176,0.010510,0.002416,0.000214,0.002398,-0.000199,-8.6109,1,1970-01-01 02:54:58


In [32]:
roles={
    getml.data.roles.target: ["f_x"],
    getml.data.roles.join_key: ["id"],
    getml.data.roles.time_stamp: ["ds"],
}

df_featuretools_train = getml.data.DataFrame.from_pandas(
    featuretools_train, name="featuretools_train", roles=roles
)

df_featuretools_test = getml.data.DataFrame.from_pandas(
    featuretools_test, name="featuretools_test", roles=roles
)


In [33]:
df_featuretools_train.set_role(
    df_featuretools_train.unused_names, getml.data.roles.numerical
)

df_featuretools_test.set_role(
    df_featuretools_test.unused_names, getml.data.roles.numerical
)

In [34]:
predictor = getml.predictors.XGBoostRegressor()

pipe_ft_pr = getml.pipeline.Pipeline(
    tags=["prediction", "featuretools"], predictors=[predictor]
)

pipe_ft_pr

In [35]:
pipe_ft_pr.fit(df_featuretools_train)

Checking data model...
OK.

XGBoost: Training as predictor...

Trained pipeline.
Time taken: 0h:0m:4.722218



In [36]:
pipe_ft_pr.score(df_featuretools_test)




Unnamed: 0,date time,set used,target,mae,rmse,rsquared
0,2021-05-18 13:26:23,featuretools_train,f_x,0.4309,0.5701,0.9964
1,2021-05-18 13:26:23,featuretools_test,f_x,0.5596,0.7385,0.9948


### 2.3 Propositionalization with tsfresh

In [37]:
tsfresh_builder = TSFreshBuilder(
    num_features=200, 
    memory=15, 
    column_id="id",
    time_stamp="ds",
    target="f_x",
)

In [38]:
tsfresh_train = tsfresh_builder.fit(dfs_pandas["data_train"])

tsfresh_test = tsfresh_builder.transform(dfs_pandas["data_test"])

Rolling: 100%|██████████| 20/20 [00:13<00:00,  1.45it/s]
Feature Extraction: 100%|██████████| 20/20 [00:20<00:00,  1.01s/it]
Feature Extraction: 100%|██████████| 20/20 [00:37<00:00,  1.87s/it]


Selecting the best out of 120 features...
Time taken: 0h:1m:19.589981



Rolling: 100%|██████████| 20/20 [00:06<00:00,  3.01it/s]
Feature Extraction: 100%|██████████| 20/20 [00:09<00:00,  2.05it/s]
Feature Extraction: 100%|██████████| 20/20 [00:16<00:00,  1.19it/s]


In [39]:
roles={
    getml.data.roles.target: ["f_x"],
    getml.data.roles.join_key: ["id"],
    getml.data.roles.time_stamp: ["ds"],
}

df_tsfresh_train = getml.data.DataFrame.from_pandas(
    tsfresh_train, name="tsfresh_train", roles=roles
)

df_tsfresh_test = getml.data.DataFrame.from_pandas(
    tsfresh_test, name="tsfresh_test", roles=roles
)

In [40]:
df_tsfresh_train.set_role(df_tsfresh_train.unused_names, getml.data.roles.numerical)

df_tsfresh_test.set_role(df_tsfresh_test.unused_names, getml.data.roles.numerical)

In [41]:
pipe_tsf_pr = getml.pipeline.Pipeline(
    tags=["predicition", "tsfresh"], predictors=[predictor]
)

pipe_tsf_pr

In [42]:
pipe_tsf_pr.check(df_tsfresh_train)

Checking data model...
OK.


In [43]:
pipe_tsf_pr.fit(df_tsfresh_train)

Checking data model...
OK.

XGBoost: Training as predictor...

Trained pipeline.
Time taken: 0h:0m:4.324484



In [44]:
pipe_tsf_pr.score(df_tsfresh_test)




Unnamed: 0,date time,set used,target,mae,rmse,rsquared
0,2021-05-18 13:28:25,tsfresh_train,f_x,0.4873,0.6564,0.9952
1,2021-05-18 13:28:25,tsfresh_test,f_x,0.5939,0.7857,0.9939


### 3. Comparison

In [45]:
num_features = dict(
    fastprop=123,
    featuretools=177,
    tsfresh=110,
)

runtime_per_feature = [
    fastprop_runtime / num_features['fastprop'],
    ft_builder.runtime / num_features['featuretools'],
    tsfresh_builder.runtime / num_features['tsfresh'],
]

features_per_second = [1.0/r.total_seconds() for r in runtime_per_feature]

speedup_per_feature = [r/runtime_per_feature[0] for r in runtime_per_feature]

comparison = pd.DataFrame(
    dict(
        runtime=[fastprop_runtime, ft_builder.runtime, tsfresh_builder.runtime],
        num_features=num_features.values(),
        features_per_second=features_per_second,
        speedup=[1, ft_builder.runtime/fastprop_runtime, tsfresh_builder.runtime/fastprop_runtime],
        speedup_per_feature=speedup_per_feature,
        mae=[pipe_fp_pr.mae, pipe_ft_pr.mae, pipe_tsf_pr.mae],
        rmse=[pipe_fp_pr.rmse, pipe_ft_pr.rmse, pipe_tsf_pr.rmse],
        rsquared=[pipe_fp_pr.rsquared, pipe_ft_pr.rsquared, pipe_tsf_pr.rsquared],
    )
)

comparison.index = ["getML: FastProp", "featuretools", "tsfresh"]

In [46]:
comparison

Unnamed: 0,runtime,num_features,features_per_second,speedup,speedup_per_feature,mae,rmse,rsquared
getML: FastProp,0 days 00:00:03.929530,123,31.301844,1.0,1.0,0.559264,0.733232,0.994929
featuretools,0 days 00:06:08.647821,177,0.480133,93.814736,65.194103,0.55958,0.738467,0.994774
tsfresh,0 days 00:01:19.589981,110,1.382084,20.254326,22.648292,0.593878,0.785746,0.993929


In [47]:
comparison.to_csv("comparisons/robot.csv")

## 3. Conclusion


The purpose of this notebook has been to illustrate the problem of the curse of dimensionality when engineering features from datasets with many columns.

The most important thing to remember is that this problem exists regardless of whether you engineer your features manually or using algorithms. Whether you like it or not: If you write your features in the traditional way, your search space grows quadratically with the number of columns.

# Next Steps

This tutorial explained how to overcome the problem of feature explosion with getML's feature learning algorithms Multirel and Relboost. 

If you are interested in further real-world applications of getML, head back to the [notebook overview](welcome.md) and choose one of the remaining examples. 

Here is some additional material from our [documentation](https://docs.getml.com/latest/) if you want to learn more about getML:
* [Feature learning with Multirel](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#multirel)
* [Feature learning with Relboost](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#relboost)

# Get in contact

If you have any question schedule a [call with Alex](https://go.getml.com/meetings/alexander-uhlig/getml-demo), the co-founder of getML, or write us an [email](team@getml.com). Prefer a private demo of getML? Just contact us to make an appointment.