# Feature extraction with tsfresh transformer

In this tutorial, we show how you can use `sktime` with [`tsfresh`](https://tsfresh.readthedocs.io) to first extract features from time series, so that we can then use any scikit-learn estimator.

## Preliminaries
You have to install tsfresh if you haven't already. To install it, uncomment the cell below:

In [1]:
# !pip install --upgrade tsfresh

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sktime.datasets import load_basic_motions
from sktime.datasets import load_gunpoint
from sktime.transformers.series_as_features.summarize import \
    TSFreshFeatureExtractor

## Univariate time series classification data

For more details on the data set, see the [univariate time series classification notebook](https://github.com/alan-turing-institute/sktime/blob/master/examples/01_classification_univariate.ipynb).

In [3]:
X, y = load_gunpoint(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(150, 1) (150,) (50, 1) (50,)


In [4]:
X_train.head()

Unnamed: 0,dim_0
7,0 -1.1404 1 -1.1382 2 -1.1347 3 ...
41,0 -1.4143 1 -1.4154 2 -1.4163 3 ...
110,0 -0.61669 1 -0.61457 2 -0.61434 3...
36,0 -1.2870 1 -1.2824 2 -1.3037 3 ...
42,0 -0.96986 1 -0.97268 2 -0.97060 3...


In [5]:
# binary classification task
np.unique(y_train)

array(['1', '2'], dtype=object)

## Using tsfresh to extract features

In [6]:
# tf = TsFreshTransfomer()
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:03<00:15,  3.97s/it]

Feature Extraction:  40%|████      | 2/5 [00:07<00:11,  3.83s/it]

Feature Extraction:  60%|██████    | 3/5 [00:11<00:07,  3.76s/it]

Feature Extraction:  80%|████████  | 4/5 [00:14<00:03,  3.72s/it]

Feature Extraction: 100%|██████████| 5/5 [00:18<00:00,  3.70s/it]

Feature Extraction: 100%|██████████| 5/5 [00:18<00:00,  3.67s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_0__symmetry_looking__r_0.9500000000000001,dim_0__time_reversal_asymmetry_statistic__lag_1,dim_0__time_reversal_asymmetry_statistic__lag_2,dim_0__time_reversal_asymmetry_statistic__lag_3,dim_0__value_count__value_-1,dim_0__value_count__value_0,dim_0__value_count__value_1,dim_0__variance,dim_0__variance_larger_than_standard_deviation,dim_0__variation_coefficient
0,148.999495,36.21824,0.503181,0.532003,0.041019,-0.756751,-1.152128,-1.362392,0.12246,0.714658,...,1.0,0.00316,6.6e-05,-0.00915,0.0,0.0,0.0,0.99333,0.0,1124052.0
1,149.000791,45.83683,0.210965,0.219571,0.105857,-0.067978,-0.618506,-1.743038,0.427029,0.774103,...,1.0,0.004457,-0.006031,-0.030647,0.0,0.0,0.0,0.993339,0.0,-605261.4
2,148.999048,25.83833,0.458544,0.510811,0.095597,-1.131448,-1.087768,-0.729832,-0.097042,-0.460083,...,1.0,0.001321,0.002268,0.007223,0.0,0.0,0.0,0.993327,0.0,-543631.6
3,149.000022,42.87819,0.364435,0.359077,0.057456,-0.395867,-0.85189,-1.541396,0.308219,0.965678,...,1.0,0.000458,-0.003426,-0.026767,0.0,0.0,0.0,0.993333,0.0,-3646321.0
4,148.999406,33.419822,0.584903,0.635598,0.042326,-1.075007,-1.230166,-1.125237,-0.046539,0.195284,...,1.0,0.005683,0.004689,0.000136,0.0,0.0,0.0,0.993329,0.0,-1256293.0


## Using tsfresh with sktime

In [7]:
classifier = make_pipeline(
    TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False),
    RandomForestClassifier()
)
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:03<00:14,  3.72s/it]

Feature Extraction:  40%|████      | 2/5 [00:07<00:11,  3.68s/it]

Feature Extraction:  60%|██████    | 3/5 [00:10<00:07,  3.66s/it]

Feature Extraction:  80%|████████  | 4/5 [00:14<00:03,  3.66s/it]

Feature Extraction: 100%|██████████| 5/5 [00:18<00:00,  3.66s/it]

Feature Extraction: 100%|██████████| 5/5 [00:18<00:00,  3.65s/it]




  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:01<00:05,  1.27s/it]

Feature Extraction:  40%|████      | 2/5 [00:02<00:03,  1.25s/it]

Feature Extraction:  60%|██████    | 3/5 [00:03<00:02,  1.24s/it]

Feature Extraction:  80%|████████  | 4/5 [00:04<00:01,  1.23s/it]

Feature Extraction: 100%|██████████| 5/5 [00:06<00:00,  1.22s/it]

Feature Extraction: 100%|██████████| 5/5 [00:06<00:00,  1.22s/it]




0.96

## Multivariate time series classification data

In [8]:
X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(60, 6) (60,) (20, 6) (20,)


In [9]:
# multivariate input data
X_train.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
32,0 -0.179131 1 -0.179131 2 0.461767 3...,0 -1.108077 1 -1.108077 2 -1.187180 3...,0 0.012600 1 0.012600 2 2.360390 3...,0 0.066584 1 0.066584 2 -0.463427 3...,0 -0.095881 1 -0.095881 2 0.639209 3...,0 0.396843 1 0.396843 2 -0.383526 3...
0,0 0.079106 1 0.079106 2 -0.903497 3...,0 0.394032 1 0.394032 2 -3.666397 3...,0 0.551444 1 0.551444 2 -0.282844 3...,0 0.351565 1 0.351565 2 -0.095881 3...,0 0.023970 1 0.023970 2 -0.319605 3...,0 0.633883 1 0.633883 2 0.972131 3...
35,0 1.102297 1 1.102297 2 0.73238...,0 -1.790773 1 -1.790773 2 0.661191 3...,0 0.001413 1 0.001413 2 -1.57956...,0 0.258347 1 0.258347 2 -0.127842 3...,0 -0.165129 1 -0.165129 2 -0.16779...,0 0.516694 1 0.516694 2 -0.58860...
4,0 0.354481 1 0.354481 2 0.449142 3...,0 -0.567671 1 -0.567671 2 -1.899854 3...,0 -0.084270 1 -0.084270 2 0.913056 3...,0 -0.223723 1 -0.223723 2 0.692477 3...,0 -0.247694 1 -0.247694 2 0.149149 3...,0 0.050604 1 0.050604 2 0.849616 3...
29,0 0.118553 1 0.118553 2 -0.545332 3...,0 0.419456 1 0.419456 2 0.371223 3...,0 -0.283447 1 -0.283447 2 0.707172 3...,0 0.135832 1 0.135832 2 0.159802 3...,0 -0.079901 1 -0.079901 2 -0.090555 3...,0 0.050604 1 0.050604 2 0.474080 3...


In [10]:
t = TSFreshFeatureExtractor(default_fc_parameters="efficient", show_warnings=False)
Xt = t.fit_transform(X_train)
Xt.head()

  warn("Found non-unique index, replaced with unique index.")


Feature Extraction:   0%|          | 0/5 [00:00<?, ?it/s]

Feature Extraction:  20%|██        | 1/5 [00:07<00:31,  7.98s/it]

Feature Extraction:  40%|████      | 2/5 [00:15<00:23,  7.95s/it]

Feature Extraction:  60%|██████    | 3/5 [00:23<00:15,  7.91s/it]

Feature Extraction:  80%|████████  | 4/5 [00:31<00:07,  7.89s/it]

Feature Extraction: 100%|██████████| 5/5 [00:39<00:00,  7.88s/it]

Feature Extraction: 100%|██████████| 5/5 [00:39<00:00,  7.88s/it]




variable,dim_0__abs_energy,dim_0__absolute_sum_of_changes,"dim_0__agg_autocorrelation__f_agg_""mean""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""median""__maxlag_40","dim_0__agg_autocorrelation__f_agg_""var""__maxlag_40","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","dim_0__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,dim_5__symmetry_looking__r_0.9500000000000001,dim_5__time_reversal_asymmetry_statistic__lag_1,dim_5__time_reversal_asymmetry_statistic__lag_2,dim_5__time_reversal_asymmetry_statistic__lag_3,dim_5__value_count__value_-1,dim_5__value_count__value_0,dim_5__value_count__value_1,dim_5__variance,dim_5__variance_larger_than_standard_deviation,dim_5__variation_coefficient
0,5467.09735,370.994067,-0.016418,-0.030386,0.016192,12.91137,3.722119,0.30169,22.223362,27.201447,...,1.0,-43.876241,-12.725011,4.862733,0.0,0.0,0.0,15.936696,1.0,154.206374
1,10.629914,22.690124,0.039365,0.029099,0.008885,1.021608,0.068493,-0.493076,0.172195,1.6382,...,1.0,0.019919,-0.005089,-0.02841,0.0,0.0,0.0,0.260379,0.0,9.377847
2,8664.94077,324.813455,-0.034966,-0.112034,0.071626,20.912103,5.560128,-1.942906,63.322971,28.026657,...,1.0,32.364692,48.229669,93.801281,0.0,1.0,0.0,19.9436,1.0,-14.591905
3,8.82363,23.875852,0.00393,-0.040282,0.026035,0.892803,-0.024433,-0.348778,0.129618,1.40552,...,1.0,-0.00735,-0.017857,-0.011087,0.0,0.0,0.0,0.190012,0.0,21.338468
4,232.319298,137.452006,-0.012131,-0.039673,0.023279,4.086147,0.998918,-0.92573,2.506426,5.070925,...,1.0,0.130831,0.451732,0.787399,0.0,0.0,0.0,2.031789,1.0,17.833684
