Larger-than-memory time series #965

anovv · 2022-09-06T06:48:07Z

anovv
Sep 6, 2022

Thanks for the great project!

My question is about using tsfresh in case when an individual timeseries is very large.
I've glanced over the documentation (particularly Large Input Data), it says that in case of big data the input is divided into chunks which then are distributed over the cluster, where minimal unit of each chunk is an individual timeseries (i.e. two chunks can't have data from one timeseries, unless the user manually handles this case). Hence, my understanding is, the whole framework assumes that the individual unit of user's input data should still be castable to a single-machine-memory-pandas-dataframe. Looking at feature_calculators also seems to support this, since all of the functions operate on np arrays.

But what do we do if I want to calculate a feature (e.g. stddev) on a single timeseries which does not fit into memory? Although the framework claims Dask support, feature calculators still operate on np arrays only, not on Dask collections/dataframes. Does it mean that the user has to manually chunk the input? This means that we also need to handle the reduce step to gather chunk calculation result into an output feature value, which adds a whole layer of ambiguity (e.g. how do I get stddev of a large ts from 10 values of stddevs from 10 smaller ts). In that case, is there an idea how to handle this reduction step for all of the calculators so we can scale the whole framework for the large data? Or it should be done on calculator level (e.g. reductor per calculator), individually for each feature?

Same question goes regarding rolling ts, is it possible to roll larger-than-memory ts with the framework?

Thanks!

nils-braun · 2023-03-25T17:35:49Z

nils-braun
Mar 25, 2023
Maintainer

First of all, sorry for the massive delay in answering this!

should still be castable to a single-machine-memory-pandas-datafram

Thank is correct!
In general, all "operations" we do on timeseries (e.g. feature calculation, rolling) is always happening on a single timeseries at a time, which needs to be in memory as a pandas dataframe. Therefore you are right: we do not support larger than memory time series. We do support larger than memory data, if the data can be chunked up into multiple IDs.
Re-writing all feature calculators to handle out-of-core data will be close to impossible - stddev would work, but a lot of calculators either call external libraries (which do not handle this) or are quite complex. At least, it will be a very hard and complex project!

What you can do however, is the following: chunk up your data into smaller partitions (e.g. with dask or spark) and then calculate features on them. Maybe choose overlapping chunks (although you probably do not want to do a full rolling). Additionally, downsample the time series and extract features on the downsampled version.
I do not know your domain, but large time series will likely have different characteristics at different windows in time, so a procedure which both down-samples and uses windows might even give you better features.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Larger-than-memory time series #965

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Larger-than-memory time series #965

anovv Sep 6, 2022

Replies: 1 comment

nils-braun Mar 25, 2023 Maintainer

anovv
Sep 6, 2022

nils-braun
Mar 25, 2023
Maintainer