Dask support #10

jneuff · 2016-10-27T09:28:59Z

To efficiently handle huge data sets, we could make use of Dask

nils-braun · 2016-11-18T21:50:30Z

This is quite an interesting topic. However, the dask DataFrame-creation API is highly incompatible with the pandas one and using dask correctly would mean some severe changes to the implementation.

So my question is
(1) do we really need this
(2) are there use cases where we need to run over more data that fits into the memory and
(3) isn't the processing time so large in these cases that we should rather think about a "cluster-parallel" implementation?

jneuff · 2016-11-21T21:29:03Z

Yes, this would imply quite some changes. This issue was meant to trigger the discussion.

On (3): Dask actually offers distributed scheduling.

MaxBenChrist · 2016-11-21T22:55:45Z

(1) maybe dask is not the right solution for our proble. But at the moment, when wants to extract features for big number of time series, one has to apply tsfresh chunkwise because some feature calculators are quite memory intensive. So lets say you have 16 GB of ram, then you probably can not process 2-3 GB of Time series data in one chunk, you have to split it by devices and pickle the feature DataFrames. What do you think about renaming this issue to "allow extraction of features for big time series"?

(2) yes, definitely

(3) you are probably right here.

mrocklin · 2017-02-14T20:44:57Z

Just came across this. If there are particular incompatibilities between Dask.dataframes and Pandas dataframes that affect this project please let us know.

Also, it may be that the right approach isn't to use dask.dataframe directly, but instead use it to pre-process data and then apply tsfresh functions over it. For example it would be easy to use dask.dataframe to load large datasets, share a fixed window of data (maybe five minutes) between neighboring partitions (which are just pandas dataframes) and then call tsfresh functions on each of those pandas dataframes. This sort of solution (or something similar) is common for more complex applications like yours.

Anyway, let me know if dask developers can help.

MaxBenChrist · 2017-02-15T09:16:17Z

@mrocklin, thank you for your nice message.

At the moment we are working on some other issues (E.g. shaping the tsfresh APi, restructure the test framework, ...). I discussed a possible dask implementation with @chmp (he contributed a few things to the dask project). As soon as this comes up again and we have some spare time to tackle this, I will get in contact with you guys :).

MaxBenChrist · 2017-09-13T15:54:16Z

@mrocklin

We are working on supporting dask to calculate tsfresh features in a distributed fashion (see the distributed branch). dask is a great tool and our first experiments on a cluster make me really exiting about the upcoming tsfresh release

I have one question regarding the pure argument on the "client.map" method. Essentially, tsfresh is a wrapper around numpy, scipy and pandas methods. Most of the feature that we calculate, e.g. fourier coefficients, are calculated in c libraries.

In the dask documentation you say

By default we assume that all functions are pure. If this is not the case we should use the pure=False keyword argument.

so following this advice, I should set pure=False because we rely on those scipy/numpy methods. But, on that same page you mention that

This key should be the same across all computations with the same inputs and across all machines. If we run the computation above on any computer with the same environment then we should get the exact same key.

The scheduler avoids redundant computations. If the result is already in memory from a previous call then that old result will be used rather than recomputing it. Calls to submit or map are idempotent in the common case.

We want the same features to be calculated for the same input. So, to achieve that and reduce the number of redundant calculations, we have to set pure=True such that same input gets the same calculation key, right?

MaxBenChrist · 2017-09-13T15:55:09Z

reference pr #316

mrocklin · 2017-09-13T15:58:46Z

The pure= keyword is a bit inaccurate. Pure has a few meanings. It would be more accurate to define this as deterministic= instead. If you use pure=True then you are stating that applying the same function to the same arguments will always produce the same result.

If you submit two functions

x = client.submit(inc, 1)
y = client.submit(inc, 1)

Under pure=True these will point to the same data. Under pure=False they will point to different data.

Relying on NumPy/SciPy/Pandas has no impact here. Most of those functions are deterministic.

MaxBenChrist · 2017-09-14T10:58:32Z

The pure= keyword is a bit inaccurate. Pure has a few meanings. It would be more accurate to define this as deterministic= instead. If you use pure=True then you are stating that applying the same function to the same arguments will always produce the same result.

Thanks for the explanation!

mrocklin · 2017-09-14T11:01:13Z

Understandable. The term pure was really a bad decision on my part early on. It means too many things.

…

On Thu, Sep 14, 2017 at 6:58 AM, Maximilian Christ ***@***.*** > wrote: The pure= keyword is a bit inaccurate. Pure has a few meanings. It would be more accurate to define this as deterministic= instead. If you use pure=True then you are stating that applying the same function to the same arguments will always produce the same result. Thanks for the explanation. I was somehow associating pure with pure python code in the sense of having not C extensions. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszMnkLJqnmX-SWsB_pkY0KWd4OFGmks5siQbbgaJpZM4KiIVd> .

MaxBenChrist · 2017-10-14T09:19:33Z

Dask support is now in the master 👍

jneuff added enhancement help wanted labels Oct 27, 2016

MaxBenChrist self-assigned this Jul 23, 2017

MaxBenChrist modified the milestone: v0.10.0 Jul 23, 2017

MaxBenChrist removed their assignment Jul 28, 2017

MaxBenChrist closed this as completed Oct 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask support #10

Dask support #10

jneuff commented Oct 27, 2016

nils-braun commented Nov 18, 2016

jneuff commented Nov 21, 2016

MaxBenChrist commented Nov 21, 2016

mrocklin commented Feb 14, 2017

MaxBenChrist commented Feb 15, 2017

MaxBenChrist commented Sep 13, 2017 •

edited

Loading

MaxBenChrist commented Sep 13, 2017

mrocklin commented Sep 13, 2017

MaxBenChrist commented Sep 14, 2017 •

edited

Loading

mrocklin commented Sep 14, 2017 via email

MaxBenChrist commented Oct 14, 2017

Dask support #10

Dask support #10

Comments

jneuff commented Oct 27, 2016

nils-braun commented Nov 18, 2016

jneuff commented Nov 21, 2016

MaxBenChrist commented Nov 21, 2016

mrocklin commented Feb 14, 2017

MaxBenChrist commented Feb 15, 2017

MaxBenChrist commented Sep 13, 2017 • edited Loading

MaxBenChrist commented Sep 13, 2017

mrocklin commented Sep 13, 2017

MaxBenChrist commented Sep 14, 2017 • edited Loading

mrocklin commented Sep 14, 2017 via email

MaxBenChrist commented Oct 14, 2017

MaxBenChrist commented Sep 13, 2017 •

edited

Loading

MaxBenChrist commented Sep 14, 2017 •

edited

Loading