Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask support #10

Closed
jneuff opened this issue Oct 27, 2016 · 11 comments
Closed

Dask support #10

jneuff opened this issue Oct 27, 2016 · 11 comments

Comments

@jneuff
Copy link
Collaborator

jneuff commented Oct 27, 2016

To efficiently handle huge data sets, we could make use of Dask

@nils-braun
Copy link
Collaborator

This is quite an interesting topic. However, the dask DataFrame-creation API is highly incompatible with the pandas one and using dask correctly would mean some severe changes to the implementation.

So my question is
(1) do we really need this
(2) are there use cases where we need to run over more data that fits into the memory and
(3) isn't the processing time so large in these cases that we should rather think about a "cluster-parallel" implementation?

@jneuff
Copy link
Collaborator Author

jneuff commented Nov 21, 2016

Yes, this would imply quite some changes. This issue was meant to trigger the discussion.

On (3): Dask actually offers distributed scheduling.

@MaxBenChrist
Copy link
Collaborator

(1) maybe dask is not the right solution for our proble. But at the moment, when wants to extract features for big number of time series, one has to apply tsfresh chunkwise because some feature calculators are quite memory intensive. So lets say you have 16 GB of ram, then you probably can not process 2-3 GB of Time series data in one chunk, you have to split it by devices and pickle the feature DataFrames. What do you think about renaming this issue to "allow extraction of features for big time series"?

(2) yes, definitely

(3) you are probably right here.

@mrocklin
Copy link

Just came across this. If there are particular incompatibilities between Dask.dataframes and Pandas dataframes that affect this project please let us know.

Also, it may be that the right approach isn't to use dask.dataframe directly, but instead use it to pre-process data and then apply tsfresh functions over it. For example it would be easy to use dask.dataframe to load large datasets, share a fixed window of data (maybe five minutes) between neighboring partitions (which are just pandas dataframes) and then call tsfresh functions on each of those pandas dataframes. This sort of solution (or something similar) is common for more complex applications like yours.

Anyway, let me know if dask developers can help.

@MaxBenChrist
Copy link
Collaborator

@mrocklin, thank you for your nice message.

At the moment we are working on some other issues (E.g. shaping the tsfresh APi, restructure the test framework, ...). I discussed a possible dask implementation with @chmp (he contributed a few things to the dask project). As soon as this comes up again and we have some spare time to tackle this, I will get in contact with you guys :).

@MaxBenChrist MaxBenChrist self-assigned this Jul 23, 2017
@MaxBenChrist MaxBenChrist modified the milestone: v0.10.0 Jul 23, 2017
@MaxBenChrist MaxBenChrist removed their assignment Jul 28, 2017
@MaxBenChrist
Copy link
Collaborator

MaxBenChrist commented Sep 13, 2017

@mrocklin

We are working on supporting dask to calculate tsfresh features in a distributed fashion (see the distributed branch). dask is a great tool and our first experiments on a cluster make me really exiting about the upcoming tsfresh release

I have one question regarding the pure argument on the "client.map" method. Essentially, tsfresh is a wrapper around numpy, scipy and pandas methods. Most of the feature that we calculate, e.g. fourier coefficients, are calculated in c libraries.

In the dask documentation you say

By default we assume that all functions are pure. If this is not the case we should use the pure=False keyword argument.

so following this advice, I should set pure=False because we rely on those scipy/numpy methods. But, on that same page you mention that

This key should be the same across all computations with the same inputs and across all machines. If we run the computation above on any computer with the same environment then we should get the exact same key.

The scheduler avoids redundant computations. If the result is already in memory from a previous call then that old result will be used rather than recomputing it. Calls to submit or map are idempotent in the common case.

We want the same features to be calculated for the same input. So, to achieve that and reduce the number of redundant calculations, we have to set pure=True such that same input gets the same calculation key, right?

@MaxBenChrist
Copy link
Collaborator

reference pr #316

@mrocklin
Copy link

The pure= keyword is a bit inaccurate. Pure has a few meanings. It would be more accurate to define this as deterministic= instead. If you use pure=True then you are stating that applying the same function to the same arguments will always produce the same result.

If you submit two functions

x = client.submit(inc, 1)
y = client.submit(inc, 1)

Under pure=True these will point to the same data. Under pure=False they will point to different data.

Relying on NumPy/SciPy/Pandas has no impact here. Most of those functions are deterministic.

@MaxBenChrist
Copy link
Collaborator

MaxBenChrist commented Sep 14, 2017

The pure= keyword is a bit inaccurate. Pure has a few meanings. It would be more accurate to define this as deterministic= instead. If you use pure=True then you are stating that applying the same function to the same arguments will always produce the same result.

Thanks for the explanation!

@mrocklin
Copy link

mrocklin commented Sep 14, 2017 via email

@MaxBenChrist
Copy link
Collaborator

Dask support is now in the master 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants