Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe isin function with strange behaviour on large lists #3198

Closed
dennis-ec opened this issue Feb 23, 2018 · 7 comments
Closed

Dataframe isin function with strange behaviour on large lists #3198

dennis-ec opened this issue Feb 23, 2018 · 7 comments

Comments

@dennis-ec
Copy link

This issue is based on this stackoverflow question and motivated on @mrocklin 's answer.

I was playing around on filtering dask dataframes using the isin function with big lists to check on:

import dask.dataframe as dd
import pandas
from dask.distributed import Client, LocalCluster

c = Client(LocalCluster())
dask_df = dd.from_pandas(pandas.DataFrame.from_dict({'A':[1,2,3,4,5]*1000}), npartitions=10)
filter_list = list(range(2,600000,2))

Now I tried different ways to filter A from dask_df with filter_list:

# Method A:
mask = dask_df['A'].isin(filter_list)
res = dask_df[mask].compute()

# Method B:
scattered_filter_list = c.scatter(filter_list)
mask = c.submit(dask_df['A'].isin,scattered_filter_list).result()
res = dask_df[mask].compute()

# Method C:
mask = c.submit(dask_df['A'].isin, filter_list).result()
res = dask_df[mask].compute()

# Method D:
scattered_filter_list = c.scatter(filter_list)
mask = dask_df['A'].isin(scattered_filter_list)
res = dask_df[mask].compute()

Each one of them gave me the warning:

/home/miniconda3/lib/python3.6/site-packages/distributed/worker.py:739: UserWarning: Large object of size 1.44 MB detected in task graph: 
  ([2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26,  ... 9996, 599998],)
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and 
keep data on workers

    future = client.submit(func, big_data)    # bad

    big_future = client.scatter(big_data)     # good
    future = client.submit(func, big_future)  # good
  % (format_bytes(len(b)), s))

I think there should be a 'right way' to do this, which should not produce the warning and should also be in the docs.

pandas='0.22.0'
dask='0.17.0'
@dennis-ec
Copy link
Author

-bump @mrocklin any ideas?
I'm happy to try out new ideas if anyone has one.
The best way to filter dask dataframes with large lists is something that has been on my mind for some time.

@mrocklin
Copy link
Member

mrocklin commented Mar 1, 2018

I agree that there should be a right way to do this. Currently it isn't supported. It seems like a bug to me. I agree that it should be fixed.

@groceryheist
Copy link

I ran into a related problem today: ddf = ddf[ddf.A.isin(filter_set)] was slow running in a single thread.
filter_set was pretty big!

I found that setting the index, creating a dataframe from filter_set and doing a merge was faster and less error prone.

@nabinkhadka
Copy link
Contributor

@mrocklin do we have any new addition to this? You mentioned it seemed like a bug to you. Was that ever fixed?

@mrocklin
Copy link
Member

mrocklin commented Jan 2, 2019

We usually close issues after they have been fixed (though not always). If the issue is still open then I suspect that the bug remains. If you wanted to verify this you could look through git history or try out the failing example above on master. Unfortuantely I'm not personally able to keep track of all bugs.

If you have any interest in helping to resolve this issue that would be welcome.

@bnaul
Copy link
Contributor

bnaul commented May 23, 2019

I can confirm that this was fixed by #4727, thanks @jcrist! anyone want to close?

@jcrist jcrist closed this as completed May 23, 2019
@jrbourbeau
Copy link
Member

Thanks for following up on this @bnaul!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants