New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible bug in using apply in dask dataframes #2774
Comments
The metadata is maybe getting messed up, since you're repeatedly assigning a In [50]: xs = [df2.triggers.apply(lambda x: trigger in x, meta=(trigger, 'bool')) for trigger in pop_triggers]
In [51]: dd.concat(ss, axis=1).compute()
Out[51]:
Total Traffic UDP DNS TCP SYN TCP null ICMP
0 False False False False False False
1 False False False False False False
2 False False False False False False
3 False False False False False False
4 True True True True True True
.. ... ... ... ... ... ...
45 False False False False False False
46 False False False False False False
47 False False False False False False
48 False False False False False False
49 True True True True True True
[50 rows x 6 columns] FYI, if you can avoid it, you don't want to be storing lists in pandas / dask dataframes. Typically it's better to do that kind of pre-processing outside pandas. I won't have a chance to look into this bug more until later next week. Feel free to have a look at what's going on if you're interested. |
@TomAugspurger if you look at your results, you will see something's wrong (in addition to the metadata issue). It seems that the If you implement the lambda as a function, I guess the explicit closure bypass the issue (see my answer to OP's SO question, link above). I'm not familiar enough with |
Thanks, I didn't see that. I think we've had an issue like this before...
I'll see if I can dig it up.
…On Mon, Oct 16, 2017 at 2:09 AM, Alex ***@***.***> wrote:
@TomAugspurger <https://github.com/tomaugspurger> if you look at your
results, you will see something's wrong (in addition to the metadata
issue). It seems that the trigger variable used in the lambda remains
mutable until the compute() call and so at compute() time, the lambda is
evaluated against the last value that trigger held.
If you implement the lambda as a function, I guess the explicit closure
bypass the issue (see my answer to OP's SO question, link above).
I'm not familiar enough with dask's internals to say if it's a limitation
or a bug.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2774 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABQHInP35Q-Za_-_B-tSDSccomQk3Wpaks5sswEjgaJpZM4P39bP>
.
|
This isn't a bug, this is just how python works. Closures evaluate based on the defining scope, if you change the value of In [1]: [(lambda: a)() for a in range(3)] # call immediately
Out[1]: [0, 1, 2]
In [2]: funcs = [lambda: a for a in range(3)] # call later
In [3]: [f() for f in funcs]
Out[3]: [2, 2, 2] The issue here is that this code would run fine in pandas, since there is an evaluation in each loop, but in dask all the evaluations are delayed until later, and thus all use the same value for The solution you posted on stackoverflow using Also note that in general it's best to avoid creating closures when working with dask, as they don't serialize well and may degrade performance if using the distributed scheduler. |
Ok, thanks for clarification. It was still not 100% clear at first reading but using an explicit function and loop, the function defined in the loop is rewritten over and over. It's indeed when going out of scope that the functions become distinct from each other being bound with their closure. So my solution worked but not for the reason I thought... Well noted for closure under As for |
Yes, but the speedup could be negligible or significant depending on task. Most dask dataframe operations are implemented using a core set of functions (
In general I recommend sticking with the pandas api until you find tasks that are slow, and then drop down to using the dask-specific operations to optimize (e.g. |
Closing, as this is not really a bug in dask but more an issue with how python's closures work. Feel free to reopen if you disagree. |
This is a crosspost from https://stackoverflow.com/questions/46720983/incompatibility-of-apply-in-dask-and-pandas-dataframes. Would appreciate it if it could be answered there if it isn't a bug.
A sample of the
triggers
column in my Dask dataframe looks like the following:I wish to create a one hot encoded version of the above arrays (putting a
1
against theDNS
column in row 1 for example) by doing the following.pop_triggers
contains all possible values oftriggers
.However, the
Total Traffic
,DNS
etc. columns all contain the value 0 and not 1 for the relevant value. When I copy it into a pandas dataframe and do the same operation, they get the expected value.What am I missing here? Is it because dask is lazy that somehow it's not filling in the values as expected?
Possible bug below:
I investigated some places where the flag was set in the first place (which turned out to be far less than I expected, and got some really weird results. See below:
output:
Minimal working example:
Output:
The text was updated successfully, but these errors were encountered: