-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation for set_index(col, compute=True)
is unclear/inaccurate
#8415
Comments
Thanks for raising this issue and for the examples-- I was able to reproduce what you're describing and was actually about to open an issue after reading your SO post. @ian-r-rose and I had a good discussion on this, and it does seem that at the very least the documentation should be updated to reflect this behavior, but it's possible this could be a bug. @gjoseph92 maybe you have thoughts on this? |
@scharlottej13 thanks for looking into it! Btw, do please upvote any SO questions you find useful — it's a minor thing that nonetheless helps the question get recognized and lets the asker know that they're at least being useful :) |
A quick skim of the code: dask/dask/dataframe/shuffle.py Lines 558 to 564 in d34a354
makes me think that It's not immediately clear to me what the value of this is? Seems like it was added long, long ago in #378. The documentation is certainly lacking here, but I'm not sure how much reason there even is to keep the feature around? |
I agree that it's not obvious that
I don't think these are hugely valuable options, personally, though unfortunately |
Regarding the usefulness:
|
Agreed that If
@DahnJ yes, basically. You'd probably use it in a similar way. Being pedantic, it is slightly different, in that you are writing 100% of the data to disk, but in an intermediate form where there's still a little work left to do to get it in its final shuffled form. It's basically writing If there was some more general way to have |
Yes, I'm referring to "O".
I suppose you could make this argument, but it seems pretty slippery to me. I can say that I've been surprised by all of the above having |
Agreed, it is a little odd that in this lazy computing library, a handful of functions are not lazy by default. And that that is probably not going to change. Back to this one in |
To me, the interesting question would be whether using this option provides any improvements compared to using Based on this quick benchmark, it seems that while the writing is faster, reading is much slower (possibly because of the necessity to still import dask
import dask.dataframe as dd
df = dask.datasets.timeseries()
# `compute=True`
# 8.97s
dfb = df.set_index('name', divisions=('Alice', 'Michael', 'Zelda'), shuffle='disk', compute=True)
# 36.9s
dfb.compute()
# `to_parquet`
# 49s
df.set_index('name', divisions=('Alice', 'Michael', 'Zelda'), shuffle='disk').to_parquet('test.temp')
# 1.21s
dd.read_parquet('test.temp').compute() One more thing to possibly consider: where are the files written to? If to the workers' disks, then this is possibly still a significantly different operation from what An additional point is that I think it might be quite likely that only a very few people are using this option (an perhaps even fewer who actually do so on purpose), given that it's so hard to work out what it even does. |
Funny, running your benchmarks on my machine (2019 Intel MacBook Pro, 8-core, SSD), they only take 6-8sec, not the 45sec-1min you posted. And with This makes sense, because with The fact that with
Yup, written to workers' disks. It's true that with multiple worker machines, you probably couldn't get All this is making me lean towards deprecating the |
This makes sense to me. However, I still want to explicitly ask about the option of having I would probably start this as another issue since it's independent of this issue, but I first wanted to ask you to see if I'm not missing something. |
Thanks all for the discussion! To summarize, it sounds like there are two options (1) to deprecate the compute option in |
Thanks for the ping, I am all for deprecating surprising behaviors and eager computations. @DahnJ did you open an issue for adding a |
I think the documentation is currently unclear/inaccurate about the nature of the
compute
parameter forset_index
:The documentation currently contains this description:
This would suggest that if I provide divisions and set
compute=True
, immediate computation will be triggered. This only seems to be the case when usingshuffle=disk
, however. Even then, it's not clear to me what is actually being computed.Examples from the SO question where I originally asked about this (What does set_index(col, compute=True) do in Dask?):
Going down the stack of functions
set_index
actually calls, it appears that the only place wherecompute
is actually used inrearrange_by_column_disk
. And indeed:If I'm correct, then I believe the documentation should reflect the fact that this setting only affects the
shuffle=disk
case. Also, I can't work out from the documentation what is actually being computed — "immediate computation" of what?The text was updated successfully, but these errors were encountered: