added mode#5958
Conversation
TomAugspurger
left a comment
There was a problem hiding this comment.
Thanks for working on this. We'll want to make sure we have tests for Series and Dataframes that have multiple values for the mode
|
|
|
@TomAugspurger this should be ready for another review whenever you get a chance. Thank you! |
|
@balast just an FYI the Windows CI failures are unrelated to the changes here. I'll work on fixing those CI builds |
|
@TomAugspurger All checks are passing! This should now be ready for review when you get a chance. Thanks! |
dask/dataframe/core.py
Outdated
|
|
||
| name = "concat-" + tokenize(*mode_series_list) | ||
|
|
||
| concat_axis1 = partial(methods.concat, axis=1) |
There was a problem hiding this comment.
We try to avoid placing partials and other dynamicly generated functions into the task graph. This aids future debugging and reduces scheduling overhead. If you need to add a keyword, then consider using the apply function.
However, I also wonder if there is a way to do what you're trying to do here with map_partitions instead. If
so, it would be good to do so. More future devs/reviewers will be able to understand map_partitions than custom graphs (which are used only when we need to do something complicated).
There was a problem hiding this comment.
I replaced partial with apply.
I don't think map_partitions is appropriate here though I'm happy to try additional suggestions.
Why I don't believe map_partitions is appropriate here
In this implementation, mode is computed on each series of the dataframe. As part of computing mode on a series, all of the Series partitions are collapsed into a single partition via value_counts and sum. This last bit of the mode method is trying to concatenate the result of having run mode on each of the series. Each of the series are often of differing lengths, and I am trying to concatenate them along axis=1 to construct the final dataframe to be returned with nans filling in the extra space in the shorter series (as pandas does).
The concat_unindexed_dataframes in dask/dataframe/multi.py (see below) is very close to what I want to do except it calls concat_and_check which checks that the series are the same length, which they are not. I therefore copied concat_and_check and modified it slightly to do what I want in this bit of code.
Part of dask/dataframe/multi.py for reference
def concat_and_check(dfs):
if len(set(map(len, dfs))) != 1:
raise ValueError("Concatenated DataFrames of different lengths")
return methods.concat(dfs, axis=1)
def concat_unindexed_dataframes(dfs):
name = "concat-" + tokenize(*dfs)
dsk = {
(name, i): (concat_and_check, [(df._name, i) for df in dfs])
for i in range(dfs[0].npartitions)
}
meta = methods.concat([df._meta for df in dfs], axis=1)
graph = HighLevelGraph.from_collections(name, dsk, dependencies=dfs)
return new_dd_object(graph, name, meta, dfs[0].divisions)|
@TomAugspurger I would welcome additional feedback on the latest changes when you are able. Thank you! |
TomAugspurger
left a comment
There was a problem hiding this comment.
LGTM once we have a note for #5958 (comment).
|
The failing test is test_cov. I believe I'm seeing #5910. Recommitting to get around this. |
|
Thanks @balast! |
(Added a few which passes, one fails during pytest, but runs fine otherwise, and I'm stumped)black dask/flake8 daskAdds the mode function for Series and DataFrame, doesn't support numeric_only parameter currently, though I believe I could add it. Hoping to get some feedback generally and specifically on why my last test won't pass when run in pytest but runs fine otherwise. It's commented out at the moment.