added mode by Adam-D-Lewis · Pull Request #5958 · dask/dask

Adam-D-Lewis · 2020-02-28T06:51:44Z

Tests added / passed ~~(Added a few which passes, one fails during pytest, but runs fine otherwise, and I'm stumped)~~
Passes black dask / flake8 dask

Adds the mode function for Series and DataFrame, doesn't support numeric_only parameter currently, though I believe I could add it. Hoping to get some feedback generally and specifically on why my last test won't pass when run in pytest but runs fine otherwise. It's commented out at the moment.

TomAugspurger

Thanks for working on this. We'll want to make sure we have tests for Series and Dataframes that have multiple values for the mode

dask/dataframe/core.py

dask/dataframe/tests/test_dataframe.py

Adam-D-Lewis · 2020-03-06T04:14:45Z

~~Closing in favor of #5983~~

dask/dataframe/core.py

dask/dataframe/tests/test_dataframe.py

dask/dataframe/core.py

Adam-D-Lewis · 2020-03-12T04:29:32Z

@TomAugspurger this should be ready for another review whenever you get a chance. Thank you!

jrbourbeau · 2020-03-13T15:20:33Z

@balast just an FYI the Windows CI failures are unrelated to the changes here. I'll work on fixing those CI builds

Adam-D-Lewis · 2020-03-20T23:00:31Z

@TomAugspurger All checks are passing! This should now be ready for review when you get a chance. Thanks!

mrocklin · 2020-03-22T19:16:49Z

dask/dataframe/core.py

+
+            name = "concat-" + tokenize(*mode_series_list)
+
+            concat_axis1 = partial(methods.concat, axis=1)


We try to avoid placing partials and other dynamicly generated functions into the task graph. This aids future debugging and reduces scheduling overhead. If you need to add a keyword, then consider using the apply function.

However, I also wonder if there is a way to do what you're trying to do here with map_partitions instead. If
so, it would be good to do so. More future devs/reviewers will be able to understand map_partitions than custom graphs (which are used only when we need to do something complicated).

I replaced partial with apply.

I don't think map_partitions is appropriate here though I'm happy to try additional suggestions.

Why I don't believe map_partitions is appropriate here
In this implementation, mode is computed on each series of the dataframe. As part of computing mode on a series, all of the Series partitions are collapsed into a single partition via value_counts and sum. This last bit of the mode method is trying to concatenate the result of having run mode on each of the series. Each of the series are often of differing lengths, and I am trying to concatenate them along axis=1 to construct the final dataframe to be returned with nans filling in the extra space in the shorter series (as pandas does).

The concat_unindexed_dataframes in dask/dataframe/multi.py (see below) is very close to what I want to do except it calls concat_and_check which checks that the series are the same length, which they are not. I therefore copied concat_and_check and modified it slightly to do what I want in this bit of code.

Part of dask/dataframe/multi.py for reference

def concat_and_check(dfs): if len(set(map(len, dfs))) != 1: raise ValueError("Concatenated DataFrames of different lengths") return methods.concat(dfs, axis=1) def concat_unindexed_dataframes(dfs): name = "concat-" + tokenize(*dfs) dsk = { (name, i): (concat_and_check, [(df._name, i) for df in dfs]) for i in range(dfs[0].npartitions) } meta = methods.concat([df._meta for df in dfs], axis=1) graph = HighLevelGraph.from_collections(name, dsk, dependencies=dfs) return new_dd_object(graph, name, meta, dfs[0].divisions)

dask/dataframe/core.py

Adam-D-Lewis · 2020-03-30T18:53:07Z

@TomAugspurger I would welcome additional feedback on the latest changes when you are able. Thank you!

TomAugspurger

Looking close, thanks.

dask/dataframe/core.py

dask/dataframe/tests/test_dataframe.py

TomAugspurger

LGTM once we have a note for #5958 (comment).

Adam-D-Lewis · 2020-04-06T15:20:00Z

The failing test is test_cov. I believe I'm seeing #5910. Recommitting to get around this.

TomAugspurger · 2020-04-06T19:00:55Z

Thanks @balast!

added mode

9378b0c

Adam-D-Lewis mentioned this pull request Feb 28, 2020

Feature request: DataFrame.mode() #5744

Closed

TomAugspurger reviewed Mar 1, 2020

View reviewed changes

cleanup

6175a29

Adam-D-Lewis closed this Mar 6, 2020

TomAugspurger mentioned this pull request Mar 6, 2020

Add mode [WIP] #5983

Closed

2 tasks

Adam-D-Lewis reopened this Mar 6, 2020

Adam-D-Lewis commented Mar 6, 2020

View reviewed changes

dask/dataframe/core.py Outdated Show resolved Hide resolved

Adam-D-Lewis mentioned this pull request Mar 7, 2020

allow dataframes of different lengths to be concatenated when divisions are unknown #5990

Closed

2 tasks

balast added 3 commits March 11, 2020 00:02

added a concat nocheck for mode

4ca9873

fix tests and cleanup

efee765

undo changes on multi.py

8abe430

Adam-D-Lewis commented Mar 12, 2020

View reviewed changes

dask/dataframe/tests/test_dataframe.py Outdated Show resolved Hide resolved

Adam-D-Lewis commented Mar 12, 2020

View reviewed changes

dask/dataframe/core.py Outdated Show resolved Hide resolved

small change to retrigger github tests

8a60378

balast added 7 commits March 13, 2020 15:56

delete blank line to retrigger CI

e0a4782

throw error when dropna is False when pandas version too low

8b53688

fixed if statement

9dd532f

fix ifs

77d5ce0

fix tests and logic for pandas <= 0.24

dc2e005

removed unnecessary compute

e9fec60

consolidate warning statement

0b2150c

mrocklin reviewed Mar 22, 2020

View reviewed changes

TomAugspurger reviewed Mar 23, 2020

View reviewed changes

dask/dataframe/core.py Outdated Show resolved Hide resolved

dask/dataframe/core.py Outdated Show resolved Hide resolved

dask/dataframe/core.py Outdated Show resolved Hide resolved

dask/dataframe/core.py Show resolved Hide resolved

balast added 2 commits March 23, 2020 11:11

add _mode_aggregate function

c0ece17

remove axis parameter

dd08af0

black

4ab63f8

Adam-D-Lewis requested review from TomAugspurger and mrocklin March 24, 2020 21:20

TomAugspurger reviewed Mar 30, 2020

View reviewed changes

dask/dataframe/core.py Outdated Show resolved Hide resolved

dask/dataframe/tests/test_dataframe.py Show resolved Hide resolved

balast added 2 commits April 3, 2020 12:07

allow duplicate column names

8c9b529

empty dataframe mode test

bdf811d

Adam-D-Lewis requested a review from TomAugspurger April 3, 2020 18:41

TomAugspurger reviewed Apr 6, 2020

View reviewed changes

add reference to related pandas issue

df36f69

retrigger CI tests

4e3b6b0

Adam-D-Lewis requested a review from TomAugspurger April 6, 2020 15:49

TomAugspurger merged commit ecff32f into dask:master Apr 6, 2020

Adam-D-Lewis deleted the add_mode branch April 6, 2020 19:04


		name = "concat-" + tokenize(*mode_series_list)

		concat_axis1 = partial(methods.concat, axis=1)

Uh oh!

Conversation

Adam-D-Lewis commented Feb 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Adam-D-Lewis commented Mar 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Adam-D-Lewis commented Mar 12, 2020

Uh oh!

jrbourbeau commented Mar 13, 2020

Uh oh!

Adam-D-Lewis commented Mar 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin Mar 22, 2020

Choose a reason for hiding this comment

Uh oh!

Adam-D-Lewis Mar 24, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Adam-D-Lewis commented Mar 30, 2020

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Adam-D-Lewis commented Apr 6, 2020

Uh oh!

TomAugspurger commented Apr 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Adam-D-Lewis commented Feb 28, 2020 •

edited

Loading

Adam-D-Lewis commented Mar 6, 2020 •

edited

Loading

Adam-D-Lewis commented Mar 20, 2020 •

edited

Loading