Skip to content

added mode#5958

Merged
TomAugspurger merged 20 commits intodask:masterfrom
Adam-D-Lewis:add_mode
Apr 6, 2020
Merged

added mode#5958
TomAugspurger merged 20 commits intodask:masterfrom
Adam-D-Lewis:add_mode

Conversation

@Adam-D-Lewis
Copy link
Copy Markdown

@Adam-D-Lewis Adam-D-Lewis commented Feb 28, 2020

  • Tests added / passed (Added a few which passes, one fails during pytest, but runs fine otherwise, and I'm stumped)
  • Passes black dask / flake8 dask

Adds the mode function for Series and DataFrame, doesn't support numeric_only parameter currently, though I believe I could add it. Hoping to get some feedback generally and specifically on why my last test won't pass when run in pytest but runs fine otherwise. It's commented out at the moment.

Copy link
Copy Markdown
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. We'll want to make sure we have tests for Series and Dataframes that have multiple values for the mode

@Adam-D-Lewis
Copy link
Copy Markdown
Author

Adam-D-Lewis commented Mar 6, 2020

Closing in favor of #5983

@TomAugspurger TomAugspurger mentioned this pull request Mar 6, 2020
2 tasks
@Adam-D-Lewis Adam-D-Lewis reopened this Mar 6, 2020
@Adam-D-Lewis
Copy link
Copy Markdown
Author

@TomAugspurger this should be ready for another review whenever you get a chance. Thank you!

@jrbourbeau
Copy link
Copy Markdown
Member

@balast just an FYI the Windows CI failures are unrelated to the changes here. I'll work on fixing those CI builds

@Adam-D-Lewis
Copy link
Copy Markdown
Author

Adam-D-Lewis commented Mar 20, 2020

@TomAugspurger All checks are passing! This should now be ready for review when you get a chance. Thanks!


name = "concat-" + tokenize(*mode_series_list)

concat_axis1 = partial(methods.concat, axis=1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to avoid placing partials and other dynamicly generated functions into the task graph. This aids future debugging and reduces scheduling overhead. If you need to add a keyword, then consider using the apply function.

However, I also wonder if there is a way to do what you're trying to do here with map_partitions instead. If
so, it would be good to do so. More future devs/reviewers will be able to understand map_partitions than custom graphs (which are used only when we need to do something complicated).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced partial with apply.

I don't think map_partitions is appropriate here though I'm happy to try additional suggestions.

Why I don't believe map_partitions is appropriate here
In this implementation, mode is computed on each series of the dataframe. As part of computing mode on a series, all of the Series partitions are collapsed into a single partition via value_counts and sum. This last bit of the mode method is trying to concatenate the result of having run mode on each of the series. Each of the series are often of differing lengths, and I am trying to concatenate them along axis=1 to construct the final dataframe to be returned with nans filling in the extra space in the shorter series (as pandas does).

The concat_unindexed_dataframes in dask/dataframe/multi.py (see below) is very close to what I want to do except it calls concat_and_check which checks that the series are the same length, which they are not. I therefore copied concat_and_check and modified it slightly to do what I want in this bit of code.

Part of dask/dataframe/multi.py for reference

def concat_and_check(dfs):
    if len(set(map(len, dfs))) != 1:
        raise ValueError("Concatenated DataFrames of different lengths")
    return methods.concat(dfs, axis=1)


def concat_unindexed_dataframes(dfs):
    name = "concat-" + tokenize(*dfs)

    dsk = {
        (name, i): (concat_and_check, [(df._name, i) for df in dfs])
        for i in range(dfs[0].npartitions)
    }

    meta = methods.concat([df._meta for df in dfs], axis=1)

    graph = HighLevelGraph.from_collections(name, dsk, dependencies=dfs)
    return new_dd_object(graph, name, meta, dfs[0].divisions)

@Adam-D-Lewis
Copy link
Copy Markdown
Author

@TomAugspurger I would welcome additional feedback on the latest changes when you are able. Thank you!

Copy link
Copy Markdown
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking close, thanks.

Copy link
Copy Markdown
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once we have a note for #5958 (comment).

@Adam-D-Lewis
Copy link
Copy Markdown
Author

The failing test is test_cov. I believe I'm seeing #5910. Recommitting to get around this.

@TomAugspurger TomAugspurger merged commit ecff32f into dask:master Apr 6, 2020
@TomAugspurger
Copy link
Copy Markdown
Member

Thanks @balast!

@Adam-D-Lewis Adam-D-Lewis deleted the add_mode branch April 6, 2020 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants