DataFrameGroupBy.value_counts method raises an error when at least a partition is empty. #7065

FilippoBovo · 2021-01-13T17:54:04Z

What happened:
DataFrameGroupBy.value_counts gives an error if there is at least one empty partition.

What you expected to happen:
I expect that no error is raised and that the result matches the respective Pandas DataFrame result.

Minimal Complete Verifiable Example:

import pandas as pd
import dask.dataframe as dd


if __name__ == "__main__":
    df = pd.DataFrame(
        data=[
            ["a1", "b1"],
            ["a1", None],
            ["a1", "b1"],
            [None, None],
            [None, None],
            ["a1", None],
            [None, "b1"],
            [None, None],
            [None, None],
            ["a3", "b3"],
            ["a3", "b3"],
            ["a3", "b3"],
            ["a5", "b5"],
            ["a5", "b5"],
        ],
        columns=["A", "B"],
    )

    expected_result = df.groupby("A")["B"].value_counts()

    ddf = dd.from_pandas(df, npartitions=2)

    ddf.groupby("A")["B"].value_counts().compute()

    print("With null values:")
    for npartitions in [1, 2, 3, 4, 5, 6, 7, 8]:
        try:
            result = ddf.repartition(npartitions=npartitions).groupby("A")["B"].value_counts().compute()
        except Exception as e:
            print(f"npartitions: {npartitions}, error: {e}")

    print("\nWithout null values:")
    for npartitions in [1, 2, 3, 4, 5]:
        try:
            result = ddf.dropna().repartition(npartitions=npartitions).groupby("A")["B"].value_counts().compute()
        except Exception as e:
            print(f"npartitions: {npartitions}, error: {e}")

The output of the above script is the following.

With null values:
npartitions: 6, error: boolean index did not match indexed array along dimension 0; dimension is 0 but corresponding boolean dimension is 1
npartitions: 7, error: boolean index did not match indexed array along dimension 0; dimension is 0 but corresponding boolean dimension is 1
npartitions: 8, error: boolean index did not match indexed array along dimension 0; dimension is 0 but corresponding boolean dimension is 1

Without null values:
npartitions: 3, error: boolean index did not match indexed array along dimension 0; dimension is 0 but corresponding boolean dimension is 1
npartitions: 4, error: boolean index did not match indexed array along dimension 0; dimension is 0 but corresponding boolean dimension is 1
npartitions: 5, error: boolean index did not match indexed array along dimension 0; dimension is 0 but corresponding boolean dimension is 1

There seems to be an error when at least one of the partitions is empty.

Anything else we need to know?:
No.

Environment:

Dask version: 2020.12.0
Python version: 3.8.5
Operating System: MacOS Big Sur (11.1)
Install method (conda, pip, source): Pip

The text was updated successfully, but these errors were encountered:

jsignell · 2021-01-13T18:54:55Z

Well that is interesting. I think this might be a pandas bug. You can't do a groupby.value_counts on an empty array.

import pandas as pd

pd.DataFrame(columns=["A", "B"]).groupby("A")["B"].value_counts()

FilippoBovo · 2021-01-13T21:46:03Z

Thank you, @jsignell.

Yes, Pandas does not allow that, but I am not sure if this behaviour is what the developers want or if it is just a bug. Is this something you can find out? Otherwise, I can open an issue on Pandas Github page.

Meanwhile, do you know a way to ensure that none of the partitions are empty?

quasiben · 2021-01-14T14:01:04Z

I don't believe there are explicit ways of removing empty partitions as this forces computation. This SO post outlines how you can remove empty partitions (should any exist) or you could also try repartitioning -- xref: #5729

FilippoBovo · 2021-01-14T14:15:05Z

Thank you, @quasiben. I saw that post but I was wondering if there was a method to do that.

jsignell · 2021-01-14T15:23:11Z

I am not sure if this behaviour is what the developers want or if it is just a bug

Yeah I am also not sure about that. I would expect this kind of thing to work on an empty dataframe though. So raising an issue on pandas would be good.

do you know a way to ensure that none of the partitions are empty?

Maybe that is the wrong approach. What if we only run the function on partitions that are non-empty? That seems like it would be fine for value_counts.

mrocklin · 2021-01-14T15:26:00Z

We can probably make our version of value_counts robust though? Or maybe we can raise an issue / ask a question upstream.

…

On Thu, Jan 14, 2021 at 6:15 AM Filippo Bovo ***@***.***> wrote: Thank you, @quasiben <https://github.com/quasiben>. I saw that post but I was wondering if there was a method to do that. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#7065 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTGDOHYA3W6VJMEKWYDSZ337ZANCNFSM4WBFTFGA> .

FilippoBovo · 2021-01-14T16:01:45Z

I would expect this kind of thing to work on an empty dataframe though. So raising an issue on pandas would be good.

Done: pandas-dev/pandas#39172

What if we only run the function on partitions that are non-empty?

Does this have to be done manually or is there a simple way to do it?

jsignell · 2021-01-14T16:34:50Z

Does this have to be done manually or is there a simple way to do it?

I looked into this for a little bit. My first attempt was to use a special function on each chunk so instead of using func=M.value_counts here:

dask/dask/dataframe/groupby.py

Lines 1932 to 1938 in 1a52a0c

    
           return self._aca_agg( 
        
               token="value_counts", 
        
               func=M.value_counts, 
        
               aggfunc=_value_counts_aggregate, 
        
               split_every=split_every, 
        
               split_out=split_out, 
        
           )

We could use something like:

def _value_counts(x, **kwargs):
    if len(x):
        return M.value_counts(x, **kwargs)
    else:
        return pd.Series()

That seems to work locally. Another option would be to add a kwarg to _aca_apply that allows for skipping empty partitions.

FilippoBovo · 2021-01-15T18:08:48Z

@jsignell, it seems that it's a bug in Pandas: pandas-dev/pandas#39172 (comment)

jsignell mentioned this issue Jan 15, 2021

Skip empty partitions when doing groupby.value_counts #7073

Merged

3 tasks

jrbourbeau closed this as completed in #7073 Jan 20, 2021

charlesbluca mentioned this issue Feb 1, 2023

Fix GroupBy.value_counts handling for multiple groupby columns #9905

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrameGroupBy.value_counts method raises an error when at least a partition is empty. #7065

DataFrameGroupBy.value_counts method raises an error when at least a partition is empty. #7065

FilippoBovo commented Jan 13, 2021

jsignell commented Jan 13, 2021

FilippoBovo commented Jan 13, 2021

quasiben commented Jan 14, 2021

FilippoBovo commented Jan 14, 2021

jsignell commented Jan 14, 2021

mrocklin commented Jan 14, 2021 via email

FilippoBovo commented Jan 14, 2021

jsignell commented Jan 14, 2021

FilippoBovo commented Jan 15, 2021

DataFrameGroupBy.value_counts method raises an error when at least a partition is empty. #7065

DataFrameGroupBy.value_counts method raises an error when at least a partition is empty. #7065

Comments

FilippoBovo commented Jan 13, 2021

jsignell commented Jan 13, 2021

FilippoBovo commented Jan 13, 2021

quasiben commented Jan 14, 2021

FilippoBovo commented Jan 14, 2021

jsignell commented Jan 14, 2021

mrocklin commented Jan 14, 2021 via email

FilippoBovo commented Jan 14, 2021

jsignell commented Jan 14, 2021

FilippoBovo commented Jan 15, 2021