Add support for normalize in value_counts #7342

jsignell · 2021-03-09T15:24:06Z

Closes .value_counts(normalize=True) Unsupported #7341
Tests added / passed
Passes black dask / flake8 dask

dask/dataframe/core.py

rjzamora

Thanks for this @jsignell ! My only real concern is that split_out>1 may not be supported. Related test coverage would clarify if this is an issue.

dask/dataframe/core.py

dask/dataframe/tests/test_dataframe.py

jsignell · 2021-03-15T14:30:17Z

dask/dataframe/core.py

+
+        if split_out > 1 and normalize:
+            aggregate_kwargs["length"] = (
+                len(self) if dropna is False else len(self.dropna())


I'm not sure if this is acceptable. Basically I needed a way to pass the total len into the split_out combine.

This seems reasonable to me - I guess an "optimal" solution for multi-partition results would be to collect/aggregate/combine the counts within the first aca pass. However, I'm not sure it is completely worth the additional code complexity to do this, because the result would need to be quite large to benefit (and split_out is often 1 anyway).

rjzamora · 2021-03-15T15:53:34Z

dask/dataframe/core.py

+
+        if split_out > 1 and normalize:
+            aggregate_kwargs["length"] = (
+                len(self) if dropna is False else len(self.dropna())


This seems reasonable to me - I guess an "optimal" solution for multi-partition results would be to collect/aggregate/combine the counts within the first aca pass. However, I'm not sure it is completely worth the additional code complexity to do this, because the result would need to be quite large to benefit (and split_out is often 1 anyway).

jsignell · 2021-03-18T13:24:34Z

Ok I will merge this if it passes the regular tests.

jsignell commented Mar 9, 2021

View reviewed changes

dask/dataframe/core.py Show resolved Hide resolved

rjzamora reviewed Mar 9, 2021

View reviewed changes

dask/dataframe/core.py Outdated Show resolved Hide resolved

dask/dataframe/core.py Outdated Show resolved Hide resolved

dask/dataframe/core.py Show resolved Hide resolved

dask/dataframe/tests/test_dataframe.py Show resolved Hide resolved

jsignell commented Mar 15, 2021

View reviewed changes

rjzamora approved these changes Mar 15, 2021

View reviewed changes

jsignell force-pushed the value_counts branch from 21ca4df to cfa4e6b Compare March 15, 2021 17:47

jsignell added 4 commits March 18, 2021 09:23

Add support for normalize in value_counts

9b07efb

Add support for split_out with normalize

0b41e7a

Move meta def back into aca

21b6817

Just use kwargs in aca

3858302

jsignell force-pushed the value_counts branch from cfa4e6b to 3858302 Compare March 18, 2021 13:24

Fix kwargs

4f19d11

jsignell merged commit 0bc155e into dask:main Mar 19, 2021

jsignell deleted the value_counts branch March 19, 2021 13:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for normalize in value_counts #7342

Add support for normalize in value_counts #7342

jsignell commented Mar 9, 2021

rjzamora left a comment

jsignell Mar 15, 2021

rjzamora Mar 15, 2021

rjzamora Mar 15, 2021

jsignell commented Mar 18, 2021

Add support for normalize in value_counts #7342

Add support for normalize in value_counts #7342

Conversation

jsignell commented Mar 9, 2021

rjzamora left a comment

Choose a reason for hiding this comment

jsignell Mar 15, 2021

Choose a reason for hiding this comment

rjzamora Mar 15, 2021

Choose a reason for hiding this comment

rjzamora Mar 15, 2021

Choose a reason for hiding this comment

jsignell commented Mar 18, 2021