Don't sample dict result of a shuffle group when calculating its size by fjetter · Pull Request #7834 · dask/dask

fjetter · 2021-06-25T13:31:02Z

The size calculation for shuffle group results is very sensitive to sampling since there may
be empty splits skewing the result.

See also dask/distributed#4962

I decided to go for this weird sentinel to not have to import dask.dataframe.backends in dask.sizeof but rather the opposite. Open to other suggestions.

Regarding the implementation, there is also the possibility to have some pseudo sampling which ensures that we have at least X% of the rows in our sum. I figured this is not necessary since iterating over the splits should be sufficiently fast. In my micro benchmarks it was still about a factor of 2 slower than the ordinary sizeof but still around 1ms for a DF with 1M rows (incl a str col)

Closes #xxxx
Tests added / passed
Passes black dask / flake8 dask / isort dask

mrocklin · 2021-06-25T14:03:16Z

cc @madsbk

The size calculation for shuffle group results is very sensitive to sampling since there may be empty splits skewing the result. See also dask/distributed#4962

github-actions bot added dataframe dispatch Related to `Dispatch` extension objects labels Jun 25, 2021

fjetter mentioned this pull request Jun 25, 2021

Worker assignment for split-shuffle tasks dask/distributed#4962

Open

Don't sample dict result of a shuffle group when calculating its size

b2e5a3f

The size calculation for shuffle group results is very sensitive to sampling since there may be empty splits skewing the result. See also dask/distributed#4962

fjetter force-pushed the shuffle_group_sizeof branch from f836cef to b2e5a3f Compare June 25, 2021 16:00

fjetter mentioned this pull request Jun 29, 2021

Ensure shuffle split operations are blacklisted from work stealing dask/distributed#4964

Merged

mrocklin merged commit 8cc9468 into dask:main Jun 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't sample dict result of a shuffle group when calculating its size#7834

Don't sample dict result of a shuffle group when calculating its size#7834
mrocklin merged 1 commit intodask:mainfrom
fjetter:shuffle_group_sizeof

fjetter commented Jun 25, 2021 •

edited

Loading

Uh oh!

mrocklin commented Jun 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

fjetter commented Jun 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Jun 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fjetter commented Jun 25, 2021 •

edited

Loading