Skip to content

Don't sample dict result of a shuffle group when calculating its size#7834

Merged
mrocklin merged 1 commit intodask:mainfrom
fjetter:shuffle_group_sizeof
Jun 29, 2021
Merged

Don't sample dict result of a shuffle group when calculating its size#7834
mrocklin merged 1 commit intodask:mainfrom
fjetter:shuffle_group_sizeof

Conversation

@fjetter
Copy link
Copy Markdown
Member

@fjetter fjetter commented Jun 25, 2021

The size calculation for shuffle group results is very sensitive to sampling since there may
be empty splits skewing the result.

See also dask/distributed#4962

I decided to go for this weird sentinel to not have to import dask.dataframe.backends in dask.sizeof but rather the opposite. Open to other suggestions.

Regarding the implementation, there is also the possibility to have some pseudo sampling which ensures that we have at least X% of the rows in our sum. I figured this is not necessary since iterating over the splits should be sufficiently fast. In my micro benchmarks it was still about a factor of 2 slower than the ordinary sizeof but still around 1ms for a DF with 1M rows (incl a str col)

  • Closes #xxxx
  • Tests added / passed
  • Passes black dask / flake8 dask / isort dask

@github-actions github-actions bot added dataframe dispatch Related to `Dispatch` extension objects labels Jun 25, 2021
@mrocklin
Copy link
Copy Markdown
Member

cc @madsbk

The size calculation for shuffle group results is very sensitive to sampling since there may
be empty splits skewing the result.

See also dask/distributed#4962
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataframe dispatch Related to `Dispatch` extension objects

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants