Skip to content

Add transparent compression for on-disk shuffle with partd#5786

Merged
TomAugspurger merged 11 commits intodask:masterfrom
ChrWesp:master
Jan 16, 2020
Merged

Add transparent compression for on-disk shuffle with partd#5786
TomAugspurger merged 11 commits intodask:masterfrom
ChrWesp:master

Conversation

@ChrWesp
Copy link
Copy Markdown
Contributor

@ChrWesp ChrWesp commented Jan 14, 2020

Shoud close #5704

This change adds transparent file compression for on-disk shuffle via partd.
The compression algorithm can be specified by dask config (keyword shuffle_disk_compression).
It can be any algorithm which is exposed by partd.compressed. In case an algorithm is absent, it silently falls back to no compression.

Please check one thing - I am not sure about the naming schema and where to document the config parameter.

  • Tests added / passed
  • Passes black dask / flake8 dask

@quasiben
Copy link
Copy Markdown
Member

Thanks for submitting a PR for #5704 . The default config file can be found here: https://github.com/dask/dask/blob/master/dask/dask.yaml . Can I ask that you also add to the config test here:

def test_core_file():
assert "temporary-directory" in dask.config.config

@mrocklin
Copy link
Copy Markdown
Member

What do we think about the name? Two things come to mind:

  1. We've started namespacing most of our configuration names. Maybe we should do the same here?
  2. We tend to use hyphens rather than underscores. Maybe we should do the same here.

@ChrWesp
Copy link
Copy Markdown
Contributor Author

ChrWesp commented Jan 14, 2020

@quasiben : Thanks, I will add the config test.
@mrocklin : The current naming is definitly not optimal but I do not know the general naming convention or what would fit best here. Please make a proposal

@TomAugspurger
Copy link
Copy Markdown
Member

I'd suggest dataframe.shuffle-comperssion.

Copy link
Copy Markdown
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Can you also add tests?

@TomAugspurger
Copy link
Copy Markdown
Member

There's a linting issue too: https://travis-ci.org/dask/dask/jobs/637027038#L1437-L1440. Contributing guidelines are at https://docs.dask.org/en/latest/develop.html#code-formatting. You might want to install and use pre-commit.

…+ Refactored naming of tests + Refactor to fix an incompatibility between black and flake8 style enforcement on multi-line strings and line length which lead CI to fail
ChrWesp and others added 3 commits January 15, 2020 14:53
@TomAugspurger TomAugspurger merged commit 3893c68 into dask:master Jan 16, 2020
@TomAugspurger
Copy link
Copy Markdown
Member

Thanks @ChrWesp!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable compression for on-disk shuffle with PartD

4 participants