Skip to content

Add compression='infer' default to read_csv#6960

Merged
jsignell merged 6 commits intodask:masterfrom
rjzamora:compression-infer
Jan 14, 2021
Merged

Add compression='infer' default to read_csv#6960
jsignell merged 6 commits intodask:masterfrom
rjzamora:compression-infer

Conversation

@rjzamora
Copy link
Copy Markdown
Member

@rjzamora rjzamora commented Dec 11, 2020

Adds default compression="infer" option to read_csv.

Closes #6929

TODO:

  • Test compression="infer" for bz2, zip and xz (gzip done)

@rjzamora rjzamora marked this pull request as ready for review December 11, 2020 22:24
@rjzamora
Copy link
Copy Markdown
Member Author

@martindurant - Let me know if you think there is a more elegant way to accomplish what we need here

@martindurant
Copy link
Copy Markdown
Member

Maybe fsspec.utils.infer_compression does this?

@martindurant
Copy link
Copy Markdown
Member

Also, I think if you are using fsspec anyway, you can pass compression="infer" to open_files and get what you want. I don't remember if any decompression is handled in dask or passed for pandas to handle.

@rjzamora
Copy link
Copy Markdown
Member Author

Maybe fsspec.utils.infer_compression does this?

Perfect - thanks!

Also, I think if you are using fsspec anyway, you can pass compression="infer" to open_files and get what you want. I don't remember if any decompression is handled in dask or passed for pandas to handle.

Understood - I'll think about how this relates to the read_csv code path. We may need to detect compression before opening the files so that we are ensuring blocksize==None. Right now, the files are opened within read_bytes, and if blocksize is not None, this will fail. I'm not sure it is worthwhile to detect the compression on open if we can use something like infer_compression.

@martindurant
Copy link
Copy Markdown
Member

(sorry, wrong place for link)

@rjzamora
Copy link
Copy Markdown
Member Author

Thanks for the feedback here @martindurant - We are now using fsspec.utils.infer_compression. I decided not to open any files (with fsspec or pandas) to infer compression, since pandas relies on the file suffix for inference anyway.

Let me know if you have any other thoughts, or feel strongly that we should be doing things differently here.

Copy link
Copy Markdown
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All seems fair, yes. I still have a couple of small comments.

if compression == "infer":
# Translate the input urlpath to a simple path list
if not isinstance(urlpath, (str, list, tuple, os.PathLike)):
raise TypeError("Path should be a string, os.PathLike, list or tuple")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message isn't quite right.
IF we allow other types at all (does file-like work? I thought not), then the message would be "compression='infer' only makes sense when passing a path or set of paths". Or something. I'm not sure we need to have the check at all.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay - I think I agree that the instance check is unnecessary. Removed it

def test_read_csv_compression(fmt, blocksize):
if fmt not in compress:
pytest.skip("compress function not provided for %s" % fmt)
suffix = {"gzip": ".gz", "bz2": ".bz2", "zip": ".zip", "xz": ".xz"}.get(fmt, "")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should somehow test for no compression.

A comment should be added to say that this test if effectively using "infer" and the extension, or we could test with both infer and explicit (which you already did above, so not necessary).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay - I changed the parametrize layout a bit and added a non-compression test. I also added a brief comment to explain that we are relying on compression="infer"

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Dec 17, 2020
Closes #6850 

dask_cudf version of the `dask.dataframe` changes proposed in [dask#6960](dask/dask#6960).  Uses `fsspec` to infer the default `compression` argument from the suffix of the first file-path argument.

Authors:
  - rjzamora <rzamora217@gmail.com>

Approvers:
  - Keith Kraus

URL: #7013
@jsignell jsignell merged commit d055081 into dask:master Jan 14, 2021
@rjzamora rjzamora deleted the compression-infer branch January 14, 2021 14:17
# set the proper compression option if the suffix is recongnized.
if compression == "infer":
# Translate the input urlpath to a simple path list
paths = get_fs_token_paths(urlpath, mode="rb", storage_options=kwargs)[2]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran into an issue with read_csv on a remote source during dask-cudf s3 tests. (rapidsai/cudf#7144)
Looks like storage options isn't being propagated here.
paths = get_fs_token_paths(urlpath, mode="rb", storage_options=storage_options)[2]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ayushdg! cc @rjzamora

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! I'll submit a quick fix. Sorry about that!

abduhbm pushed a commit to abduhbm/dask that referenced this pull request Jan 19, 2021
* support compression='infer' default

* test all compression options with default 'infer'

* use infer_compression

* address code review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add default compression="infer" option to read_csv

5 participants