Add compression='infer' default to read_csv by rjzamora · Pull Request #6960 · dask/dask

rjzamora · 2020-12-11T22:06:25Z

Adds default compression="infer" option to read_csv.

Closes #6929

TODO:

Test compression="infer" for bz2, zip and xz (gzip done)

rjzamora · 2020-12-14T15:47:54Z

@martindurant - Let me know if you think there is a more elegant way to accomplish what we need here

martindurant · 2020-12-14T15:52:48Z

Maybe fsspec.utils.infer_compression does this?

martindurant · 2020-12-14T15:58:00Z

Also, I think if you are using fsspec anyway, you can pass compression="infer" to open_files and get what you want. I don't remember if any decompression is handled in dask or passed for pandas to handle.

rjzamora · 2020-12-14T16:07:07Z

Maybe fsspec.utils.infer_compression does this?

Perfect - thanks!

Also, I think if you are using fsspec anyway, you can pass compression="infer" to open_files and get what you want. I don't remember if any decompression is handled in dask or passed for pandas to handle.

Understood - I'll think about how this relates to the read_csv code path. We may need to detect compression before opening the files so that we are ensuring blocksize==None. Right now, the files are opened within read_bytes, and if blocksize is not None, this will fail. I'm not sure it is worthwhile to detect the compression on open if we can use something like infer_compression.

martindurant · 2020-12-14T16:11:02Z

(sorry, wrong place for link)

rjzamora · 2020-12-17T16:56:47Z

Thanks for the feedback here @martindurant - We are now using fsspec.utils.infer_compression. I decided not to open any files (with fsspec or pandas) to infer compression, since pandas relies on the file suffix for inference anyway.

Let me know if you have any other thoughts, or feel strongly that we should be doing things differently here.

martindurant

All seems fair, yes. I still have a couple of small comments.

martindurant · 2020-12-17T17:03:57Z

dask/dataframe/io/csv.py

+    if compression == "infer":
+        # Translate the input urlpath to a simple path list
+        if not isinstance(urlpath, (str, list, tuple, os.PathLike)):
+            raise TypeError("Path should be a string, os.PathLike, list or tuple")


This message isn't quite right.
IF we allow other types at all (does file-like work? I thought not), then the message would be "compression='infer' only makes sense when passing a path or set of paths". Or something. I'm not sure we need to have the check at all.

Okay - I think I agree that the instance check is unnecessary. Removed it

martindurant · 2020-12-17T17:08:47Z

dask/dataframe/io/tests/test_csv.py

 def test_read_csv_compression(fmt, blocksize):
    if fmt not in compress:
        pytest.skip("compress function not provided for %s" % fmt)
+    suffix = {"gzip": ".gz", "bz2": ".bz2", "zip": ".zip", "xz": ".xz"}.get(fmt, "")


We should somehow test for no compression.

A comment should be added to say that this test if effectively using "infer" and the extension, or we could test with both infer and explicit (which you already did above, so not necessary).

Okay - I changed the parametrize layout a bit and added a non-compression test. I also added a brief comment to explain that we are relying on compression="infer"

Closes #6850 dask_cudf version of the `dask.dataframe` changes proposed in [dask#6960](dask/dask#6960). Uses `fsspec` to infer the default `compression` argument from the suffix of the first file-path argument. Authors: - rjzamora <rzamora217@gmail.com> Approvers: - Keith Kraus URL: #7013

ayushdg · 2021-01-15T03:43:57Z

dask/dataframe/io/csv.py

+    # set the proper compression option if the suffix is recongnized.
+    if compression == "infer":
+        # Translate the input urlpath to a simple path list
+        paths = get_fs_token_paths(urlpath, mode="rb", storage_options=kwargs)[2]


Ran into an issue with read_csv on a remote source during dask-cudf s3 tests. (rapidsai/cudf#7144)
Looks like storage options isn't being propagated here.
paths = get_fs_token_paths(urlpath, mode="rb", storage_options=storage_options)[2]

Thanks @ayushdg! cc @rjzamora

Oops! I'll submit a quick fix. Sorry about that!

* support compression='infer' default * test all compression options with default 'infer' * use infer_compression * address code review

rjzamora added 2 commits December 11, 2020 15:49

support compression='infer' default

a570255

test all compression options with default 'infer'

f7cf199

rjzamora marked this pull request as ready for review December 11, 2020 22:24

use infer_compression

cc5ebad

Merge remote-tracking branch 'upstream/master' into compression-infer

652b0c0

rjzamora mentioned this pull request Dec 15, 2020

Add compression="infer" as default for dask_cudf.read_csv rapidsai/cudf#7013

Merged

martindurant reviewed Dec 17, 2020

View reviewed changes

address code review

4f47252

Merge remote-tracking branch 'upstream/master' into compression-infer

6c3c460

jsignell merged commit d055081 into dask:master Jan 14, 2021

rjzamora deleted the compression-infer branch January 14, 2021 14:17

ayushdg reviewed Jan 15, 2021

View reviewed changes

ayushdg mentioned this pull request Jan 15, 2021

Update s3 tests to use moto_server rapidsai/cudf#7144

Merged

rjzamora mentioned this pull request Jan 15, 2021

Propagate storage_options in read_csv #7074

Merged

abduhbm pushed a commit to abduhbm/dask that referenced this pull request Jan 19, 2021

Add compression='infer' default to read_csv (dask#6960)

918a7d2

* support compression='infer' default * test all compression options with default 'infer' * use infer_compression * address code review

Uh oh!

Conversation

rjzamora commented Dec 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rjzamora commented Dec 14, 2020

Uh oh!

martindurant commented Dec 14, 2020

Uh oh!

martindurant commented Dec 14, 2020

Uh oh!

rjzamora commented Dec 14, 2020

Uh oh!

martindurant commented Dec 14, 2020

Uh oh!

rjzamora commented Dec 17, 2020

Uh oh!

martindurant left a comment

Choose a reason for hiding this comment

Uh oh!

martindurant Dec 17, 2020

Choose a reason for hiding this comment

Uh oh!

rjzamora Dec 17, 2020

Choose a reason for hiding this comment

Uh oh!

martindurant Dec 17, 2020

Choose a reason for hiding this comment

Uh oh!

rjzamora Dec 17, 2020

Choose a reason for hiding this comment

Uh oh!

ayushdg Jan 15, 2021

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Jan 15, 2021

Choose a reason for hiding this comment

Uh oh!

rjzamora Jan 15, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rjzamora commented Dec 11, 2020 •

edited

Loading