Add compression='infer' default to read_csv#6960
Conversation
|
@martindurant - Let me know if you think there is a more elegant way to accomplish what we need here |
|
Maybe |
|
Also, I think if you are using fsspec anyway, you can pass |
Perfect - thanks!
Understood - I'll think about how this relates to the read_csv code path. We may need to detect compression before opening the files so that we are ensuring |
|
(sorry, wrong place for link) |
|
Thanks for the feedback here @martindurant - We are now using Let me know if you have any other thoughts, or feel strongly that we should be doing things differently here. |
martindurant
left a comment
There was a problem hiding this comment.
All seems fair, yes. I still have a couple of small comments.
dask/dataframe/io/csv.py
Outdated
| if compression == "infer": | ||
| # Translate the input urlpath to a simple path list | ||
| if not isinstance(urlpath, (str, list, tuple, os.PathLike)): | ||
| raise TypeError("Path should be a string, os.PathLike, list or tuple") |
There was a problem hiding this comment.
This message isn't quite right.
IF we allow other types at all (does file-like work? I thought not), then the message would be "compression='infer' only makes sense when passing a path or set of paths". Or something. I'm not sure we need to have the check at all.
There was a problem hiding this comment.
Okay - I think I agree that the instance check is unnecessary. Removed it
| def test_read_csv_compression(fmt, blocksize): | ||
| if fmt not in compress: | ||
| pytest.skip("compress function not provided for %s" % fmt) | ||
| suffix = {"gzip": ".gz", "bz2": ".bz2", "zip": ".zip", "xz": ".xz"}.get(fmt, "") |
There was a problem hiding this comment.
We should somehow test for no compression.
A comment should be added to say that this test if effectively using "infer" and the extension, or we could test with both infer and explicit (which you already did above, so not necessary).
There was a problem hiding this comment.
Okay - I changed the parametrize layout a bit and added a non-compression test. I also added a brief comment to explain that we are relying on compression="infer"
Closes #6850 dask_cudf version of the `dask.dataframe` changes proposed in [dask#6960](dask/dask#6960). Uses `fsspec` to infer the default `compression` argument from the suffix of the first file-path argument. Authors: - rjzamora <rzamora217@gmail.com> Approvers: - Keith Kraus URL: #7013
| # set the proper compression option if the suffix is recongnized. | ||
| if compression == "infer": | ||
| # Translate the input urlpath to a simple path list | ||
| paths = get_fs_token_paths(urlpath, mode="rb", storage_options=kwargs)[2] |
There was a problem hiding this comment.
Ran into an issue with read_csv on a remote source during dask-cudf s3 tests. (rapidsai/cudf#7144)
Looks like storage options isn't being propagated here.
paths = get_fs_token_paths(urlpath, mode="rb", storage_options=storage_options)[2]
There was a problem hiding this comment.
Oops! I'll submit a quick fix. Sorry about that!
* support compression='infer' default * test all compression options with default 'infer' * use infer_compression * address code review
Adds default
compression="infer"option toread_csv.Closes #6929
TODO:
compression="infer"for bz2, zip and xz (gzip done)