-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10372: [Dataset][C++][Python][R] Support reading compressed CSV #9685
Conversation
Leaving as draft for now since I observed the Python tests hang without ARROW-11937/#9680 fixed. |
157e14c
to
aacd903
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite clean, thanks!
some minor comments
python/pyarrow/_dataset.pyx
Outdated
if compression: | ||
self.compression = compression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: is it necessary to check for None
twice?
if compression: | |
self.compression = compression | |
self.compression = compression |
python/pyarrow/_dataset.pyx
Outdated
if isinstance(compression, str): | ||
compression = Codec(compression) | ||
self.csv_format.compression = \ | ||
(<Codec> compression).unwrap().compression_type() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if isinstance(compression, str): | |
compression = Codec(compression) | |
self.csv_format.compression = \ | |
(<Codec> compression).unwrap().compression_type() | |
elif isinstance(compression, str): | |
self.csv_format.compression = _ensure_compression(compression) | |
elif isinstance(compression, Codec): | |
self.csv_format.compression = \ | |
(<Codec> compression).unwrap().compression_type() | |
else: | |
raise TypeError(f'Cannot set compression with value of type {type(compression)}') |
cpp/src/arrow/dataset/file_csv.cc
Outdated
ARROW_ASSIGN_OR_RAISE(auto file, source.Open()); | ||
if (format.compression == Compression::UNCOMPRESSED) { | ||
input = file; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(For follow up, maybe ARROW-8981) maybe it'd be more useful to encapsulate compressed FIleSources with an overload of Open()
ARROW_ASSIGN_OR_RAISE(auto file, source.Open()); | |
if (format.compression == Compression::UNCOMPRESSED) { | |
input = file; | |
ARROW_ASSIGN_OR_RAISE(std::shared_ptr<io::InputStream> file, source.Open(format.compression)); |
rather than the (currently ignored) FileSource::compression property
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! A couple of suggestions on the R side.
r/R/dataset-format.R
Outdated
compression = CompressionType$UNCOMPRESSED) { | ||
dataset___CsvFileFormat__Make(opts, compression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a helper used in a few places that makes this a little more human-friendly:
compression = CompressionType$UNCOMPRESSED) { | |
dataset___CsvFileFormat__Make(opts, compression) | |
compression = "uncompressed") { | |
dataset___CsvFileFormat__Make(opts, compression_from_name(compression)) |
Should also add a note in the FileFormat$create docs above (around L48) that compression is an option now.
r/tests/testthat/test-dataset.R
Outdated
dst_dir <- make_temp_dir() | ||
dst_file <- file.path(dst_dir, "data.csv.gz") | ||
write.csv(df1, gzfile(dst_file), row.names = FALSE, quote = FALSE) | ||
format <- FileFormat$create("csv", compression = CompressionType$GZIP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the above suggestion, this becomes:
format <- FileFormat$create("csv", compression = CompressionType$GZIP) | |
format <- FileFormat$create("csv", compression = "gzip") |
Thanks for the feedback. I've fixed things and opened up the PR (the issue in #9680 apparently only affects my local build, not CI). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Changes look good, just one minor suggestion on the docs
r/R/dataset-format.R
Outdated
#' `format = "csv"``: | ||
#' * `compression`: Assume CSV files have been compressed with this codec. | ||
#' Any options from [CsvParseOptions] may also be passed. | ||
#' | ||
#' `format = "parquet"``: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#' `format = "parquet"``: | |
#' `format = "parquet"`: |
Should fix this while we're here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed the R doc tweaks I wanted, so don't worry about it
Ah thanks for fixing that, I was just about to take a look. |
::testing::Values(Compression::UNCOMPRESSED)); | ||
#ifdef ARROW_WITH_BROTLI | ||
INSTANTIATE_TEST_SUITE_P(TestBrotliCsv, TestCsvFileFormat, | ||
::testing::Values(Compression::BROTLI)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also test with bz2 (which is much more likely for compression of CSV files than Brotli)?
Small note here: the python API for reading plain CSV files (using |
Yeah, we'd have to implement that on the C++ side as well. It could be tackled in ARROW-8981 as part of the refactoring that Ben suggested above for that issue, too. |
If we want to support detection of compression then that requires a fairly significant change to this PR. As written, compression is a property of the FileFormat, which is not mutated (even during discovery). Thus we couldn't look at (for example) the Adding discovery of file formats would give us a place to put this functionality, but that's a larger change and definitely out of scope here. If guessing compression will ever be a priority, I'd recommend removing compression-as-property and instead writing |
In that case, we could still keep the property if people wanted to force a certain compression type instead of guessing it, right? (Though maybe that isn't something that's ever done.) |
I'd say forcing a given compression is something we could add later if there's specific demand. IMO, this PR should provide a single clear answer to supporting compressed CSV, be it
|
I'll rework this then, since from the original issue, people expect compression to be transparent to CSV. |
ARROW_ASSIGN_OR_RAISE(auto file, Open()); | ||
auto actual_compression = Compression::type::UNCOMPRESSED; | ||
if (!compression.has_value()) { | ||
// Guess compression from file extension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be made more compact with arrow::fs::internal::GetAbstractPathExtension and arrow::util::GetCompressionType
feb1671
to
48a63fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, merging
This adds support for reading compressed CSV datasets in C++/Python/R. Files' compression will be guessed from their extensions (`f.csv.gz` -> gzip compression). Closes apache#9685 from lidavidm/arrow-10372 Lead-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
This adds support for reading compressed CSV datasets in C++/Python/R. Files' compression will be guessed from their extensions (`f.csv.gz` -> gzip compression). Closes apache#9685 from lidavidm/arrow-10372 Lead-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
This adds support for reading compressed CSV datasets in C++/Python/R. Files' compression will be guessed from their extensions (
f.csv.gz
-> gzip compression).