Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10372: [Dataset][C++][Python][R] Support reading compressed CSV #9685

Closed
wants to merge 7 commits into from

Conversation

lidavidm
Copy link
Member

@lidavidm lidavidm commented Mar 12, 2021

This adds support for reading compressed CSV datasets in C++/Python/R. Files' compression will be guessed from their extensions (f.csv.gz -> gzip compression).

@lidavidm
Copy link
Member Author

Leaving as draft for now since I observed the Python tests hang without ARROW-11937/#9680 fixed.

@github-actions
Copy link

@lidavidm lidavidm force-pushed the arrow-10372 branch 2 times, most recently from 157e14c to aacd903 Compare March 12, 2021 15:37
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite clean, thanks!

some minor comments

Comment on lines 1370 to 1371
if compression:
self.compression = compression
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: is it necessary to check for None twice?

Suggested change
if compression:
self.compression = compression
self.compression = compression

Comment on lines 1396 to 1399
if isinstance(compression, str):
compression = Codec(compression)
self.csv_format.compression = \
(<Codec> compression).unwrap().compression_type()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if isinstance(compression, str):
compression = Codec(compression)
self.csv_format.compression = \
(<Codec> compression).unwrap().compression_type()
elif isinstance(compression, str):
self.csv_format.compression = _ensure_compression(compression)
elif isinstance(compression, Codec):
self.csv_format.compression = \
(<Codec> compression).unwrap().compression_type()
else:
raise TypeError(f'Cannot set compression with value of type {type(compression)}')

Comment on lines 115 to 117
ARROW_ASSIGN_OR_RAISE(auto file, source.Open());
if (format.compression == Compression::UNCOMPRESSED) {
input = file;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(For follow up, maybe ARROW-8981) maybe it'd be more useful to encapsulate compressed FIleSources with an overload of Open()

Suggested change
ARROW_ASSIGN_OR_RAISE(auto file, source.Open());
if (format.compression == Compression::UNCOMPRESSED) {
input = file;
ARROW_ASSIGN_OR_RAISE(std::shared_ptr<io::InputStream> file, source.Open(format.compression));

rather than the (currently ignored) FileSource::compression property

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! A couple of suggestions on the R side.

Comment on lines 106 to 107
compression = CompressionType$UNCOMPRESSED) {
dataset___CsvFileFormat__Make(opts, compression)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a helper used in a few places that makes this a little more human-friendly:

Suggested change
compression = CompressionType$UNCOMPRESSED) {
dataset___CsvFileFormat__Make(opts, compression)
compression = "uncompressed") {
dataset___CsvFileFormat__Make(opts, compression_from_name(compression))

Should also add a note in the FileFormat$create docs above (around L48) that compression is an option now.

dst_dir <- make_temp_dir()
dst_file <- file.path(dst_dir, "data.csv.gz")
write.csv(df1, gzfile(dst_file), row.names = FALSE, quote = FALSE)
format <- FileFormat$create("csv", compression = CompressionType$GZIP)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the above suggestion, this becomes:

Suggested change
format <- FileFormat$create("csv", compression = CompressionType$GZIP)
format <- FileFormat$create("csv", compression = "gzip")

@lidavidm lidavidm marked this pull request as ready for review March 12, 2021 21:18
@lidavidm
Copy link
Member Author

Thanks for the feedback. I've fixed things and opened up the PR (the issue in #9680 apparently only affects my local build, not CI).

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Changes look good, just one minor suggestion on the docs

#' `format = "csv"``:
#' * `compression`: Assume CSV files have been compressed with this codec.
#' Any options from [CsvParseOptions] may also be passed.
#'
#' `format = "parquet"``:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#' `format = "parquet"``:
#' `format = "parquet"`:

Should fix this while we're here.

r/R/dataset-format.R Outdated Show resolved Hide resolved
Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed the R doc tweaks I wanted, so don't worry about it

@lidavidm
Copy link
Member Author

Ah thanks for fixing that, I was just about to take a look.

::testing::Values(Compression::UNCOMPRESSED));
#ifdef ARROW_WITH_BROTLI
INSTANTIATE_TEST_SUITE_P(TestBrotliCsv, TestCsvFileFormat,
::testing::Values(Compression::BROTLI));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also test with bz2 (which is much more likely for compression of CSV files than Brotli)?

@jorisvandenbossche
Copy link
Member

It does not autodetect the type of compression (but perhaps this could be added, by inspecting FileSource).

Small note here: the python API for reading plain CSV files (using pyarrow.csv) automatically detects compressed files and doesn't have an explicit option for that. So ideally, the dataset CSV reading would work similarly, I think.
But AFAIK, the decompressing for pyarrow.csv currently happens on the python side (and not in C++)? (i.e. get_input_stream in the cython code detects compression)

@lidavidm
Copy link
Member Author

It does not autodetect the type of compression (but perhaps this could be added, by inspecting FileSource).

Small note here: the python API for reading plain CSV files (using pyarrow.csv) automatically detects compressed files and doesn't have an explicit option for that. So ideally, the dataset CSV reading would work similarly, I think.
But AFAIK, the decompressing for pyarrow.csv currently happens on the python side (and not in C++)? (i.e. get_input_stream in the cython code detects compression)

Yeah, we'd have to implement that on the C++ side as well. It could be tackled in ARROW-8981 as part of the refactoring that Ben suggested above for that issue, too.

@bkietz
Copy link
Member

bkietz commented Mar 15, 2021

If we want to support detection of compression then that requires a fairly significant change to this PR. As written, compression is a property of the FileFormat, which is not mutated (even during discovery). Thus we couldn't look at (for example) the .gz extension on provided file sources and switch from "CSV" to "gzipped CSV". Compression-as-FileFormat-property paints us into a corner WRT guessing compression.

Adding discovery of file formats would give us a place to put this functionality, but that's a larger change and definitely out of scope here.

If guessing compression will ever be a priority, I'd recommend removing compression-as-property and instead writing Result<shared_ptr<InputStream>> FileSource::OpenCompressed(optional<Compression::type> = {}) (without an explicit compression type, it will guess what codec to use). This can replace usage of FileSource::Open in file_csv.cc:OpenReader

@lidavidm
Copy link
Member Author

In that case, we could still keep the property if people wanted to force a certain compression type instead of guessing it, right? (Though maybe that isn't something that's ever done.)

@bkietz
Copy link
Member

bkietz commented Mar 15, 2021

I'd say forcing a given compression is something we could add later if there's specific demand. IMO, this PR should provide a single clear answer to supporting compressed CSV, be it

  • compression is a property of CsvFileFormat or
  • compression is transparent to CsvFileFormat

@lidavidm
Copy link
Member Author

I'll rework this then, since from the original issue, people expect compression to be transparent to CSV.

ARROW_ASSIGN_OR_RAISE(auto file, Open());
auto actual_compression = Compression::type::UNCOMPRESSED;
if (!compression.has_value()) {
// Guess compression from file extension
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be made more compact with arrow::fs::internal::GetAbstractPathExtension and arrow::util::GetCompressionType

@lidavidm lidavidm force-pushed the arrow-10372 branch 2 times, most recently from feb1671 to 48a63fe Compare March 15, 2021 17:43
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, merging

@bkietz bkietz closed this in 06cb1a6 Mar 17, 2021
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This adds support for reading compressed CSV datasets in C++/Python/R. Files' compression will be guessed from their extensions (`f.csv.gz` -> gzip compression).

Closes apache#9685 from lidavidm/arrow-10372

Lead-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
This adds support for reading compressed CSV datasets in C++/Python/R. Files' compression will be guessed from their extensions (`f.csv.gz` -> gzip compression).

Closes apache#9685 from lidavidm/arrow-10372

Lead-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants