ARROW-10372: [Dataset][C++][Python][R] Support reading compressed CSV #9685

lidavidm · 2021-03-12T13:58:20Z

This adds support for reading compressed CSV datasets in C++/Python/R. Files' compression will be guessed from their extensions (f.csv.gz -> gzip compression).

lidavidm · 2021-03-12T13:58:57Z

Leaving as draft for now since I observed the Python tests hang without ARROW-11937/#9680 fixed.

github-actions · 2021-03-12T14:29:15Z

https://issues.apache.org/jira/browse/ARROW-10372

bkietz

This is quite clean, thanks!

some minor comments

bkietz · 2021-03-12T16:08:43Z

python/pyarrow/_dataset.pyx

+        if compression:
+            self.compression = compression


Nit: is it necessary to check for None twice?

Suggested change

if compression:

self.compression = compression

self.compression = compression

bkietz · 2021-03-12T16:12:12Z

python/pyarrow/_dataset.pyx

+        if isinstance(compression, str):
+            compression = Codec(compression)
+        self.csv_format.compression = \
+            (<Codec> compression).unwrap().compression_type()


Suggested change

if isinstance(compression, str):

compression = Codec(compression)

self.csv_format.compression = \

(<Codec> compression).unwrap().compression_type()

elif isinstance(compression, str):

self.csv_format.compression = _ensure_compression(compression)

elif isinstance(compression, Codec):

self.csv_format.compression = \

(<Codec> compression).unwrap().compression_type()

else:

raise TypeError(f'Cannot set compression with value of type {type(compression)}')

bkietz · 2021-03-12T16:25:09Z

cpp/src/arrow/dataset/file_csv.cc

+  ARROW_ASSIGN_OR_RAISE(auto file, source.Open());
+  if (format.compression == Compression::UNCOMPRESSED) {
+    input = file;


(For follow up, maybe ARROW-8981) maybe it'd be more useful to encapsulate compressed FIleSources with an overload of Open()

Suggested change

ARROW_ASSIGN_OR_RAISE(auto file, source.Open());

if (format.compression == Compression::UNCOMPRESSED) {

input = file;

ARROW_ASSIGN_OR_RAISE(std::shared_ptr<io::InputStream> file, source.Open(format.compression));

rather than the (currently ignored) FileSource::compression property

nealrichardson

Thanks for doing this! A couple of suggestions on the R side.

nealrichardson · 2021-03-12T16:53:33Z

r/R/dataset-format.R

+                                 compression = CompressionType$UNCOMPRESSED) {
+  dataset___CsvFileFormat__Make(opts, compression)


We have a helper used in a few places that makes this a little more human-friendly:

Suggested change

compression = CompressionType$UNCOMPRESSED) {

dataset___CsvFileFormat__Make(opts, compression)

compression = "uncompressed") {

dataset___CsvFileFormat__Make(opts, compression_from_name(compression))

Should also add a note in the FileFormat$create docs above (around L48) that compression is an option now.

nealrichardson · 2021-03-12T16:54:05Z

r/tests/testthat/test-dataset.R

+  dst_dir <- make_temp_dir()
+  dst_file <- file.path(dst_dir, "data.csv.gz")
+  write.csv(df1, gzfile(dst_file), row.names = FALSE, quote = FALSE)
+  format <- FileFormat$create("csv", compression = CompressionType$GZIP)


With the above suggestion, this becomes:

Suggested change

format <- FileFormat$create("csv", compression = CompressionType$GZIP)

format <- FileFormat$create("csv", compression = "gzip")

lidavidm · 2021-03-12T21:18:51Z

Thanks for the feedback. I've fixed things and opened up the PR (the issue in #9680 apparently only affects my local build, not CI).

nealrichardson

Thanks! Changes look good, just one minor suggestion on the docs

nealrichardson · 2021-03-13T15:50:55Z

r/R/dataset-format.R

+#'   `format = "csv"``:
+#'   * `compression`: Assume CSV files have been compressed with this codec.
+#'   Any options from [CsvParseOptions] may also be passed.
+#'
 #'   `format = "parquet"``:


Suggested change

#' `format = "parquet"``:

#' `format = "parquet"`:

Should fix this while we're here.

r/R/dataset-format.R

nealrichardson

I just pushed the R doc tweaks I wanted, so don't worry about it

lidavidm · 2021-03-13T15:59:21Z

Ah thanks for fixing that, I was just about to take a look.

pitrou · 2021-03-15T14:01:25Z

cpp/src/arrow/dataset/file_csv_test.cc

+                         ::testing::Values(Compression::UNCOMPRESSED));
+#ifdef ARROW_WITH_BROTLI
+INSTANTIATE_TEST_SUITE_P(TestBrotliCsv, TestCsvFileFormat,
+                         ::testing::Values(Compression::BROTLI));


Can you also test with bz2 (which is much more likely for compression of CSV files than Brotli)?

jorisvandenbossche · 2021-03-15T14:12:35Z

It does not autodetect the type of compression (but perhaps this could be added, by inspecting FileSource).

Small note here: the python API for reading plain CSV files (using pyarrow.csv) automatically detects compressed files and doesn't have an explicit option for that. So ideally, the dataset CSV reading would work similarly, I think.
But AFAIK, the decompressing for pyarrow.csv currently happens on the python side (and not in C++)? (i.e. get_input_stream in the cython code detects compression)

lidavidm · 2021-03-15T14:14:29Z

It does not autodetect the type of compression (but perhaps this could be added, by inspecting FileSource).

Small note here: the python API for reading plain CSV files (using pyarrow.csv) automatically detects compressed files and doesn't have an explicit option for that. So ideally, the dataset CSV reading would work similarly, I think.
But AFAIK, the decompressing for pyarrow.csv currently happens on the python side (and not in C++)? (i.e. get_input_stream in the cython code detects compression)

Yeah, we'd have to implement that on the C++ side as well. It could be tackled in ARROW-8981 as part of the refactoring that Ben suggested above for that issue, too.

bkietz · 2021-03-15T14:55:08Z

If we want to support detection of compression then that requires a fairly significant change to this PR. As written, compression is a property of the FileFormat, which is not mutated (even during discovery). Thus we couldn't look at (for example) the .gz extension on provided file sources and switch from "CSV" to "gzipped CSV". Compression-as-FileFormat-property paints us into a corner WRT guessing compression.

Adding discovery of file formats would give us a place to put this functionality, but that's a larger change and definitely out of scope here.

If guessing compression will ever be a priority, I'd recommend removing compression-as-property and instead writing Result<shared_ptr<InputStream>> FileSource::OpenCompressed(optional<Compression::type> = {}) (without an explicit compression type, it will guess what codec to use). This can replace usage of FileSource::Open in file_csv.cc:OpenReader

lidavidm · 2021-03-15T15:01:11Z

In that case, we could still keep the property if people wanted to force a certain compression type instead of guessing it, right? (Though maybe that isn't something that's ever done.)

bkietz · 2021-03-15T15:03:53Z

I'd say forcing a given compression is something we could add later if there's specific demand. IMO, this PR should provide a single clear answer to supporting compressed CSV, be it

compression is a property of CsvFileFormat or
compression is transparent to CsvFileFormat

lidavidm · 2021-03-15T15:05:00Z

I'll rework this then, since from the original issue, people expect compression to be transparent to CSV.

bkietz · 2021-03-15T17:04:40Z

cpp/src/arrow/dataset/file_base.cc

+  ARROW_ASSIGN_OR_RAISE(auto file, Open());
+  auto actual_compression = Compression::type::UNCOMPRESSED;
+  if (!compression.has_value()) {
+    // Guess compression from file extension


This can be made more compact with arrow::fs::internal::GetAbstractPathExtension and arrow::util::GetCompressionType

bkietz

LGTM, merging

This adds support for reading compressed CSV datasets in C++/Python/R. Files' compression will be guessed from their extensions (`f.csv.gz` -> gzip compression). Closes apache#9685 from lidavidm/arrow-10372 Lead-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>

lidavidm added Component: R Component: C++ Component: Python labels Mar 12, 2021

lidavidm force-pushed the arrow-10372 branch 2 times, most recently from 157e14c to aacd903 Compare March 12, 2021 15:37

bkietz requested changes Mar 12, 2021

View reviewed changes

nealrichardson requested changes Mar 12, 2021

View reviewed changes

lidavidm force-pushed the arrow-10372 branch from aacd903 to 1a7e7e1 Compare March 12, 2021 17:24

lidavidm marked this pull request as ready for review March 12, 2021 21:18

nealrichardson reviewed Mar 13, 2021

View reviewed changes

nealrichardson approved these changes Mar 13, 2021

View reviewed changes

pitrou reviewed Mar 15, 2021

View reviewed changes

lidavidm force-pushed the arrow-10372 branch from 4ffd49b to ae47a5e Compare March 15, 2021 14:08

bkietz reviewed Mar 15, 2021

View reviewed changes

lidavidm force-pushed the arrow-10372 branch 2 times, most recently from feb1671 to 48a63fe Compare March 15, 2021 17:43

lidavidm and others added 4 commits March 16, 2021 09:38

ARROW-10372: [C++][Dataset] Support compressed CSV

aa7c61b

ARROW-10372: [Python][Dataset] Support compressed CSV in Python

3e7b971

ARROW-10372: [R][Dataset] Support compressed CSV in R

e63f030

Update docs

81f9cf8

lidavidm added 3 commits March 16, 2021 09:38

ARROW-10372: [C++][Dataset] Auto-detect compression for CSV

32181e0

ARROW-10372: [Python][Dataset] Auto-detect compression for CSV

10aa089

ARROW-10372: [R][Dataset] Auto-detect compression for CSV

ae0c43d

lidavidm force-pushed the arrow-10372 branch from 48a63fe to ae0c43d Compare March 16, 2021 13:41

bkietz approved these changes Mar 17, 2021

View reviewed changes

bkietz closed this in 06cb1a6 Mar 17, 2021

asfimport mentioned this pull request Jan 5, 2022

[C++][Dataset] Read compressed CSVs #26358

Closed

jinchengchenghh mentioned this pull request Apr 19, 2024

[GLUTEN-5414][VL] FEAT: Support read CSV apache/incubator-gluten#5447

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-10372: [Dataset][C++][Python][R] Support reading compressed CSV #9685

ARROW-10372: [Dataset][C++][Python][R] Support reading compressed CSV #9685

lidavidm commented Mar 12, 2021 •

edited by bkietz

Loading

lidavidm commented Mar 12, 2021

github-actions bot commented Mar 12, 2021

bkietz left a comment

bkietz Mar 12, 2021

bkietz Mar 12, 2021

bkietz Mar 12, 2021

nealrichardson left a comment

nealrichardson Mar 12, 2021

nealrichardson Mar 12, 2021

lidavidm commented Mar 12, 2021

nealrichardson left a comment

nealrichardson Mar 13, 2021

nealrichardson left a comment

lidavidm commented Mar 13, 2021

pitrou Mar 15, 2021

jorisvandenbossche commented Mar 15, 2021

lidavidm commented Mar 15, 2021

bkietz commented Mar 15, 2021

lidavidm commented Mar 15, 2021

bkietz commented Mar 15, 2021

lidavidm commented Mar 15, 2021

bkietz Mar 15, 2021

bkietz left a comment

	if compression:
	self.compression = compression
	self.compression = compression

		compression = CompressionType$UNCOMPRESSED) {
		dataset___CsvFileFormat__Make(opts, compression)

	format <- FileFormat$create("csv", compression = CompressionType$GZIP)
	format <- FileFormat$create("csv", compression = "gzip")

ARROW-10372: [Dataset][C++][Python][R] Support reading compressed CSV #9685

ARROW-10372: [Dataset][C++][Python][R] Support reading compressed CSV #9685

Conversation

lidavidm commented Mar 12, 2021 • edited by bkietz Loading

lidavidm commented Mar 12, 2021

github-actions bot commented Mar 12, 2021

bkietz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nealrichardson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidavidm commented Mar 12, 2021

nealrichardson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nealrichardson left a comment

Choose a reason for hiding this comment

lidavidm commented Mar 13, 2021

Choose a reason for hiding this comment

jorisvandenbossche commented Mar 15, 2021

lidavidm commented Mar 15, 2021

bkietz commented Mar 15, 2021

lidavidm commented Mar 15, 2021

bkietz commented Mar 15, 2021

lidavidm commented Mar 15, 2021

Choose a reason for hiding this comment

bkietz left a comment

Choose a reason for hiding this comment

lidavidm commented Mar 12, 2021 •

edited by bkietz

Loading