ARROW-7547: [C++][Dataset][Python] Add ParquetFileFormat options #6235

bkietz · 2020-01-20T21:25:13Z

Add parquet reader and arrow reader options to file format

github-actions · 2020-01-20T21:31:32Z

https://issues.apache.org/jira/browse/ARROW-7547

cpp/src/arrow/dataset/file_parquet.h

cpp/src/arrow/dataset/file_parquet_test.cc

fsaintjacques

I would have expected this to simply receive a parquet::ArrowReaderProperties and parquet::ReaderProperties is there a reason this is not the case and members are copied in a new struct which will diverge over time.

Also remove the #pragma once from file you didn't modify the content, maybe a separate patch.

bkietz · 2020-01-21T14:47:10Z

@fsaintjacques I implemented that (taking /.*Properties/ explicitly) initially, but: replicating those fields seemed more robust than documenting what fields would be ignored. For example: ReaderProperties::memory pool (which will be ignored in favor of ScanContext::pool) and ArrowReaderProperties::use threads (which is redundant since the scanner will already parallelize across threads.

Furthermore as noted above: it seems to me that ParquetFormat should expose a set of field names (rather than indices) to be read as dictionaries which conflicts directly with ArrowReaderProperties::read dictionary indices

jorisvandenbossche

I tried this out a bit, working nicely! Two main remarks:

The options you now added to ParquetFileFormat are reader options. But if we are going to start adding writing functionality as well, that might get into conflict with writer options ? Should they then still be part of the Format object?

You mentioned in an inline comment that we should rather use field names instead of indices. Fully agreed with that (in the cython parquet code, inside the reader class the names are converted into indices, so the user can pass names). Can this be done on the C++ side, or is there something blocking this?

python/pyarrow/_dataset.pyx

bkietz · 2020-02-18T16:11:56Z

@jorisvandenbossche I have namespaced reader specific options within ParquetFileFormat so they won't conflict with writer options. I've also switched to using names to designate dictionary columns. batch_size is a general consideration of scans, so I've promoted to a scan option (@fsaintjacques testing that necessitated resolving https://issues.apache.org/jira/browse/ARROW-7338 )

PTAL

nealrichardson

Some notes; LMK if you want backup on the R stuff.

r/src/dataset.cpp

r/R/dataset.R

r/tests/testthat/test-dataset.R

r/R/dataset.R

r/src/dataset.cpp

jorisvandenbossche

I have namespaced reader specific options

+1

cpp/src/arrow/dataset/scanner.h

python/pyarrow/_dataset.pyx

bkietz · 2020-02-19T22:01:59Z

@fsaintjacques Since use_buffered_stream/buffer_size were introduced to resolve https://issues.apache.org/jira/browse/ARROW-6180 (multiple threads accessing a single RandomAccessFile), maybe this shouldn't be an option and instead should be configured automatically during a multithreaded scan?

jorisvandenbossche · 2020-02-20T13:40:27Z

I am not really familiar with those options. So if your comment about them is accurate, it indeed doesn't sound useful to expose them here.

bkietz · 2020-02-20T14:33:40Z

@fsaintjacques @jorisvandenbossche I was incorrect: these options are not related to threading. They are provided to reduce memory overhead in the case where large row groups would otherwise be mapped whole into memory. I'll reintroduce the options and update the doccomments

cpp/src/arrow/dataset/file_parquet.h

cpp/src/arrow/dataset/scanner.cc

cpp/src/arrow/dataset/scanner.h

cpp/src/arrow/status.h

cpp/src/arrow/dataset/scanner.h

cpp/src/arrow/dataset/dataset.h

cpp/src/arrow/dataset/dataset.cc

cpp/CMakeLists.txt

cpp/src/arrow/dataset/dataset.cc

nealrichardson · 2020-02-20T17:54:42Z

ICYMI there's a bug in the Rcpp you added: https://github.com/apache/arrow/pull/6235/checks?check_run_id=458210010#step:5:1020

nealrichardson

One final question but otherwise I approve. I have some ideas for making the open_dataset interface nicer with these options, but I'll follow up with that separately.

nealrichardson · 2020-02-22T21:30:05Z

r/R/dataset.R

@@ -449,6 +466,10 @@ ScannerBuilder <- R6Class("ScannerBuilder", inherit = Object,
      dataset___ScannerBuilder__UseThreads(self, threads)
      self
    },
+    BatchSize = function(batch_size) {


Why exactly is this method added in this PR? Doesn't seem related to ParquetFileFormat options. Also, it's not tested.

This is a property on the parquet reader so initially I had in alongside the other options in this PR. However it's not a parquet specific concern and it's one that might need tweaking on a scan-by-scan basis so I extracted it to ScanOptions.

It's tested in C++, but I can certainly add an equivalent one in R

This reverts commit ed53f97.

bkietz requested a review from fsaintjacques January 20, 2020 21:43

bkietz commented Jan 20, 2020

View reviewed changes

cpp/src/arrow/dataset/file_parquet.h Outdated Show resolved Hide resolved

bkietz commented Jan 20, 2020

View reviewed changes

cpp/src/arrow/dataset/file_parquet_test.cc Show resolved Hide resolved

bkietz force-pushed the 7547-Python-Dataset-Additional branch from f70eeee to 7f0b358 Compare January 20, 2020 21:58

fsaintjacques requested changes Jan 21, 2020

View reviewed changes

bkietz force-pushed the 7547-Python-Dataset-Additional branch from 7f0b358 to 9ae341c Compare February 5, 2020 16:05

kszucs force-pushed the master branch from b18ed44 to e79c251 Compare February 7, 2020 07:41

kszucs force-pushed the 7547-Python-Dataset-Additional branch from 76dbbef to be466c0 Compare February 7, 2020 10:13

bkietz force-pushed the 7547-Python-Dataset-Additional branch from be466c0 to 3912553 Compare February 10, 2020 16:59

bkietz marked this pull request as ready for review February 10, 2020 21:34

jorisvandenbossche reviewed Feb 12, 2020

View reviewed changes

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Feb 12, 2020

View reviewed changes

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved

bkietz force-pushed the 7547-Python-Dataset-Additional branch 5 times, most recently from 54c25f4 to 9197c62 Compare February 18, 2020 16:04

nealrichardson requested changes Feb 18, 2020

View reviewed changes

jorisvandenbossche reviewed Feb 19, 2020

View reviewed changes

cpp/src/arrow/dataset/scanner.h Outdated Show resolved Hide resolved

python/pyarrow/_dataset.pyx Show resolved Hide resolved

bkietz force-pushed the 7547-Python-Dataset-Additional branch from 0bd7cef to 12e33d3 Compare February 20, 2020 16:19

fsaintjacques reviewed Feb 20, 2020

View reviewed changes

cpp/src/arrow/dataset/dataset.cc Outdated Show resolved Hide resolved

bkietz force-pushed the 7547-Python-Dataset-Additional branch from 139d274 to 41f9f10 Compare February 21, 2020 15:05

bkietz force-pushed the 7547-Python-Dataset-Additional branch from f68c8de to e311959 Compare February 22, 2020 19:59

nealrichardson approved these changes Feb 22, 2020

View reviewed changes

bkietz added 24 commits February 22, 2020 22:41

ARROW-7547: [C++][Dataset][Python] Add ParquetFileFormat options

74d41d3

refactoring

22caaa1

update tests to test dictionary read of string columns

7546851

lint/CI fixes

5540eb9

add bindings to parquet file format properties in R and python

7b6d619

namespace reader_options, use strings(names) for dict_columns

55bdce4

port reader_options changes to R

2b4a37e

move batch_size to ScanOptions

d90fd19

autopep8

1cd90e8

add support for batch_size to InMemorySource

2e5480a

check dictionary field in R test

7d90954

add doccomments for parquet format options in R

30e2c3f

msvc fix

1412591

automatically configure use_buffered_stream for threaded scans

27ea0d3

Revert "automatically configure use_buffered_stream for threaded scans"

df2a2f6

This reverts commit ed53f97.

amend doccomments for use_buffered_stream/buffer_size

990833c

msvc fix, attempt 2

9d9a342

only #include file_parquet.h when parquet is built

61f98db

address review comments

cf30168

revert to enable_shared_from_this

f46b0cc

don't require equivalent metadata in InspectDictEncoded

cb0cefd

don't check schema metadata in file_parquet_test

1b2aeeb

msvc fix: an explicit function object for table constructor as well

8b63349

msvc fix: don't use std::function

850125e

bkietz force-pushed the 7547-Python-Dataset-Additional branch from 0d2a3f9 to 850125e Compare February 23, 2020 04:03

bkietz closed this in efbc047 Feb 23, 2020

bkietz deleted the 7547-Python-Dataset-Additional branch February 25, 2021 16:34

asfimport mentioned this pull request Apr 10, 2020

[C++] [Python] [Dataset] Additional reader options in ParquetFileFormat #17017

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-7547: [C++][Dataset][Python] Add ParquetFileFormat options #6235

ARROW-7547: [C++][Dataset][Python] Add ParquetFileFormat options #6235

bkietz commented Jan 20, 2020

github-actions bot commented Jan 20, 2020

fsaintjacques left a comment

bkietz commented Jan 21, 2020

jorisvandenbossche left a comment

bkietz commented Feb 18, 2020

nealrichardson left a comment

jorisvandenbossche left a comment

bkietz commented Feb 19, 2020

jorisvandenbossche commented Feb 20, 2020

bkietz commented Feb 20, 2020

nealrichardson commented Feb 20, 2020

nealrichardson left a comment

nealrichardson Feb 22, 2020

bkietz Feb 23, 2020

ARROW-7547: [C++][Dataset][Python] Add ParquetFileFormat options #6235

ARROW-7547: [C++][Dataset][Python] Add ParquetFileFormat options #6235

Conversation

bkietz commented Jan 20, 2020

github-actions bot commented Jan 20, 2020

fsaintjacques left a comment

Choose a reason for hiding this comment

bkietz commented Jan 21, 2020

jorisvandenbossche left a comment

Choose a reason for hiding this comment

bkietz commented Feb 18, 2020

nealrichardson left a comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

bkietz commented Feb 19, 2020

jorisvandenbossche commented Feb 20, 2020

bkietz commented Feb 20, 2020

nealrichardson commented Feb 20, 2020

nealrichardson left a comment

Choose a reason for hiding this comment

nealrichardson Feb 22, 2020

Choose a reason for hiding this comment

bkietz Feb 23, 2020

Choose a reason for hiding this comment