Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-7547: [C++][Dataset][Python] Add ParquetFileFormat options #6235

Closed
wants to merge 24 commits into from

Conversation

bkietz
Copy link
Member

@bkietz bkietz commented Jan 20, 2020

Add parquet reader and arrow reader options to file format

@github-actions
Copy link

@bkietz bkietz force-pushed the 7547-Python-Dataset-Additional branch from f70eeee to 7f0b358 Compare January 20, 2020 21:58
Copy link
Contributor

@fsaintjacques fsaintjacques left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected this to simply receive a parquet::ArrowReaderProperties and parquet::ReaderProperties is there a reason this is not the case and members are copied in a new struct which will diverge over time.

Also remove the #pragma once from file you didn't modify the content, maybe a separate patch.

@bkietz
Copy link
Member Author

bkietz commented Jan 21, 2020

@fsaintjacques I implemented that (taking /.*Properties/ explicitly) initially, but: replicating those fields seemed more robust than documenting what fields would be ignored. For example: ReaderProperties::memory pool (which will be ignored in favor of ScanContext::pool) and ArrowReaderProperties::use threads (which is redundant since the scanner will already parallelize across threads.

Furthermore as noted above: it seems to me that ParquetFormat should expose a set of field names (rather than indices) to be read as dictionaries which conflicts directly with ArrowReaderProperties::read dictionary indices

@bkietz bkietz force-pushed the 7547-Python-Dataset-Additional branch from 7f0b358 to 9ae341c Compare February 5, 2020 16:05
@kszucs kszucs force-pushed the 7547-Python-Dataset-Additional branch from 76dbbef to be466c0 Compare February 7, 2020 10:13
@bkietz bkietz force-pushed the 7547-Python-Dataset-Additional branch from be466c0 to 3912553 Compare February 10, 2020 16:59
@bkietz bkietz marked this pull request as ready for review February 10, 2020 21:34
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this out a bit, working nicely! Two main remarks:

The options you now added to ParquetFileFormat are reader options. But if we are going to start adding writing functionality as well, that might get into conflict with writer options ? Should they then still be part of the Format object?

You mentioned in an inline comment that we should rather use field names instead of indices. Fully agreed with that (in the cython parquet code, inside the reader class the names are converted into indices, so the user can pass names). Can this be done on the C++ side, or is there something blocking this?

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved
python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved
@bkietz bkietz force-pushed the 7547-Python-Dataset-Additional branch 5 times, most recently from 54c25f4 to 9197c62 Compare February 18, 2020 16:04
@bkietz
Copy link
Member Author

bkietz commented Feb 18, 2020

@jorisvandenbossche I have namespaced reader specific options within ParquetFileFormat so they won't conflict with writer options. I've also switched to using names to designate dictionary columns. batch_size is a general consideration of scans, so I've promoted to a scan option (@fsaintjacques testing that necessitated resolving https://issues.apache.org/jira/browse/ARROW-7338 )

PTAL

Copy link
Contributor

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes; LMK if you want backup on the R stuff.

r/src/dataset.cpp Outdated Show resolved Hide resolved
r/R/dataset.R Show resolved Hide resolved
r/R/dataset.R Outdated Show resolved Hide resolved
r/tests/testthat/test-dataset.R Show resolved Hide resolved
r/R/dataset.R Show resolved Hide resolved
r/src/dataset.cpp Outdated Show resolved Hide resolved
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have namespaced reader specific options

+1

cpp/src/arrow/dataset/scanner.h Outdated Show resolved Hide resolved
python/pyarrow/_dataset.pyx Show resolved Hide resolved
@bkietz
Copy link
Member Author

bkietz commented Feb 19, 2020

@fsaintjacques Since use_buffered_stream/buffer_size were introduced to resolve https://issues.apache.org/jira/browse/ARROW-6180 (multiple threads accessing a single RandomAccessFile), maybe this shouldn't be an option and instead should be configured automatically during a multithreaded scan?

@jorisvandenbossche
Copy link
Member

I am not really familiar with those options. So if your comment about them is accurate, it indeed doesn't sound useful to expose them here.

@bkietz
Copy link
Member Author

bkietz commented Feb 20, 2020

@fsaintjacques @jorisvandenbossche I was incorrect: these options are not related to threading. They are provided to reduce memory overhead in the case where large row groups would otherwise be mapped whole into memory. I'll reintroduce the options and update the doccomments

@bkietz bkietz force-pushed the 7547-Python-Dataset-Additional branch from 0bd7cef to 12e33d3 Compare February 20, 2020 16:19
cpp/src/arrow/dataset/file_parquet.h Show resolved Hide resolved
cpp/src/arrow/dataset/file_parquet.h Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/file_parquet.h Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/scanner.cc Show resolved Hide resolved
cpp/src/arrow/dataset/scanner.h Outdated Show resolved Hide resolved
cpp/src/arrow/status.h Show resolved Hide resolved
cpp/src/arrow/dataset/scanner.h Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/dataset.h Show resolved Hide resolved
cpp/src/arrow/dataset/dataset.cc Outdated Show resolved Hide resolved
cpp/CMakeLists.txt Outdated Show resolved Hide resolved
@nealrichardson
Copy link
Contributor

ICYMI there's a bug in the Rcpp you added: https://github.com/apache/arrow/pull/6235/checks?check_run_id=458210010#step:5:1020

@bkietz bkietz force-pushed the 7547-Python-Dataset-Additional branch from 139d274 to 41f9f10 Compare February 21, 2020 15:05
@bkietz bkietz force-pushed the 7547-Python-Dataset-Additional branch from f68c8de to e311959 Compare February 22, 2020 19:59
Copy link
Contributor

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final question but otherwise I approve. I have some ideas for making the open_dataset interface nicer with these options, but I'll follow up with that separately.

@@ -449,6 +466,10 @@ ScannerBuilder <- R6Class("ScannerBuilder", inherit = Object,
dataset___ScannerBuilder__UseThreads(self, threads)
self
},
BatchSize = function(batch_size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why exactly is this method added in this PR? Doesn't seem related to ParquetFileFormat options. Also, it's not tested.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a property on the parquet reader so initially I had in alongside the other options in this PR. However it's not a parquet specific concern and it's one that might need tweaking on a scan-by-scan basis so I extracted it to ScanOptions.

It's tested in C++, but I can certainly add an equivalent one in R

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants