New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-7547: [C++][Dataset][Python] Add ParquetFileFormat options #6235
Conversation
f70eeee
to
7f0b358
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have expected this to simply receive a parquet::ArrowReaderProperties
and parquet::ReaderProperties
is there a reason this is not the case and members are copied in a new struct which will diverge over time.
Also remove the #pragma once
from file you didn't modify the content, maybe a separate patch.
@fsaintjacques I implemented that (taking Furthermore as noted above: it seems to me that ParquetFormat should expose a set of field names (rather than indices) to be read as dictionaries which conflicts directly with |
7f0b358
to
9ae341c
Compare
76dbbef
to
be466c0
Compare
be466c0
to
3912553
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this out a bit, working nicely! Two main remarks:
The options you now added to ParquetFileFormat are reader options. But if we are going to start adding writing functionality as well, that might get into conflict with writer options ? Should they then still be part of the Format object?
You mentioned in an inline comment that we should rather use field names instead of indices. Fully agreed with that (in the cython parquet code, inside the reader class the names are converted into indices, so the user can pass names). Can this be done on the C++ side, or is there something blocking this?
54c25f4
to
9197c62
Compare
@jorisvandenbossche I have namespaced reader specific options within PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes; LMK if you want backup on the R stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have namespaced reader specific options
+1
@fsaintjacques Since |
I am not really familiar with those options. So if your comment about them is accurate, it indeed doesn't sound useful to expose them here. |
@fsaintjacques @jorisvandenbossche I was incorrect: these options are not related to threading. They are provided to reduce memory overhead in the case where large row groups would otherwise be mapped whole into memory. I'll reintroduce the options and update the doccomments |
0bd7cef
to
12e33d3
Compare
ICYMI there's a bug in the Rcpp you added: https://github.com/apache/arrow/pull/6235/checks?check_run_id=458210010#step:5:1020 |
139d274
to
41f9f10
Compare
f68c8de
to
e311959
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One final question but otherwise I approve. I have some ideas for making the open_dataset
interface nicer with these options, but I'll follow up with that separately.
@@ -449,6 +466,10 @@ ScannerBuilder <- R6Class("ScannerBuilder", inherit = Object, | |||
dataset___ScannerBuilder__UseThreads(self, threads) | |||
self | |||
}, | |||
BatchSize = function(batch_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why exactly is this method added in this PR? Doesn't seem related to ParquetFileFormat options. Also, it's not tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a property on the parquet reader so initially I had in alongside the other options in this PR. However it's not a parquet specific concern and it's one that might need tweaking on a scan-by-scan basis so I extracted it to ScanOptions.
It's tested in C++, but I can certainly add an equivalent one in R
This reverts commit ed53f97.
0d2a3f9
to
850125e
Compare
Add parquet reader and arrow reader options to file format