ARROW-8220: [Python] Make dataset FileFormat objects serializable #6720

kszucs · 2020-03-25T20:56:05Z

Also did some refactoring for a more pleasant user API.

github-actions · 2020-03-25T21:02:12Z

https://issues.apache.org/jira/browse/ARROW-8220

jorisvandenbossche · 2020-03-26T09:20:39Z

Also did some refactoring for a more pleasant user API.

I also don't like the ParquetFileFormatReaderOptions very much as user API, but, I am not sure we can just pass them all to ParquetFileFormat, since we are going to use that for both reading and writing, and mixing keywords for those all in a single constructor is going to get confusing.

I think we should rather give a better API in a parquet / reading specific API like parquet.read_table or ParquetDataset.

kszucs · 2020-03-26T11:07:10Z

@jorisvandenbossche Agreed. May we defer your suggestion to a follow-up?

jorisvandenbossche · 2020-03-26T11:09:54Z

Well, my comment is kind of: we need to keep ParquetFileFormatReaderOptions, so since you are removing that, I would rather not defer that to a follow-up (but you don't need to agree with keeping it, of course :-))

kszucs · 2020-03-26T11:14:21Z

ParquetFileFormatReaderOptions was still bound to the ParquetFormat, your proposal is more about making the reader and writer options independent from the ParquetFormat. So this PR doesn't change that dependency.

kszucs · 2020-03-26T11:17:28Z

I can wire these options, but it's not entirely clear because we don't have a read() method on the datasets. Once we add support for writing we can refine the API.

jorisvandenbossche · 2020-03-26T12:40:06Z

ParquetFileFormatReaderOptions was still bound to the ParquetFormat, your proposal is more about making the reader and writer options independent from the ParquetFormat. So this PR doesn't change that dependency.

Yes, it is still bound to the format, but it splits its keywords in two groups:

format = ParquetFileFormat(reader_options=dict(...), writer_options=dict(...))

it's not entirely clear because we don't have a read() method on the datasets

I think to_table is the "read" method?

Once we add support for writing we can refine the API.

Yeah, I fully agree much of this discussion is a bit "up in the air", since we don't yet have writing, so don't yet know how we would want to make the API for writing.
But it's for that reason that I commented to keep it as is, as there is also no clear reason yet for changing IMO, since we don't know the final API with writing (but it was an explicit decision, at least on the C++ side, to have this a separate set of options instead of direct ParquetFileFormat options). But OK, since it is easy to put it back later, I won't block removing it if you prefer that :)

python/pyarrow/_dataset.pyx

kszucs · 2020-03-26T13:40:56Z

@jorisvandenbossche updated as you requested

python/pyarrow/_dataset.pyx

jorisvandenbossche · 2020-03-26T13:47:07Z

python/pyarrow/_dataset.pyx

+        uint32_t buffer_size
+        set dictionary_columns
+
+    def __init__(self, bint use_buffered_stream=False, buffer_size=8192,


Hmm, see my earlier comment about this default. But when setting it like this on the options class, it becomes more difficult to do that (unless not typing the attribute as an uint)

I'm not sure what you mean.

kszucs

+1

kszucs · 2020-03-30T12:00:27Z

Build failure is unrelated.

kszucs requested a review from jorisvandenbossche March 25, 2020 20:56

jorisvandenbossche reviewed Mar 26, 2020

View reviewed changes

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Mar 26, 2020

View reviewed changes

kszucs force-pushed the ARROW-8220 branch from c10746e to 241e7f5 Compare March 27, 2020 13:00

kszucs added 6 commits March 27, 2020 22:23

make FileFormat options serializable

19840bb

docstring

0573784

expose ParquetReadOptions

5d27220

remove parametrization

5c41ab2

support dict input

bf5fc33

add note to update the defaults in the python bindings

1493ce2

kszucs force-pushed the ARROW-8220 branch from c4a02cd to 1493ce2 Compare March 27, 2020 21:26

kszucs commented Mar 30, 2020

View reviewed changes

kszucs closed this in 6be085f Mar 30, 2020

asfimport mentioned this pull request Mar 30, 2020

[Python] Make dataset FileFormat objects serializable #24417

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-8220: [Python] Make dataset FileFormat objects serializable #6720

ARROW-8220: [Python] Make dataset FileFormat objects serializable #6720

kszucs commented Mar 25, 2020

github-actions bot commented Mar 25, 2020

jorisvandenbossche commented Mar 26, 2020

kszucs commented Mar 26, 2020 •

edited

Loading

jorisvandenbossche commented Mar 26, 2020

kszucs commented Mar 26, 2020

kszucs commented Mar 26, 2020

jorisvandenbossche commented Mar 26, 2020

kszucs commented Mar 26, 2020

jorisvandenbossche Mar 26, 2020

kszucs Mar 27, 2020

kszucs left a comment

kszucs commented Mar 30, 2020

ARROW-8220: [Python] Make dataset FileFormat objects serializable #6720

ARROW-8220: [Python] Make dataset FileFormat objects serializable #6720

Conversation

kszucs commented Mar 25, 2020

github-actions bot commented Mar 25, 2020

jorisvandenbossche commented Mar 26, 2020

kszucs commented Mar 26, 2020 • edited Loading

jorisvandenbossche commented Mar 26, 2020

kszucs commented Mar 26, 2020

kszucs commented Mar 26, 2020

jorisvandenbossche commented Mar 26, 2020

kszucs commented Mar 26, 2020

jorisvandenbossche Mar 26, 2020

Choose a reason for hiding this comment

kszucs Mar 27, 2020

Choose a reason for hiding this comment

kszucs left a comment

Choose a reason for hiding this comment

kszucs commented Mar 30, 2020

kszucs commented Mar 26, 2020 •

edited

Loading