Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-8220: [Python] Make dataset FileFormat objects serializable #6720

Closed
wants to merge 6 commits into from

Conversation

kszucs
Copy link
Member

@kszucs kszucs commented Mar 25, 2020

Also did some refactoring for a more pleasant user API.

@github-actions
Copy link

@jorisvandenbossche
Copy link
Member

Also did some refactoring for a more pleasant user API.

I also don't like the ParquetFileFormatReaderOptions very much as user API, but, I am not sure we can just pass them all to ParquetFileFormat, since we are going to use that for both reading and writing, and mixing keywords for those all in a single constructor is going to get confusing.

I think we should rather give a better API in a parquet / reading specific API like parquet.read_table or ParquetDataset.

@kszucs
Copy link
Member Author

kszucs commented Mar 26, 2020

@jorisvandenbossche Agreed. May we defer your suggestion to a follow-up?

@jorisvandenbossche
Copy link
Member

Well, my comment is kind of: we need to keep ParquetFileFormatReaderOptions, so since you are removing that, I would rather not defer that to a follow-up (but you don't need to agree with keeping it, of course :-))

@kszucs
Copy link
Member Author

kszucs commented Mar 26, 2020

ParquetFileFormatReaderOptions was still bound to the ParquetFormat, your proposal is more about making the reader and writer options independent from the ParquetFormat. So this PR doesn't change that dependency.

@kszucs
Copy link
Member Author

kszucs commented Mar 26, 2020

I can wire these options, but it's not entirely clear because we don't have a read() method on the datasets. Once we add support for writing we can refine the API.

@jorisvandenbossche
Copy link
Member

ParquetFileFormatReaderOptions was still bound to the ParquetFormat, your proposal is more about making the reader and writer options independent from the ParquetFormat. So this PR doesn't change that dependency.

Yes, it is still bound to the format, but it splits its keywords in two groups:

format = ParquetFileFormat(reader_options=dict(...), writer_options=dict(...))

it's not entirely clear because we don't have a read() method on the datasets

I think to_table is the "read" method?

Once we add support for writing we can refine the API.

Yeah, I fully agree much of this discussion is a bit "up in the air", since we don't yet have writing, so don't yet know how we would want to make the API for writing.
But it's for that reason that I commented to keep it as is, as there is also no clear reason yet for changing IMO, since we don't know the final API with writing (but it was an explicit decision, at least on the C++ side, to have this a separate set of options instead of direct ParquetFileFormat options). But OK, since it is easy to put it back later, I won't block removing it if you prefer that :)

@kszucs
Copy link
Member Author

kszucs commented Mar 26, 2020

@jorisvandenbossche updated as you requested

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved
uint32_t buffer_size
set dictionary_columns

def __init__(self, bint use_buffered_stream=False, buffer_size=8192,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, see my earlier comment about this default. But when setting it like this on the options class, it becomes more difficult to do that (unless not typing the attribute as an uint)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean.

Copy link
Member Author

@kszucs kszucs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@kszucs
Copy link
Member Author

kszucs commented Mar 30, 2020

Build failure is unrelated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants