Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][Dataset] Improve ergonomy of the FileSystemDataset constructor #24483

Closed
asfimport opened this issue Mar 31, 2020 · 2 comments
Closed

Comments

@asfimport
Copy link

Currently, to manually create a FileSystemDataset, you can do something like:

dataset = ds.FileSystemDataset(
        schema, None, ds.ParquetFileFormat(), pa.fs.LocalFileSystem(),
        ["data_file1.parquet", "data_file2.parquet"],
        [ds.field('file') == 1, ds.field('file') == 2])

There are some usibility improvements we can do though:

  • Allow passing the arguments by name to improve readability of the calling code (now they all need to be passed positionally, due to the way they are implemented in cython as not None)
  • I would maybe change the order of the arguments (eg start with the paths, we don't need to match the order of the C++ constructor)
  • Potentially allow partitions to be optional, in which case they need to be set to a list of ScalarExpression(True) values.

Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Joris Van den Bossche / @jorisvandenbossche

PRs and other links:

Note: This issue was originally created as ARROW-8290. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Ben Kietzman / @bkietz:
Small amenity: if an empty vector is passed for partitions we will populate it with scalar(true) automatically

@asfimport
Copy link
Author

Krisztian Szucs / @kszucs:
Issue resolved by pull request 6913
#6913

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants