Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Dataset] Revisit File discovery failure mode #23918

Closed
asfimport opened this issue Jan 24, 2020 · 2 comments
Closed

[C++][Dataset] Revisit File discovery failure mode #23918

asfimport opened this issue Jan 24, 2020 · 2 comments

Comments

@asfimport
Copy link

asfimport commented Jan 24, 2020

Currently, the default FileSystemFactoryOptions::exclude_invalid_files will silently ignore unsupported files (either IO error, not of the valid format, corruption, missing compression codecs, etc...) when creating a FileSystemSource.

We should change this behavior to propagate an error in the Inspect/Finish calls by default and allow the user to toggle exclude_invalid_files. The error should contain at least the file path and a decipherable error (if possible).

Reporter: Francois Saint-Jacques / @fsaintjacques
Assignee: Francois Saint-Jacques / @fsaintjacques

Related issues:

Note: This issue was originally created as ARROW-7673. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Krisztian Szucs / @kszucs:
@fsaintjacques So in 0.16 we plan to provide best effort readers, and aiming this ticket to 1.0.0, right?

@asfimport
Copy link
Author

Francois Saint-Jacques / @fsaintjacques:
This has been refactored/fixed in ARROW-8058:

In [40]: da.dataset("/home/fsaintjacques/datasets/nyc-tlc/csv/2016", format="csv")                                                           
Out[40]: <pyarrow._dataset.FileSystemDataset at 0x7fef446b2930>

In [41]: da.dataset("/home/fsaintjacques/datasets/nyc-tlc/csv/2016", format="parquet")                                                       
...
OSError: Could not open parquet input source '/home/fsaintjacques/datasets/nyc-tlc/csv/2016/01/data.csv': Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

In [42]: da.dataset("/home/fsaintjacques/datasets/nyc-tlc/parquet/2016", format="parquet")                                                   
Out[42]: <pyarrow._dataset.FileSystemDataset at 0x7fef447ad7f0>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants