[python] Use duck typing for arrow dataset #5998
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently the pyarrow.dataset.Scanner.from_dataset static method is being called to create a scanner which depends on the Arrow C++ Dataset API. This makes it harder to integrate Rust packages using arrow-rs because these use the Arrow C data interface which is very very limited (for stability purposes).
Instead, this is a small PR that uses the Dataset.scanner method to create the Scanner. For regular pyarrow Datasets, that just calls the same Scanner.from_dataset under the hood so there is no difference in behavior. But for Rust-based packages integrated via pyo3, it provides a way to override the behavior and create its own scanner, as long as the final data still meets the RecordBatchReader C data interface.