Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Use duck typing for arrow dataset #5998

Merged
merged 3 commits into from
Jan 27, 2023

Conversation

changhiskhan
Copy link
Contributor

Currently the pyarrow.dataset.Scanner.from_dataset static method is being called to create a scanner which depends on the Arrow C++ Dataset API. This makes it harder to integrate Rust packages using arrow-rs because these use the Arrow C data interface which is very very limited (for stability purposes).

Instead, this is a small PR that uses the Dataset.scanner method to create the Scanner. For regular pyarrow Datasets, that just calls the same Scanner.from_dataset under the hood so there is no difference in behavior. But for Rust-based packages integrated via pyo3, it provides a way to override the behavior and create its own scanner, as long as the final data still meets the RecordBatchReader C data interface.

Currently the pyarrow.dataset.Scanner.from_dataset static method
is being called to create a scanner which depends on the Arrow
C++ Dataset API. This makes it harder to integrate Rust packages
using arrow-rs because these use the Arrow C data interface which
is very very limited (for stability purposes).

Instead, this is a small PR that uses the Dataset.scanner method
to create the Scanner. For regular pyarrow Datasets, that just
calls the same Scanner.from_dataset under the hood so there is no
difference in behavior. But for Rust-based packages integrated via pyo3,
it provides a way to override the behavior and create its own scanner,
as long as the final data still meets the RecordBatchReader
C data interface.
@Mytherin
Copy link
Collaborator

Thanks for the PR! Looks like the CI is not passing - could you have a look?



if can_run:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're changing this file, can you update it to use pytests importorskip instead?

@changhiskhan
Copy link
Contributor Author

Thanks for the PR! Looks like the CI is not passing - could you have a look?

Will do . It looks like what @Mause pointed out, the test case was running but can_run was false i think? Will change it to importorskip

@changhiskhan
Copy link
Contributor Author

ok looks like it passed. Thanks @Mause for pointing out importorskip that did the trick.

@Mytherin Mytherin merged commit 4248488 into duckdb:master Jan 27, 2023
@Mytherin
Copy link
Collaborator

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants