[python] Use duck typing for arrow dataset #5998

changhiskhan · 2023-01-25T20:50:53Z

Currently the pyarrow.dataset.Scanner.from_dataset static method is being called to create a scanner which depends on the Arrow C++ Dataset API. This makes it harder to integrate Rust packages using arrow-rs because these use the Arrow C data interface which is very very limited (for stability purposes).

Instead, this is a small PR that uses the Dataset.scanner method to create the Scanner. For regular pyarrow Datasets, that just calls the same Scanner.from_dataset under the hood so there is no difference in behavior. But for Rust-based packages integrated via pyo3, it provides a way to override the behavior and create its own scanner, as long as the final data still meets the RecordBatchReader C data interface.

Currently the pyarrow.dataset.Scanner.from_dataset static method is being called to create a scanner which depends on the Arrow C++ Dataset API. This makes it harder to integrate Rust packages using arrow-rs because these use the Arrow C data interface which is very very limited (for stability purposes). Instead, this is a small PR that uses the Dataset.scanner method to create the Scanner. For regular pyarrow Datasets, that just calls the same Scanner.from_dataset under the hood so there is no difference in behavior. But for Rust-based packages integrated via pyo3, it provides a way to override the behavior and create its own scanner, as long as the final data still meets the RecordBatchReader C data interface.

Mytherin · 2023-01-26T09:50:00Z

Thanks for the PR! Looks like the CI is not passing - could you have a look?

Mause · 2023-01-26T10:19:49Z

tools/pythonpkg/tests/fast/arrow/test_dataset.py

+
+
+if can_run:


While you're changing this file, can you update it to use pytests importorskip instead?

changhiskhan · 2023-01-26T18:58:38Z

Thanks for the PR! Looks like the CI is not passing - could you have a look?

Will do . It looks like what @Mause pointed out, the test case was running but can_run was false i think? Will change it to importorskip

changhiskhan · 2023-01-27T01:25:37Z

ok looks like it passed. Thanks @Mause for pointing out importorskip that did the trick.

Mytherin · 2023-01-27T13:15:57Z

Thanks!

changhiskhan added 2 commits January 25, 2023 12:45

add unit test

a25fe6b

Mause suggested changes Jan 26, 2023

View reviewed changes

update tests to use pytest.importorskip

c8fda48

Mytherin merged commit 4248488 into duckdb:master Jan 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Use duck typing for arrow dataset #5998

[python] Use duck typing for arrow dataset #5998

changhiskhan commented Jan 25, 2023

Mytherin commented Jan 26, 2023

Mause Jan 26, 2023

changhiskhan commented Jan 26, 2023

changhiskhan commented Jan 27, 2023

Mytherin commented Jan 27, 2023

[python] Use duck typing for arrow dataset #5998

[python] Use duck typing for arrow dataset #5998

Conversation

changhiskhan commented Jan 25, 2023

Mytherin commented Jan 26, 2023

Mause Jan 26, 2023

Choose a reason for hiding this comment

changhiskhan commented Jan 26, 2023

changhiskhan commented Jan 27, 2023

Mytherin commented Jan 27, 2023