Allow SessionContext.read_table to accept objects exposing __datafusion_table_provider__
(PyCapsule)
#1246
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
SessionContext.read_table
previously required adatafusion.catalog.Table
(the PythonTable
wrapper) and forwarded its.table
member into the Rust binding. That meant objects that expose a__datafusion_table_provider__()
API returning a PyCapsule (aTableProvider
exported via the FFI) could not be passed directly toread_table
and instead had to be registered in the catalog first. This added an unnecessary registration round-trip and prevented ergonomic use of PyCapsule-backed/custom table providers.This PR makes
read_table
accept either adatafusion.catalog.Table
or any Python object that implements__datafusion_table_provider__()
and returns a properly-validated PyCapsule. The change removes the need to register a provider just to obtain aDataFrame
and unifies the behavior with other places that already accept PyCapsule-backed providers.What changes are included in this PR?
High-level summary of the changes applied across Python and Rust layers:
Python documentation
docs/source/user-guide/io/table_provider.rst
: documentSessionContext.read_table(provider)
usage.Python bindings
python/datafusion/catalog.py
Table.__datafusion_table_provider__
to expose the underlying PyCapsule from the PythonTable
wrapper so it can be treated as a TableProvider-exportable object by other Python code.python/datafusion/context.py
SessionContext.read_table
typing and docstring to accept eitherTable
or aTableProviderExportable
object (an object implementing__datafusion_table_provider__
).Table
instances and provider objects are supported.Rust core
src/utils.rs
foreign_table_provider_from_capsule
andtry_table_provider_from_object
helpers to centralize validation and extraction ofFFI_TableProvider
from a PyCapsule and to convert it into anArc<dyn TableProvider>
.src/catalog.rs
try_table_provider_from_object
to detect and accept provider objects that expose__datafusion_table_provider__
when registering tables into the catalog.PyTable::__datafusion_table_provider__
soTable
can export anFFI_TableProvider
PyCapsule (this is whatpython/catalog.py
calls through the Python layer).register_table
and schema provider lookup to prefer directPyTable
extraction, thentry_table_provider_from_object
, then fallback to constructing aDataset
as before.src/context.rs
PySessionContext::register_table
to accept PyCapsule-backed provider objects by usingtry_table_provider_from_object
.PySessionContext::read_table
to accept a genericPyAny
bound and detect eitherPyTable
(native, avoid FFI round-trip) or any object that exposes__datafusion_table_provider__
. Returns an error if neither condition is met.src/udtf.rs
try_table_provider_from_object
when calling Python table functions so UDTFs that return a provider object via__datafusion_table_provider__
are accepted.Tests
python/tests/test_catalog.py
test_register_raw_table_without_capsule
to ensure rawRawTable
objects can be registered (monkeypatch ensures the capsule path is not invoked), queried, and deregistered.python/tests/test_context.py
test_read_table_accepts_table_provider
to verifyctx.read_table(provider)
works whenprovider
is a PyCapsule-backed object, and thatctx.read_table(table)
still works for regularTable
objects.uuid4
import to module-level where appropriate).Other smaller maintenance changes: imports reorganized and some helper functions added to centralize PyCapsule validation and conversion.
Are these changes tested?
Yes — new unit tests have been added to validate the new behavior and to guard against regressions:
test_read_table_accepts_table_provider
(inpython/tests/test_context.py
) exercises reading from a registered provider and from a provider object directly.test_register_raw_table_without_capsule
(inpython/tests/test_catalog.py
) verifies raw table registration path does not trigger the capsule-based extraction and that queries against the registered table return expected results.Existing tests were left intact and the new tests exercise both the Python and Rust-side changes.
Are there any user-facing changes?
Yes — API behavior and documentation are updated:
SessionContext.read_table
now accepts either adatafusion.catalog.Table
or any object that implements__datafusion_table_provider__()
and returns adatafusion_table_provider
PyCapsule. Users can now callctx.read_table(provider)
on provider objects without registering them first.docs/source/user-guide/io/table_provider.rst
show the direct-use pattern viactx.read_table(provider)
.This is backwards-compatible: previously-accepted inputs (the Python
Table
wrapper andDataset
-like objects) continue to work.No public API breaking changes were made to function signatures on the Rust side; changes are additive and focus on extending accepted input types and centralizing provider extraction logic.
Notes / Caveats
"datafusion_table_provider"
. Provider objects must implement__datafusion_table_provider__()
that returns a PyCapsule with that name.PyTable
(the native PythonTable
wrapper) still exposes its provider via__datafusion_table_provider__()
; however, the Rustread_table
path prefers directPyTable
usage to avoid unnecessary FFI round-trips when the object is already aRawTable
.FFI_TableProvider::new(..., Some(runtime))
call means the created FFI wrapper captures a Tokio runtime handle — ensure that embedding contexts keep compatible runtimes available.