Skip to content

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Sep 19, 2025

Which issue does this PR close?

Rationale for this change

SessionContext.read_table previously required a datafusion.catalog.Table (the Python Table wrapper) and forwarded its .table member into the Rust binding. That meant objects that expose a __datafusion_table_provider__() API returning a PyCapsule (a TableProvider exported via the FFI) could not be passed directly to read_table and instead had to be registered in the catalog first. This added an unnecessary registration round-trip and prevented ergonomic use of PyCapsule-backed/custom table providers.

This PR makes read_table accept either a datafusion.catalog.Table or any Python object that implements __datafusion_table_provider__() and returns a properly-validated PyCapsule. The change removes the need to register a provider just to obtain a DataFrame and unifies the behavior with other places that already accept PyCapsule-backed providers.

What changes are included in this PR?

High-level summary of the changes applied across Python and Rust layers:

  • Python documentation

    • docs/source/user-guide/io/table_provider.rst: document SessionContext.read_table(provider) usage.
  • Python bindings

    • python/datafusion/catalog.py

      • Add Table.__datafusion_table_provider__ to expose the underlying PyCapsule from the Python Table wrapper so it can be treated as a TableProvider-exportable object by other Python code.
    • python/datafusion/context.py

      • Update SessionContext.read_table typing and docstring to accept either Table or a TableProviderExportable object (an object implementing __datafusion_table_provider__).
      • Adjust internal dispatch so both Table instances and provider objects are supported.
  • Rust core

    • src/utils.rs

      • Add foreign_table_provider_from_capsule and try_table_provider_from_object helpers to centralize validation and extraction of FFI_TableProvider from a PyCapsule and to convert it into an Arc<dyn TableProvider>.
    • src/catalog.rs

      • Use try_table_provider_from_object to detect and accept provider objects that expose __datafusion_table_provider__ when registering tables into the catalog.
      • Add PyTable::__datafusion_table_provider__ so Table can export an FFI_TableProvider PyCapsule (this is what python/catalog.py calls through the Python layer).
      • Simplify and reorganize provider extraction logic inside register_table and schema provider lookup to prefer direct PyTable extraction, then try_table_provider_from_object, then fallback to constructing a Dataset as before.
    • src/context.rs

      • Update PySessionContext::register_table to accept PyCapsule-backed provider objects by using try_table_provider_from_object.
      • Update PySessionContext::read_table to accept a generic PyAny bound and detect either PyTable (native, avoid FFI round-trip) or any object that exposes __datafusion_table_provider__. Returns an error if neither condition is met.
    • src/udtf.rs

      • Use try_table_provider_from_object when calling Python table functions so UDTFs that return a provider object via __datafusion_table_provider__ are accepted.
  • Tests

    • python/tests/test_catalog.py

      • Add test_register_raw_table_without_capsule to ensure raw RawTable objects can be registered (monkeypatch ensures the capsule path is not invoked), queried, and deregistered.
    • python/tests/test_context.py

      • Add test_read_table_accepts_table_provider to verify ctx.read_table(provider) works when provider is a PyCapsule-backed object, and that ctx.read_table(table) still works for regular Table objects.
      • Minor import cleanup (moved uuid4 import to module-level where appropriate).

Other smaller maintenance changes: imports reorganized and some helper functions added to centralize PyCapsule validation and conversion.

Are these changes tested?

Yes — new unit tests have been added to validate the new behavior and to guard against regressions:

  • test_read_table_accepts_table_provider (in python/tests/test_context.py) exercises reading from a registered provider and from a provider object directly.
  • test_register_raw_table_without_capsule (in python/tests/test_catalog.py) verifies raw table registration path does not trigger the capsule-based extraction and that queries against the registered table return expected results.

Existing tests were left intact and the new tests exercise both the Python and Rust-side changes.

Are there any user-facing changes?

Yes — API behavior and documentation are updated:

  • SessionContext.read_table now accepts either a datafusion.catalog.Table or any object that implements __datafusion_table_provider__() and returns a datafusion_table_provider PyCapsule. Users can now call ctx.read_table(provider) on provider objects without registering them first.
  • New docs in docs/source/user-guide/io/table_provider.rst show the direct-use pattern via ctx.read_table(provider).

This is backwards-compatible: previously-accepted inputs (the Python Table wrapper and Dataset-like objects) continue to work.

No public API breaking changes were made to function signatures on the Rust side; changes are additive and focus on extending accepted input types and centralizing provider extraction logic.

Notes / Caveats

  • The capsule name used and validated is "datafusion_table_provider". Provider objects must implement __datafusion_table_provider__() that returns a PyCapsule with that name.
  • A PyTable (the native Python Table wrapper) still exposes its provider via __datafusion_table_provider__(); however, the Rust read_table path prefers direct PyTable usage to avoid unnecessary FFI round-trips when the object is already a RawTable.
  • The FFI_TableProvider::new(..., Some(runtime)) call means the created FFI wrapper captures a Tokio runtime handle — ensure that embedding contexts keep compatible runtimes available.

@kosiew kosiew changed the title Allow SessionContext.read\_table to accept objects exposing __datafusion_table_provider__ (PyCapsule) Allow SessionContext.read_table to accept objects exposing __datafusion_table_provider__ (PyCapsule) Sep 19, 2025
@kosiew kosiew marked this pull request as ready for review September 19, 2025 14:38
@timsaucer
Copy link
Member

I'm going to hold off on reviewing this one until #1243 is complete because they are so closely related and I think changes there will impact this code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SessionContext.read_table should take PyCapsule objects
2 participants