Unify Table representations #1256

timsaucer · 2025-10-01T14:42:15Z

Which issue does this PR close?

Closes #1239
Closes #1245

Rationale for this change

This is built on top of #1243

With this change we have a single class and single entry point for turning any table-like object into a Table class in python. On the rust side, we use this one struct for doing things like registering with the schema provider or session context.

What changes are included in this PR?

Pushes the Python object type evaluation into the rust side with a constructor for the PyTable struct that handles:

Existing PyTable objects
pyarrow datasets
DataFrames (as views)
External table providers via PyCapsule

This removes the cognitive load from the end users from having to understand the differences between any of these. If they have something that is table-like they can simply call Table(my_obj) and get something that can be used to register in the schema provider or directly with the session context.

Are there any user-facing changes?

We have deprecated one method, register_table_provider.

docs/tests, add DataFrame view support, and improve Send/concurrency support. migrates the codebase from using `Table` to a `TableProvider`-based API, refactors registration and access paths to simplify catalog/context interactions, and updates documentation and examples. DataFrame view handling is improved (`into_view` is now public), the test-suite is expanded to cover new registration and async SQL scenarios, and `TableProvider` now supports the `Send` trait across modules for safer concurrency. Minor import cleanup and utility adjustments (including a refined `pyany_to_table_provider`) are included.

DataFrame→TableProvider conversion, plus tests and FFI/pycapsule improvements. -- Registration logic & API * Refactor of table provider registration logic for improved clarity and simpler call sites. * Remove PyTableProvider registration from an internal module (reduces surprising side effects). * Update table registration method to call `register_table` instead of `register_table_provider`. * Extend `register_table` to support `TableProviderExportable` so more provider types can be registered uniformly. * Improve error messages related to registration failures (missing PyCapsule name and DataFrame registration errors). -- DataFrame ↔ TableProvider conversions * Introduce utility functions to simplify table provider conversions and centralize conversion logic. * Rename `into_view_provider` → `to_view_provider` for clearer intent. * Fix `from_dataframe` to return the correct type and update `DataFrame.into_view` to import the correct `TableProvider`. * Remove an obsolete `dataframe_into_view` test case after the refactor. -- FFI / PyCapsule handling * Update `FFI_TableProvider` initialization to accept an optional parameter (improves FFI ergonomics). * Introduce `table_provider_from_pycapsule` utility to standardize pycapsule-based construction. * Improve the error message when a PyCapsule name is missing to help debugging. -- DeltaTable & specific integrations * Update TableProvider registration for `DeltaTable` to use the correct registration method (matches the new API surface). -- Tests, docs & minor fixes * Add tests for registering a `TableProvider` from a `DataFrame` and from a capsule to ensure conversion paths are covered. * Fix a typo in the `register_view` docstring and another typo in the error message for unsupported volatility type. * Simplify version retrieval by removing exception handling around `PackageNotFoundError` (streamlines code path).

* Removed unused helpers (`extract_table_provider`, `_wrap`) and dead code to simplify maintenance. * Consolidated and streamlined table-provider extraction and registration logic; improved error handling and replaced a hardcoded error message with `EXPECTED_PROVIDER_MSG`. * Marked `from_view` as deprecated; updated deprecation message formatting and adjusted the warning `stacklevel` so it points to caller code. * Removed the `Send` marker from TableProvider trait objects to increase type flexibility — review threading assumptions. * Added type hints to `register_schema` and `deregister_table` methods. * Adjusted tests and exceptions (e.g., changed one test to expect `RuntimeError`) and updated test coverage accordingly. * Introduced a refactored `TableProvider` class and enhanced Python integration by adding support for extracting `PyDataFrame` in `PySchema`. Notes: * Consumers should migrate away from `TableProvider::from_view` to the new TableProvider integration. * Audit any code relying on `Send` for trait objects passed across threads. * Update downstream tests and documentation to reflect the changed exception types and deprecation.

utilities, docs, and robustness fixes * Normalized table-provider handling and simplified registration flow across the codebase; multiple commits centralize provider coercion and normalization. * Introduced utility helpers (`coerce_table_provider`, `extract_table_provider`, `_normalize_table_provider`) to centralize extraction, error handling, and improve clarity. * Simplified `from_dataframe` / `into_view` behavior: clearer implementations, direct returns of DataFrame views where appropriate, and added internal tests for DataFrame flows. * Fixed DataFrame registration semantics: enforce `TypeError` for invalid registrations; added handling for `DataFrameWrapper` by converting it to a view. * Added tests, including a schema registration test using a PyArrow dataset and internal DataFrame tests to cover new flows. * Documentation improvements: expanded `from_dataframe` docstrings with parameter details, added usage examples for `into_view`, and documented deprecations (e.g., `register_table_provider` → `register_table`). * Warning and UX fixes: synchronized deprecation `stacklevel` so warnings point to caller code; improved `__dir__` to return sorted, unique attributes. * Cleanup: removed unused imports (including an unused error import from `utils.rs`) and other dead code to reduce noise.

…sion

…dating method calls

…d avoid documentation duplication

…ge and advantages

…ld methods

…age of Table instead

…adability

timsaucer · 2025-10-01T17:59:40Z

@kosiew Would you mind taking a look at this? I started from your PR and made a few adjustments to try to make it as ergonomic as possible for the end users. My goal is that a consumer of this project does not need to know much about the internals of the table types as long as they have something that should be table-like.

With this change it pushes most of the type coercion into a single place in the rust side. I think it also makes it nice to not have to have multiple entry points such as from_dataset and from_capsule. Any python object that can be one of our table types is checked in the rust code.

What do you think?

kosiew · 2025-10-02T05:42:14Z

@timsaucer ,

Centralizing the coercion logic in PyTable definitely improves ergonomics; callers can now hand almost anything “table-like” to Rust and let it figure things out, which is great.
The remaining blocker is that the unified wrapper no longer exposes __datafusion_table_provider__, so anything that depends on that FFI capsule now fails. PyTableFunction.__call__ still converts Python-returned objects by invoking table_provider_from_pycapsule, which requires __datafusion_table_provider__ to be present; without it, Python table UDTFs raise PyNotImplementedError, and external integrations that unwrap tables via the capsule break as well.

Restoring the capsule hook (even if it just forwards to the inner provider) keeps those advanced scenarios working without forcing ordinary users to think about internals.

timsaucer · 2025-10-02T11:19:50Z

The remaining blocker is that the unified wrapper no longer exposes __datafusion_table_provider__, so anything that depends on that FFI capsule now fails. PyTableFunction.__call__ still converts Python-returned objects by invoking table_provider_from_pycapsule, which requires __datafusion_table_provider__ to be present; without it, Python table UDTFs raise PyNotImplementedError, and external integrations that unwrap tables via the capsule break as well.

Can you expand on this? I don't understand the use case. I did run the integration tests in our repo without issue. I removed some of the tests you had added that were looking for __datafusion_table_provider__ because I couldn't understand their utility.

kosiew · 2025-10-03T07:26:03Z

@timsaucer,

here's an example that shows the NotImplementedError

examples/table_capsule_failure.py

"""Demonstrate how missing __datafusion_table_provider__ breaks table UDTFs.

Run with ``python examples/table_capsule_failure.py``.

This example mirrors how advanced integrations unwrap ``Table`` instances via the
``__datafusion_table_provider__`` PyCapsule. The refactor that removed this method
means user-defined table functions returning ``Table`` now raise ``NotImplementedError``
and the script prints the resulting error message instead of crashing.
"""

from __future__ import annotations

from datafusion import SessionContext, Table, udtf


def main() -> None:
    """Register a Python table UDTF that returns a ``Table`` and trigger it."""

    ctx = SessionContext()
    failing_table = Table(ctx.sql("SELECT 1 AS value"))

    @udtf("capsule_dependent")
    def capsule_dependent_udtf() -> Table:
        """Return a ``Table`` so DataFusion unwraps it via the FFI capsule."""

        # Prior to the refactor the wrapper exposed ``__datafusion_table_provider__``
        # so this conversion succeeded. Without it the runtime raises a
        # ``NotImplementedError`` complaining about the missing attribute.
        return failing_table

    ctx.register_udtf(capsule_dependent_udtf)

    # Executing the UDTF now fails because ``Table`` no longer exposes the
    # ``__datafusion_table_provider__`` helper that PyTableFunction expects.
    try:
        ctx.sql("SELECT * FROM capsule_dependent()").collect()
        print("capsule_dependent() works")
    except NotImplementedError as err:
        # Document the regression by surfacing the missing capsule attribute
        # instead of crashing with a panic inside the execution engine.
        print(
            "capsule_dependent() failed due to missing __datafusion_table_provider__: "
            f"{err}"
        )


if __name__ == "__main__":
    main()

On my computer, I got this traceback:

  File "/Users/kosiew/GitHub/datafusion-python/examples/table_capsule_failure.py", line 48, in <module>
    main()
    ~~~~^^
  File "/Users/kosiew/GitHub/datafusion-python/examples/table_capsule_failure.py", line 36, in main
    ctx.sql("SELECT * FROM capsule_dependent()").collect()
    ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kosiew/GitHub/datafusion-python/python/datafusion/context.py", line 611, in sql
    return DataFrame(self.ctx.sql(query))
                     ~~~~~~~~~~~~^^^^^^^
Exception: DataFusion error: Execution("PyErr { type: <class 'NotImplementedError'>, value: NotImplementedError('__datafusion_table_provider__ does not exist on Table Provider object.'), traceback: None }")

timsaucer · 2025-10-03T20:09:24Z

I was able to reproduce the problem, but I think we can lean on the PyTable to handle these cases as well. I have pushed a unit test that fails without my addition and passes with my addition in the last commit. Do you still have concerns?

kosiew · 2025-10-06T07:45:11Z

If I have a python object that implements TableProvider via PyCapsule, I should be able to pass this object directly to SessionContext.read_table

Source: #1245

I tested with SessionContext.read_table and it chokes on raw PyCapsule objects.

from __future__ import annotations

import ctypes

from datafusion import SessionContext, Table


# Keep the backing memory alive for the lifetime of the module so the capsule
# always wraps a valid (non-null) pointer. The capsule content is irrelevant for
# this regression example—we only need a non-null address.
_DUMMY_CAPSULE_BYTES = ctypes.create_string_buffer(b"x")


def make_table_provider_capsule() -> object:
    """Create a dummy PyCapsule with the expected table provider name."""

    pycapsule_new = ctypes.pythonapi.PyCapsule_New
    pycapsule_new.restype = ctypes.py_object
    pycapsule_new.argtypes = [ctypes.c_void_p, ctypes.c_char_p, ctypes.c_void_p]
    dummy_ptr = ctypes.cast(_DUMMY_CAPSULE_BYTES, ctypes.c_void_p)
    return pycapsule_new(dummy_ptr, b"datafusion_table_provider", None)


def main() -> None:
    """Attempt to use the capsule the same way existing callers do."""

    ctx = SessionContext()
    try:
        capsule = make_table_provider_capsule()
    except Exception as err:
        print("Creating the PyCapsule failed:", err)
        return


    ctx.read_table(capsule)
    

if __name__ == "__main__":
    main()

raises this Traceback:

  File "/Users/kosiew/GitHub/datafusion-python/examples/raw_capsule_registration_failure.py", line 49, in <module>
    main()
    ~~~~^^
  File "/Users/kosiew/GitHub/datafusion-python/examples/raw_capsule_registration_failure.py", line 45, in main
    ctx.read_table(capsule)
    ~~~~~~~~~~~~~~^^^^^^^^^
  File "/Users/kosiew/GitHub/datafusion-python/python/datafusion/context.py", line 1184, in read_table
    return DataFrame(self.ctx.read_table(table))
                     ~~~~~~~~~~~~~~~~~~~^^^^^^^
ValueError: dataset argument must be a pyarrow.dataset.Dataset object

timsaucer · 2025-10-06T11:35:45Z

I don't think that example demonstrates how users are expected to provide PyCapsule based providers. I changed it slightly below to fit TableProviderExportable. With this change it does segfault, but I expect that because it is not a valid object.

from __future__ import annotations

import ctypes

from datafusion import SessionContext, Table


# Keep the backing memory alive for the lifetime of the module so the capsule
# always wraps a valid (non-null) pointer. The capsule content is irrelevant for
# this regression example—we only need a non-null address.
_DUMMY_CAPSULE_BYTES = ctypes.create_string_buffer(b"x")

class CapsuleContainer:
    def __init__(self):
        self.__datafusion_table_provider__ = make_table_provider_capsule

def make_table_provider_capsule() -> object:
    """Create a dummy PyCapsule with the expected table provider name."""

    pycapsule_new = ctypes.pythonapi.PyCapsule_New
    pycapsule_new.restype = ctypes.py_object
    pycapsule_new.argtypes = [ctypes.c_void_p, ctypes.c_char_p, ctypes.c_void_p]
    dummy_ptr = ctypes.cast(_DUMMY_CAPSULE_BYTES, ctypes.c_void_p)
    return pycapsule_new(dummy_ptr, b"datafusion_table_provider", None)


def main() -> None:
    """Attempt to use the capsule the same way existing callers do."""

    ctx = SessionContext()
    try:
        capsule = CapsuleContainer()
    except Exception as err:
        print("Creating the PyCapsule failed:", err)
        return


    ctx.read_table(capsule)
    

if __name__ == "__main__":
    main()

kosiew · 2025-10-10T08:49:04Z

@timsaucer

I changed it slightly below to fit TableProviderExportable. With this change it does segfault, but I expect that because it is not a valid object.

We still don't have an example/test to show that SessionContext.read_table can take a PyCapsule object (#1245).

Can you resolve the conflicts and also leave #1245 open to move the PR forward

timsaucer · 2025-10-10T12:17:23Z

examples/datafusion-ffi-example/python/tests/_test_table_provider.py

+    result = ctx.read_table(table).collect()
+    result = [r.column(0) for r in result]
+    assert result == expected


@kosiew Here is the unit test for read_table with a PyCapsule based table provider.

timsaucer · 2025-10-11T00:43:57Z

I removed the statement this would close #1245 per your request, but I do think it has the appropriate unit test.

kosiew

I removed the statement this would close #1245 per your request, but I do think it has the appropriate unit test.

You're right.
This does close #1245 too.

timsaucer · 2025-10-11T14:28:28Z

Thank you @kosiew for the collaboration on this! I think the final result is a really nice step forward!

kosiew and others added 30 commits September 18, 2025 15:11

refactor: update documentation for DataFrame to Table Provider conver…

00bd445

…sion

refactor: replace to_view_provider with inner_df for DataFrame access

6869919

refactor: streamline TableProvider creation from DataFrame by consoli…

6e46d43

…dating method calls

Merge branch 'main' into table-provider-1239

38af2b5

fix ruff errors

1872a7f

refactor: enhance autoapi_skip_member_fn to skip private variables an…

5948fb4

…d avoid documentation duplication

revert main 49.0.0 md

b9851d8

refactor: add comment in autoapi_skip_member_fn

586c2cf

refactor: remove isort and ruff comments to clean up import section

d4ff136

docs: enhance docstring for DataFrame.into_view method to clarify usa…

29203c6

…ge and advantages

docs: update example in DataFrame.into_view docstring for clarity

ae8c1dd

docs: update example for registering Delta Lake tables to simplify usage

0c5eb17

docs: update table provider documentation for clarity and deprecate o…

f9a3a22

…ld methods

docs: update documentation to reflect removal of TableProvider and us…

f930181

…age of Table instead

remove TableProvider in Python, update missing_exports function, doc

afc9b4e

Fix Ruff errors

918b1ce

Refactor test_table_loading to use Table instead of TableProvider

93f0a31

Refactor aggregate tests to simplify result assertions and improve re…

7bc303d

…adability

Add comments to clarify table normalization in aggregate tests

4429614

Initial implementation of unified table suggestion

38bb25a

update unit tests

49abaeb

Change documentation to be more user oriented

cb7a755

Update ffi examples

937d39c

Update documentation

c70968f

More documentation

90b3cb6

Make documentation more user facing

25d4141

timsaucer added 2 commits October 1, 2025 13:38

More documentation updates

29b634e

remove cruft

20099d2

timsaucer added enhancement New feature or request python Pull requests that update Python code rust Pull requests that update Rust code api change labels Oct 1, 2025

merge main

c82d617

fix ordering

9964b7f

give read_table the same treatment

81b46cb

timsaucer marked this pull request as ready for review October 2, 2025 22:29

Reuse Table constructor to idenfity non-ffi tables when using udtf

36084a0

timsaucer commented Oct 10, 2025

View reviewed changes

Merge main into tsaucer/table-provider-recommendations

a50c1c6

kosiew approved these changes Oct 11, 2025

View reviewed changes

timsaucer merged commit 6f3b1ca into apache:main Oct 11, 2025
16 checks passed

Unify Table representations #1256

Unify Table representations #1256

Conversation

timsaucer commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

timsaucer commented Oct 1, 2025

Uh oh!

kosiew commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timsaucer commented Oct 2, 2025

Uh oh!

kosiew commented Oct 3, 2025

Uh oh!

timsaucer commented Oct 3, 2025

Uh oh!

kosiew commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timsaucer commented Oct 6, 2025

Uh oh!

kosiew commented Oct 10, 2025

Uh oh!

timsaucer Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

timsaucer commented Oct 11, 2025

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

timsaucer commented Oct 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

timsaucer commented Oct 1, 2025 •

edited

Loading

kosiew commented Oct 2, 2025 •

edited

Loading

kosiew commented Oct 6, 2025 •

edited

Loading