Skip to content

Conversation

timsaucer
Copy link
Member

@timsaucer timsaucer commented Oct 1, 2025

Which issue does this PR close?

Closes #1239
Closes #1245

Rationale for this change

This is built on top of #1243

With this change we have a single class and single entry point for turning any table-like object into a Table class in python. On the rust side, we use this one struct for doing things like registering with the schema provider or session context.

What changes are included in this PR?

Pushes the Python object type evaluation into the rust side with a constructor for the PyTable struct that handles:

  • Existing PyTable objects
  • pyarrow datasets
  • DataFrames (as views)
  • External table providers via PyCapsule

This removes the cognitive load from the end users from having to understand the differences between any of these. If they have something that is table-like they can simply call Table(my_obj) and get something that can be used to register in the schema provider or directly with the session context.

Are there any user-facing changes?

We have deprecated one method, register_table_provider.

kosiew and others added 30 commits September 18, 2025 15:11
docs/tests, add DataFrame view support, and improve Send/concurrency
support.

migrates the codebase from using `Table` to a
`TableProvider`-based API, refactors registration and access paths to
simplify catalog/context interactions, and updates documentation and
examples. DataFrame view handling is improved (`into_view` is now
public), the test-suite is expanded to cover new registration and async
SQL scenarios, and `TableProvider` now supports the `Send` trait across
modules for safer concurrency. Minor import cleanup and utility
adjustments (including a refined `pyany_to_table_provider`) are
included.
DataFrame→TableProvider conversion, plus tests and FFI/pycapsule
improvements.

-- Registration logic & API

* Refactor of table provider registration logic for improved clarity and
  simpler call sites.
* Remove PyTableProvider registration from an internal module (reduces
  surprising side effects).
* Update table registration method to call `register_table` instead of
  `register_table_provider`.
* Extend `register_table` to support `TableProviderExportable` so more
  provider types can be registered uniformly.
* Improve error messages related to registration failures (missing
  PyCapsule name and DataFrame registration errors).

-- DataFrame ↔ TableProvider conversions

* Introduce utility functions to simplify table provider conversions and
  centralize conversion logic.
* Rename `into_view_provider` → `to_view_provider` for clearer intent.
* Fix `from_dataframe` to return the correct type and update
  `DataFrame.into_view` to import the correct `TableProvider`.
* Remove an obsolete `dataframe_into_view` test case after the refactor.

-- FFI / PyCapsule handling

* Update `FFI_TableProvider` initialization to accept an optional
  parameter (improves FFI ergonomics).
* Introduce `table_provider_from_pycapsule` utility to standardize
  pycapsule-based construction.
* Improve the error message when a PyCapsule name is missing to help
  debugging.

-- DeltaTable & specific integrations

* Update TableProvider registration for `DeltaTable` to use the correct
  registration method (matches the new API surface).

-- Tests, docs & minor fixes

* Add tests for registering a `TableProvider` from a `DataFrame` and
  from a capsule to ensure conversion paths are covered.
* Fix a typo in the `register_view` docstring and another typo in the
  error message for unsupported volatility type.
* Simplify version retrieval by removing exception handling around
  `PackageNotFoundError` (streamlines code path).
* Removed unused helpers (`extract_table_provider`, `_wrap`) and dead code to simplify maintenance.
* Consolidated and streamlined table-provider extraction and registration logic; improved error handling and replaced a hardcoded error message with `EXPECTED_PROVIDER_MSG`.
* Marked `from_view` as deprecated; updated deprecation message formatting and adjusted the warning `stacklevel` so it points to caller code.
* Removed the `Send` marker from TableProvider trait objects to increase type flexibility — review threading assumptions.
* Added type hints to `register_schema` and `deregister_table` methods.
* Adjusted tests and exceptions (e.g., changed one test to expect `RuntimeError`) and updated test coverage accordingly.
* Introduced a refactored `TableProvider` class and enhanced Python integration by adding support for extracting `PyDataFrame` in `PySchema`.

Notes:

* Consumers should migrate away from `TableProvider::from_view` to the new TableProvider integration.
* Audit any code relying on `Send` for trait objects passed across threads.
* Update downstream tests and documentation to reflect the changed exception types and deprecation.
utilities, docs, and robustness fixes

* Normalized table-provider handling and simplified registration flow
  across the codebase; multiple commits centralize provider coercion and
normalization.
* Introduced utility helpers (`coerce_table_provider`,
  `extract_table_provider`, `_normalize_table_provider`) to centralize
extraction, error handling, and improve clarity.
* Simplified `from_dataframe` / `into_view` behavior: clearer
  implementations, direct returns of DataFrame views where appropriate,
and added internal tests for DataFrame flows.
* Fixed DataFrame registration semantics: enforce `TypeError` for
  invalid registrations; added handling for `DataFrameWrapper` by
converting it to a view.
* Added tests, including a schema registration test using a PyArrow
  dataset and internal DataFrame tests to cover new flows.
* Documentation improvements: expanded `from_dataframe` docstrings with
  parameter details, added usage examples for `into_view`, and
documented deprecations (e.g., `register_table_provider` →
`register_table`).
* Warning and UX fixes: synchronized deprecation `stacklevel` so
  warnings point to caller code; improved `__dir__` to return sorted,
unique attributes.
* Cleanup: removed unused imports (including an unused error import from
  `utils.rs`) and other dead code to reduce noise.
@timsaucer timsaucer added enhancement New feature or request python Pull requests that update Python code rust Pull requests that update Rust code api change labels Oct 1, 2025
@timsaucer
Copy link
Member Author

@kosiew Would you mind taking a look at this? I started from your PR and made a few adjustments to try to make it as ergonomic as possible for the end users. My goal is that a consumer of this project does not need to know much about the internals of the table types as long as they have something that should be table-like.

With this change it pushes most of the type coercion into a single place in the rust side. I think it also makes it nice to not have to have multiple entry points such as from_dataset and from_capsule. Any python object that can be one of our table types is checked in the rust code.

What do you think?

@kosiew
Copy link
Contributor

kosiew commented Oct 2, 2025

@timsaucer ,

Centralizing the coercion logic in PyTable definitely improves ergonomics; callers can now hand almost anything “table-like” to Rust and let it figure things out, which is great.
The remaining blocker is that the unified wrapper no longer exposes __datafusion_table_provider__, so anything that depends on that FFI capsule now fails. PyTableFunction.__call__ still converts Python-returned objects by invoking table_provider_from_pycapsule, which requires __datafusion_table_provider__ to be present; without it, Python table UDTFs raise PyNotImplementedError, and external integrations that unwrap tables via the capsule break as well.

Restoring the capsule hook (even if it just forwards to the inner provider) keeps those advanced scenarios working without forcing ordinary users to think about internals.

@timsaucer
Copy link
Member Author

The remaining blocker is that the unified wrapper no longer exposes __datafusion_table_provider__, so anything that depends on that FFI capsule now fails. PyTableFunction.__call__ still converts Python-returned objects by invoking table_provider_from_pycapsule, which requires __datafusion_table_provider__ to be present; without it, Python table UDTFs raise PyNotImplementedError, and external integrations that unwrap tables via the capsule break as well.

Can you expand on this? I don't understand the use case. I did run the integration tests in our repo without issue. I removed some of the tests you had added that were looking for __datafusion_table_provider__ because I couldn't understand their utility.

@timsaucer timsaucer marked this pull request as ready for review October 2, 2025 22:29
@kosiew
Copy link
Contributor

kosiew commented Oct 3, 2025

@timsaucer,

here's an example that shows the NotImplementedError

examples/table_capsule_failure.py

"""Demonstrate how missing __datafusion_table_provider__ breaks table UDTFs.

Run with ``python examples/table_capsule_failure.py``.

This example mirrors how advanced integrations unwrap ``Table`` instances via the
``__datafusion_table_provider__`` PyCapsule. The refactor that removed this method
means user-defined table functions returning ``Table`` now raise ``NotImplementedError``
and the script prints the resulting error message instead of crashing.
"""

from __future__ import annotations

from datafusion import SessionContext, Table, udtf


def main() -> None:
    """Register a Python table UDTF that returns a ``Table`` and trigger it."""

    ctx = SessionContext()
    failing_table = Table(ctx.sql("SELECT 1 AS value"))

    @udtf("capsule_dependent")
    def capsule_dependent_udtf() -> Table:
        """Return a ``Table`` so DataFusion unwraps it via the FFI capsule."""

        # Prior to the refactor the wrapper exposed ``__datafusion_table_provider__``
        # so this conversion succeeded. Without it the runtime raises a
        # ``NotImplementedError`` complaining about the missing attribute.
        return failing_table

    ctx.register_udtf(capsule_dependent_udtf)

    # Executing the UDTF now fails because ``Table`` no longer exposes the
    # ``__datafusion_table_provider__`` helper that PyTableFunction expects.
    try:
        ctx.sql("SELECT * FROM capsule_dependent()").collect()
        print("capsule_dependent() works")
    except NotImplementedError as err:
        # Document the regression by surfacing the missing capsule attribute
        # instead of crashing with a panic inside the execution engine.
        print(
            "capsule_dependent() failed due to missing __datafusion_table_provider__: "
            f"{err}"
        )


if __name__ == "__main__":
    main()

On my computer, I got this traceback:

  File "/Users/kosiew/GitHub/datafusion-python/examples/table_capsule_failure.py", line 48, in <module>
    main()
    ~~~~^^
  File "/Users/kosiew/GitHub/datafusion-python/examples/table_capsule_failure.py", line 36, in main
    ctx.sql("SELECT * FROM capsule_dependent()").collect()
    ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kosiew/GitHub/datafusion-python/python/datafusion/context.py", line 611, in sql
    return DataFrame(self.ctx.sql(query))
                     ~~~~~~~~~~~~^^^^^^^
Exception: DataFusion error: Execution("PyErr { type: <class 'NotImplementedError'>, value: NotImplementedError('__datafusion_table_provider__ does not exist on Table Provider object.'), traceback: None }")

@timsaucer
Copy link
Member Author

I was able to reproduce the problem, but I think we can lean on the PyTable to handle these cases as well. I have pushed a unit test that fails without my addition and passes with my addition in the last commit. Do you still have concerns?

@kosiew
Copy link
Contributor

kosiew commented Oct 6, 2025

If I have a python object that implements TableProvider via PyCapsule, I should be able to pass this object directly to SessionContext.read_table

Source: #1245

I tested with SessionContext.read_table and it chokes on raw PyCapsule objects.

from __future__ import annotations

import ctypes

from datafusion import SessionContext, Table


# Keep the backing memory alive for the lifetime of the module so the capsule
# always wraps a valid (non-null) pointer. The capsule content is irrelevant for
# this regression example—we only need a non-null address.
_DUMMY_CAPSULE_BYTES = ctypes.create_string_buffer(b"x")


def make_table_provider_capsule() -> object:
    """Create a dummy PyCapsule with the expected table provider name."""

    pycapsule_new = ctypes.pythonapi.PyCapsule_New
    pycapsule_new.restype = ctypes.py_object
    pycapsule_new.argtypes = [ctypes.c_void_p, ctypes.c_char_p, ctypes.c_void_p]
    dummy_ptr = ctypes.cast(_DUMMY_CAPSULE_BYTES, ctypes.c_void_p)
    return pycapsule_new(dummy_ptr, b"datafusion_table_provider", None)


def main() -> None:
    """Attempt to use the capsule the same way existing callers do."""

    ctx = SessionContext()
    try:
        capsule = make_table_provider_capsule()
    except Exception as err:
        print("Creating the PyCapsule failed:", err)
        return


    ctx.read_table(capsule)
    

if __name__ == "__main__":
    main()

raises this Traceback:

  File "/Users/kosiew/GitHub/datafusion-python/examples/raw_capsule_registration_failure.py", line 49, in <module>
    main()
    ~~~~^^
  File "/Users/kosiew/GitHub/datafusion-python/examples/raw_capsule_registration_failure.py", line 45, in main
    ctx.read_table(capsule)
    ~~~~~~~~~~~~~~^^^^^^^^^
  File "/Users/kosiew/GitHub/datafusion-python/python/datafusion/context.py", line 1184, in read_table
    return DataFrame(self.ctx.read_table(table))
                     ~~~~~~~~~~~~~~~~~~~^^^^^^^
ValueError: dataset argument must be a pyarrow.dataset.Dataset object

@timsaucer
Copy link
Member Author

I don't think that example demonstrates how users are expected to provide PyCapsule based providers. I changed it slightly below to fit TableProviderExportable. With this change it does segfault, but I expect that because it is not a valid object.

from __future__ import annotations

import ctypes

from datafusion import SessionContext, Table


# Keep the backing memory alive for the lifetime of the module so the capsule
# always wraps a valid (non-null) pointer. The capsule content is irrelevant for
# this regression example—we only need a non-null address.
_DUMMY_CAPSULE_BYTES = ctypes.create_string_buffer(b"x")

class CapsuleContainer:
    def __init__(self):
        self.__datafusion_table_provider__ = make_table_provider_capsule

def make_table_provider_capsule() -> object:
    """Create a dummy PyCapsule with the expected table provider name."""

    pycapsule_new = ctypes.pythonapi.PyCapsule_New
    pycapsule_new.restype = ctypes.py_object
    pycapsule_new.argtypes = [ctypes.c_void_p, ctypes.c_char_p, ctypes.c_void_p]
    dummy_ptr = ctypes.cast(_DUMMY_CAPSULE_BYTES, ctypes.c_void_p)
    return pycapsule_new(dummy_ptr, b"datafusion_table_provider", None)


def main() -> None:
    """Attempt to use the capsule the same way existing callers do."""

    ctx = SessionContext()
    try:
        capsule = CapsuleContainer()
    except Exception as err:
        print("Creating the PyCapsule failed:", err)
        return


    ctx.read_table(capsule)
    

if __name__ == "__main__":
    main()

@kosiew
Copy link
Contributor

kosiew commented Oct 10, 2025

@timsaucer

I changed it slightly below to fit TableProviderExportable. With this change it does segfault, but I expect that because it is not a valid object.

We still don't have an example/test to show that SessionContext.read_table can take a PyCapsule object (#1245).

Can you resolve the conflicts and also leave #1245 open to move the PR forward

Comment on lines +44 to +46
result = ctx.read_table(table).collect()
result = [r.column(0) for r in result]
assert result == expected
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kosiew Here is the unit test for read_table with a PyCapsule based table provider.

@timsaucer
Copy link
Member Author

I removed the statement this would close #1245 per your request, but I do think it has the appropriate unit test.

Copy link
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the statement this would close #1245 per your request, but I do think it has the appropriate unit test.

You're right.
This does close #1245 too.

@timsaucer
Copy link
Member Author

Thank you @kosiew for the collaboration on this! I think the final result is a really nice step forward!

@timsaucer timsaucer merged commit 6f3b1ca into apache:main Oct 11, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change enhancement New feature or request python Pull requests that update Python code rust Pull requests that update Rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SessionContext.read_table should take PyCapsule objects A single common PyTableProvider that can be created either via a pycapsule or into_view

2 participants