-
Notifications
You must be signed in to change notification settings - Fork 128
Unify Table representations #1256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify Table representations #1256
Conversation
docs/tests, add DataFrame view support, and improve Send/concurrency support. migrates the codebase from using `Table` to a `TableProvider`-based API, refactors registration and access paths to simplify catalog/context interactions, and updates documentation and examples. DataFrame view handling is improved (`into_view` is now public), the test-suite is expanded to cover new registration and async SQL scenarios, and `TableProvider` now supports the `Send` trait across modules for safer concurrency. Minor import cleanup and utility adjustments (including a refined `pyany_to_table_provider`) are included.
DataFrame→TableProvider conversion, plus tests and FFI/pycapsule improvements. -- Registration logic & API * Refactor of table provider registration logic for improved clarity and simpler call sites. * Remove PyTableProvider registration from an internal module (reduces surprising side effects). * Update table registration method to call `register_table` instead of `register_table_provider`. * Extend `register_table` to support `TableProviderExportable` so more provider types can be registered uniformly. * Improve error messages related to registration failures (missing PyCapsule name and DataFrame registration errors). -- DataFrame ↔ TableProvider conversions * Introduce utility functions to simplify table provider conversions and centralize conversion logic. * Rename `into_view_provider` → `to_view_provider` for clearer intent. * Fix `from_dataframe` to return the correct type and update `DataFrame.into_view` to import the correct `TableProvider`. * Remove an obsolete `dataframe_into_view` test case after the refactor. -- FFI / PyCapsule handling * Update `FFI_TableProvider` initialization to accept an optional parameter (improves FFI ergonomics). * Introduce `table_provider_from_pycapsule` utility to standardize pycapsule-based construction. * Improve the error message when a PyCapsule name is missing to help debugging. -- DeltaTable & specific integrations * Update TableProvider registration for `DeltaTable` to use the correct registration method (matches the new API surface). -- Tests, docs & minor fixes * Add tests for registering a `TableProvider` from a `DataFrame` and from a capsule to ensure conversion paths are covered. * Fix a typo in the `register_view` docstring and another typo in the error message for unsupported volatility type. * Simplify version retrieval by removing exception handling around `PackageNotFoundError` (streamlines code path).
* Removed unused helpers (`extract_table_provider`, `_wrap`) and dead code to simplify maintenance. * Consolidated and streamlined table-provider extraction and registration logic; improved error handling and replaced a hardcoded error message with `EXPECTED_PROVIDER_MSG`. * Marked `from_view` as deprecated; updated deprecation message formatting and adjusted the warning `stacklevel` so it points to caller code. * Removed the `Send` marker from TableProvider trait objects to increase type flexibility — review threading assumptions. * Added type hints to `register_schema` and `deregister_table` methods. * Adjusted tests and exceptions (e.g., changed one test to expect `RuntimeError`) and updated test coverage accordingly. * Introduced a refactored `TableProvider` class and enhanced Python integration by adding support for extracting `PyDataFrame` in `PySchema`. Notes: * Consumers should migrate away from `TableProvider::from_view` to the new TableProvider integration. * Audit any code relying on `Send` for trait objects passed across threads. * Update downstream tests and documentation to reflect the changed exception types and deprecation.
utilities, docs, and robustness fixes * Normalized table-provider handling and simplified registration flow across the codebase; multiple commits centralize provider coercion and normalization. * Introduced utility helpers (`coerce_table_provider`, `extract_table_provider`, `_normalize_table_provider`) to centralize extraction, error handling, and improve clarity. * Simplified `from_dataframe` / `into_view` behavior: clearer implementations, direct returns of DataFrame views where appropriate, and added internal tests for DataFrame flows. * Fixed DataFrame registration semantics: enforce `TypeError` for invalid registrations; added handling for `DataFrameWrapper` by converting it to a view. * Added tests, including a schema registration test using a PyArrow dataset and internal DataFrame tests to cover new flows. * Documentation improvements: expanded `from_dataframe` docstrings with parameter details, added usage examples for `into_view`, and documented deprecations (e.g., `register_table_provider` → `register_table`). * Warning and UX fixes: synchronized deprecation `stacklevel` so warnings point to caller code; improved `__dir__` to return sorted, unique attributes. * Cleanup: removed unused imports (including an unused error import from `utils.rs`) and other dead code to reduce noise.
…dating method calls
…d avoid documentation duplication
…ge and advantages
…age of Table instead
@kosiew Would you mind taking a look at this? I started from your PR and made a few adjustments to try to make it as ergonomic as possible for the end users. My goal is that a consumer of this project does not need to know much about the internals of the table types as long as they have something that should be table-like. With this change it pushes most of the type coercion into a single place in the rust side. I think it also makes it nice to not have to have multiple entry points such as What do you think? |
Centralizing the coercion logic in PyTable definitely improves ergonomics; callers can now hand almost anything “table-like” to Rust and let it figure things out, which is great. Restoring the capsule hook (even if it just forwards to the inner provider) keeps those advanced scenarios working without forcing ordinary users to think about internals. |
Can you expand on this? I don't understand the use case. I did run the integration tests in our repo without issue. I removed some of the tests you had added that were looking for |
here's an example that shows the NotImplementedError examples/table_capsule_failure.py """Demonstrate how missing __datafusion_table_provider__ breaks table UDTFs.
Run with ``python examples/table_capsule_failure.py``.
This example mirrors how advanced integrations unwrap ``Table`` instances via the
``__datafusion_table_provider__`` PyCapsule. The refactor that removed this method
means user-defined table functions returning ``Table`` now raise ``NotImplementedError``
and the script prints the resulting error message instead of crashing.
"""
from __future__ import annotations
from datafusion import SessionContext, Table, udtf
def main() -> None:
"""Register a Python table UDTF that returns a ``Table`` and trigger it."""
ctx = SessionContext()
failing_table = Table(ctx.sql("SELECT 1 AS value"))
@udtf("capsule_dependent")
def capsule_dependent_udtf() -> Table:
"""Return a ``Table`` so DataFusion unwraps it via the FFI capsule."""
# Prior to the refactor the wrapper exposed ``__datafusion_table_provider__``
# so this conversion succeeded. Without it the runtime raises a
# ``NotImplementedError`` complaining about the missing attribute.
return failing_table
ctx.register_udtf(capsule_dependent_udtf)
# Executing the UDTF now fails because ``Table`` no longer exposes the
# ``__datafusion_table_provider__`` helper that PyTableFunction expects.
try:
ctx.sql("SELECT * FROM capsule_dependent()").collect()
print("capsule_dependent() works")
except NotImplementedError as err:
# Document the regression by surfacing the missing capsule attribute
# instead of crashing with a panic inside the execution engine.
print(
"capsule_dependent() failed due to missing __datafusion_table_provider__: "
f"{err}"
)
if __name__ == "__main__":
main() On my computer, I got this traceback:
|
I was able to reproduce the problem, but I think we can lean on the |
Source: #1245 I tested with SessionContext.read_table and it chokes on raw PyCapsule objects. from __future__ import annotations
import ctypes
from datafusion import SessionContext, Table
# Keep the backing memory alive for the lifetime of the module so the capsule
# always wraps a valid (non-null) pointer. The capsule content is irrelevant for
# this regression example—we only need a non-null address.
_DUMMY_CAPSULE_BYTES = ctypes.create_string_buffer(b"x")
def make_table_provider_capsule() -> object:
"""Create a dummy PyCapsule with the expected table provider name."""
pycapsule_new = ctypes.pythonapi.PyCapsule_New
pycapsule_new.restype = ctypes.py_object
pycapsule_new.argtypes = [ctypes.c_void_p, ctypes.c_char_p, ctypes.c_void_p]
dummy_ptr = ctypes.cast(_DUMMY_CAPSULE_BYTES, ctypes.c_void_p)
return pycapsule_new(dummy_ptr, b"datafusion_table_provider", None)
def main() -> None:
"""Attempt to use the capsule the same way existing callers do."""
ctx = SessionContext()
try:
capsule = make_table_provider_capsule()
except Exception as err:
print("Creating the PyCapsule failed:", err)
return
ctx.read_table(capsule)
if __name__ == "__main__":
main() raises this Traceback:
|
I don't think that example demonstrates how users are expected to provide PyCapsule based providers. I changed it slightly below to fit TableProviderExportable. With this change it does segfault, but I expect that because it is not a valid object.
|
We still don't have an example/test to show that SessionContext.read_table can take a PyCapsule object (#1245). Can you resolve the conflicts and also leave #1245 open to move the PR forward |
result = ctx.read_table(table).collect() | ||
result = [r.column(0) for r in result] | ||
assert result == expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kosiew Here is the unit test for read_table
with a PyCapsule based table provider.
I removed the statement this would close #1245 per your request, but I do think it has the appropriate unit test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @kosiew for the collaboration on this! I think the final result is a really nice step forward! |
Which issue does this PR close?
Closes #1239
Closes #1245
Rationale for this change
This is built on top of #1243
With this change we have a single class and single entry point for turning any table-like object into a
Table
class in python. On the rust side, we use this one struct for doing things like registering with the schema provider or session context.What changes are included in this PR?
Pushes the Python object type evaluation into the rust side with a constructor for the
PyTable
struct that handles:PyTable
objectsThis removes the cognitive load from the end users from having to understand the differences between any of these. If they have something that is table-like they can simply call
Table(my_obj)
and get something that can be used to register in the schema provider or directly with the session context.Are there any user-facing changes?
We have deprecated one method,
register_table_provider
.