feat(python): introduce pypaimon core and DataFusion catalog integration#204
feat(python): introduce pypaimon core and DataFusion catalog integration#204JingsongLi merged 7 commits intoapache:mainfrom
Conversation
c8ca0a6 to
1a39c6e
Compare
2fda549 to
fdc9cfb
Compare
| ctx.register_catalog_provider("paimon", catalog) | ||
| ``` | ||
|
|
||
| Time travel queries are not supported in the Python binding at this time. |
There was a problem hiding this comment.
Why time travel is not supported in this PR
- This Python integration intentionally uses native DataFusion
SessionContextplusCatalogProvider. - That path works well for normal table queries, but time travel requires planner-level extension rather than simple catalog/table
registration. - DataFusion Python/FFI does not currently expose a suitable planner extension point for this, so we cannot cleanly wire time-travel semantics
into native SessionContext here.
What would be needed to support time travel
- Either introduce a custom context that owns the planner extension,
- or wait for / contribute planner-level registration support in DataFusion Python/FFI,
- or expose time travel through a separate explicit API instead of native SQL/catalog resolution.
Why not use a custom context now
- The goal of this PR is to stay aligned with native DataFusion usage and keep the Python integration lightweight and predictable.
- If we introduce a custom context and still want a user experience close to native SessionContext, we would need to re-expose a large part of
the SessionContext API surface there, which is relatively heavy to implement and maintain. - It would also introduce another API model and make the overall Python experience less consistent with native DataFusion.
- If time travel becomes a strong requirement, we can revisit it and design dedicated support separately.
74b297e to
a96e040
Compare
68dc25c to
3b1ff8c
Compare
| RUST_BACKTRACE: full | ||
|
|
||
| - name: Install uv | ||
| uses: astral-sh/setup-uv@v6 |
There was a problem hiding this comment.
ASF cannot execute this uses.
There was a problem hiding this comment.
Thanks. I have updated.
505c877 to
59a9ac8
Compare
621c2fe to
e70e4c1
Compare
| same "printed page" as the copyright notice for easier | ||
| identification within third-party archives. | ||
|
|
||
| Copyright 2021 Datafuse Labs |
There was a problem hiding this comment.
This is wrong... Also modify project LICENSE file.
| F: Future + Send + 'static, | ||
| F::Output: Send + 'static, | ||
| { | ||
| let run = move || block_on_new_runtime(future, runtime_error); |
There was a problem hiding this comment.
Why not use global OnceLock<Runtime> to avoid creating everytime.
| TableType::Base | ||
| } | ||
|
|
||
| fn supports_filters_pushdown( |
There was a problem hiding this comment.
No behavior change here — I didn’t remove supports_filters_pushdown(). I only moved it to match the method order in the TableProvider trait, to make IDE happy
| // Tokio runtime. `scan.plan()` can reach OpenDAL/Tokio filesystem calls while | ||
| // reading Paimon metadata, so we must provide a runtime here instead of | ||
| // assuming the caller already entered one. | ||
| let plan = await_with_runtime( |
There was a problem hiding this comment.
scan() itself is an asynchronous fn, and the DataFusion FFI uses the runtime handle passed by FFI_CatalogProvider to poll it, so there is no need to do a runtime fallback here.
There was a problem hiding this comment.
I initially thought the same, but turns out not, Seems not all async FFI entry points propagate the runtime —
Without the await_with_runtime fallback, scan.plan() would panic at runtime for no active runtime on the thread.
|
|
||
| static RUNTIME: OnceLock<Runtime> = OnceLock::new(); | ||
|
|
||
| pub fn runtime() -> Handle { |
There was a problem hiding this comment.
Can this just use paimon_datafusion::runtime::runtime?
|
|
||
| ```python | ||
| from datafusion import SessionContext | ||
| from pypaimon_core.datafusion import PaimonCatalog |
There was a problem hiding this comment.
maybe use pypaimon_rust.datafusion?
There was a problem hiding this comment.
Good point. The current name pypaimon_core was inspired by pyiceberg-core, but given that pypaimon already exists as the pure-Python package, pypaimon_core could be misleading. pypaimon_rust might be more straightforward — it clearly indicates this is the Rust-backed implementation, similar to how cryptography has cryptography-rust. The mixed Python/Rust naming is a bit unusual but arguably more honest about what the package actually is.
e70e4c1 to
de602b6
Compare
Purpose
Linked issue: close #189
Subtask of #173
Introduce the Rust-powered core for
pypaimonand provide Python-side DataFusion integration for querying Paimon tables.This change adds a new
bindings/pythonpackage, exposes aPaimonCatalogthat can be registered into a native DataFusionSessionContext, and updates the Python API/docs to focus on catalog-based integration for simple table queries.It also includes a runtime fallback on the Rust DataFusion integration side to handle DataFusion FFI callbacks that may run without an entered Tokio runtime. This works around the current gap discussed in apache/datafusion#16312.
Brief change log
bindings/pythonpackage withmaturin-based build configurationpypaimonPaimonCatalogctx.register_catalog_provider("paimon", catalog)datafusionas a Python dev dependency for local testingpaimon-datafusionfor catalog and scan-planning FFI paths when DataFusion invokes callbacks without a Tokio runtimeTests
cargo check -p paimon-datafusionuv run --no-sync maturin developuv run --no-sync pytest tests/test_datafusion.py -qAPI and Format
bindings/pythonPaimonCatalogfor DataFusion integration in PythonSessionContextDocumentation
bindings/python/README.mdbindings/python/project-description.mdpypaimon