Skip to content

feat(python): introduce pypaimon core and DataFusion catalog integration#204

Merged
JingsongLi merged 7 commits intoapache:mainfrom
luoyuxia:introduce-pypaimon
Apr 6, 2026
Merged

feat(python): introduce pypaimon core and DataFusion catalog integration#204
JingsongLi merged 7 commits intoapache:mainfrom
luoyuxia:introduce-pypaimon

Conversation

@luoyuxia
Copy link
Copy Markdown
Contributor

@luoyuxia luoyuxia commented Apr 4, 2026

Purpose

Linked issue: close #189

Subtask of #173

Introduce the Rust-powered core for pypaimon and provide Python-side DataFusion integration for querying Paimon tables.

This change adds a new bindings/python package, exposes a PaimonCatalog that can be registered into a native DataFusion SessionContext, and updates the Python API/docs to focus on catalog-based integration for simple table queries.

It also includes a runtime fallback on the Rust DataFusion integration side to handle DataFusion FFI callbacks that may run without an entered Tokio runtime. This works around the current gap discussed in apache/datafusion#16312.

Brief change log

  • add the new bindings/python package with maturin-based build configuration
  • expose the Rust-powered core for pypaimon
  • add Python DataFusion integration through PaimonCatalog
  • align the Python API with native DataFusion catalog registration via ctx.register_catalog_provider("paimon", catalog)
  • simplify the Python-side test to cover normal table queries only and drop time-travel coverage for now
  • update Python README and package description to describe the Rust-powered core and DataFusion integration
  • add Python typing support for DataFusion catalog export
  • add datafusion as a Python dev dependency for local testing
  • add a runtime fallback in paimon-datafusion for catalog and scan-planning FFI paths when DataFusion invokes callbacks without a Tokio runtime
  • centralize the runtime fallback helpers and document that they are a temporary workaround until the upstream FFI runtime propagation issue is resolved

Tests

  • cargo check -p paimon-datafusion
  • uv run --no-sync maturin develop
  • uv run --no-sync pytest tests/test_datafusion.py -q

API and Format

  • adds a new Python package under bindings/python
  • exposes PaimonCatalog for DataFusion integration in Python
  • Python users register the catalog with native DataFusion SessionContext
  • time travel is not exposed through the current Python DataFusion catalog integration
  • no storage format change

Documentation

  • add bindings/python/README.md
  • add bindings/python/project-description.md
  • document that this project builds the Rust-powered core for pypaimon
  • document DataFusion integration for querying Paimon tables from Python

@luoyuxia luoyuxia force-pushed the introduce-pypaimon branch from c8ca0a6 to 1a39c6e Compare April 4, 2026 02:41
@luoyuxia luoyuxia changed the title WIP: feat: introduce python binding WIP: feat: introduce python binding to expose datadusion Apr 4, 2026
@luoyuxia luoyuxia marked this pull request as draft April 4, 2026 02:42
@luoyuxia luoyuxia changed the title WIP: feat: introduce python binding to expose datadusion feat: introduce python binding to expose datadusion Apr 4, 2026
@luoyuxia luoyuxia force-pushed the introduce-pypaimon branch 3 times, most recently from 2fda549 to fdc9cfb Compare April 4, 2026 11:11
@luoyuxia luoyuxia changed the title feat: introduce python binding to expose datadusion feat(python): introduce pypaimon core and DataFusion catalog integration Apr 4, 2026
ctx.register_catalog_provider("paimon", catalog)
```

Time travel queries are not supported in the Python binding at this time.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why time travel is not supported in this PR

  • This Python integration intentionally uses native DataFusion SessionContext plus CatalogProvider.
  • That path works well for normal table queries, but time travel requires planner-level extension rather than simple catalog/table
    registration.
  • DataFusion Python/FFI does not currently expose a suitable planner extension point for this, so we cannot cleanly wire time-travel semantics
    into native SessionContext here.

What would be needed to support time travel

  • Either introduce a custom context that owns the planner extension,
  • or wait for / contribute planner-level registration support in DataFusion Python/FFI,
  • or expose time travel through a separate explicit API instead of native SQL/catalog resolution.

Why not use a custom context now

  • The goal of this PR is to stay aligned with native DataFusion usage and keep the Python integration lightweight and predictable.
  • If we introduce a custom context and still want a user experience close to native SessionContext, we would need to re-expose a large part of
    the SessionContext API surface there, which is relatively heavy to implement and maintain.
  • It would also introduce another API model and make the overall Python experience less consistent with native DataFusion.
  • If time travel becomes a strong requirement, we can revisit it and design dedicated support separately.

@luoyuxia luoyuxia force-pushed the introduce-pypaimon branch 3 times, most recently from 74b297e to a96e040 Compare April 4, 2026 11:49
@luoyuxia luoyuxia marked this pull request as ready for review April 4, 2026 11:50
@luoyuxia luoyuxia force-pushed the introduce-pypaimon branch 2 times, most recently from 68dc25c to 3b1ff8c Compare April 4, 2026 11:54
@luoyuxia luoyuxia closed this Apr 4, 2026
@luoyuxia luoyuxia reopened this Apr 4, 2026
Comment thread .github/workflows/ci.yml Outdated
RUST_BACKTRACE: full

- name: Install uv
uses: astral-sh/setup-uv@v6
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASF cannot execute this uses.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have updated.

@luoyuxia luoyuxia force-pushed the introduce-pypaimon branch 3 times, most recently from 505c877 to 59a9ac8 Compare April 5, 2026 10:01
@luoyuxia luoyuxia marked this pull request as draft April 5, 2026 10:18
@luoyuxia luoyuxia force-pushed the introduce-pypaimon branch 2 times, most recently from 621c2fe to e70e4c1 Compare April 5, 2026 11:22
@luoyuxia luoyuxia marked this pull request as ready for review April 5, 2026 11:38
@luoyuxia luoyuxia requested a review from JingsongLi April 5, 2026 11:45
Comment thread bindings/python/LICENSE Outdated
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright 2021 Datafuse Labs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong... Also modify project LICENSE file.

F: Future + Send + 'static,
F::Output: Send + 'static,
{
let run = move || block_on_new_runtime(future, runtime_error);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use global OnceLock<Runtime> to avoid creating everytime.

TableType::Base
}

fn supports_filters_pushdown(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Revert this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No behavior change here — I didn’t remove supports_filters_pushdown(). I only moved it to match the method order in the TableProvider trait, to make IDE happy

// Tokio runtime. `scan.plan()` can reach OpenDAL/Tokio filesystem calls while
// reading Paimon metadata, so we must provide a runtime here instead of
// assuming the caller already entered one.
let plan = await_with_runtime(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scan() itself is an asynchronous fn, and the DataFusion FFI uses the runtime handle passed by FFI_CatalogProvider to poll it, so there is no need to do a runtime fallback here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially thought the same, but turns out not, Seems not all async FFI entry points propagate the runtime —
Without the await_with_runtime fallback, scan.plan() would panic at runtime for no active runtime on the thread.

Comment thread bindings/python/src/runtime.rs Outdated

static RUNTIME: OnceLock<Runtime> = OnceLock::new();

pub fn runtime() -> Handle {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this just use paimon_datafusion::runtime::runtime?

Comment thread bindings/python/README.md Outdated

```python
from datafusion import SessionContext
from pypaimon_core.datafusion import PaimonCatalog
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use pypaimon_rust.datafusion?

Copy link
Copy Markdown
Contributor Author

@luoyuxia luoyuxia Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. The current name pypaimon_core was inspired by pyiceberg-core, but given that pypaimon already exists as the pure-Python package, pypaimon_core could be misleading. pypaimon_rust might be more straightforward — it clearly indicates this is the Rust-backed implementation, similar to how cryptography has cryptography-rust. The mixed Python/Rust naming is a bit unusual but arguably more honest about what the package actually is.

@luoyuxia luoyuxia force-pushed the introduce-pypaimon branch from e70e4c1 to de602b6 Compare April 5, 2026 14:18
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit e8c98d9 into apache:main Apr 6, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

introduce python binding to expose datafusion intergration

2 participants