[Enhancement] Enable Distributed Computing Support by Making Core Objects Serializable

Currently, paimon-python uses Py4J to interact with Java Paimon, which causes core objects like Split and Predicate to contain non-serializable Py4J properties. This prevents these objects from being serialized by common Python frameworks (e.g., pickle), making them incompatible with distributed computing frameworks.

This is a critical limitation because distributed computing is a core use case for data lake formats like Paimon. Without distributed computing support, paimon-python's practical utility is severely limited.

Proposed solution is to refactor core objects (Split, Predicate, etc.) to just keep Java object bytes, not Java object itself, and delegate Py4J properties to upstream/downstream objects. 

In the long-term, we might want to follow PyIceberg's approach and reimplement paimon-python as a pure Python project, completely removing the Py4J dependency. This will not only resolve issues above, but also resolve the serialization cost caused by passing Arrow between Python and Java. However, this would require significant development effort and can be considered in a future phase.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Enhancement] Enable Distributed Computing Support by Making Core Objects Serializable #44

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Enhancement] Enable Distributed Computing Support by Making Core Objects Serializable #44

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions