Skip to content

[Enhancement] Enable Distributed Computing Support by Making Core Objects Serializable #44

@chenghuichen

Description

@chenghuichen

Currently, paimon-python uses Py4J to interact with Java Paimon, which causes core objects like Split and Predicate to contain non-serializable Py4J properties. This prevents these objects from being serialized by common Python frameworks (e.g., pickle), making them incompatible with distributed computing frameworks.

This is a critical limitation because distributed computing is a core use case for data lake formats like Paimon. Without distributed computing support, paimon-python's practical utility is severely limited.

Proposed solution is to refactor core objects (Split, Predicate, etc.) to just keep Java object bytes, not Java object itself, and delegate Py4J properties to upstream/downstream objects.

In the long-term, we might want to follow PyIceberg's approach and reimplement paimon-python as a pure Python project, completely removing the Py4J dependency. This will not only resolve issues above, but also resolve the serialization cost caused by passing Arrow between Python and Java. However, this would require significant development effort and can be considered in a future phase.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions