Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Umbrella] Arrow-based results serialization #3863

Closed
2 of 5 tasks
pan3793 opened this issue Nov 28, 2022 · 0 comments
Closed
2 of 5 tasks

[Umbrella] Arrow-based results serialization #3863

pan3793 opened this issue Nov 28, 2022 · 0 comments
Labels
kind:umbrella This a umbrella ticket priority:major

Comments

@pan3793
Copy link
Member

pan3793 commented Nov 28, 2022

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the proposal

Currently, Kyuubi supports thrift-based HS2 protocol, and the results transmission is not efficient enough.

For the Spark engine, the main pain points are:

  • Driver has high memory pressure because it needs to collect RDD as InternelRow, convert to Row, and convert to TRow(row-based or columnar-based, which depends on the client protocol) before sending it back to Kyuubi Server, which typically consumes several times memory size than the data stored in the parquet file.

  • The data conversion happens on the Driver side, consuming much CPU time as well.

  • The protocol does not support compression, compression is quite helpful for network bandwidth-limited scenarios.

Apache Arrow is a columnar-based format that is a more efficient format for data transmission, it is adopted by PySpark as the data serialization format between JVM and Python Process, and will be adopted by the ongoing Spark Connect. Kyuubi can support fetching data in Arrow format to improve the results transmission efficiency.

The core ideas are:

  • converting (encoding as arrow and optional compression) data on the executor side before collecting to driver
  • the driver collects arrow results, encodes arrow data as thrift binary data, and set a flag to indicate the client should decode the result in arrow format, then sends them back to the server directly
  • the client should be updated to support decoding and decompressing arrow format

Task list

Are you willing to submit PR?

  • Yes. I can submit a PR independently to improve.
  • Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.
  • No. I cannot submit a PR at this time.
@pan3793 pan3793 added kind:umbrella This a umbrella ticket priority:major labels Nov 28, 2022
@pan3793 pan3793 changed the title [Umbrella] Arrow based results serialization [Umbrella] Arrow-based results serialization Nov 28, 2022
@pan3793 pan3793 pinned this issue Nov 28, 2022
@pan3793 pan3793 unpinned this issue Dec 22, 2022
@pan3793 pan3793 closed this as completed Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:umbrella This a umbrella ticket priority:major
Projects
None yet
Development

No branches or pull requests

1 participant