Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve parquet binary writer speed by reducing allocations #819

Closed
nevi-me opened this issue Oct 7, 2021 · 0 comments · Fixed by #820
Closed

Improve parquet binary writer speed by reducing allocations #819

nevi-me opened this issue Oct 7, 2021 · 0 comments · Fixed by #820
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@nevi-me
Copy link
Contributor

nevi-me commented Oct 7, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

When writing arrow binary columns to parquet, we create thousands of small ByteBuffer objects, and this leads to much of the writer time spent on allocating and dropping these objects.

Describe the solution you'd like

A ButeBuffer is backed by a ByteBufferPtr which is an alias for Arc<Vec<u8>> with similar abstractions on length and offset. If we were to create a single ByteBuffer from the Arrow data, we would reduce allocations down to 1, and then reuse this buffer when writing binary values.

Local experiments have shown reasonable improvements in the writer.

Describe alternatives you've considered

I considered slicing into the Arrow buffer directly, but the parquet::encoding::Encoding is inflexible to this approach.

Additional context

I noticed this while profiling code and trying to simplify how we write nested lists.

@nevi-me nevi-me added the enhancement Any new improvement worthy of a entry in the changelog label Oct 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant