Improve parquet binary writer speed by reducing allocations #819

nevi-me · 2021-10-07T00:35:34Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

When writing arrow binary columns to parquet, we create thousands of small ByteBuffer objects, and this leads to much of the writer time spent on allocating and dropping these objects.

Describe the solution you'd like

A ButeBuffer is backed by a ByteBufferPtr which is an alias for Arc<Vec<u8>> with similar abstractions on length and offset. If we were to create a single ByteBuffer from the Arrow data, we would reduce allocations down to 1, and then reuse this buffer when writing binary values.

Local experiments have shown reasonable improvements in the writer.

Describe alternatives you've considered

I considered slicing into the Arrow buffer directly, but the parquet::encoding::Encoding is inflexible to this approach.

Additional context

I noticed this while profiling code and trying to simplify how we write nested lists.

The text was updated successfully, but these errors were encountered:

nevi-me added the enhancement Any new improvement worthy of a entry in the changelog label Oct 7, 2021

nevi-me mentioned this issue Oct 7, 2021

Fewer ByteArray allocations when writing binary columns #820

Merged

alamb closed this as completed in #820 Oct 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve parquet binary writer speed by reducing allocations #819

Improve parquet binary writer speed by reducing allocations #819

nevi-me commented Oct 7, 2021

Improve parquet binary writer speed by reducing allocations #819

Improve parquet binary writer speed by reducing allocations #819

Comments

nevi-me commented Oct 7, 2021