parquet: optimize CachedArrayReader byte-array coalescing#9743
parquet: optimize CachedArrayReader byte-array coalescing#9743ClSlaid wants to merge 1 commit intoapache:mainfrom
Conversation
When CachedArrayReader builds output from multiple cached batches, the old path materialized filtered byte arrays and then concatenated them. Replace that path for Utf8/Binary arrays with a direct coalescer that builds offsets, values, and validity in one output array, while keeping the existing generic MutableArrayData path for other types. Add a dedicated CachedArrayReader benchmark and a sparse string regression test so this path is measured directly and covered independently of broader parquet reader benchmarks. Benchmark vs main: - cached_array_reader/utf8_sparse_cross_batch_4m_rows/consume_batch: 11.949 ms -> 4.153 ms (-65.2%) - arrow_reader_clickbench/sync/Q24 (same filter/projection as ClickBench Q26): 28.377 ms -> 28.443 ms (+0.2%, no measurable change) Signed-off-by: cl <cailue@apache.org>
|
@alamb I've tried to optimize with GPT 5.4, the improvement is not that obvious in the original test case you gave. So I let it wrote a new benchmark and optimized on it. However, I'm still not really confident about the current implementation, so please have a look. |
|
@XiangpengHao can you help review this PR? |
|
run benchmarks arrow_reader_clickbench |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing issue-9060-cached-array-reader-byte-coalescer (5e78671) to d7d9ad3 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
| selected_row_count: usize, | ||
| ) -> ArrayRef { | ||
| match selected_batches[0].array.data_type() { | ||
| ArrowType::Utf8 => { |
There was a problem hiding this comment.
Wouldn't it need utf8view to show up in the clickbench benchmarks?
|
Thank you @ClSlaid The idea is to avoid materialize filtered batch and directly build the final batch, so essentially a fused filter-and-concate kernel. It makes sense to me. |
When CachedArrayReader builds output from multiple cached batches, the old path materialized filtered byte arrays and then concatenated them. Replace that path for Utf8/Binary arrays with a direct coalescer that builds offsets, values, and validity in one output array, while keeping the existing generic MutableArrayData path for other types.
Add a dedicated CachedArrayReader benchmark and a sparse string regression test so this path is measured directly and covered independently of broader parquet reader benchmarks.
Benchmark vs main:
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?