[C++] Parquet reading performance regressions #38432

jorisvandenbossche · 2023-10-24T11:06:15Z

Looking at some of the Parquet read benchmarks we have (https://conbench.ursa.dev/c-benchmarks/file-read), there are two small regressions to be observed the last months.

For example, zooming in on that time for nyctaxi_2010 with snappy compression (https://conbench.ursa.dev/benchmark-results/0653775157e07b6980005bce6e59a2ad/):

And for fanniemae (https://conbench.ursa.dev/benchmark-results/0653774cc429733d8000b2db91fa63fa/); here only the first bump is noticeable:

There are two small bumps, which I think are caused by the following two PRs, in order of the bumps:

GH-32863: [C++][Parquet] Add DELTA_BYTE_ARRAY encoder to Parquet writer #14341
This PR added a new encoding option, but so my expectation is that it shouldn't have to slow down the default (the new encoding isn't used)? (cc @rok)
GH-36765: [Python][Dataset] Change default of pre_buffer to True for reading Parquet files #37854
While this change is mostly useful for cloud filesystems, and could be detrimental for fast local disk, the general assumption was that this effect for local disk would not be significant, so we could simply enable it for all filesystems. But based on this benchmark it seems the impact might actually not be fully insignificant? Should we consider making the default for pre_buffer depend on the type of filesystem?

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2023-10-24T11:19:32Z

Although I have to add that we also have "dataset-read" benchmarks (https://conbench.ursa.dev/c-benchmarks/dataset-read), which also benchmark Parquet reading, and there I don't directly see any visible regressions (although those are reading from S3, and might be less sensitive to small regressions in the reading part)

Also, the "file-read" benchmark is actually using pq.read_table, which already did set pre_buffer=True for the default for some time. So it can't be that change in default (on the dataset side) that causes the slowdown. The other change in that PR is changing CacheOptions::Defaults() to LazyDefaults()

rok · 2023-10-24T11:47:48Z

@jorisvandenbossche I can't seem to find the exact benchmark that is producing this metric.
Is it possible that the new encoding added by #14341 is now used as default for the benchmark or added as another codepath?

jorisvandenbossche · 2023-10-24T11:54:26Z

No, that benchmark is just using a default pq.read_table (https://github.com/voltrondata-labs/benchmarks/blob/93a57ba1019f65a4cea124d17772b0e8c60ac5ce/benchmarks/file_benchmark.py#L84), which should not use any of the newer encodings by default AFAIK

mapleFU · 2023-10-24T11:58:39Z

Emmm any profiling message between the benchmark? Or what's the simplest way to re-produce this problem?

I think it might related to cache, since pre-buffer on a SSD might not a good way. I don't think encoding is related.

jorisvandenbossche · 2023-10-24T12:00:47Z

I don't think encoding is related.

To be clear, there are potentially two separate issues (maybe it was not the best idea to open a single issue for it). One related to the encoding change, and one related to the pre_buffer/caching change.

rok · 2023-10-24T12:07:57Z

Looking at https://github.com/apache/arrow/pull/14341/files - we abstracted some logic to ArrowBinaryHelper that would get used when decoding plain encoded string arrays. I don't see other potential culprits.

mapleFU · 2023-10-24T12:18:02Z

I'll try to run micro-benchmark with https://github.com/apache/arrow/pull/14341/files

mapleFU · 2023-10-24T13:04:11Z

I found DictDecodingByteArray benchmark become slightly slower (about 2% or less). not sure thats the reason or this is just unstable benchmark...

Updated: Oh I found DictDecodingByteArray become slower, let me analyze it...

rok · 2023-10-24T13:48:07Z

Looking at the commits in vicinity it seems most likely #14341 has caused this and looking at the timeline it doesn't seem like it's noise (except if something changed with the benchmarking infra). Conbench comparison.

mapleFU · 2023-10-24T14:04:03Z

#38437
@rok @jorisvandenbossche Would you mind run same benchmark here? I've trying to change the logic of ArrowBinaryBuilder as before. But I'm not sure would it helps.

rok · 2023-10-24T14:39:03Z

On eba5044 I ran:

archery benchmark run --output=run-head-1.json HEAD~1 --benchmark-filter=DictDecodingByteArray
archery benchmark diff HEAD run-head-1.json --benchmark-filter=DictDecodingByteArray
---

And actually got a small regression (but could be noise):

--------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------
BM_DictDecodingByteArray/max-string-length:8/batch-size:512           2750 ns         2750 ns       257429 bytes_per_second=1.3859G/s items_per_second=186.194M/s
BM_DictDecodingByteArray/max-string-length:64/batch-size:512          3405 ns         3405 ns       207141 bytes_per_second=5.05129G/s items_per_second=150.383M/s
BM_DictDecodingByteArray/max-string-length:1024/batch-size:512       10643 ns        10642 ns        67129 bytes_per_second=23.1467G/s items_per_second=48.1097M/s
BM_DictDecodingByteArray/max-string-length:8/batch-size:2048          7744 ns         7744 ns        90086 bytes_per_second=1.96492G/s items_per_second=264.454M/s
BM_DictDecodingByteArray/max-string-length:64/batch-size:2048        10882 ns        10882 ns        63705 bytes_per_second=6.29198G/s items_per_second=188.207M/s
BM_DictDecodingByteArray/max-string-length:1024/batch-size:2048      56034 ns        56034 ns        14860 bytes_per_second=17.5095G/s items_per_second=36.5494M/s
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (5)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                     benchmark       baseline      contender  change %                                                                                                                                                                                                      counters
  BM_DictDecodingByteArray/max-string-length:8/batch-size:2048  1.951 GiB/sec  1.965 GiB/sec     0.706   {'family_index': 0, 'per_family_instance_index': 3, 'run_name': 'BM_DictDecodingByteArray/max-string-length:8/batch-size:2048', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 86664}
BM_DictDecodingByteArray/max-string-length:1024/batch-size:512 23.075 GiB/sec 23.147 GiB/sec     0.310 {'family_index': 0, 'per_family_instance_index': 2, 'run_name': 'BM_DictDecodingByteArray/max-string-length:1024/batch-size:512', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 69531}
 BM_DictDecodingByteArray/max-string-length:64/batch-size:2048  6.302 GiB/sec  6.292 GiB/sec    -0.154  {'family_index': 0, 'per_family_instance_index': 4, 'run_name': 'BM_DictDecodingByteArray/max-string-length:64/batch-size:2048', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 64627}
  BM_DictDecodingByteArray/max-string-length:64/batch-size:512  5.097 GiB/sec  5.051 GiB/sec    -0.901  {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'BM_DictDecodingByteArray/max-string-length:64/batch-size:512', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 205168}
   BM_DictDecodingByteArray/max-string-length:8/batch-size:512  1.410 GiB/sec  1.386 GiB/sec    -1.730   {'family_index': 0, 'per_family_instance_index': 0, 'run_name': 'BM_DictDecodingByteArray/max-string-length:8/batch-size:512', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 259846}

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Regressions: (1)
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                      benchmark       baseline      contender  change %                                                                                                                                                                                                       counters
BM_DictDecodingByteArray/max-string-length:1024/batch-size:2048 21.256 GiB/sec 17.510 GiB/sec   -17.626 {'family_index': 0, 'per_family_instance_index': 5, 'run_name': 'BM_DictDecodingByteArray/max-string-length:1024/batch-size:2048', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 15530}

mapleFU · 2023-10-24T14:48:30Z

Because DictDecodingByteArray itself is un-stable...T_T

mapleFU · 2023-10-24T16:04:29Z

Benchmark runs are scheduled for commit eba5044. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev/ for updates. A comment will be posted here when the runs are complete.

After revert some change seems the dataset-read doesn't become faster. Maybe we can try to profile the two runing for detail..

jorisvandenbossche · 2023-10-27T07:59:09Z

After revert some change seems the dataset-read doesn't become faster.

I posted on the PR as well (#38437 (comment)), but from what I can see, the file-read benchmark did significantly improve with the PR.

…ession fix

…tByteArrayDecoderImpl (#38784) ### Rationale for this change Do some changes mentioned in #38432 I believe this might fix #38577 Problem1: The `BinaryHelper` might call `Prepare()` and `Prepare(estimated-output-binary-length)` for data. This might because: 1. For Plain Encoding ByteArray, the `len_` is similar to the data-page size, so `Reserve` is related. 2. For Dict Encoding. The Data Page is just a RLE encoding Page, it's `len_` might didn't directly related to output-binary. Problem2: `Prepare` using `::arrow::kBinaryMemoryLimit` as min-value, we should use `this->chunk_space_remaining_`. Problem3: `std::optional<int64_t>` is hard to optimize for some compilers ### What changes are included in this PR? Mention the behavior of BinaryHelper. And trying to fix it. ### Are these changes tested? No ### Are there any user-facing changes? Regression fixes * Closes: #38432 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: mwish <1506118561@qq.com> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

…he DictByteArrayDecoderImpl (apache#38784) ### Rationale for this change Do some changes mentioned in apache#38432 I believe this might fix apache#38577 Problem1: The `BinaryHelper` might call `Prepare()` and `Prepare(estimated-output-binary-length)` for data. This might because: 1. For Plain Encoding ByteArray, the `len_` is similar to the data-page size, so `Reserve` is related. 2. For Dict Encoding. The Data Page is just a RLE encoding Page, it's `len_` might didn't directly related to output-binary. Problem2: `Prepare` using `::arrow::kBinaryMemoryLimit` as min-value, we should use `this->chunk_space_remaining_`. Problem3: `std::optional<int64_t>` is hard to optimize for some compilers ### What changes are included in this PR? Mention the behavior of BinaryHelper. And trying to fix it. ### Are these changes tested? No ### Are there any user-facing changes? Regression fixes * Closes: apache#38432 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: mwish <1506118561@qq.com> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

jorisvandenbossche added Component: Parquet Component: C++ labels Oct 24, 2023

mapleFU added a commit to mapleFU/arrow that referenced this issue Oct 24, 2023

Prepare apacheGH-38432

2ff9ab4

mapleFU added a commit to mapleFU/arrow that referenced this issue Oct 24, 2023

Prepare apacheGH-38432

eba5044

jorisvandenbossche added the backport-candidate label Oct 27, 2023

jorisvandenbossche mentioned this issue Oct 27, 2023

GH-38432: [C++][Parquet] Trying to use original logic of BinaryBuilder #38437

Closed

github-actions bot assigned mapleFU Oct 27, 2023

mapleFU added a commit to mapleFU/arrow that referenced this issue Nov 19, 2023

apacheGH-38432: [C++][Parquet] Encoding: Dict Arrow Decoder tiny regr…

a50f3fa

…ession fix

mapleFU mentioned this issue Nov 19, 2023

GH-38432: [C++][Parquet] Try to fix performance regression in the DictByteArrayDecoderImpl #38784

Merged

mapleFU added a commit to mapleFU/arrow that referenced this issue Nov 20, 2023

apacheGH-38432: [C++][Parquet] Encoding: Dict Arrow Decoder tiny regr…

a7633d0

…ession fix

pitrou closed this as completed in #38784 Nov 27, 2023

pitrou added this to the 15.0.0 milestone Nov 27, 2023

raulcd modified the milestones: 15.0.0, 14.0.2 Nov 28, 2023

raulcd removed the backport-candidate label Nov 28, 2023

jorisvandenbossche mentioned this issue Dec 20, 2023

Website: Release 14.0.2 blog post apache/arrow-site#443

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Parquet reading performance regressions #38432

[C++] Parquet reading performance regressions #38432

jorisvandenbossche commented Oct 24, 2023 •

edited

Loading

jorisvandenbossche commented Oct 24, 2023

rok commented Oct 24, 2023

jorisvandenbossche commented Oct 24, 2023

mapleFU commented Oct 24, 2023

jorisvandenbossche commented Oct 24, 2023

rok commented Oct 24, 2023

mapleFU commented Oct 24, 2023

mapleFU commented Oct 24, 2023 •

edited

Loading

rok commented Oct 24, 2023

mapleFU commented Oct 24, 2023

rok commented Oct 24, 2023

mapleFU commented Oct 24, 2023

mapleFU commented Oct 24, 2023

jorisvandenbossche commented Oct 27, 2023 •

edited

Loading

[C++] Parquet reading performance regressions #38432

[C++] Parquet reading performance regressions #38432

Comments

jorisvandenbossche commented Oct 24, 2023 • edited Loading

jorisvandenbossche commented Oct 24, 2023

rok commented Oct 24, 2023

jorisvandenbossche commented Oct 24, 2023

mapleFU commented Oct 24, 2023

jorisvandenbossche commented Oct 24, 2023

rok commented Oct 24, 2023

mapleFU commented Oct 24, 2023

mapleFU commented Oct 24, 2023 • edited Loading

rok commented Oct 24, 2023

mapleFU commented Oct 24, 2023

rok commented Oct 24, 2023

mapleFU commented Oct 24, 2023

mapleFU commented Oct 24, 2023

jorisvandenbossche commented Oct 27, 2023 • edited Loading

jorisvandenbossche commented Oct 24, 2023 •

edited

Loading

mapleFU commented Oct 24, 2023 •

edited

Loading

jorisvandenbossche commented Oct 27, 2023 •

edited

Loading