Skip to content

[SPARK-38354][SQL] Add hash probes metric for shuffled hash join#35686

Closed
c21 wants to merge 1 commit intoapache:masterfrom
c21:probe-metrics
Closed

[SPARK-38354][SQL] Add hash probes metric for shuffled hash join#35686
c21 wants to merge 1 commit intoapache:masterfrom
c21:probe-metrics

Conversation

@c21
Copy link
Contributor

@c21 c21 commented Feb 28, 2022

What changes were proposed in this pull request?

For hash aggregate, there's a SQL metrics to track number of hash probes per looked-up key. It would be better to add a similar metrics for shuffled hash join as well, to get some idea of hash probing performance. Also renamed the existing SQL metrics (and related methods names) in hash aggregate, from avg hash probe bucket list iters to avg hash probes per key, as the original name is quite obscured to understand.

Why are the changes needed?

To show up in Spark web UI (and allow metrics collection) for shuffled hash join probing performance. When the metrics is more closer to 1.0, the probing performance is better.

Does this PR introduce any user-facing change?

Yes, the added SQL metrics. Will attach screenshot later.

How was this patch tested?

The modified unit test in SQLMetricsSuite.scala.

@c21 c21 force-pushed the probe-metrics branch from c491e9f to 3c56b98 Compare March 1, 2022 00:48
@c21
Copy link
Contributor Author

c21 commented Mar 1, 2022

cc @cloud-fan could you help take a look when you have time? Thanks.

*/
private def updateIndex(key: Long, address: Long): Unit = {
numKeyLookups += 1
numProbes += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, do we need to track the probe time when building the hash relation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan - This is the same behavior for UnsafeHashedRelation, while when it builds hash relation, it updates the lookup/probe metrics as well. I guess it would be good to keep consistent between UnsafeHashedRelation and LongHashedRelation here?

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 1584366 Mar 9, 2022
@c21
Copy link
Contributor Author

c21 commented Mar 9, 2022

Thank you @cloud-fan for review!

@c21 c21 deleted the probe-metrics branch March 9, 2022 20:53
LuciferYang pushed a commit to LuciferYang/spark that referenced this pull request Mar 10, 2022
### What changes were proposed in this pull request?

For hash aggregate, there's a SQL metrics to track number of hash probes per looked-up key. It would be better to add a similar metrics for shuffled hash join as well, to get some idea of hash probing performance. Also renamed the existing SQL metrics (and related methods names) in hash aggregate, from `avg hash probe bucket list iters` to `avg hash probes per key`, as the original name is quite obscured to understand.

### Why are the changes needed?

To show up in Spark web UI (and allow metrics collection) for shuffled hash join probing performance. When the metrics is more closer to 1.0, the probing performance is better.

### Does this PR introduce _any_ user-facing change?

Yes, the added SQL metrics. Will attach screenshot later.

### How was this patch tested?

The modified unit test in `SQLMetricsSuite.scala`.

Closes apache#35686 from c21/probe-metrics.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@somani
Copy link
Contributor

somani commented Apr 24, 2022

@c21 @cloud-fan This has caused a performance regression in our tests where broadcast hash join is 5x slower.
It can be reproduced easily on tpcds 3tb data with the following query:

select sum(
ws_ext_sales_price
) sun_sales, count(*)
from
web_sales, date_dim where ws_sold_date_sk = d_date_sk

I could not figure out why it caused a regression, but it is clear it goes away on reverting the commit.

@cloud-fan
Copy link
Contributor

probably because adding a new metrics in a critical code path has perf overhead. @c21 can you open a PR to revert it? We can have more time to think about how to add this metrics without significant perf overhead in Spark 3.4.

@c21
Copy link
Contributor Author

c21 commented Apr 25, 2022

@cloud-fan and @somani - makes sense, let me revert this to unblock release for now.
@somani - it would be very helpful if you could share any profiling or any flume graph for the regressed query.

dongjoon-hyun pushed a commit that referenced this pull request Apr 26, 2022
…oin"

This reverts commit 1584366, as the original PR caused performance regression reported in #35686 (comment) .

Closes #36338 from c21/revert-metrics.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request Apr 26, 2022
…oin"

This reverts commit 1584366, as the original PR caused performance regression reported in #35686 (comment) .

Closes #36338 from c21/revert-metrics.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 6b5a1f9)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments