Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32420][SQL] Add handling for unique key in non-codegen hash join #29216

Closed
wants to merge 1 commit into from

Conversation

c21
Copy link
Contributor

@c21 c21 commented Jul 24, 2020

What changes were proposed in this pull request?

HashRelation has two separate code paths for unique key look up and non-unique key look up E.g. in its subclass UnsafeHashedRelation, unique key look up is more efficient as it does not have e.g. extra Iterator[UnsafeRow].hasNext()/next() overhead per row.

BroadcastHashJoinExec has handled unique key vs non-unique key separately in code-gen path. But the non-codegen path for broadcast hash join and shuffled hash join do not separate it yet, so adding the support here.

Why are the changes needed?

Shuffled hash join and non-codegen broadcast hash join still rely on this code path for execution. So this PR will help save CPU for executing this two type of join. Adding codegen for shuffled hash join would be a different topic and I will add it in https://issues.apache.org/jira/browse/SPARK-32421 .

Ran the same query as JoinBenchmark, with enabling and disabling this feature. Verified 20% wall clock time improvement (switch control and test group order as well to verify the improvement to not be the noise).

Running benchmark: shuffle hash join
  Running case: shuffle hash join unique key SHJ off
  Stopped after 5 iterations, 4039 ms
  Running case: shuffle hash join unique key SHJ on
  Stopped after 5 iterations, 2898 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
shuffle hash join:                        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
shuffle hash join unique key SHJ off                707            808          81          5.9         168.6       1.0X
shuffle hash join unique key SHJ on                 547            580          50          7.7         130.4       1.3X
Running benchmark: shuffle hash join
  Running case: shuffle hash join unique key SHJ on
  Stopped after 5 iterations, 3333 ms
  Running case: shuffle hash join unique key SHJ off
  Stopped after 5 iterations, 4268 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
shuffle hash join:                        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
shuffle hash join unique key SHJ on                 565            667          60          7.4         134.8       1.0X
shuffle hash join unique key SHJ off                774            854          85          5.4         184.4       0.7X

Does this PR introduce any user-facing change?

No.

How was this patch tested?

@SparkQA
Copy link

SparkQA commented Jul 24, 2020

Test build #126467 has finished for PR 29216 at commit 09d5a0d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@c21
Copy link
Contributor Author

c21 commented Jul 24, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jul 24, 2020

Test build #126501 has finished for PR 29216 at commit 09d5a0d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@c21
Copy link
Contributor Author

c21 commented Jul 24, 2020

cc @cloud-fan and @sameeragarwal if you guys can help take a look. Thanks!

matches.map(joinRow.withRight(_)).filter(boundCondition)
} else {
Seq.empty

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we already have test for inner join?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

})
result.setBoolean(0, exists)
joinedRow(current, result)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And test for existenceJoin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya - existence join is tricky and I will try to add one. But besides testing, wondering what do you think of this PR? Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya - I just found test single unique condition (equal) for left Anti join in ExistenceJoinSuite already covered existence join. So we should be good with existence join. More historical context for existence join is this PR just FYI.

@c21
Copy link
Contributor Author

c21 commented Jul 26, 2020

@viirya - given tests already cover all joins, could you give a review of core logic to help move forward? Thanks.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 548b7db Jul 27, 2020
@c21
Copy link
Contributor Author

c21 commented Jul 27, 2020

Thanks @cloud-fan and @viirya for review!

@c21 c21 deleted the unique-key branch July 27, 2020 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants