[SPARK-37175][SQL] Performance improvement to hash joins with many duplicate keys by sumeetgajjar · Pull Request #35047 · apache/spark

sumeetgajjar · 2021-12-29T02:12:45Z

What changes were proposed in this pull request?

This PR aims at improving performance for Hash joins with many duplicate keys.

A HashedRelation uses a map underneath to store rows against a corresponding key. A LongToUnsafeRowMap is used by LongHashedRelation and a BytesToBytesMap is used by UnsafeHashedRelation.
We propose to reorder the underlying map thereby placing all the rows for a given key adjacent in the memory to improve the spatial locality while iterating over them in the stream side of the join.

This is achieved in the following steps:

creating another copy of the underlying map
for all keys in the existing map
- get the corresponding rows
- insert all the rows for the given key at once in the new map
use the new map for look-ups

This optimization can be enabled by specifying spark.sql.hashedRelationReorderFactor=<value>.
Once the condition number of rows >= number of unique keys * above value is satisfied for the underlying map, the optimization will kick in.

Why are the changes needed?

There is no order maintained when the rows are added to the underlying map, thus for a given key, the corresponding rows are typically non-adjacent in memory, resulting in a poor spatial locality. Placing the rows for adjacent in memory yields a performance boost thereby reducing execution time.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Modified existing unit tests to run against the suggested improvement.
Added a couple of cases to test the scenarios when the improvement throws an exception due to insufficient memory.
Added a micro-benchmark that clearly indicates performance improvements when there are duplicate keys.
Ran the four example queries mentioned in the JIRA in spark-sql as a final check for performance improvement.

Credits

This work is based on the initial idea proposed by @bersprockets.

singhpk234 · 2021-12-29T06:32:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala

[question] IIUC, at this point of time we will have both the old map as well as the new compact map, in memory and we would be holding 2X memory (at this time) which we were holding before (as we just had 1 map). This itself can cause us OOM, do we need some heuristic here ? or perhap's delete / free (not sure if at present we have this functionality) enteries once they have been entered in the compactMap ?

In case an OOM-related exception is thrown while appending to the map, we invoke maybeCompactMap.foreach(_.free()) in the catch clause which releases the memory of the compactMap.

https://github.com/apache/spark/pull/35047/files#diff-127291a0287f790755be5473765ea03eb65f8b58b9ec0760955f124e21e3452fR1171

What I meant here is if we know before hand we don't have this much memory available we should try to short circuit this re-write by not attempting to do it ? we have this getTotalMemoryConsumption from earlier map, and we can calculate available memory to us during start, if we know we can't have / support 2X , maybe we will save time avoiding the extra work if the exception happens.

That sounds like a good idea, however, TaskMemoryManager does not expose API to fetch available execution memory. There does exist MemoryManager.getExecutionMemoryUsageForTask but unfortunately MemoryManager is not accessible at the place of building the Map.

spark/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala

Line 216 in 4c80672

private[memory] def getExecutionMemoryUsageForTask(taskAttemptId: Long): Long = synchronized {

Furthermore, re-building the map takes a fraction amount of the time of the entire query execution.
I observed the following times while running a test with this feature:

Stream side 300M rows

Build side 90M rows

Rebuilding the map 4 seconds (diff from the logs)

Total query execution time with optimization enabled 3.9 minutes

Total query execution time with optimization disabled 9.2 minutes

It's a valid concern that this puts more memory pressure on the driver. Is it possible to improve the relation-building logic and make it co-locate the values of the same key? Then we don't need to rewrite the relation.

It's a valid concern that this puts more memory pressure on the driver. Is it possible to improve the relation-building logic and make it co-locate the values of the same key? Then we don't need to rewrite the relation.

+1 on this. Beside memory concern, I am also not sure of the performance penalty when doing shuffled hash join with large hash table in memory. We need to probe each key in the hash table to rebuild the table, and we kind of waste of the time of 1st build of table.

Is it possible to improve the relation-building logic and make it co-locate the values of the same key?

Since there is no ordering guarantee with Iterator[InternalRow], I don't believe there is a way to co-locate the values of the same key when building the relation, but let me think about it again.

I am also not sure of the performance penalty when doing shuffled hash join with large hash table in memory. We need to probe each key in the hash table to rebuild the table, and we kind of waste of the time of 1st build of table.

I totally agree regarding wasting time on 1st build of the table. However, if we consider the amount of time taken to re-build the table, it is a minuscule amount when compared to the overall execution time of the query.
It can be clearly seen in the above example.
Also, in the above case, I was running the spark application in local mode. If we run the application on Yarn or K8s, the overall execution time will increase due to added network cost and the table rebuild time would further become a smaller fraction of the overall time.

AmplabJenkins · 2021-12-30T00:42:10Z

Can one of the admins verify this patch?

sumeetgajjar · 2022-01-05T21:32:44Z

Hi @HyukjinKwon @cloud-fan @dongjoon-hyun can you please review this PR?

cloud-fan · 2022-01-06T09:46:02Z

project/SparkBuild.scala

    (Test / javaOptions) += "-ea",
    (Test / javaOptions) ++= {
      val metaspaceSize = sys.env.get("METASPACE_SIZE").getOrElse("1300m")
-      val extraTestJavaArgs = Array("-XX:+IgnoreUnrecognizedVMOptions",


what's the actual change?

The actual change is to expose an ENV variable to set the heap size. If set this heap size will be using as -Xmx while invoking runMain from sbt.
The default 4g headspace was not sufficient when I was running the newly added benchmark for 10 billion rows (driver OOM'd), thus the change.

While making the change I noticed extraTestJavaArgs Array was first converted to a string and immediately at the next line it was again split on spaces and converted to Seq.
Thus I refactored the code a bit.

cloud-fan · 2022-01-06T09:58:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala

+      var maybeCompactMap: Option[LongToUnsafeRowMap] = None
+      try {
+        maybeCompactMap = Some(new LongToUnsafeRowMap(taskMemoryManager,
+          Math.toIntExact(map.numUniqueKeys)))


shall we let this LongToUnsafeRowMap allocate all the pages ahead, so that we can fail earlier if memory is not enough?

nvm, it already does.

c21 · 2022-01-06T18:57:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    .version("3.3.0")
+    .doubleConf
+    .checkValue(_ > 1, "The value must be greater than 1.")
+    .createOptional


curious of what would be a good default value for this? Are we expecting users to tune this config accordingly for each query? If user needs to tune this config for each query, then I feel this feature is less useful, because user can always choose to rewrite query to sort on join keys before join.

As per the SQL benchmarks (not the micro-benchmark) that I ran, 4 is a good default value for this config.
A value of 4 reduces the query processing time to half.

This was not set at query level but at the spark application level.

Yeah, I mean do we expect users to tune the config per each Spark app/query? I feel the improvement also depends on access pattern from probe/stream side. If the probe side has many looks up of keys with a lot of values, then we can see the improvement. But if the probe side does not look up much for these keys, then we probably cannot see the benefit. I kind of feel that this config is not so easy to use in practice.

do we expect users to tune the config per each Spark app/query?

No, a global default value of 4, should suffice here.

Maybe a default hash reorder factor of 4, but a separate config for on/off (off would mean the reorder factor gets passed to the various hash relation factories as None, regardless of the setting of spark.sql.hashedRelationReorderFactor).

Maybe a default hash reorder factor of 4, but a separate config for on/off (off would mean the reorder factor gets passed to the various hash relation factories as None, regardless of the setting of spark.sql.hashedRelationReorderFactor).

I like the idea, I'll add it.

I feel the improvement also depends on access pattern from probe/stream side. If the probe side has many looks up of keys with a lot of values, then we can see the improvement. But if the probe side does not look up much for these keys, then we probably cannot see the benefit. I kind of feel that this config is not so easy to use in practice.

No, a global default value of 4, should suffice here.

@sumeetgajjar - could you help elaborate more why a global default value is sufficient per my question above?

Apologies for the delayed response, I was stuck with some work stuff followed by a sick week due to covid.

@sumeetgajjar - could you help elaborate more why a global default value is sufficient per my question above?

@c21 My rationale behind suggesting a global value of 4 was based on the experiment that I ran. I ran a synthetic workload with the HashOptimization enabled no matter the duplication factor of the keys. I gradually iterated over the duplication factor from 1 to 20, I noticed the optimization to be beneficial right after the duplication factor crossed a value of 4. Thus based on the experiment I conducted locally, I suggested a value of 4.

If the probe side has many looks up of keys with a lot of values, then we can see the improvement. But if the probe side does not look up much for these keys, then we probably cannot see the benefit.

I agree, the synthetic workload I was running queried the probe side such that the majority of the keys had multiple values.

Anyways, due to concerns over the added memory pressure introduced by this Optimization and the feedback received on the config being difficult to tweak, I've decided to close the PR. In case I find a better solution, I'll reopen the PR.

c21 · 2022-01-06T19:02:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala

+        new UnsafeHashedRelation(key.size, numFields, compactMap)
+      } catch {
+        case e: SparkOutOfMemoryError =>
+          logWarning("Reordering BytesToBytesMap failed, " +


HashedRelation building can happen either on driver side (for broadcast join), or executor side (for shuffled hash join), so try increasing the driver memory to mitigate it is not accurate suggestion for users. We probably want to suggest users to disable reordering because it causes OOM here.

Good catch, will modify the warning message.

c21 · 2022-01-06T19:21:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala

It's a valid concern that this puts more memory pressure on the driver. Is it possible to improve the relation-building logic and make it co-locate the values of the same key? Then we don't need to rewrite the relation.

+1 on this. Beside memory concern, I am also not sure of the performance penalty when doing shuffled hash join with large hash table in memory. We need to probe each key in the hash table to rebuild the table, and we kind of waste of the time of 1st build of table.

cloud-fan · 2022-01-07T07:31:59Z

Since this introduces overhead (rebuild hash relation, more memory), I think we need to carefully make sure the benefit is larger than the overhead. Asking users to tune the config is really not a good way to roll out this optimization.

Some random ideas:

look at the ndv of join keys. If there are very few duplicated keys, don't do this optimization
doing it at the executor side, in a dynamic way. If we detect that a key is consistently looked up and has many values, compact the values of this key and put it to a new fast map. We can even set an upper bound of the number of keys to optimize, to avoid taking too much memory. We don't need to rewrite the entire hash relation.

sumeetgajjar · 2022-02-01T06:39:25Z

Apologies for the delayed response, I was stuck with some work stuff followed by a sick week due to covid.

Some random ideas:

Thanks for the suggestions, appreciate it.

Since this introduces overhead (rebuild hash relation, more memory), I think we need to carefully make sure the benefit is larger than the overhead. Asking users to tune the config is really not a good way to roll out this optimization.

Agreed, in that case, I'll close this PR for the time being. In case I find a better solution, I'll reopen the PR.

Thank you @cloud-fan @c21 @bersprockets @singhpk234 for your comments on the PR.

github-actions bot added BUILD SQL labels Dec 29, 2021

singhpk234 reviewed Dec 29, 2021

View reviewed changes

sumeetgajjar and others added 26 commits December 30, 2021 21:55

Initial Benchmark

e382d43

Continue benchmarking

335dc3b

Testing

2b3886c

Extend testing to UnsafeHashedRelation

5cb11fc

First pass at fixing segfault

8f7d344

Fix messages

1715ec9

Better way to get the keys from unsafe hashed relation

b9c4915

Enable the optimization if spark.sql.enableHashRelationOptimization=1

81b51a9

Add SqlConf to enable reordering of HashedRelation

9f588d3

Convert iterator while loop to foreach for LongHashedRelation

4f0da86

Refactor UnsafeHashedRelation optimization

5f454e9

Remove unecessary changes

4b604fb

LongHashedRelation micro-benchmarking

cdcef76

Update SparkBuild.scala to read Xmx from HEAPSPACE_SIZE env

ed9c5a0

change benchmark loop to mimic stream side of join

8939fab

Add UnsafeHashedRelation benchmark

8e7b07f

Extend UTs to run with map reordering

c245d76

Add benchmark results.txt

4c946c0

Fix indent

6217439

Handle LongToUnsafeRowMap reorder failure

3008f56

Handle BytesToBytesMap reorder failure

e96eaca

Fix tests

8008783

Simplify OOM handling

589b775

Fix negative tests

590994c

Add comments

098315b

Minor refactoring

e7a04c0

sumeetgajjar added 4 commits December 30, 2021 22:05

Minor refactoring

27860cf

Update benchmarks

852e72f

Use logAppender to assert reordering is attempted

462cfb9

Resolve SparkBuild.scala conflict

772070c

sumeetgajjar force-pushed the SPARK-37175-HashRelation-Optimization branch from a0a6e0e to 772070c Compare December 31, 2021 06:11

cloud-fan reviewed Jan 6, 2022

View reviewed changes

c21 reviewed Jan 6, 2022

View reviewed changes

sumeetgajjar closed this Feb 1, 2022

Stream side	300M rows
Build side	90M rows
Rebuilding the map	4 seconds (diff from the logs)
Total query execution time with optimization enabled	3.9 minutes
Total query execution time with optimization disabled	9.2 minutes

Conversation

sumeetgajjar commented Dec 29, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Credits

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

singhpk234 Dec 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Dec 30, 2021

Uh oh!

sumeetgajjar commented Jan 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sumeetgajjar Jan 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jan 7, 2022

Uh oh!

sumeetgajjar commented Feb 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Comments

singhpk234 Dec 29, 2021 •

edited

Loading

sumeetgajjar commented Jan 5, 2022 •

edited

Loading

sumeetgajjar Jan 6, 2022 •

edited

Loading

sumeetgajjar commented Feb 1, 2022 •

edited

Loading