-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21052][SQL] Add hash map metrics to join #18301
Conversation
Test build #78047 has finished for PR 18301 at commit
|
Can you put a screenshot of the UI up, for both join and aggregate? |
@@ -573,8 +586,11 @@ private[execution] final class LongToUnsafeRowMap(val mm: TaskMemoryManager, cap | |||
private def updateIndex(key: Long, address: Long): Unit = { | |||
var pos = firstSlot(key) | |||
assert(numKeys < array.length / 2) | |||
numKeyLookups += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should also add this code to the het and the getValue methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we? It seems to me that we should only care about the hash collision happened when inserting the data into the hash map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should. The number of required probes is the different per key, and is also dependent on the order in which the map was constructed. If you combine this with some skew and missing keys, the number of probes can be much higher than expected.
You could even argue that we do not really care about the number of probes when building the map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. OK. I think you're right. We should also care about the collision when searching keys in join operator. I'll update this in next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ain't you on a beach somewhere?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for review even you're on a beach.
I'd shorten it to "avg hash probe". Also do we really need min, med, max? Maybe just a single global avg? |
So just get the global average of all avg hash probe metrics of all tasks? If there's skew, won't we like to see min, med, max? |
yes but i just feel it is getting very long and verbose .. |
also the avg probe probably shouldn't be an integer. at least we should show something like 1.9? |
Because When preparing the values for UI, dividing the long with 1000 to get a float back. So it's a workaround for long-based Does it sound too hacky for you? |
Maybe just min and max? Or med and max? |
Test build #78088 has finished for PR 18301 at commit
|
Test build #78089 has finished for PR 18301 at commit
|
Test build #78091 has finished for PR 18301 at commit
|
// Because `SQLMetric` only stores long value, in order to store double average metrics, we | ||
// multiply the given double with a base integer. When showing the metrics, it will be | ||
// divided by the base integer to restore back a double. | ||
def setWithDouble(v: Double): Unit = _value = (v * SQLMetrics.baseForAvgMetric).toLong |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if you think this is a bit hacky. To store a float value into SQLMetric
, currently I have no better idea to do. Any suggestion welcome.
eb979dd
to
59c3e93
Compare
Test build #78105 has finished for PR 18301 at commit
|
Test build #78107 has finished for PR 18301 at commit
|
Test build #78109 has finished for PR 18301 at commit
|
cc @cloud-fan @gatorsmile for review. |
hey i didn't track super closely, but it is pretty important to show at least one more digit, e.g. 1.7, rather than just 2. |
@rxin I just revert it in previous commits. @cloud-fan should I revert it back? |
Test build #78859 has finished for PR 18301 at commit
|
@viirya ok let's add it back |
// | ||
// WholeStageCodegen enabled: | ||
// ... -> | ||
// WholeStageCodegen(nodeId = 0, Filter(nodeId = 4) -> Project(nodeId = 3) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you format it a little bit? to indicate that we only have a WholeStageCodegen
, all other plans are the inner children of WholeStageCodegen
.
Test build #78862 has finished for PR 18301 at commit
|
Test build #78878 has finished for PR 18301 at commit
|
retest this please. |
LGTM, pending jenkins |
Test build #78882 has finished for PR 18301 at commit
|
retest this please. |
Test build #78893 has finished for PR 18301 at commit
|
thanks, merging to master! |
Thanks! @cloud-fan @rxin @hvanhovell @dongjoon-hyun |
## What changes were proposed in this pull request? This adds the average hash map probe metrics to join operator such as `BroadcastHashJoin` and `ShuffledHashJoin`. This PR adds the API to `HashedRelation` to get average hash map probe. ## How was this patch tested? Related test cases are added. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#18301 from viirya/SPARK-21052.
/** | ||
* Returns the average number of probes per key lookup. | ||
*/ | ||
def getAverageProbesPerLookup(): Double |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def getAverageProbesPerLookup(): Double
-> def getAverageProbesPerLookup: Double
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you insist this change, I can do it in a related PR or a follow-up PR.
@@ -273,6 +279,8 @@ private[joins] class UnsafeHashedRelation( | |||
override def read(kryo: Kryo, in: Input): Unit = Utils.tryOrIOException { | |||
read(in.readInt, in.readLong, in.readBytes) | |||
} | |||
|
|||
override def getAverageProbesPerLookup(): Double = binaryMap.getAverageProbesPerLookup() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
override def getAverageProbesPerLookup: Double = binaryMap.getAverageProbesPerLookup
How about getting rid of Also remove the useless |
A dumb question. Why not reporting |
But we want to show the average number of probes, which includes the cases different key hash, equal key hash (hash collision) before finding a empty slot in the map or verifying it is an existing key in the map. |
|
Let me rephrase my question. Why users care the |
My understanding is, even a key doesn't have collision when inserting, it still possible needs multiple probes to find empty slot. Personally I think number of collision doesn't tell too much information to users. It reflects how well the hash algorithm is designed. |
@viirya Could you show me the codes that need multiple probes to find an empty slot? |
In the a probe at L473, if the slot pointed by the hash code is not empty, it's possible that there's hash collision (equal hash codes, different keys), but it's possible too that the slot is occupied by a key with different hash (the if condition at L475 is false). In this case, we continue to look up an empty slot by going forward |
What changes were proposed in this pull request?
This adds the average hash map probe metrics to join operator such as
BroadcastHashJoin
andShuffledHashJoin
.This PR adds the API to
HashedRelation
to get average hash map probe.How was this patch tested?
Related test cases are added.