New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11792] [SQL] SizeEstimator cannot provide a good size estimation of UnsafeHashedRelations #9788
Conversation
Some(binaryMap.getTotalMemoryConsumption) | ||
} else { | ||
None | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be Option(binaryMap).map(_.getTotalMemoryConsumption)
LGTM |
LGTM too (assuming it fixes the tpcds problem) |
I am running the test. From what I have seen so far, it fixes the problem. |
* as the size of the object. Otherwise, [[SizeEstimator]] will do the estimation work. | ||
*/ | ||
private[spark] trait SizeEstimation { | ||
def estimatedSize: Option[Long] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not return Long? If a class extends this, it should return a Long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the driver side, UnsafeHashedRelation
is using a java hashmap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the BytesToBytesMap implement this interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do not need to do that.
SizeEstimator.estimate
(the publish method of SizeEstimator
) is used at two places. One is memory store and another one is trait SizeTracker
(a utility trait used to implement collections that need to track estimated size). We do not put BytesToBytesMap
to memory store, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BytesToBytesMap is used by UnsafeHashRelation, so it's put into memory store, that's the root cause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach could be remove the reference to BlockManager in BytesToBytesMap, using SparkEnv.get
when needed, the difficulty could be how to fix the test (which use mocked BlockManager).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to get it merged first if there is no fundamental issue. So, we can unblock the preview package. I can make the change if we prefer to change BytesToBytesMap
instead of UnsafeHashedRelation
. I agree returning a Option
is weird. But, I feel if it is possible, we should prefer changing UnsafeHashedRelation
because it is the one used as the broadcast variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going publish a preview tonight or tomorrow morning? I will try to send out a patch to fix BytesToBytesMap, if I can't make it before publishing preview, feel free to merge this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The high level approach of not relying on reflection and object walking is a good one -- actually with dataset and dataframes, we don't really need size estimation. I also think relying on thread locals and SparkEnv is much less ideal than explicit dependencies.
Either way, this pull request is ok to merge in its current shape, given it's fairly critical. We can do more changes later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test build #46147 has finished for PR 9788 at commit
|
Test build #46154 has finished for PR 9788 at commit
|
retest this please |
Test build #46176 has finished for PR 9788 at commit
|
I'm going to merge this for the size estimator change. |
…n of UnsafeHashedRelations https://issues.apache.org/jira/browse/SPARK-11792 Right now, SizeEstimator will "think" a small UnsafeHashedRelation is several GBs. Author: Yin Huai <yhuai@databricks.com> Closes #9788 from yhuai/SPARK-11792. (cherry picked from commit 1714350) Signed-off-by: Reynold Xin <rxin@databricks.com>
https://issues.apache.org/jira/browse/SPARK-11792
Right now, SizeEstimator will "think" a small UnsafeHashedRelation is several GBs.