[SPARK-11792] [SQL] SizeEstimator cannot provide a good size estimation of UnsafeHashedRelations #9788

yhuai · 2015-11-18T02:35:54Z

https://issues.apache.org/jira/browse/SPARK-11792

Right now, SizeEstimator will "think" a small UnsafeHashedRelation is several GBs.

andrewor14 · 2015-11-18T02:54:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala

+      Some(binaryMap.getTotalMemoryConsumption)
+    } else {
+      None
+    }


this can be Option(binaryMap).map(_.getTotalMemoryConsumption)

andrewor14 · 2015-11-18T02:55:59Z

LGTM

rxin · 2015-11-18T03:12:53Z

LGTM too (assuming it fixes the tpcds problem)

yhuai · 2015-11-18T03:16:25Z

I am running the test. From what I have seen so far, it fixes the problem.

davies · 2015-11-18T04:33:17Z

core/src/main/scala/org/apache/spark/util/SizeEstimator.scala

+ * as the size of the object. Otherwise, [[SizeEstimator]] will do the estimation work.
+ */
+private[spark] trait SizeEstimation {
+  def estimatedSize: Option[Long]


Why not return Long? If a class extends this, it should return a Long.

At the driver side, UnsafeHashedRelation is using a java hashmap.

Should the BytesToBytesMap implement this interface?

I think we do not need to do that.

SizeEstimator.estimate (the publish method of SizeEstimator) is used at two places. One is memory store and another one is trait SizeTracker (a utility trait used to implement collections that need to track estimated size). We do not put BytesToBytesMap to memory store, right?

BytesToBytesMap is used by UnsafeHashRelation, so it's put into memory store, that's the root cause.

Another approach could be remove the reference to BlockManager in BytesToBytesMap, using SparkEnv.get when needed, the difficulty could be how to fix the test (which use mocked BlockManager).

I'd like to get it merged first if there is no fundamental issue. So, we can unblock the preview package. I can make the change if we prefer to change BytesToBytesMap instead of UnsafeHashedRelation. I agree returning a Option is weird. But, I feel if it is possible, we should prefer changing UnsafeHashedRelation because it is the one used as the broadcast variable.

Are we going publish a preview tonight or tomorrow morning? I will try to send out a patch to fix BytesToBytesMap, if I can't make it before publishing preview, feel free to merge this one.

The high level approach of not relying on reflection and object walking is a good one -- actually with dataset and dataframes, we don't really need size estimation. I also think relying on thread locals and SparkEnv is much less ideal than explicit dependencies.

Either way, this pull request is ok to merge in its current shape, given it's fairly critical. We can do more changes later.

Created #9799.

@rxin We didn't passing BlockManager down to BytesToBytesMap, already rely on thread local.

SparkQA · 2015-11-18T05:27:36Z

Test build #46147 has finished for PR 9788 at commit bba848a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-18T05:37:33Z

Test build #46154 has finished for PR 9788 at commit c1a9a4f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-11-18T05:49:00Z

retest this please

SparkQA · 2015-11-18T08:40:04Z

Test build #46176 has finished for PR 9788 at commit c1a9a4f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-11-18T08:42:49Z

I'm going to merge this for the size estimator change.

…n of UnsafeHashedRelations https://issues.apache.org/jira/browse/SPARK-11792 Right now, SizeEstimator will "think" a small UnsafeHashedRelation is several GBs. Author: Yin Huai <yhuai@databricks.com> Closes #9788 from yhuai/SPARK-11792. (cherry picked from commit 1714350) Signed-off-by: Reynold Xin <rxin@databricks.com>

andrewor14 reviewed Nov 18, 2015
View reviewed changes

Fix

c1a9a4f

davies reviewed Nov 18, 2015
View reviewed changes

asfgit closed this in 1714350 Nov 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-11792] [SQL] SizeEstimator cannot provide a good size estimation of UnsafeHashedRelations #9788

[SPARK-11792] [SQL] SizeEstimator cannot provide a good size estimation of UnsafeHashedRelations #9788

yhuai commented Nov 18, 2015

andrewor14 Nov 18, 2015

andrewor14 commented Nov 18, 2015

rxin commented Nov 18, 2015

yhuai commented Nov 18, 2015

davies Nov 18, 2015

yhuai Nov 18, 2015

davies Nov 18, 2015

yhuai Nov 18, 2015

davies Nov 18, 2015

davies Nov 18, 2015

yhuai Nov 18, 2015

davies Nov 18, 2015

rxin Nov 18, 2015

davies Nov 18, 2015

SparkQA commented Nov 18, 2015

SparkQA commented Nov 18, 2015

andrewor14 commented Nov 18, 2015

SparkQA commented Nov 18, 2015

rxin commented Nov 18, 2015

[SPARK-11792] [SQL] SizeEstimator cannot provide a good size estimation of UnsafeHashedRelations #9788

[SPARK-11792] [SQL] SizeEstimator cannot provide a good size estimation of UnsafeHashedRelations #9788

Conversation

yhuai commented Nov 18, 2015

Choose a reason for hiding this comment

andrewor14 commented Nov 18, 2015

rxin commented Nov 18, 2015

yhuai commented Nov 18, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 18, 2015

SparkQA commented Nov 18, 2015

andrewor14 commented Nov 18, 2015

SparkQA commented Nov 18, 2015

rxin commented Nov 18, 2015