Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11792] [SQL] SizeEstimator cannot provide a good size estimation of UnsafeHashedRelations #9788

Closed
wants to merge 1 commit into from

Conversation

yhuai
Copy link
Contributor

@yhuai yhuai commented Nov 18, 2015

https://issues.apache.org/jira/browse/SPARK-11792

Right now, SizeEstimator will "think" a small UnsafeHashedRelation is several GBs.

Some(binaryMap.getTotalMemoryConsumption)
} else {
None
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be Option(binaryMap).map(_.getTotalMemoryConsumption)

@andrewor14
Copy link
Contributor

LGTM

@rxin
Copy link
Contributor

rxin commented Nov 18, 2015

LGTM too (assuming it fixes the tpcds problem)

@yhuai
Copy link
Contributor Author

yhuai commented Nov 18, 2015

I am running the test. From what I have seen so far, it fixes the problem.

* as the size of the object. Otherwise, [[SizeEstimator]] will do the estimation work.
*/
private[spark] trait SizeEstimation {
def estimatedSize: Option[Long]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not return Long? If a class extends this, it should return a Long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the driver side, UnsafeHashedRelation is using a java hashmap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the BytesToBytesMap implement this interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do not need to do that.

SizeEstimator.estimate (the publish method of SizeEstimator) is used at two places. One is memory store and another one is trait SizeTracker (a utility trait used to implement collections that need to track estimated size). We do not put BytesToBytesMap to memory store, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BytesToBytesMap is used by UnsafeHashRelation, so it's put into memory store, that's the root cause.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach could be remove the reference to BlockManager in BytesToBytesMap, using SparkEnv.get when needed, the difficulty could be how to fix the test (which use mocked BlockManager).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to get it merged first if there is no fundamental issue. So, we can unblock the preview package. I can make the change if we prefer to change BytesToBytesMap instead of UnsafeHashedRelation. I agree returning a Option is weird. But, I feel if it is possible, we should prefer changing UnsafeHashedRelation because it is the one used as the broadcast variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going publish a preview tonight or tomorrow morning? I will try to send out a patch to fix BytesToBytesMap, if I can't make it before publishing preview, feel free to merge this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The high level approach of not relying on reflection and object walking is a good one -- actually with dataset and dataframes, we don't really need size estimation. I also think relying on thread locals and SparkEnv is much less ideal than explicit dependencies.

Either way, this pull request is ok to merge in its current shape, given it's fairly critical. We can do more changes later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #9799.

@rxin We didn't passing BlockManager down to BytesToBytesMap, already rely on thread local.

@SparkQA
Copy link

SparkQA commented Nov 18, 2015

Test build #46147 has finished for PR 9788 at commit bba848a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 18, 2015

Test build #46154 has finished for PR 9788 at commit c1a9a4f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 18, 2015

Test build #46176 has finished for PR 9788 at commit c1a9a4f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Nov 18, 2015

I'm going to merge this for the size estimator change.

@asfgit asfgit closed this in 1714350 Nov 18, 2015
asfgit pushed a commit that referenced this pull request Nov 18, 2015
…n of UnsafeHashedRelations

https://issues.apache.org/jira/browse/SPARK-11792

Right now, SizeEstimator will "think" a small UnsafeHashedRelation is several GBs.

Author: Yin Huai <yhuai@databricks.com>

Closes #9788 from yhuai/SPARK-11792.

(cherry picked from commit 1714350)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants