[SPARK-25474][SQL][FOLLOW-UP] fallback to hdfs when relation table stats is not available #25460

shahidki31 · 2019-08-15T06:44:28Z

What changes were proposed in this pull request?

When the table relation stats are not empty, do not fall back to HDFS for size estimation.

How was this patch tested?

Existing tests

…ts not available

shahidki31 · 2019-08-15T06:45:04Z

cc @cloud-fan @dongjoon-hyun

maropu · 2019-08-15T06:59:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala

@@ -72,7 +72,8 @@ case class HadoopFsRelation(
    val compressionFactor = sqlContext.conf.fileCompressionFactor
    val defaultSize = (location.sizeInBytes * compressionFactor).toLong
    location match {
-      case cfi: CatalogFileIndex if sparkSession.sessionState.conf.fallBackToHdfsForStatsEnabled =>
+      case cfi: CatalogFileIndex if sparkSession.sessionState.conf.fallBackToHdfsForStatsEnabled
+        && defaultSize == sqlContext.conf.defaultSizeInBytes =>


nit:

case cfi: CatalogFileIndex if sparkSession.sessionState.conf.fallBackToHdfsForStatsEnabled && defaultSize == sqlContext.conf.defaultSizeInBytes =>

I have 2 questions here:

Is this defaultSize the correct data size?

If this defaultSize is the correct data size. Could we do some benchmark about (location.sizeInBytes * compressionFactor).toLong and CommandUtils.getSizeInBytesFallBackToHdfs(sparkSession, new Path(cfi.table.location), defaultSize)?

@wangyum
default size is coming as Long.MAX. If it is correct size it will not fallback to HDFS. Even if it is correct size, fallbackhdfs also give the same result. Also it is not a performance sensitive path I guess. Because, it will come here only when it requires to compute statistics, for eg: during join operation. And if the table already has statistics, flow will not come here.

Please see benchmark here: #24715 (comment)

@wangyum
(location.sizeInBytes * compressionFactor).toLong is always 8.0EB, even after the PR #24715
I am not sure I understand your comment. If the statistics doesn't exists, it has to fallback to HDFS. right? Next time onwards it will read from stats cache.

Number of times falling back to HDFS after this PR and #24715 are also same. right?

We can avoid this when constructing this CatalogFileIndex:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

Lines 383 to 387 in c30b529

val defaultTableSize = sparkSession.sessionState.conf.defaultSizeInBytes

val index = new CatalogFileIndex(

sparkSession,

catalogTable.get,

catalogTable.get.stats.map(_.sizeInBytes.toLong).getOrElse(defaultTableSize))

btw, not defaultSize == sqlContext.conf.defaultSizeInBytes but location.sizeInBytes == sqlContext.conf.defaultSizeInBytes?
See: #22502 (comment)

But, does this comparison work well even when sqlContext.conf.defaultSizeInBytes changed by users?

Ah good point! Basically there is no way to tell the table stats are available or not at this point. sqlContext.conf.defaultSizeInBytes is configurable and it's possible that the table stats just equal to sqlContext.conf.defaultSizeInBytes.

#24715 seems to be able to fix it.

Yea, so how about closing this and moving #24715 for more discussions about solving this case?

If the table statistics is available here,

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

Lines 387 to 390 in 0ea8db9

val index = new CatalogFileIndex(

sparkSession,

catalogTable.get,

catalogTable.get.stats.map(_.sizeInBytes.toLong).getOrElse(defaultTableSize))

,

Then it should be available here too right?

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala

Lines 43 to 45 in 0ea8db9

catalogTable

.flatMap(_.stats.map(_.toPlanStats(output, conf.cboEnabled)))

.getOrElse(Statistics(sizeInBytes = relation.sizeInBytes))

So, ideally the flow shouldn't come to the fallback logic, if the table statistics already exists. That is why even after #24715, location.sizeInBytes is 8.0EB

@cloud-fan Could you please give a reproducible test, where the issue can happen?

SparkQA · 2019-08-15T07:05:02Z

Test build #109146 has finished for PR 25460 at commit c7edcb4.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

shahidki31 · 2019-08-15T11:56:38Z

retest this please

dongjoon-hyun · 2019-08-15T15:49:43Z

Retest this please.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala

dongjoon-hyun · 2019-08-15T17:45:53Z

Thank you for update, @shahidki31 .

Hi, @cloud-fan , @maropu , @wangyum , @srowen .
Since this is a correct follow-up, I want to land this first on master/2.4/2.3 and discuss @wangyum 's PR (#24715) on top of that.

How do you think about that?

maropu · 2019-08-16T01:02:44Z

I feel ok to do so (I think we should fix the existing issue on master/2.4/2.3...)

dongjoon-hyun · 2019-08-16T05:45:54Z

To @shahidki31 .

@maropu and @cloud-fan meant the corner case when the table size is equal to the user configuration value (not 8.0EB). Let say we set the configuration to 1GB and we have a static table T1 whose size happens to be 1GB. In that case, every query on that tables might invoke this function. Although it's a very special case, but it's a regression.

So, @cloud-fan and @maropu suggested to close this PR and proceed with #24715 .

I'm +1 for that suggestion because that is the correct way.

I know that you are worrying that #24715 doesn't resolve 8.0EB issue. However, that should be covered by your UTs in the previous PR. In the worst case, some of your code might be reverted. However, your test cases should survive there. It's your contribution. I believe @wangyum 's PR will pass your existing test cases in addition to his new test code. That's the way we make Apache Spark stronger.

How do you think about this, @shahidki31 ? It's a way of collaboration.

cloud-fan · 2019-08-16T06:03:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala

@@ -72,7 +72,8 @@ case class HadoopFsRelation(
    val compressionFactor = sqlContext.conf.fileCompressionFactor
    val defaultSize = (location.sizeInBytes * compressionFactor).toLong
    location match {
-      case cfi: CatalogFileIndex if sparkSession.sessionState.conf.fallBackToHdfsForStatsEnabled =>
+      case cfi: CatalogFileIndex if sparkSession.sessionState.conf.fallBackToHdfsForStatsEnabled &&
+          location.sizeInBytes == sqlContext.conf.defaultSizeInBytes =>


The point @maropu and I were making is, location.sizeInBytes == sqlContext.conf.defaultSizeInBytes doesn't mean the table stats are not available. sqlContext.conf.defaultSizeInBytes is configurable, and it's possible that the table stats are the same as sqlContext.conf.defaultSizeInBytes, in which case we shouldn't fallback to HDFS.

Yes, I agree. my point was, if the table statistics is not empty, it will not fallback to hdfs even without the condition. So, the PR itself isn't necessary. I will close this to #24715
Thanks @dongjoon-hyun , @cloud-fan @maropu for the feedbacks

SparkQA · 2019-08-16T06:29:21Z

Test build #109157 has finished for PR 25460 at commit ac0ad4a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-16T07:05:01Z

Test build #109180 has finished for PR 25460 at commit 5a98bdd.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

[SPARK-25474][SQL][Followup] fallback to hdfs when relation table sta…

c7edcb4

…ts not available

maropu reviewed Aug 15, 2019

View reviewed changes

maropu changed the title ~~[SPARK-25474][SQL][Followup] fallback to hdfs when relation table stats is not available~~ [SPARK-25474][SQL][FOLLOW-UP] fallback to hdfs when relation table stats is not available Aug 15, 2019

address comments

d0eb108

dongjoon-hyun added the SQL label Aug 15, 2019

dongjoon-hyun reviewed Aug 15, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala Outdated Show resolved Hide resolved

style correction

ac0ad4a

update

5a98bdd

cloud-fan reviewed Aug 16, 2019

View reviewed changes

shahidki31 closed this Aug 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25474][SQL][FOLLOW-UP] fallback to hdfs when relation table stats is not available #25460

[SPARK-25474][SQL][FOLLOW-UP] fallback to hdfs when relation table stats is not available #25460

shahidki31 commented Aug 15, 2019 •

edited by maropu

shahidki31 commented Aug 15, 2019

maropu Aug 15, 2019

wangyum Aug 15, 2019

shahidki31 Aug 15, 2019

wangyum Aug 15, 2019

shahidki31 Aug 15, 2019 •

edited

wangyum Aug 16, 2019

maropu Aug 16, 2019

cloud-fan Aug 16, 2019

maropu Aug 16, 2019

shahidki31 Aug 16, 2019 •

edited

SparkQA commented Aug 15, 2019

shahidki31 commented Aug 15, 2019

dongjoon-hyun commented Aug 15, 2019

dongjoon-hyun commented Aug 15, 2019 •

edited

maropu commented Aug 16, 2019

dongjoon-hyun commented Aug 16, 2019 •

edited

cloud-fan Aug 16, 2019

shahidki31 Aug 16, 2019

SparkQA commented Aug 16, 2019

SparkQA commented Aug 16, 2019

	val defaultTableSize = sparkSession.sessionState.conf.defaultSizeInBytes
	val index = new CatalogFileIndex(
	sparkSession,
	catalogTable.get,
	catalogTable.get.stats.map(_.sizeInBytes.toLong).getOrElse(defaultTableSize))

	catalogTable
	.flatMap(_.stats.map(_.toPlanStats(output, conf.cboEnabled)))
	.getOrElse(Statistics(sizeInBytes = relation.sizeInBytes))

[SPARK-25474][SQL][FOLLOW-UP] fallback to hdfs when relation table stats is not available #25460

[SPARK-25474][SQL][FOLLOW-UP] fallback to hdfs when relation table stats is not available #25460

Conversation

shahidki31 commented Aug 15, 2019 • edited by maropu

What changes were proposed in this pull request?

How was this patch tested?

shahidki31 commented Aug 15, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shahidki31 Aug 15, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shahidki31 Aug 16, 2019 • edited

Choose a reason for hiding this comment

SparkQA commented Aug 15, 2019

shahidki31 commented Aug 15, 2019

dongjoon-hyun commented Aug 15, 2019

dongjoon-hyun commented Aug 15, 2019 • edited

maropu commented Aug 16, 2019

dongjoon-hyun commented Aug 16, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 16, 2019

SparkQA commented Aug 16, 2019

shahidki31 commented Aug 15, 2019 •

edited by maropu

shahidki31 Aug 15, 2019 •

edited

shahidki31 Aug 16, 2019 •

edited

dongjoon-hyun commented Aug 15, 2019 •

edited

dongjoon-hyun commented Aug 16, 2019 •

edited