[SPARK-25474][SQL] Support `spark.sql.statistics.fallBackToHdfs` in data source tables #22502

shahidki31 · 2018-09-20T19:09:32Z

What changes were proposed in this pull request?

In case of CatalogFileIndex datasource table, sizeInBytes is always coming as default size in bytes, which is 8.0EB (Even when the user give fallBackToHdfsForStatsEnabled=true) . So, the datasource table which has CatalogFileIndex, always prefer SortMergeJoin, instead of BroadcastJoin, even though the size is below broadcast join threshold.
In this PR, In case of CatalogFileIndex table, if we enable "fallBackToHdfsForStatsEnabled=true", then the computeStatistics get the sizeInBytes from the hdfs and we get the actual size of the table. Hence, during join operation, when the table size is below broadcast threshold, it will prefer broadCastHashJoin instead of SortMergeJoin.

How was this patch tested?

Added UT

shahidki31 · 2018-09-21T09:13:41Z

cc @cloud-fan @gatorsmile

dongjoon-hyun

@shahidki31 . The code doesn't look specific to Parquet data source. If then, please remove in case of parquet datasource table from the title.

shahidki31 · 2018-09-25T00:32:25Z

@dongjoon-hyun . Thanks for the comment. I have modified the title. Kindly review the PR.

shahidki31 · 2018-09-26T09:25:26Z

Hi @cloud-fan , could you please review the code.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala

cloud-fan · 2018-11-06T01:20:47Z

@shahidki31 thanks for fixing it!

Do you know where we read fallBackToHdfsForStats currently and see if we can have a unified place to do it?

shahidki31 · 2018-11-06T07:28:42Z

@cloud-fan Thanks. I will check and update the PR.

SparkQA · 2019-01-15T21:05:25Z

Test build #101276 has finished for PR 22502 at commit 79d0794.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-06-17T13:45:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala

  }

+  private def sizeInBytesFallBackToHdfs: Long = {


Rather than repeat the 'compression' part here, you could inline this method, return from the try block, and ignore the exception, falling through to a default return value with the 'compression' logic

I have updated the PR, based on the above comments

…tes is coming as default size in bytes ( 8.0 EB)

SparkQA · 2019-06-26T00:24:23Z

Test build #106905 has finished for PR 22502 at commit 4bef729.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-06-26T14:29:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala

    val compressionFactor = sqlContext.conf.fileCompressionFactor
-    (location.sizeInBytes * compressionFactor).toLong
+    val defaultSize = (location.sizeInBytes * compressionFactor).toLong
+


location match { case cfi: CatalogFileIndex if sparkSession.sessionState.conf.fallBackToHdfsForStatsEnabled => DDLUtils... case _ => defaultSize }

maybe?
Does it make sense to push the check for sparkSession.sessionState.conf.fallBackToHdfsForStatsEnabled into the method?

Thanks. Updated.

Does it make sense to push the check for sparkSession.sessionState.conf.fallBackToHdfsForStatsEnabled into the method?

The method sizeInBytesFallBackToHdfs is supposed for getting sizeInBytes from hdfs, if user enable the fallback configuration. I am not sure about moving configuration check to the method sizeInBytesFallBackToHdfs.

srowen · 2019-06-26T16:34:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala

+    location match {
+      case _: CatalogFileIndex if sparkSession.sessionState.conf.fallBackToHdfsForStatsEnabled =>
+        DDLUtils.sizeInBytesFallBackToHdfs(sparkSession,
+          new Path(location.asInstanceOf[CatalogFileIndex].table.location), defaultSize)


if you name the case match variable, you already have it as a cast here. But yeah this is cleaner.

Yes. Updated.

SparkQA · 2019-06-26T18:58:56Z

Test build #106935 has finished for PR 22502 at commit c758d42.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-26T20:20:01Z

Test build #106940 has finished for PR 22502 at commit ca25a17.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Looks reasonable to me. @dongjoon-hyun ?

wangyum · 2019-06-27T16:38:52Z

I think the correct approach should be to add a new rule(#24715) if the issue occurs at the table level.
Actually, I have a long-term plan:

Data source tables support fallback to HDFS for size estimation [SPARK-25474][SQL][DOCS] Update the docs for spark.sql.statistics.fallBackToHdfs #24715
Remove duplicate logic of calculate table size [SPARK-27843][SQL] Remove duplicate logic of calculate table size #24712
Persistent the table statistics to metadata after fall back to hdfs [SPARK-27655][SQL] Persistent the table statistics to metadata after fall back to hdfs #24551
Refactor DetermineTableStats to invalidate cache when some configuration changed [SPARK-25740][SQL] Refactor DetermineTableStats to invalidate cache when some configuration changed #22743

For example, after #24715:

[root@spark-3267648 spark]# bin/spark-shell --conf spark.sql.statistics.fallBackToHdfs=true
Spark context Web UI available at http://spark-3267648.lvs02.dev.ebayc3.com:4040
Spark context available as 'sc' (master = local[*], app id = local-1561652081851).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
      /_/

Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_211)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("create table table1 (id int, name string) using parquet partitioned by (name)")
res0: org.apache.spark.sql.DataFrame = []

scala> spark.sql("insert into table1 values (1, 'a')")
res1: org.apache.spark.sql.DataFrame = []

scala> spark.sql("explain cost select * from table1").show(false)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|plan                                                                                                                                                                                                                                                                                                                                                                                        |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|== Optimized Logical Plan ==
Relation[id#2,name#3] parquet, Statistics(sizeInBytes=421.0 B)

== Physical Plan ==
*(1) FileScan parquet default.table1[id#2,name#3] Batched: true, DataFilters: [], Format: Parquet, Location: CatalogFileIndex[file:/root/opensource/spark/spark-warehouse/table1], PartitionCount: 1, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>

|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

shahidki31 · 2019-06-27T18:07:54Z

@wangyum The issue seems happening only with the catalogFileIndex datasource tables. For InMemoryFileIndex case, sizeInBytes is already estimating from the hdfs. That is why in the PR I have put the condition for only catalogFileIndex.

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala

Line 103 in a7e1619

override def sizeInBytes: Long = allFiles().map(_.getLen).sum

dongjoon-hyun · 2019-07-17T02:58:38Z

Sorry for missing your ping here, @srowen . I'll take a look this one and @wangyum 's PRs together tomorrow.

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala

dongjoon-hyun

Very sorry for the delay, @srowen and @shahidki31 .
I left a few comments. I'll review again after rebasing and updating.

shahidki31 · 2019-07-28T10:57:47Z

Thanks @srowen @dongjoon-hyun @maropu for the review comments. I have updated the code.

SparkQA · 2019-07-28T12:24:01Z

Test build #108271 has finished for PR 22502 at commit 878c6ac.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2019-07-28T14:03:40Z

Test build #108272 has finished for PR 22502 at commit 4e51a4a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-28T21:23:20Z

Test build #108277 has finished for PR 22502 at commit a27e72b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-07-28T22:13:38Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

+          sql("CREATE TABLE t1 (id INT, name STRING) USING PARQUET PARTITIONED BY (name)")
+          sql("INSERT INTO t1 VALUES (1, 'a')")
+          // Analyze command updates the statistics of table `t1`
+          sql("analyze table t1 compute statistics")


nit.

- // Analyze command updates the statistics of table `t1` - sql("analyze table t1 compute statistics") + sql("ANALYZE TABLE t1 COMPUTE STATISTICS")

@shahidki31 . I'll fix this during merging~

dongjoon-hyun · 2019-07-28T22:16:43Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

+        }
+      }
+    }
+    assert(sizeInBytesEnabledFallBack === sizeInBytesDisabledFallBack)


~~Ur, if the fallback logic returns the same value with ANALYZE TABLE t1 COMPUTE STATISTICS, this assertion doesn't prove anything.~~ Never mind.

dongjoon-hyun · 2019-07-28T22:30:18Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

@@ -1484,4 +1484,44 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto
      }
    }
  }
+
+  test("SPARK-25474: test sizeInBytes for CatalogFileIndex dataSourceTable") {
+    withSQLConf("spark.sql.statistics.fallBackToHdfs" -> "true") {


nit. SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key.

dongjoon-hyun · 2019-07-28T22:30:28Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

+
+    var sizeInBytesDisabledFallBack, sizeInBytesEnabledFallBack = 0L
+    Seq(true, false).foreach { fallBackToHdfs =>
+      withSQLConf("spark.sql.statistics.fallBackToHdfs" -> fallBackToHdfs.toString) {


nit. SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key.

dongjoon-hyun

+1, LGTM. Thank you, @shahidki31 , @maropu , @srowen , and @wangyum .
Merged to master.

dongjoon-hyun · 2019-07-28T22:42:30Z

Could you make two backporting PRs to branch-2.4 and branch-2.3, please? @shahidki31 .

dongjoon-hyun · 2019-07-28T22:43:53Z

cc @gatorsmile and @cloud-fan

shahidki31 · 2019-07-29T05:58:25Z

Thank you @dongjoon-hyun for merging. Sure, I will create PRs for backporting.

dongjoon-hyun · 2019-07-29T06:59:18Z

Thanks!

cloud-fan · 2019-08-14T15:13:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala

-    (location.sizeInBytes * compressionFactor).toLong
+    val defaultSize = (location.sizeInBytes * compressionFactor).toLong
+    location match {
+      case cfi: CatalogFileIndex if sparkSession.sessionState.conf.fallBackToHdfsForStatsEnabled =>


We should only fallback to HDFS if the table stats are not available from table metadata. How can we know the table stats are not available here?

I checked the code in DataSource.resolveRelation, the CatalogFileIndex is created as

val index = new CatalogFileIndex( sparkSession, catalogTable.get, catalogTable.get.stats.map(_.sizeInBytes.toLong).getOrElse(defaultTableSize))

Nice catch! Then, shall we compare the location.sizeInBytes with spark.sql.defaultSizeInBytes in order to check that?

I think (location.sizeInBytes * compressionFactor).toLong is enough.

And (location.sizeInBytes * compressionFactor).toLong is faster than CommandUtils.getSizeInBytesFallBackToHdfs(sparkSession, new Path(cfi.table.location), defaultSize).

That why I only recalculate table statistics if we go this code path:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

Lines 383 to 387 in c30b529

val defaultTableSize = sparkSession.sessionState.conf.defaultSizeInBytes

val index = new CatalogFileIndex(

sparkSession,

catalogTable.get,

catalogTable.get.stats.map(_.sizeInBytes.toLong).getOrElse(defaultTableSize))

https://github.com/apache/spark/pull/24715/files#diff-d99813bd5bbc18277e4090475e4944cfR643-R646

I am not sure, is there any issue in this PR. As per this code, if the table doesn't have any statistics, then only will come to the sizeInBytes method. May be we can add the extra check mentioned above.

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala

Lines 42 to 46 in 0526529

override def computeStats(): Statistics = {

catalogTable

.flatMap(_.stats.map(_.toPlanStats(output, conf.cboEnabled)))

.getOrElse(Statistics(sizeInBytes = relation.sizeInBytes))

}

IIUC the issue in this PR is, we always fallback to HDFS stats even if table stats are available.

@cloud-fan I have created a followup PR to add the extra condition. Thanks

Yes. I have prepared some tests to illustrate this issue. These tests can be passed before this commit:

@wangyum First and 3rd will pass after the PR. 2nd test is a bug which fixed in the commit.

Btw, 1st and 3rd tests are not CatalogFileIndex, which any way won't come to this flow

dongjoon-hyun · 2019-08-18T14:34:20Z

Hi, guys. As @cloud-fan mentioned, since there is a regression case, I'll revert this from branch-2.4 and branch-2.3 for 2.4.4/2.3.4 release to prevent VOTE failures. In the master branch, we need to use those test case in #24715 .

cc @kiszk since he is a release manager for 2.3.4.

shahidki31 changed the title ~~[SPARK-25474][SQL]Size in bytes of the query is coming in EB in case of parquet datasource~~ [SPARK-25474][SQL]Datasource table using SortMergeJoin instead of BroadCastJoin, eventhough the size of the table is under broadcast threshold Sep 21, 2018

dongjoon-hyun reviewed Sep 24, 2018

View reviewed changes

shahidki31 changed the title ~~[SPARK-25474][SQL]When the "fallBackToHdfsForStatsEnabled = true", Size in bytes is coming as default size in bytes ( 8.0 EB)~~ [SPARK-25474][SQL]When the "fallBackToHdfsForStats= true", Size in bytes is coming as default size in bytes ( 8.0 EB) Oct 16, 2018

wangyum reviewed Nov 5, 2018

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala Show resolved Hide resolved

dongjoon-hyun added the SQL label Jun 14, 2019

srowen reviewed Jun 17, 2019

View reviewed changes

[SPARK-25474][SQL]When the "fallBackToHdfsForStats= true", Size in by…

294e1f6

…tes is coming as default size in bytes ( 8.0 EB)

shahidki31 force-pushed the SPARK-25474 branch from 79d0794 to 294e1f6 Compare June 25, 2019 21:09

[SPARK-25474][SQL]When the "fallBackToHdfsForStats= true", Size in by…

4bef729

…tes is coming as default size in bytes ( 8.0 EB)

srowen reviewed Jun 26, 2019

View reviewed changes

address comment

c758d42

srowen reviewed Jun 26, 2019

View reviewed changes

address comment

ca25a17

srowen approved these changes Jun 27, 2019

View reviewed changes

maropu reviewed Jul 28, 2019

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Jul 28, 2019

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala Show resolved Hide resolved

dongjoon-hyun reviewed Jul 28, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala Show resolved Hide resolved

dongjoon-hyun requested changes Jul 28, 2019

View reviewed changes

shahidki31 added 2 commits July 28, 2019 15:20

Adress comments

878c6ac

resolve conflict

4e51a4a

shahidki31 added 2 commits July 28, 2019 23:28

update test

1ef8d54

update test

a27e72b

dongjoon-hyun reviewed Jul 28, 2019

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-25474][SQL] When the "fallBackToHdfsForStats= true", Size in bytes is coming as default size in bytes ( 8.0 EB)~~ [SPARK-25474][SQL] Support spark.sql.statistics.fallBackToHdfs in data source tables Jul 28, 2019

dongjoon-hyun reviewed Jul 28, 2019

View reviewed changes

dongjoon-hyun approved these changes Jul 28, 2019

View reviewed changes

dongjoon-hyun closed this in 485ae6d Jul 28, 2019

shahidki31 deleted the SPARK-25474 branch July 29, 2019 06:48

cloud-fan reviewed Aug 14, 2019

View reviewed changes

maropu mentioned this pull request Aug 16, 2019

[SPARK-25474][SQL][FOLLOW-UP] fallback to hdfs when relation table stats is not available #25460

Closed

wangyum mentioned this pull request Aug 16, 2019

[SPARK-25474][SQL][DOCS] Update the docs for spark.sql.statistics.fallBackToHdfs #24715

Closed

	val defaultTableSize = sparkSession.sessionState.conf.defaultSizeInBytes
	val index = new CatalogFileIndex(
	sparkSession,
	catalogTable.get,
	catalogTable.get.stats.map(_.sizeInBytes.toLong).getOrElse(defaultTableSize))

	override def computeStats(): Statistics = {
	catalogTable
	.flatMap(_.stats.map(_.toPlanStats(output, conf.cboEnabled)))
	.getOrElse(Statistics(sizeInBytes = relation.sizeInBytes))
	}

[SPARK-25474][SQL] Support spark.sql.statistics.fallBackToHdfs in data source tables #22502

[SPARK-25474][SQL] Support spark.sql.statistics.fallBackToHdfs in data source tables #22502

Conversation

shahidki31 commented Sep 20, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

shahidki31 commented Sep 21, 2018 • edited Loading

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

shahidki31 commented Sep 25, 2018 • edited Loading

shahidki31 commented Sep 26, 2018

cloud-fan commented Nov 6, 2018

shahidki31 commented Nov 6, 2018

SparkQA commented Jan 15, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 26, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 26, 2019

SparkQA commented Jun 26, 2019

srowen left a comment

Choose a reason for hiding this comment

wangyum commented Jun 27, 2019

shahidki31 commented Jun 27, 2019

dongjoon-hyun commented Jul 17, 2019

dongjoon-hyun left a comment

Choose a reason for hiding this comment

shahidki31 commented Jul 28, 2019

SparkQA commented Jul 28, 2019

SparkQA commented Jul 28, 2019

SparkQA commented Jul 28, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Jul 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Jul 28, 2019

dongjoon-hyun commented Jul 28, 2019

shahidki31 commented Jul 29, 2019

dongjoon-hyun commented Jul 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shahidki31 Aug 15, 2019 • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun commented Aug 18, 2019 • edited Loading

[SPARK-25474][SQL] Support `spark.sql.statistics.fallBackToHdfs` in data source tables #22502

[SPARK-25474][SQL] Support `spark.sql.statistics.fallBackToHdfs` in data source tables #22502

shahidki31 commented Sep 20, 2018 •

edited

Loading

shahidki31 commented Sep 21, 2018 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

shahidki31 commented Sep 25, 2018 •

edited

Loading

dongjoon-hyun Jul 28, 2019 •

edited

Loading

shahidki31 Aug 15, 2019 •

edited

Loading

dongjoon-hyun commented Aug 18, 2019 •

edited

Loading