Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25474][SQL][DOCS] Update the docs for spark.sql.statistics.fallBackToHdfs #24715

Closed
wants to merge 20 commits into from
Closed

Conversation

wangyum
Copy link
Member

@wangyum wangyum commented May 27, 2019

What changes were proposed in this pull request?

This PR update spark.sql.statistics.fallBackToHdfs's doc:

  1. This flag is effective only if it is Hive table.
  2. For non-partitioned data source table, it will be automatically recalculated if table statistics are not available
  3. For partitioned data source table, It is 'spark.sql.defaultSizeInBytes' if table statistics are not available.

Related code:

How was this patch tested?

N/A

override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
case logicalRelation @ LogicalRelation(_, _, catalogTable, _) if catalogTable.nonEmpty &&
catalogTable.forall(DDLUtils.isDatasourceTable) && catalogTable.forall(_.stats.isEmpty) =>
val sizeInBytes = if (session.sessionState.conf.fallBackToHdfsForStatsEnabled) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this code will be a duplicate at #24712. Are there more places that similar patterns happen? Let's do that PR first and deduplicate it if there are some more places to deduplicate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm move DetermineTableStats from HiveStrategies to DataSourceStrategy to reduce duplicate.

@SparkQA
Copy link

SparkQA commented May 27, 2019

Test build #105808 has finished for PR 24715 at commit dd5a125.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class DetermineDataSourceTableStats(session: SparkSession) extends Rule[LogicalPlan]

# Conflicts:
#	sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
#	sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala
@SparkQA
Copy link

SparkQA commented Jun 28, 2019

Test build #106995 has finished for PR 24715 at commit 70d3557.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan]

@SparkQA
Copy link

SparkQA commented Jun 28, 2019

Test build #107010 has finished for PR 24715 at commit dd5f356.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 29, 2019

Test build #107034 has finished for PR 24715 at commit 22dd26e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Jun 29, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Jun 29, 2019

Test build #107035 has finished for PR 24715 at commit 22dd26e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 1, 2019

Test build #107080 has finished for PR 24715 at commit 7748a32.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

The idea LGTM, can you rebase this PR?

# Conflicts:
#	sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala
#	sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala
@wangyum
Copy link
Member Author

wangyum commented Aug 15, 2019

I did some benchmark.

Prepare data:

spark.range(100000000).repartition(10000).write.saveAsTable("test_non_partition_10000")
spark.range(100000000).repartition(300000).write.saveAsTable("test_non_partition_300000")
spark.range(100000000).selectExpr("id", "id % 5000 as c2", "id as c3").repartition(org.apache.spark.sql.functions.col("c2")).write.partitionBy("c2").saveAsTable("test_partition_5000")
spark.range(100000000).selectExpr("id", "id % 10000 as c2", "id as c3").repartition(org.apache.spark.sql.functions.col("c2")).write.partitionBy("c2").saveAsTable("test_partition_10000")

Add these lines to LogicalRelation.computeStats:

val time1 = System.currentTimeMillis()
val relationSize = relation.sizeInBytes
val time2 = System.currentTimeMillis()
val fallBackToHdfsSize = CommandUtils.getSizeInBytesFallBackToHdfs(relation.sqlContext.sparkSession, catalogTable.get)
val time3 = System.currentTimeMillis()
// scalastyle:off
println(s"Get size from relation: $relationSize, time: ${time2 - time1}")
println(s"Get size fall back to HDFS: $fallBackToHdfsSize, time: ${time3 - time2}")
// scalastyle:on

Non-partitioned table benchmark result:

scala> spark.sql("explain cost select * from test_non_partition_10000 limit 1").show
Get size from relation: 576588171, time: 22
Get size fall back to HDFS: 576588171, time: 41
+--------------------+
|                plan|
+--------------------+
|== Optimized Logi...|
+--------------------+


scala> spark.sql("explain cost select * from test_non_partition_10000 limit 1").show
Get size from relation: 576588171, time: 3
Get size fall back to HDFS: 576588171, time: 28
+--------------------+
|                plan|
+--------------------+
|== Optimized Logi...|
+--------------------+


scala>

scala> spark.sql("explain cost select * from test_non_partition_300000 limit 1").show
Get size from relation: 706507984, time: 135
Get size fall back to HDFS: 706507984, time: 2038
+--------------------+
|                plan|
+--------------------+
|== Optimized Logi...|
+--------------------+


scala> spark.sql("explain cost select * from test_non_partition_300000 limit 1").show
Get size from relation: 706507984, time: 168
Get size fall back to HDFS: 706507984, time: 3629
+--------------------+
|                plan|
+--------------------+
|== Optimized Logi...|
+--------------------+

Partitioned table benchmark result:

scala> spark.sql("explain cost select * from test_partition_5000 limit 1").show
Get size from relation: 9223372036854775807, time: 0
Get size fall back to HDFS: 1018560794, time: 46
+--------------------+
|                plan|
+--------------------+
|== Optimized Logi...|
+--------------------+


scala> spark.sql("explain cost select * from test_partition_10000 limit 1").show
Get size from relation: 9223372036854775807, time: 0
Get size fall back to HDFS: 1036799332, time: 43
+--------------------+
|                plan|
+--------------------+
|== Optimized Logi...|
+--------------------+

Partitioned table with spark.sql.hive.manageFilesourcePartitions=false (set it by --conf) benchmark result:

scala> spark.sql("set spark.sql.hive.manageFilesourcePartitions").show
+--------------------+-----+
|                 key|value|
+--------------------+-----+
|spark.sql.hive.ma...|false|
+--------------------+-----+


scala> spark.sql("explain cost select * from test_partition_5000 limit 1").show
Get size from relation: 1018560794, time: 3
Get size fall back to HDFS: 1018560794, time: 45
+--------------------+
|                plan|
+--------------------+
|== Optimized Logi...|
+--------------------+


scala> spark.sql("explain cost select * from test_partition_10000 limit 1").show
Get size from relation: 1036799332, time: 865
Get size fall back to HDFS: 1036799332, time: 69
+--------------------+
|                plan|
+--------------------+
|== Optimized Logi...|
+--------------------+

@cloud-fan
Copy link
Contributor

@wangyum do you mean CommandUtils.getSizeInBytesFallBackToHdfs is very slow if there are many files?


override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
// For the data source table, we only recalculate the table statistics when it creates
// the CatalogFileIndex using defaultSizeInBytes. See SPARK-25474 for more details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when it creates the CatalogFileIndex using defaultSizeInBytes -> when the table stats are not available

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

*/
class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan] {

private val sessionConf = session.sessionState.conf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just call it conf

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

val withStats = table.copy(stats = Some(CatalogStatistics(sizeInBytes = BigInt(sizeInBytes))))
logical.copy(catalogTable = Some(withStats))

case relation: HiveTableRelation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we catch InsertIntoTable(HiveTableRelation) as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@advancedxy Already work on this:c86a27b

// Non-partitioned table
withTempDir { dir =>
Seq(false, true).foreach { fallBackToHDFSForStats =>
withSQLConf(SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key -> s"$fallBackToHDFSForStats") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this config has no effect in this test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// fallBackToHDFSForStats = true: The table stats will be recalculated by DetermineTableStats
// fallBackToHDFSForStats = false: The table stats will be recalculated by FileIndex

}

// Partitioned table
Seq(false, true).foreach { fallBackToHDFSForStats =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please create a test case for it. e.g.

test("partitioned data source tables support fallback to HDFS for size estimation")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@wangyum
Copy link
Member Author

wangyum commented Aug 15, 2019

@wangyum do you mean CommandUtils.getSizeInBytesFallBackToHdfs is very slow if there are many files?

CommandUtils.getSizeInBytesFallBackToHdfs is not very slow.
I have no idea why PartitioningAwareFileIndex.sizeInBytes is faster than CommandUtils.getSizeInBytesFallBackToHdfs.
It may be related to the cluster load, I plan to switch to an idle cluster to test tomorrow.

@SparkQA
Copy link

SparkQA commented Aug 20, 2019

Test build #109415 has finished for PR 24715 at commit cc32b48.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
}

test("External partitioned data source table does not support fallback to HDFS " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this implemented?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not support it:

              if (fallBackToHDFS) {
                assert(relation.stats.sizeInBytes === 0)
              } else {
                assert(relation.stats.sizeInBytes === conf.defaultSizeInBytes)
              }

val relation = spark.table("spark_25474").queryExecution.analyzed.children.head
assert(spark.table("spark_25474").count() === 5)
if (fallBackToHDFS) {
assert(relation.stats.sizeInBytes === 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a serious issue here: the wrong stats may mislead Spark to broadcast a very large table and OOM.

I think we can only fallback to HDFS size for non-partitioned tables.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Partitioned table usually very large.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shahidki31 @dongjoon-hyun What do you think ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for disabling for partitioned tables.
This is master branch status, isn't it? Previously, Spark returns 8EB in this case safely.

Copy link
Contributor

@shahidki31 shahidki31 Aug 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, non-partitioned data source table is already getting the correct statistics. I am not sure, we need to support fallback to HDFS for size for non partitioned table.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangyum . Why do we need to revert that? You can revert the functional part here and keep the test codes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, this PR already contains the reverting here(#24715 (comment)).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I'm -1 for removing the existing test cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun It is expensive to support partitioned table with external partitions. Please see this test case. Its data size is incorrect.

Related discussion:
#24715 (comment)
#24715 (comment)

So we plan do not fallback to HDFS size for partitioned tables.

Copy link
Contributor

@shahidki31 shahidki31 Aug 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 We need to document it.

@wangyum May be, in this PR only you can do both, if we doesn't support fallback config.

s"PARTITIONED BY(a) LOCATION '${dir.toURI}'")

withTempDir { partitionDir =>
spark.range(5).write.mode(SaveMode.Overwrite).parquet(partitionDir.getCanonicalPath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.range(5).toDF("b") instead of spark.range(5)?

@wangyum wangyum closed this Aug 23, 2019
# Conflicts:
#	sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
#	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala
#	sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala
@wangyum wangyum reopened this Aug 23, 2019
"determining if a table is small enough to use auto broadcast joins. " +
"For non-partitioned data source table, it will be automatically recalculated if table " +
"statistics are not available. For partitioned data source table, It is " +
s"'${DEFAULT_SIZE_IN_BYTES.key}' if table statistics are not available.")
.booleanConf
Copy link
Member Author

@wangyum wangyum Aug 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shahidki31 I have updated the documentation. Please take a look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wangyum . Looks good. Shall we add it in the configuration.md also. It seems these configs are not there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @wangyum . We had better minimize the patch diff always. Please keep the original location of this conf. I don't see any difference for the following three lines.

val ENABLE_FALL_BACK_TO_HDFS_FOR_STATS = buildConf("spark.sql.statistics.fallBackToHdfs")
.booleanConf
.createWithDefault(false)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because we added DEFAULT_SIZE_IN_BYTES to ENABLE_FALL_BACK_TO_HDFS_FOR_STATS. So we need to move DEFAULT_SIZE_IN_BYTES before ENABLE_FALL_BACK_TO_HDFS_FOR_STATS , otherwise:

[error] [warn] /home/jenkins/workspace/SparkPullRequestBuilder/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:1224: Reference to uninitialized value DEFAULT_SIZE_IN_BYTES
[error] [warn]       s"'${DEFAULT_SIZE_IN_BYTES.key}' if table statistics are not available.")
[error] [warn] 
[warn] 8 warnings found

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shahidki31 Yes. Running the SET -v command will show the entire list of the SQL configuration:
http://spark.apache.org/docs/latest/configuration.html#spark-sql

@wangyum wangyum changed the title [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation [SPARK-25474][SQL] Update the documentation for spark.sql.statistics.fallBackToHdfs Aug 23, 2019
@SparkQA
Copy link

SparkQA commented Aug 23, 2019

Test build #109645 has finished for PR 24715 at commit 3b0c234.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 23, 2019

Test build #109647 has finished for PR 24715 at commit d23fd47.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-25474][SQL] Update the documentation for spark.sql.statistics.fallBackToHdfs [SPARK-25474][SQL][DOCS] Update the docs for spark.sql.statistics.fallBackToHdfs Aug 24, 2019
@@ -1230,6 +1224,16 @@ object SQLConf {
.bytesConf(ByteUnit.BYTE)
.createWithDefault(Long.MaxValue)

val ENABLE_FALL_BACK_TO_HDFS_FOR_STATS = buildConf("spark.sql.statistics.fallBackToHdfs")
.doc("This flag is effective only if it is Hive table. When true, it will fall back to HDFS " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive table -> Non-partitioned Hive table?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently Hive partitioned table is supported, do you think we need to disable it?

override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
case relation: HiveTableRelation
if DDLUtils.isHiveTable(relation.tableMeta) && relation.tableMeta.stats.isEmpty =>
val table = relation.tableMeta
val sizeInBytes = if (session.sessionState.conf.fallBackToHdfsForStatsEnabled) {
try {
val hadoopConf = session.sessionState.newHadoopConf()
val tablePath = new Path(table.location)
val fs: FileSystem = tablePath.getFileSystem(hadoopConf)
fs.getContentSummary(tablePath).getLength
} catch {
case e: IOException =>
logWarning("Failed to get table size from hdfs.", e)
session.sessionState.conf.defaultSizeInBytes
}
} else {
session.sessionState.conf.defaultSizeInBytes
}
val withStats = table.copy(stats = Some(CatalogStatistics(sizeInBytes = BigInt(sizeInBytes))))
relation.copy(tableMeta = withStats)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think so. Can you send a new PR to fix it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

@@ -1230,6 +1224,16 @@ object SQLConf {
.bytesConf(ByteUnit.BYTE)
.createWithDefault(Long.MaxValue)

val ENABLE_FALL_BACK_TO_HDFS_FOR_STATS = buildConf("spark.sql.statistics.fallBackToHdfs")
.doc("When true, it will fall back to HDFS if the table statistics are not available from " +
"table metadata. This is useful in determining if a table is small enough to use auto " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto broadcast join? maybe just say broadcast join

val ENABLE_FALL_BACK_TO_HDFS_FOR_STATS = buildConf("spark.sql.statistics.fallBackToHdfs")
.doc("When true, it will fall back to HDFS if the table statistics are not available from " +
"table metadata. This is useful in determining if a table is small enough to use auto " +
"broadcast joins. This flag is effective only if it is non-partitioned Hive table. " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only for non-partitioned Hive tables

@SparkQA
Copy link

SparkQA commented Aug 28, 2019

Test build #109850 has finished for PR 24715 at commit 44ac6cc.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 28, 2019

Test build #109851 has finished for PR 24715 at commit 55d59e3.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Aug 28, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Aug 28, 2019

Test build #109853 has finished for PR 24715 at commit 55d59e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in e3b32da Aug 28, 2019
@wangyum wangyum deleted the SPARK-25474 branch August 28, 2019 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants