Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25474][SQL][DOCS] Update the docs for spark.sql.statistics.fallBackToHdfs #24715

Closed
wants to merge 20 commits into from

Conversation

@wangyum
Copy link
Member

commented May 27, 2019

What changes were proposed in this pull request?

This PR update spark.sql.statistics.fallBackToHdfs's doc:

  1. This flag is effective only if it is Hive table.
  2. For non-partitioned data source table, it will be automatically recalculated if table statistics are not available
  3. For partitioned data source table, It is 'spark.sql.defaultSizeInBytes' if table statistics are not available.

Related code:

How was this patch tested?

N/A

override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
case logicalRelation @ LogicalRelation(_, _, catalogTable, _) if catalogTable.nonEmpty &&
catalogTable.forall(DDLUtils.isDatasourceTable) && catalogTable.forall(_.stats.isEmpty) =>
val sizeInBytes = if (session.sessionState.conf.fallBackToHdfsForStatsEnabled) {

This comment has been minimized.

Copy link
@HyukjinKwon

HyukjinKwon May 27, 2019

Member

Looks like this code will be a duplicate at #24712. Are there more places that similar patterns happen? Let's do that PR first and deduplicate it if there are some more places to deduplicate.

This comment has been minimized.

Copy link
@wangyum

wangyum Jun 28, 2019

Author Member

I'm move DetermineTableStats from HiveStrategies to DataSourceStrategy to reduce duplicate.

@SparkQA

This comment has been minimized.

Copy link

commented May 27, 2019

Test build #105808 has finished for PR 24715 at commit dd5a125.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class DetermineDataSourceTableStats(session: SparkSession) extends Rule[LogicalPlan]
wangyum added 2 commits Jun 28, 2019
Merge remote-tracking branch 'upstream/master' into SPARK-25474
# Conflicts:
#	sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
#	sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala
@SparkQA

This comment has been minimized.

Copy link

commented Jun 28, 2019

Test build #106995 has finished for PR 24715 at commit 70d3557.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan]
@SparkQA

This comment has been minimized.

Copy link

commented Jun 28, 2019

Test build #107010 has finished for PR 24715 at commit dd5f356.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented Jun 29, 2019

Test build #107034 has finished for PR 24715 at commit 22dd26e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.
@wangyum

This comment has been minimized.

Copy link
Member Author

commented Jun 29, 2019

retest this please

@SparkQA

This comment has been minimized.

Copy link

commented Jun 29, 2019

Test build #107035 has finished for PR 24715 at commit 22dd26e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented Jul 1, 2019

Test build #107080 has finished for PR 24715 at commit 7748a32.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@cloud-fan

This comment has been minimized.

Copy link
Contributor

commented Aug 15, 2019

The idea LGTM, can you rebase this PR?

wangyum added 3 commits Aug 15, 2019
Merge remote-tracking branch 'upstream/master' into SPARK-25474
# Conflicts:
#	sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala
#	sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala
@wangyum

This comment has been minimized.

Copy link
Member Author

commented Aug 15, 2019

I did some benchmark.

Prepare data:

spark.range(100000000).repartition(10000).write.saveAsTable("test_non_partition_10000")
spark.range(100000000).repartition(300000).write.saveAsTable("test_non_partition_300000")
spark.range(100000000).selectExpr("id", "id % 5000 as c2", "id as c3").repartition(org.apache.spark.sql.functions.col("c2")).write.partitionBy("c2").saveAsTable("test_partition_5000")
spark.range(100000000).selectExpr("id", "id % 10000 as c2", "id as c3").repartition(org.apache.spark.sql.functions.col("c2")).write.partitionBy("c2").saveAsTable("test_partition_10000")

Add these lines to LogicalRelation.computeStats:

val time1 = System.currentTimeMillis()
val relationSize = relation.sizeInBytes
val time2 = System.currentTimeMillis()
val fallBackToHdfsSize = CommandUtils.getSizeInBytesFallBackToHdfs(relation.sqlContext.sparkSession, catalogTable.get)
val time3 = System.currentTimeMillis()
// scalastyle:off
println(s"Get size from relation: $relationSize, time: ${time2 - time1}")
println(s"Get size fall back to HDFS: $fallBackToHdfsSize, time: ${time3 - time2}")
// scalastyle:on

Non-partitioned table benchmark result:

scala> spark.sql("explain cost select * from test_non_partition_10000 limit 1").show
Get size from relation: 576588171, time: 22
Get size fall back to HDFS: 576588171, time: 41
+--------------------+
|                plan|
+--------------------+
|== Optimized Logi...|
+--------------------+


scala> spark.sql("explain cost select * from test_non_partition_10000 limit 1").show
Get size from relation: 576588171, time: 3
Get size fall back to HDFS: 576588171, time: 28
+--------------------+
|                plan|
+--------------------+
|== Optimized Logi...|
+--------------------+


scala>

scala> spark.sql("explain cost select * from test_non_partition_300000 limit 1").show
Get size from relation: 706507984, time: 135
Get size fall back to HDFS: 706507984, time: 2038
+--------------------+
|                plan|
+--------------------+
|== Optimized Logi...|
+--------------------+


scala> spark.sql("explain cost select * from test_non_partition_300000 limit 1").show
Get size from relation: 706507984, time: 168
Get size fall back to HDFS: 706507984, time: 3629
+--------------------+
|                plan|
+--------------------+
|== Optimized Logi...|
+--------------------+

Partitioned table benchmark result:

scala> spark.sql("explain cost select * from test_partition_5000 limit 1").show
Get size from relation: 9223372036854775807, time: 0
Get size fall back to HDFS: 1018560794, time: 46
+--------------------+
|                plan|
+--------------------+
|== Optimized Logi...|
+--------------------+


scala> spark.sql("explain cost select * from test_partition_10000 limit 1").show
Get size from relation: 9223372036854775807, time: 0
Get size fall back to HDFS: 1036799332, time: 43
+--------------------+
|                plan|
+--------------------+
|== Optimized Logi...|
+--------------------+

Partitioned table with spark.sql.hive.manageFilesourcePartitions=false (set it by --conf) benchmark result:

scala> spark.sql("set spark.sql.hive.manageFilesourcePartitions").show
+--------------------+-----+
|                 key|value|
+--------------------+-----+
|spark.sql.hive.ma...|false|
+--------------------+-----+


scala> spark.sql("explain cost select * from test_partition_5000 limit 1").show
Get size from relation: 1018560794, time: 3
Get size fall back to HDFS: 1018560794, time: 45
+--------------------+
|                plan|
+--------------------+
|== Optimized Logi...|
+--------------------+


scala> spark.sql("explain cost select * from test_partition_10000 limit 1").show
Get size from relation: 1036799332, time: 865
Get size fall back to HDFS: 1036799332, time: 69
+--------------------+
|                plan|
+--------------------+
|== Optimized Logi...|
+--------------------+
@cloud-fan

This comment has been minimized.

Copy link
Contributor

commented Aug 15, 2019

@wangyum do you mean CommandUtils.getSizeInBytesFallBackToHdfs is very slow if there are many files?


override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
// For the data source table, we only recalculate the table statistics when it creates
// the CatalogFileIndex using defaultSizeInBytes. See SPARK-25474 for more details.

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Aug 15, 2019

Contributor

when it creates the CatalogFileIndex using defaultSizeInBytes -> when the table stats are not available

This comment has been minimized.

Copy link
@wangyum

wangyum Aug 16, 2019

Author Member

Done

*/
class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan] {

private val sessionConf = session.sessionState.conf

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Aug 15, 2019

Contributor

nit: just call it conf

This comment has been minimized.

Copy link
@wangyum

wangyum Aug 16, 2019

Author Member

Done

val withStats = table.copy(stats = Some(CatalogStatistics(sizeInBytes = BigInt(sizeInBytes))))
logical.copy(catalogTable = Some(withStats))

case relation: HiveTableRelation

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Aug 15, 2019

Contributor

shall we catch InsertIntoTable(HiveTableRelation) as well?

This comment has been minimized.

Copy link
@wangyum

wangyum Aug 16, 2019

Author Member

@advancedxy Already work on this:c86a27b

// Non-partitioned table
withTempDir { dir =>
Seq(false, true).foreach { fallBackToHDFSForStats =>
withSQLConf(SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key -> s"$fallBackToHDFSForStats") {

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Aug 15, 2019

Contributor

why this config has no effect in this test?

This comment has been minimized.

Copy link
@wangyum

wangyum Aug 16, 2019

Author Member

// fallBackToHDFSForStats = true: The table stats will be recalculated by DetermineTableStats
// fallBackToHDFSForStats = false: The table stats will be recalculated by FileIndex

}

// Partitioned table
Seq(false, true).foreach { fallBackToHDFSForStats =>

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Aug 15, 2019

Contributor

please create a test case for it. e.g.

test("partitioned data source tables support fallback to HDFS for size estimation")

This comment has been minimized.

Copy link
@wangyum

wangyum Aug 16, 2019

Author Member

Done

@wangyum

This comment has been minimized.

Copy link
Member Author

commented Aug 15, 2019

@wangyum do you mean CommandUtils.getSizeInBytesFallBackToHdfs is very slow if there are many files?

CommandUtils.getSizeInBytesFallBackToHdfs is not very slow.
I have no idea why PartitioningAwareFileIndex.sizeInBytes is faster than CommandUtils.getSizeInBytesFallBackToHdfs.
It may be related to the cluster load, I plan to switch to an idle cluster to test tomorrow.

@wangyum

This comment has been minimized.

Copy link
Member Author

commented Aug 20, 2019

@SparkQA

This comment has been minimized.

Copy link

commented Aug 20, 2019

Test build #109413 has finished for PR 24715 at commit 1e1378d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan]
@SparkQA

This comment has been minimized.

Copy link

commented Aug 20, 2019

Test build #109415 has finished for PR 24715 at commit cc32b48.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
}
}

test("External partitioned data source table does not support fallback to HDFS " +

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Aug 21, 2019

Contributor

how is this implemented?

This comment has been minimized.

Copy link
@wangyum

wangyum Aug 21, 2019

Author Member

We do not support it:

              if (fallBackToHDFS) {
                assert(relation.stats.sizeInBytes === 0)
              } else {
                assert(relation.stats.sizeInBytes === conf.defaultSizeInBytes)
              }
val relation = spark.table("spark_25474").queryExecution.analyzed.children.head
assert(spark.table("spark_25474").count() === 5)
if (fallBackToHDFS) {
assert(relation.stats.sizeInBytes === 0)

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Aug 21, 2019

Contributor

We have a serious issue here: the wrong stats may mislead Spark to broadcast a very large table and OOM.

I think we can only fallback to HDFS size for non-partitioned tables.

This comment has been minimized.

Copy link
@wangyum

wangyum Aug 21, 2019

Author Member

+1 Partitioned table usually very large.

This comment has been minimized.

Copy link
@wangyum

wangyum Aug 22, 2019

Author Member

@shahidki31 @dongjoon-hyun What do you think ?

This comment has been minimized.

Copy link
@dongjoon-hyun

dongjoon-hyun Aug 22, 2019

Member

+1 for disabling for partitioned tables.
This is master branch status, isn't it? Previously, Spark returns 8EB in this case safely.

This comment has been minimized.

Copy link
@shahidki31

shahidki31 Aug 22, 2019

Contributor

I think, non-partitioned data source table is already getting the correct statistics. I am not sure, we need to support fallback to HDFS for size for non partitioned table.

This comment has been minimized.

Copy link
@dongjoon-hyun

dongjoon-hyun Aug 22, 2019

Member

@wangyum . Why do we need to revert that? You can revert the functional part here and keep the test codes.

This comment has been minimized.

Copy link
@dongjoon-hyun

dongjoon-hyun Aug 22, 2019

Member

For me, this PR already contains the reverting here(#24715 (comment)).

This comment has been minimized.

Copy link
@dongjoon-hyun

dongjoon-hyun Aug 22, 2019

Member

Personally, I'm -1 for removing the existing test cases.

This comment has been minimized.

Copy link
@wangyum

wangyum Aug 22, 2019

Author Member

@dongjoon-hyun It is expensive to support partitioned table with external partitions. Please see this test case. Its data size is incorrect.

Related discussion:
#24715 (comment)
#24715 (comment)

So we plan do not fallback to HDFS size for partitioned tables.

This comment has been minimized.

Copy link
@shahidki31

shahidki31 Aug 22, 2019

Contributor

+1 We need to document it.

@wangyum May be, in this PR only you can do both, if we doesn't support fallback config.

s"PARTITIONED BY(a) LOCATION '${dir.toURI}'")

withTempDir { partitionDir =>
spark.range(5).write.mode(SaveMode.Overwrite).parquet(partitionDir.getCanonicalPath)

This comment has been minimized.

Copy link
@dongjoon-hyun

dongjoon-hyun Aug 22, 2019

Member

spark.range(5).toDF("b") instead of spark.range(5)?

@wangyum wangyum closed this Aug 23, 2019

wangyum added 2 commits Aug 23, 2019
Merge remote-tracking branch 'upstream/master' into SPARK-25474
# Conflicts:
#	sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
#	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala
#	sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

@wangyum wangyum reopened this Aug 23, 2019

"determining if a table is small enough to use auto broadcast joins. " +
"For non-partitioned data source table, it will be automatically recalculated if table " +
"statistics are not available. For partitioned data source table, It is " +
s"'${DEFAULT_SIZE_IN_BYTES.key}' if table statistics are not available.")
.booleanConf

This comment has been minimized.

Copy link
@wangyum

wangyum Aug 23, 2019

Author Member

@shahidki31 I have updated the documentation. Please take a look.

This comment has been minimized.

Copy link
@shahidki31

shahidki31 Aug 24, 2019

Contributor

Thanks @wangyum . Looks good. Shall we add it in the configuration.md also. It seems these configs are not there?

This comment has been minimized.

Copy link
@dongjoon-hyun

dongjoon-hyun Aug 24, 2019

Member

Hi, @wangyum . We had better minimize the patch diff always. Please keep the original location of this conf. I don't see any difference for the following three lines.

val ENABLE_FALL_BACK_TO_HDFS_FOR_STATS = buildConf("spark.sql.statistics.fallBackToHdfs")
.booleanConf
.createWithDefault(false)

This comment has been minimized.

Copy link
@wangyum

wangyum Aug 25, 2019

Author Member

This is because we added DEFAULT_SIZE_IN_BYTES to ENABLE_FALL_BACK_TO_HDFS_FOR_STATS. So we need to move DEFAULT_SIZE_IN_BYTES before ENABLE_FALL_BACK_TO_HDFS_FOR_STATS , otherwise:

[error] [warn] /home/jenkins/workspace/SparkPullRequestBuilder/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:1224: Reference to uninitialized value DEFAULT_SIZE_IN_BYTES
[error] [warn]       s"'${DEFAULT_SIZE_IN_BYTES.key}' if table statistics are not available.")
[error] [warn] 
[warn] 8 warnings found

This comment has been minimized.

Copy link
@wangyum

wangyum Aug 25, 2019

Author Member

@shahidki31 Yes. Running the SET -v command will show the entire list of the SQL configuration:
http://spark.apache.org/docs/latest/configuration.html#spark-sql

@wangyum wangyum changed the title [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation [SPARK-25474][SQL] Update the documentation for spark.sql.statistics.fallBackToHdfs Aug 23, 2019

@SparkQA

This comment has been minimized.

Copy link

commented Aug 23, 2019

Test build #109645 has finished for PR 24715 at commit 3b0c234.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented Aug 23, 2019

Test build #109647 has finished for PR 24715 at commit d23fd47.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-25474][SQL] Update the documentation for spark.sql.statistics.fallBackToHdfs [SPARK-25474][SQL][DOCS] Update the docs for spark.sql.statistics.fallBackToHdfs Aug 24, 2019

@@ -1230,6 +1224,16 @@ object SQLConf {
.bytesConf(ByteUnit.BYTE)
.createWithDefault(Long.MaxValue)

val ENABLE_FALL_BACK_TO_HDFS_FOR_STATS = buildConf("spark.sql.statistics.fallBackToHdfs")
.doc("This flag is effective only if it is Hive table. When true, it will fall back to HDFS " +

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Aug 26, 2019

Contributor

Hive table -> Non-partitioned Hive table?

This comment has been minimized.

Copy link
@wangyum

wangyum Aug 26, 2019

Author Member

Currently Hive partitioned table is supported, do you think we need to disable it?

override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
case relation: HiveTableRelation
if DDLUtils.isHiveTable(relation.tableMeta) && relation.tableMeta.stats.isEmpty =>
val table = relation.tableMeta
val sizeInBytes = if (session.sessionState.conf.fallBackToHdfsForStatsEnabled) {
try {
val hadoopConf = session.sessionState.newHadoopConf()
val tablePath = new Path(table.location)
val fs: FileSystem = tablePath.getFileSystem(hadoopConf)
fs.getContentSummary(tablePath).getLength
} catch {
case e: IOException =>
logWarning("Failed to get table size from hdfs.", e)
session.sessionState.conf.defaultSizeInBytes
}
} else {
session.sessionState.conf.defaultSizeInBytes
}
val withStats = table.copy(stats = Some(CatalogStatistics(sizeInBytes = BigInt(sizeInBytes))))
relation.copy(tableMeta = withStats)

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Aug 26, 2019

Contributor

Yea I think so. Can you send a new PR to fix it?

This comment has been minimized.

Copy link
@wangyum

wangyum Aug 26, 2019

Author Member

OK.

@@ -1230,6 +1224,16 @@ object SQLConf {
.bytesConf(ByteUnit.BYTE)
.createWithDefault(Long.MaxValue)

val ENABLE_FALL_BACK_TO_HDFS_FOR_STATS = buildConf("spark.sql.statistics.fallBackToHdfs")
.doc("When true, it will fall back to HDFS if the table statistics are not available from " +
"table metadata. This is useful in determining if a table is small enough to use auto " +

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Aug 28, 2019

Contributor

auto broadcast join? maybe just say broadcast join

val ENABLE_FALL_BACK_TO_HDFS_FOR_STATS = buildConf("spark.sql.statistics.fallBackToHdfs")
.doc("When true, it will fall back to HDFS if the table statistics are not available from " +
"table metadata. This is useful in determining if a table is small enough to use auto " +
"broadcast joins. This flag is effective only if it is non-partitioned Hive table. " +

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Aug 28, 2019

Contributor

only for non-partitioned Hive tables

@SparkQA

This comment has been minimized.

Copy link

commented Aug 28, 2019

Test build #109850 has finished for PR 24715 at commit 44ac6cc.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented Aug 28, 2019

Test build #109851 has finished for PR 24715 at commit 55d59e3.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.
@wangyum

This comment has been minimized.

Copy link
Member Author

commented Aug 28, 2019

retest this please

@SparkQA

This comment has been minimized.

Copy link

commented Aug 28, 2019

Test build #109853 has finished for PR 24715 at commit 55d59e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@cloud-fan

This comment has been minimized.

Copy link
Contributor

commented Aug 28, 2019

thanks, merging to master!

@cloud-fan cloud-fan closed this in e3b32da Aug 28, 2019

@wangyum wangyum deleted the wangyum:SPARK-25474 branch Aug 28, 2019

PavithraRamachandran added a commit to PavithraRamachandran/spark that referenced this pull request Sep 14, 2019
[SPARK-25474][SQL][DOCS] Update the docs for spark.sql.statistics.fal…
…lBackToHdfs

## What changes were proposed in this pull request?

This PR update `spark.sql.statistics.fallBackToHdfs`'s doc:
1. This flag is effective only if it is Hive table.
2. For non-partitioned data source table, it will be automatically recalculated if table statistics are not available
3. For partitioned data source table, It is 'spark.sql.defaultSizeInBytes' if table statistics are not available.

Related code:
- Non-partitioned data source table:
[SizeInBytesOnlyStatsPlanVisitor.default()](https://github.com/apache/spark/blob/98be8953c75c026c1cb432cc8f66dd312feed0c6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L54-L57) -> [LogicalRelation.computeStats()](https://github.com/apache/spark/blob/a1c1dd3484a4dcd7c38fe256e69dbaaaf10d1a92/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala#L42-L46) -> [HadoopFsRelation.sizeInBytes()](https://github.com/apache/spark/blob/c0632cec04e5b0f3fb3c3f27c21a2d3f3fbb4f7e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala#L72-L75) -> [PartitioningAwareFileIndex.sizeInBytes()](https://github.com/apache/spark/blob/b276788d57b270d455ef6a7c5ed6cf8a74885dde/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L103)
`PartitioningAwareFileIndex.sizeInBytes()` is calculated by [`allFiles().map(_.getLen).sum`](https://github.com/apache/spark/blob/b276788d57b270d455ef6a7c5ed6cf8a74885dde/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L103) if table statistics are not available.

- Partitioned data source table:
[SizeInBytesOnlyStatsPlanVisitor.default()](https://github.com/apache/spark/blob/98be8953c75c026c1cb432cc8f66dd312feed0c6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L54-L57) -> [LogicalRelation.computeStats()](https://github.com/apache/spark/blob/a1c1dd3484a4dcd7c38fe256e69dbaaaf10d1a92/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala#L42-L46) -> [CatalogFileIndex.sizeInBytes](https://github.com/apache/spark/blob/5d672b7f3e07cfd7710df319fc6c7d2b9056a068/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CatalogFileIndex.scala#L41)
`CatalogFileIndex.sizeInBytes` is [spark.sql.defaultSizeInBytes](https://github.com/apache/spark/blob/c30b5297bc607ae33cc2fcf624b127942154e559/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L387) if table statistics are not available.

## How was this patch tested?

N/A

Closes apache#24715 from wangyum/SPARK-25474.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.