[SPARK-31877][SQL]Avoid stats computation for Hive table #28686

karuppayya · 2020-05-31T17:59:25Z

What changes were proposed in this pull request?

As part of DetermineTableStats rule we compute the stats for a HiveTableRelation, whcih can b an expensive operation. And it could happen multiple times for a query(SPARK-31850).

In most cases(the flag for converting Parquet and Orc table to datasource table is enabled by default in master branch), RelationConversion rule converts the HiveTableRelation to LogicalRelation.
When the conversion happens, the stats computed as part of Hive Table relation does not get used.

In this change, stats compute is avoided by performing the conversion before computing stats.

Why are the changes needed?

With the change, stats for Hive table will not be computed unnecessarily.

Does this PR introduce any user-facing change?

No

How was this patch tested?

It was tested on local machine and behaviour verified.

karuppayya · 2020-06-01T02:37:32Z

@viirya @dongjoon-hyun @maropu @gatorsmile Can any of you help review the change.
Thank you

maropu · 2020-06-01T05:09:22Z

ok to test

maropu · 2020-06-01T05:09:36Z

Looks okay to me.

SparkQA · 2020-06-01T07:16:07Z

Test build #123362 has finished for PR 28686 at commit 5c782a7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-02T02:29:49Z

Test build #123404 has finished for PR 28686 at commit 3b3b4cc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-02T07:05:02Z

Test build #123418 has finished for PR 28686 at commit a58a75f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

karuppayya · 2020-06-02T16:59:31Z

Failure doesnt look genuine to me.
@maropu do i need to re-trigger the tests?

viirya · 2020-06-02T17:17:48Z

retest this please

SparkQA · 2020-06-02T22:47:36Z

Test build #123443 has finished for PR 28686 at commit a58a75f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-06-03T00:12:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

@@ -654,7 +654,7 @@ case class HiveTableRelation(
    tableMeta: CatalogTable,
    dataCols: Seq[AttributeReference],
    partitionCols: Seq[AttributeReference],
-    tableStats: Option[Statistics] = None,
+    tableStats: Option[Statistics] = Some(Statistics(sizeInBytes = SQLConf.get.defaultSizeInBytes)),


We still need to use Option?

SparkQA · 2020-06-03T02:10:56Z

Test build #123456 has finished for PR 28686 at commit d5e999f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-03T07:05:02Z

Test build #123457 has finished for PR 28686 at commit 88fefaf.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-06-03T08:02:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

@@ -654,7 +654,8 @@ case class HiveTableRelation(
    tableMeta: CatalogTable,
    dataCols: Seq[AttributeReference],
    partitionCols: Seq[AttributeReference],
-    tableStats: Option[Statistics] = None,
+    tableStats: Option[Statistics] = Option(Statistics(sizeInBytes
+      = SQLConf.get.defaultSizeInBytes)),


I meant tableStats: Statistics = Statistics(sizeInBytes = SQLConf.get.defaultSizeInBytes),

SparkQA · 2020-06-03T18:52:26Z

Test build #123495 has finished for PR 28686 at commit 571319b.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-06-03T19:38:39Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala

        RelationConversions(conf, catalog) +:
+        new DetermineTableStats(session) +:


DetermineTableStats updates statistics in HiveTableRelation for some cases. Updated statistics will be propagated into HadoopFsRelation and so the sizeInBytes depends on it.

As you change the order of DetermineTableStats to after RelationConversions, it could change the sizeInBytes of converted HadoopFsRelation.

That said, if the user is willing to use fallBackToHdfsForStatsEnabled to calculate table size, this change will make it not work as before.

@viirya Hive tables are converted To HiveTableScanExec instead of Logical Relation(which inturn uses HadoopFSRelation) . This happens in org.apache.spark.sql.hive.HiveStrategies.HiveTableScans . It will not affect HadoopFSRelation I think. Let me know if I am missing some thing
Note: conversion to HiveTableScanExec happen for Hive table with any formats other than Parquet and Orc.
In case or Orc and Parquet, it happens when the flag to convert to datasource table is disabled.

I mean previously DetermineTableStats will calculate table size and update in HiveTableRelation before RelationConversions. Then in RelationConversions, this calculated table size will be propagated into HadoopFsRelation and used by sizeInBytes.

Now as you change the rule order, when running RelationConversions, even users enable fallBackToHdfsForStatsEnabled, the HadoopFsRelation won't get the table size calculated in DetermineTableStats (because this rule is run after RelationConversions now).

@viirya Thanks for catching this, I think this re-order will not be useful. I will decline this pull request.

Updated statistics will be propagated into HadoopFsRelation and so the sizeInBytes depends on it.

Ah, I see. I missed that code flow.

viirya

And I think DetermineTableStats doesn't always calculate table size. It is controlled by fallBackToHdfsForStatsEnabled and only for non-partitioned tables. If the user wants to avoid it, fallBackToHdfsForStatsEnabled should be used for it.

SparkQA · 2020-06-04T01:03:12Z

Test build #123497 has finished for PR 28686 at commit 9b4c18d.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

karuppayya · 2020-07-01T19:47:36Z

@viirya @maropu
I relooked the code, the stats from Hive table relation are propagated only for Partitioned tables. Also in DetermineTableStats we compute the stats only for non-partitioned table. I think for a partitioned table it will also be spark.sql.defaultSizeInBytes
In case of a non-partitioned table, the HadoopFSRelation created uses InMemoryFileIndex which does not use the stats computed and does a separate listing to figure the stats.
Let me know if I am missing something here

To add to how this change is useful, I took the example of q17.sql TPCDS query on scale 1000, non-partitioned data
Without this change, the following is the query metrics for the query planning phase

scala> val df = sql(query)
scala> df.queryExecution.tracker.topRulesByTime(2).foreach(println)
(org.apache.spark.sql.hive.DetermineTableStats,RuleSummary(55677175448, 3, 3))
(org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions,RuleSummary(37485411305, 6, 0))

The time is also pretty high due to SPARK-31850.

The stats computed is not used and can be avoided completely.
Let me know your thoughts.

karuppayya · 2020-07-01T19:56:48Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

@@ -171,6 +171,32 @@ class HiveCatalogedDDLSuite extends DDLSuite with TestHiveSingleton with BeforeA
    testDropTable(isDatasourceTable = false)
  }

+  test("DetermineTableStats should not cause any plan changes" +


What would be right place to add this test? @viirya @maropu
I will move the table create/drop to beforeAll and afterAll when moving this test to the right file

SparkQA · 2020-07-01T20:01:51Z

Test build #124814 has finished for PR 28686 at commit ccba79b.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-02T01:42:13Z

Test build #124813 has finished for PR 28686 at commit 9b4c18d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-02T05:01:51Z

Test build #124819 has finished for PR 28686 at commit 3ce6fbb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-07-06T21:17:25Z

Retest this please.

SparkQA · 2020-07-06T21:55:24Z

Test build #125134 has finished for PR 28686 at commit 3ce6fbb.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

karuppayya · 2020-07-06T22:13:44Z

@dongjoon-hyun The error doesn't seem to be related to the change. Can u take a look, and if intermittent can we re-trigger the tests.

karuppayya · 2020-08-10T17:46:53Z

@viirya @maropu do you think this change would be helpful? Let me know if you see any other issues.

github-actions · 2020-11-19T00:38:54Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Avoid stats compuatation for Hive table

5c782a7

probot-autolabeler bot added the SQL label May 31, 2020

Fix: UT fix attempt

3b3b4cc

karuppayya mentioned this pull request Jun 2, 2020

[SPARK-31850][SQL]Prevent DetermineTableStats from computing stats multiple times for same table #28662

Closed

Fix: UT test fix attempt

a58a75f

maropu reviewed Jun 3, 2020

View reviewed changes

Fix: Address review comments - use Option instead of Some

d5e999f

Fix: Line length violation

88fefaf

maropu reviewed Jun 3, 2020

View reviewed changes

Fix: Address rewvie commenst

571319b

Fix: missed committing file

9b4c18d

viirya reviewed Jun 3, 2020

View reviewed changes

karuppayya closed this Jun 5, 2020

karuppayya reopened this Jul 1, 2020

Add tests

ccba79b

karuppayya commented Jul 1, 2020

View reviewed changes

Fix lint errors

3ce6fbb

karuppayya requested review from viirya and maropu July 6, 2020 19:15

github-actions bot added the Stale label Nov 19, 2020

github-actions bot closed this Nov 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-31877][SQL]Avoid stats computation for Hive table #28686

[SPARK-31877][SQL]Avoid stats computation for Hive table #28686

karuppayya commented May 31, 2020 •

edited

karuppayya commented Jun 1, 2020

maropu commented Jun 1, 2020

maropu commented Jun 1, 2020

SparkQA commented Jun 1, 2020

SparkQA commented Jun 2, 2020

SparkQA commented Jun 2, 2020

karuppayya commented Jun 2, 2020

viirya commented Jun 2, 2020

SparkQA commented Jun 2, 2020

maropu Jun 3, 2020

SparkQA commented Jun 3, 2020

SparkQA commented Jun 3, 2020

maropu Jun 3, 2020

SparkQA commented Jun 3, 2020

viirya Jun 3, 2020

karuppayya Jun 3, 2020

viirya Jun 3, 2020 •

edited

karuppayya Jun 4, 2020

maropu Jun 4, 2020

viirya left a comment

SparkQA commented Jun 4, 2020

karuppayya commented Jul 1, 2020

karuppayya Jul 1, 2020

SparkQA commented Jul 1, 2020

SparkQA commented Jul 2, 2020

SparkQA commented Jul 2, 2020

dongjoon-hyun commented Jul 6, 2020

SparkQA commented Jul 6, 2020

karuppayya commented Jul 6, 2020

karuppayya commented Aug 10, 2020

github-actions bot commented Nov 19, 2020

		RelationConversions(conf, catalog) +:
		new DetermineTableStats(session) +:

[SPARK-31877][SQL]Avoid stats computation for Hive table #28686

[SPARK-31877][SQL]Avoid stats computation for Hive table #28686

Conversation

karuppayya commented May 31, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

karuppayya commented Jun 1, 2020

maropu commented Jun 1, 2020

maropu commented Jun 1, 2020

SparkQA commented Jun 1, 2020

SparkQA commented Jun 2, 2020

SparkQA commented Jun 2, 2020

karuppayya commented Jun 2, 2020

viirya commented Jun 2, 2020

SparkQA commented Jun 2, 2020

maropu Jun 3, 2020

Choose a reason for hiding this comment

SparkQA commented Jun 3, 2020

SparkQA commented Jun 3, 2020

maropu Jun 3, 2020

Choose a reason for hiding this comment

SparkQA commented Jun 3, 2020

viirya Jun 3, 2020

Choose a reason for hiding this comment

karuppayya Jun 3, 2020

Choose a reason for hiding this comment

viirya Jun 3, 2020 • edited

Choose a reason for hiding this comment

karuppayya Jun 4, 2020

Choose a reason for hiding this comment

maropu Jun 4, 2020

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

SparkQA commented Jun 4, 2020

karuppayya commented Jul 1, 2020

karuppayya Jul 1, 2020

Choose a reason for hiding this comment

SparkQA commented Jul 1, 2020

SparkQA commented Jul 2, 2020

SparkQA commented Jul 2, 2020

dongjoon-hyun commented Jul 6, 2020

SparkQA commented Jul 6, 2020

karuppayya commented Jul 6, 2020

karuppayya commented Aug 10, 2020

github-actions bot commented Nov 19, 2020

karuppayya commented May 31, 2020 •

edited

viirya Jun 3, 2020 •

edited