[SPARK-27266][SQL] Support ANALYZE TABLE to collect tables stats for cached catalog views #24200

maropu · 2019-03-25T05:58:12Z

What changes were proposed in this pull request?

The current master doesn't support ANALYZE TABLE to collect tables stats for catalog views even if they are cached as follows;

scala> sql(s"CREATE VIEW v AS SELECT 1 c")
scala> sql(s"CACHE LAZY TABLE v")
scala> sql(s"ANALYZE TABLE v COMPUTE STATISTICS")
org.apache.spark.sql.AnalysisException: ANALYZE TABLE is not supported on views.;
...

Since SPARK-25196 has supported to an ANALYZE command to collect column statistics for cached catalog view, we could support table stats, too.

How was this patch tested?

Added tests in StatisticsCollectionSuite and InMemoryColumnarQuerySuite.

SparkQA · 2019-03-25T07:05:02Z

Test build #103895 has finished for PR 24200 at commit b51b9a8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-03-25T10:03:35Z

retest this please

SparkQA · 2019-03-25T14:04:43Z

Test build #103906 has finished for PR 24200 at commit b51b9a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-03-26T01:28:45Z

cc: @dongjoon-hyun

dongjoon-hyun · 2019-03-26T03:08:51Z

Thank you for pinging me, @maropu . Yep. I'll take a look in a few hours.

dongjoon-hyun · 2019-03-26T04:41:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala

-      statsOfPlanToCache.copy(sizeInBytes = cacheBuilder.sizeInBytesStats.value.longValue)
+      statsOfPlanToCache.copy(
+        sizeInBytes = cacheBuilder.sizeInBytesStats.value.longValue,
+        rowCount = Some(cacheBuilder.rowCountStats.value.longValue)


Is rowCount required additionally? If not, please remove this.

Yea, we need it because this change passes rowCount into upper nodes;

scala> sql("CREATE VIEW v AS SELECT 1 c") scala> sql("CACHE TABLE v") scala> spark.table("v").explain(true) ... == Optimized Logical Plan == InMemoryRelation [c#28], StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) Project [1 AS c#1] +- Scan OneRowRelation[] ... > w/o this change scala> val stats = spark.table("v").queryExecution.optimizedPlan.stats .... Statistics(sizeInBytes=4.0 B) > w/ this change scala> val stats = spark.table("v").queryExecution.optimizedPlan.stats .... Statistics(sizeInBytes=4.0 B, rowCount=1) ^^^^^^^^^^^

dongjoon-hyun · 2019-03-26T04:54:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala

+      val cacheManager = sparkSession.sharedState.cacheManager
+      if (cacheManager.lookupCachedData(table.logicalPlan).isDefined) {
+        // To collect table stats, materializes an underlying columnar RDD
+        table.collect()


Ur, @maropu , is this safe? Although ANALYZE TABLE is a heavy operation in general, Spark CBO collects statistics until now. For me, table.collect() looks too heavy.

Hi, @cloud-fan . Is this kind of heavy operations allowed?

I wrote this code to do the same thing with the normal case:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala

Line 44 in 90b7251

if (noscan) None else Some(BigInt(sparkSession.table(tableIdentWithDB).count()))

SparkQA · 2019-03-26T07:05:01Z

Test build #103953 has finished for PR 24200 at commit b4bbdad.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-03-26T07:15:14Z

retest this please

SparkQA · 2019-03-26T11:28:30Z

Test build #103959 has finished for PR 24200 at commit b4bbdad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-03-26T16:11:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala

-        table.collect()
+        if (!noscan) {
+          // To collect table stats, materializes an underlying columnar RDD
+          table.collect()


According to the pointer you gave in the previous comment, table.count is enough for this?

oh! I missed it... this is my silly mistake, sorry.

SparkQA · 2019-03-27T03:37:48Z

Test build #103999 has finished for PR 24200 at commit 7b66f0d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-03-31T03:51:59Z

Retest this please.

SparkQA · 2019-03-31T07:05:01Z

Test build #104130 has finished for PR 24200 at commit 7b66f0d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-03-31T16:42:26Z

Retest this please.

SparkQA · 2019-03-31T20:50:36Z

Test build #104144 has finished for PR 24200 at commit 7b66f0d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-04-01T00:21:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala

+        if (!noscan) {
+          // To collect table stats, materializes an underlying columnar RDD
+          table.count()
+        }


If noscan is true, this is no-op. We may want to show some warning or info here later. However, this is not a big deal. For now, this PR is much better than before because this prevents AnalysisException by supporting this. We can add doc later before 3.0.0 release (if needed.).

dongjoon-hyun

+1, LGTM. Thank you, @maropu .

Merged to master.

cc @gatorsmile .

Fix

b51b9a8

dongjoon-hyun reviewed Mar 26, 2019

View reviewed changes

Fix

b4bbdad

dongjoon-hyun reviewed Mar 26, 2019

View reviewed changes

Fix

7b66f0d

dongjoon-hyun reviewed Apr 1, 2019

View reviewed changes

dongjoon-hyun approved these changes Apr 1, 2019

View reviewed changes

dongjoon-hyun closed this in 885aab4 Apr 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27266][SQL] Support ANALYZE TABLE to collect tables stats for cached catalog views #24200

[SPARK-27266][SQL] Support ANALYZE TABLE to collect tables stats for cached catalog views #24200

maropu commented Mar 25, 2019 •

edited by dongjoon-hyun

SparkQA commented Mar 25, 2019

maropu commented Mar 25, 2019

SparkQA commented Mar 25, 2019

maropu commented Mar 26, 2019

dongjoon-hyun commented Mar 26, 2019

dongjoon-hyun Mar 26, 2019 •

edited

maropu Mar 26, 2019

dongjoon-hyun Mar 26, 2019 •

edited

maropu Mar 26, 2019 •

edited

SparkQA commented Mar 26, 2019

maropu commented Mar 26, 2019

SparkQA commented Mar 26, 2019

dongjoon-hyun Mar 26, 2019

maropu Mar 26, 2019 •

edited

SparkQA commented Mar 27, 2019

dongjoon-hyun commented Mar 31, 2019

SparkQA commented Mar 31, 2019

dongjoon-hyun commented Mar 31, 2019

SparkQA commented Mar 31, 2019

dongjoon-hyun Apr 1, 2019 •

edited

dongjoon-hyun left a comment

[SPARK-27266][SQL] Support ANALYZE TABLE to collect tables stats for cached catalog views #24200

[SPARK-27266][SQL] Support ANALYZE TABLE to collect tables stats for cached catalog views #24200

Conversation

maropu commented Mar 25, 2019 • edited by dongjoon-hyun

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 25, 2019

maropu commented Mar 25, 2019

SparkQA commented Mar 25, 2019

maropu commented Mar 26, 2019

dongjoon-hyun commented Mar 26, 2019

dongjoon-hyun Mar 26, 2019 • edited

Choose a reason for hiding this comment

maropu Mar 26, 2019

Choose a reason for hiding this comment

dongjoon-hyun Mar 26, 2019 • edited

Choose a reason for hiding this comment

maropu Mar 26, 2019 • edited

Choose a reason for hiding this comment

SparkQA commented Mar 26, 2019

maropu commented Mar 26, 2019

SparkQA commented Mar 26, 2019

dongjoon-hyun Mar 26, 2019

Choose a reason for hiding this comment

maropu Mar 26, 2019 • edited

Choose a reason for hiding this comment

SparkQA commented Mar 27, 2019

dongjoon-hyun commented Mar 31, 2019

SparkQA commented Mar 31, 2019

dongjoon-hyun commented Mar 31, 2019

SparkQA commented Mar 31, 2019

dongjoon-hyun Apr 1, 2019 • edited

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

maropu commented Mar 25, 2019 •

edited by dongjoon-hyun

dongjoon-hyun Mar 26, 2019 •

edited

dongjoon-hyun Mar 26, 2019 •

edited

maropu Mar 26, 2019 •

edited

maropu Mar 26, 2019 •

edited

dongjoon-hyun Apr 1, 2019 •

edited