-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27266][SQL] Support ANALYZE TABLE to collect tables stats for cached catalog views #24200
Conversation
Test build #103895 has finished for PR 24200 at commit
|
retest this please |
Test build #103906 has finished for PR 24200 at commit
|
cc: @dongjoon-hyun |
Thank you for pinging me, @maropu . Yep. I'll take a look in a few hours. |
statsOfPlanToCache.copy(sizeInBytes = cacheBuilder.sizeInBytesStats.value.longValue) | ||
statsOfPlanToCache.copy( | ||
sizeInBytes = cacheBuilder.sizeInBytesStats.value.longValue, | ||
rowCount = Some(cacheBuilder.rowCountStats.value.longValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is rowCount
required additionally? If not, please remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, we need it because this change passes rowCount
into upper nodes;
scala> sql("CREATE VIEW v AS SELECT 1 c")
scala> sql("CACHE TABLE v")
scala> spark.table("v").explain(true)
...
== Optimized Logical Plan ==
InMemoryRelation [c#28], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Project [1 AS c#1]
+- Scan OneRowRelation[]
...
> w/o this change
scala> val stats = spark.table("v").queryExecution.optimizedPlan.stats
.... Statistics(sizeInBytes=4.0 B)
> w/ this change
scala> val stats = spark.table("v").queryExecution.optimizedPlan.stats
.... Statistics(sizeInBytes=4.0 B, rowCount=1)
^^^^^^^^^^^
val cacheManager = sparkSession.sharedState.cacheManager | ||
if (cacheManager.lookupCachedData(table.logicalPlan).isDefined) { | ||
// To collect table stats, materializes an underlying columnar RDD | ||
table.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, @maropu , is this safe? Although ANALYZE TABLE
is a heavy operation in general, Spark CBO collects statistics until now. For me, table.collect()
looks too heavy.
Hi, @cloud-fan . Is this kind of heavy operations allowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote this code to do the same thing with the normal case:
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala
Line 44 in 90b7251
if (noscan) None else Some(BigInt(sparkSession.table(tableIdentWithDB).count())) |
Test build #103953 has finished for PR 24200 at commit
|
retest this please |
Test build #103959 has finished for PR 24200 at commit
|
table.collect() | ||
if (!noscan) { | ||
// To collect table stats, materializes an underlying columnar RDD | ||
table.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the pointer you gave in the previous comment, table.count
is enough for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh! I missed it... this is my silly mistake, sorry.
Test build #103999 has finished for PR 24200 at commit
|
Retest this please. |
Test build #104130 has finished for PR 24200 at commit
|
Retest this please. |
Test build #104144 has finished for PR 24200 at commit
|
if (!noscan) { | ||
// To collect table stats, materializes an underlying columnar RDD | ||
table.count() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If noscan
is true
, this is no-op. We may want to show some warning or info here later. However, this is not a big deal. For now, this PR is much better than before because this prevents AnalysisException
by supporting this. We can add doc later before 3.0.0 release (if needed.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What changes were proposed in this pull request?
The current master doesn't support ANALYZE TABLE to collect tables stats for catalog views even if they are cached as follows;
Since SPARK-25196 has supported to an ANALYZE command to collect column statistics for cached catalog view, we could support table stats, too.
How was this patch tested?
Added tests in
StatisticsCollectionSuite
andInMemoryColumnarQuerySuite
.