New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21969][SQL] CommandUtils.updateTableStats should call refreshTable #19252
Conversation
Test build #81842 has finished for PR 19252 at commit
|
|
@@ -44,6 +44,7 @@ object CommandUtils extends Logging { | |||
} else { | |||
catalog.alterTableStats(table.identifier, None) | |||
} | |||
catalog.refreshTable(table.identifier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment above this line:
Invalidate the table relation cache
Actually, the right fix should add |
@gatorsmile thanks for the feedback. I also covered
Why are the results different? Is it a bug? |
This is not a bug. We just follow the behavior of Hive's dynamic partition insert.
|
@@ -261,6 +261,11 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared | |||
assert(fetched1.get.sizeInBytes == 0) | |||
assert(fetched1.get.colStats.size == 2) | |||
|
|||
// compute stats based on the catalog table metadata and | |||
// put the relation into the catalog cache | |||
sql(s"EXPLAIN COST SELECT DISTINCT * FROM $table") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you replace the usage of EXPLAIN COST
by
// Table lookup will make the table cached.
spark.table(table)
@@ -377,6 +377,8 @@ class SessionCatalog( | |||
requireDbExists(db) | |||
requireTableExists(tableIdentifier) | |||
externalCatalog.alterTableStats(db, table, newStats) | |||
// Invalidate the table relation cache | |||
refreshTable(identifier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove the unneeded refreshTable
calls in AnalyzeTableCommand
and AnalyzeColumnCommand
?
Test build #81896 has finished for PR 19252 at commit
|
Test build #81897 has finished for PR 19252 at commit
|
Test build #81941 has finished for PR 19252 at commit
|
LGTM |
Thanks! Merged to master. |
What changes were proposed in this pull request?
Tables in the catalog cache are not invalidated once their statistics are updated. As a consequence, existing sessions will use the cached information even though it is not valid anymore. Consider and an example below.
After step 3, the table will be present in the catalog relation cache. Step 4 will correctly update the metadata inside the catalog but will NOT invalidate the cache.
By the way,
spark.sql("analyze table tab1 compute statistics")
between step 3 and step 4 would also solve the problem.How was this patch tested?
Current and additional unit tests.