Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21969][SQL] CommandUtils.updateTableStats should call refreshTable #19252

Closed
wants to merge 5 commits into from

Conversation

aokolnychyi
Copy link
Contributor

What changes were proposed in this pull request?

Tables in the catalog cache are not invalidated once their statistics are updated. As a consequence, existing sessions will use the cached information even though it is not valid anymore. Consider and an example below.

// step 1
spark.range(100).write.saveAsTable("tab1")
// step 2
spark.sql("analyze table tab1 compute statistics")
// step 3
spark.sql("explain cost select distinct * from tab1").show(false)
// step 4
spark.range(100).write.mode("append").saveAsTable("tab1")
// step 5
spark.sql("explain cost select distinct * from tab1").show(false)

After step 3, the table will be present in the catalog relation cache. Step 4 will correctly update the metadata inside the catalog but will NOT invalidate the cache.

By the way, spark.sql("analyze table tab1 compute statistics") between step 3 and step 4 would also solve the problem.

How was this patch tested?

Current and additional unit tests.

@SparkQA
Copy link

SparkQA commented Sep 16, 2017

Test build #81842 has finished for PR 19252 at commit ba963b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

gatorsmile commented Sep 17, 2017

TruncateTableCommand and AlterTableAddPartitionCommand also have similar issues. Could you also fix it in this PR?

@@ -44,6 +44,7 @@ object CommandUtils extends Logging {
} else {
catalog.alterTableStats(table.identifier, None)
}
catalog.refreshTable(table.identifier)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment above this line:

Invalidate the table relation cache

@gatorsmile
Copy link
Member

Actually, the right fix should add refreshTable(identifier) to the SessionCatalog's alterTableStats API.

@aokolnychyi
Copy link
Contributor Author

@gatorsmile thanks for the feedback. I also covered TruncateTableCommand with additional tests. However, I see a bit strange behavior while creating a test for AlterTableAddPartitionCommand .

sql(s"CREATE TABLE t1 (col1 int, col2 int) USING PARQUET")
sql(s"INSERT INTO TABLE t1 SELECT 1, 2")
sql(s"INSERT INTO TABLE t1 SELECT 2, 4")
sql("SELECT * FROM t1").show()
+----+----+
|col1|col2|
+----+----+
|   1|   2|
|   2|   4|
+----+----+

sql(s"CREATE TABLE t2 (col1 int, col2 int) USING PARQUET PARTITIONED BY (col1)")
sql(s"INSERT INTO TABLE t2 SELECT 1, 2")
sql(s"INSERT INTO TABLE t2 SELECT 2, 4")
sql("SELECT * FROM t2").show()
+----+----+
|col2|col1|
+----+----+
|   2|   4|
|   1|   2|
+----+----+

Why are the results different? Is it a bug?

@gatorsmile
Copy link
Member

This is not a bug. We just follow the behavior of Hive's dynamic partition insert.

The dynamic partition columns must be specified last in both part_spec and the input result set (of the row value lists or the select query). They are resolved by position, instead of by names. Thus, the orders must be exactly matched.

@@ -261,6 +261,11 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared
assert(fetched1.get.sizeInBytes == 0)
assert(fetched1.get.colStats.size == 2)

// compute stats based on the catalog table metadata and
// put the relation into the catalog cache
sql(s"EXPLAIN COST SELECT DISTINCT * FROM $table")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you replace the usage of EXPLAIN COST by

// Table lookup will make the table cached.
spark.table(table)

@@ -377,6 +377,8 @@ class SessionCatalog(
requireDbExists(db)
requireTableExists(tableIdentifier)
externalCatalog.alterTableStats(db, table, newStats)
// Invalidate the table relation cache
refreshTable(identifier)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove the unneeded refreshTable calls in AnalyzeTableCommand and AnalyzeColumnCommand?

@SparkQA
Copy link

SparkQA commented Sep 19, 2017

Test build #81896 has finished for PR 19252 at commit ca09962.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 19, 2017

Test build #81897 has finished for PR 19252 at commit a5cb16d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 19, 2017

Test build #81941 has finished for PR 19252 at commit 63f9dc2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@asfgit asfgit closed this in ee13f3e Sep 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants