[SPARK-34060][SQL][3.0] Fix Hive table caching while updating stats by `ALTER TABLE .. DROP PARTITION` #31126

MaxGekk · 2021-01-11T09:34:45Z

What changes were proposed in this pull request?

Fix canonicalisation of HiveTableRelation by normalisation of CatalogTable, and exclude table stats and temporary fields from the canonicalized plan.

Why are the changes needed?

This fixes the issue demonstrated by the example below:

scala> spark.conf.set("spark.sql.statistics.size.autoUpdate.enabled", true)
scala> sql(s"CREATE TABLE tbl (id int, part int) USING hive PARTITIONED BY (part)")
scala> sql("INSERT INTO tbl PARTITION (part=0) SELECT 0")
scala> sql("INSERT INTO tbl PARTITION (part=1) SELECT 1")
scala> sql("CACHE TABLE tbl")
scala> sql("SELECT * FROM tbl").show(false)
+---+----+
|id |part|
+---+----+
|0  |0   |
|1  |1   |
+---+----+

scala> spark.catalog.isCached("tbl")
scala> sql("ALTER TABLE tbl DROP PARTITION (part=0)")
scala> spark.catalog.isCached("tbl")
res19: Boolean = false

ALTER TABLE .. DROP PARTITION must keep the table in the cache.

Does this PR introduce any user-facing change?

Yes. After the changes, the drop partition command keeps the table in the cache while updating table stats:

scala> sql("ALTER TABLE tbl DROP PARTITION (part=0)")
scala> spark.catalog.isCached("tbl")
res19: Boolean = true

How was this patch tested?

By running new UT:

$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowCreateTableSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite"

Authored-by: Max Gekk max.gekk@gmail.com
Signed-off-by: Wenchen Fan wenchen@databricks.com
(cherry picked from commit d97e991)
Signed-off-by: Max Gekk max.gekk@gmail.com

…TER TABLE .. DROP PARTITION` Fix canonicalisation of `HiveTableRelation` by normalisation of `CatalogTable`, and exclude table stats and temporary fields from the canonicalized plan. This fixes the issue demonstrated by the example below: ```scala scala> spark.conf.set("spark.sql.statistics.size.autoUpdate.enabled", true) scala> sql(s"CREATE TABLE tbl (id int, part int) USING hive PARTITIONED BY (part)") scala> sql("INSERT INTO tbl PARTITION (part=0) SELECT 0") scala> sql("INSERT INTO tbl PARTITION (part=1) SELECT 1") scala> sql("CACHE TABLE tbl") scala> sql("SELECT * FROM tbl").show(false) +---+----+ |id |part| +---+----+ |0 |0 | |1 |1 | +---+----+ scala> spark.catalog.isCached("tbl") scala> sql("ALTER TABLE tbl DROP PARTITION (part=0)") scala> spark.catalog.isCached("tbl") res19: Boolean = false ``` `ALTER TABLE .. DROP PARTITION` must keep the table in the cache. Yes. After the changes, the drop partition command keeps the table in the cache while updating table stats: ```scala scala> sql("ALTER TABLE tbl DROP PARTITION (part=0)") scala> spark.catalog.isCached("tbl") res19: Boolean = true ``` By running new UT in `AlterTableDropPartitionSuite`. Closes apache#31112 from MaxGekk/fix-caching-hive-table-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d97e991) Signed-off-by: Max Gekk <max.gekk@gmail.com>

MaxGekk · 2021-01-11T09:41:20Z

@cloud-fan Please, review this.

SparkQA · 2021-01-11T11:23:20Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38507/

SparkQA · 2021-01-11T11:56:46Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38507/

SparkQA · 2021-01-11T14:09:38Z

Test build #133920 has finished for PR 31126 at commit ce2b6cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-01-11T14:36:07Z

thanks, merging to 3.0!

…y `ALTER TABLE .. DROP PARTITION` ### What changes were proposed in this pull request? Fix canonicalisation of `HiveTableRelation` by normalisation of `CatalogTable`, and exclude table stats and temporary fields from the canonicalized plan. ### Why are the changes needed? This fixes the issue demonstrated by the example below: ```scala scala> spark.conf.set("spark.sql.statistics.size.autoUpdate.enabled", true) scala> sql(s"CREATE TABLE tbl (id int, part int) USING hive PARTITIONED BY (part)") scala> sql("INSERT INTO tbl PARTITION (part=0) SELECT 0") scala> sql("INSERT INTO tbl PARTITION (part=1) SELECT 1") scala> sql("CACHE TABLE tbl") scala> sql("SELECT * FROM tbl").show(false) +---+----+ |id |part| +---+----+ |0 |0 | |1 |1 | +---+----+ scala> spark.catalog.isCached("tbl") scala> sql("ALTER TABLE tbl DROP PARTITION (part=0)") scala> spark.catalog.isCached("tbl") res19: Boolean = false ``` `ALTER TABLE .. DROP PARTITION` must keep the table in the cache. ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the drop partition command keeps the table in the cache while updating table stats: ```scala scala> sql("ALTER TABLE tbl DROP PARTITION (part=0)") scala> spark.catalog.isCached("tbl") res19: Boolean = true ``` ### How was this patch tested? By running new UT: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowCreateTableSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> (cherry picked from commit d97e991) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #31126 from MaxGekk/fix-caching-hive-table-2-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

MaxGekk mentioned this pull request Jan 11, 2021

[SPARK-34060][SQL] Fix Hive table caching while updating stats by ALTER TABLE .. DROP PARTITION #31112

Closed

cloud-fan closed this Jan 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34060][SQL][3.0] Fix Hive table caching while updating stats by `ALTER TABLE .. DROP PARTITION` #31126

[SPARK-34060][SQL][3.0] Fix Hive table caching while updating stats by `ALTER TABLE .. DROP PARTITION` #31126

MaxGekk commented Jan 11, 2021

MaxGekk commented Jan 11, 2021

SparkQA commented Jan 11, 2021

SparkQA commented Jan 11, 2021

SparkQA commented Jan 11, 2021

cloud-fan commented Jan 11, 2021

[SPARK-34060][SQL][3.0] Fix Hive table caching while updating stats by ALTER TABLE .. DROP PARTITION #31126

[SPARK-34060][SQL][3.0] Fix Hive table caching while updating stats by ALTER TABLE .. DROP PARTITION #31126

Conversation

MaxGekk commented Jan 11, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

MaxGekk commented Jan 11, 2021

SparkQA commented Jan 11, 2021

SparkQA commented Jan 11, 2021

SparkQA commented Jan 11, 2021

cloud-fan commented Jan 11, 2021

[SPARK-34060][SQL][3.0] Fix Hive table caching while updating stats by `ALTER TABLE .. DROP PARTITION` #31126

[SPARK-34060][SQL][3.0] Fix Hive table caching while updating stats by `ALTER TABLE .. DROP PARTITION` #31126