[SPARK-34060][SQL] Fix Hive table caching while updating stats by `ALTER TABLE .. DROP PARTITION` #31112

MaxGekk · 2021-01-10T10:35:40Z

What changes were proposed in this pull request?

Fix canonicalisation of HiveTableRelation by normalisation of CatalogTable, and exclude table stats and temporary fields from the canonicalized plan.

Why are the changes needed?

This fixes the issue demonstrated by the example below:

scala> spark.conf.set("spark.sql.statistics.size.autoUpdate.enabled", true)
scala> sql(s"CREATE TABLE tbl (id int, part int) USING hive PARTITIONED BY (part)")
scala> sql("INSERT INTO tbl PARTITION (part=0) SELECT 0")
scala> sql("INSERT INTO tbl PARTITION (part=1) SELECT 1")
scala> sql("CACHE TABLE tbl")
scala> sql("SELECT * FROM tbl").show(false)
+---+----+
|id |part|
+---+----+
|0  |0   |
|1  |1   |
+---+----+

scala> spark.catalog.isCached("tbl")
scala> sql("ALTER TABLE tbl DROP PARTITION (part=0)")
scala> spark.catalog.isCached("tbl")
res19: Boolean = false

ALTER TABLE .. DROP PARTITION must keep the table in the cache.

Does this PR introduce any user-facing change?

Yes. After the changes, the drop partition command keeps the table in the cache while updating table stats:

scala> sql("ALTER TABLE tbl DROP PARTITION (part=0)")
scala> spark.catalog.isCached("tbl")
res19: Boolean = true

How was this patch tested?

By running new UT in AlterTableDropPartitionSuite.

MaxGekk · 2021-01-10T10:39:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

-      storage = CatalogStorageFormat.empty,
-      createTime = -1
-    ),
+    tableMeta = CatalogTable.normalize(tableMeta),


This is the bug fix. Cleaning of storage and createTime is not enough. tableMeta can have other "temporary" fields.

SparkQA · 2021-01-10T12:15:04Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38480/

SparkQA · 2021-01-10T12:38:26Z

Test build #133891 has finished for PR 31112 at commit 6d3058e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-10T12:50:07Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38480/

MaxGekk · 2021-01-10T21:50:41Z

@cloud-fan @HyukjinKwon @dongjoon-hyun Please, review this PR.

cloud-fan · 2021-01-11T06:30:54Z

@MaxGekk which version do we start to have this perf issue?

MaxGekk · 2021-01-11T06:56:54Z

which version do we start to have this perf issue?

I have checked 3.0, it has the issue.

cloud-fan · 2021-01-11T07:03:42Z

thanks, merging to master!

cloud-fan · 2021-01-11T07:04:24Z

@MaxGekk please open backport PRs for 3.1/3.0, thanks!

…TER TABLE .. DROP PARTITION` Fix canonicalisation of `HiveTableRelation` by normalisation of `CatalogTable`, and exclude table stats and temporary fields from the canonicalized plan. This fixes the issue demonstrated by the example below: ```scala scala> spark.conf.set("spark.sql.statistics.size.autoUpdate.enabled", true) scala> sql(s"CREATE TABLE tbl (id int, part int) USING hive PARTITIONED BY (part)") scala> sql("INSERT INTO tbl PARTITION (part=0) SELECT 0") scala> sql("INSERT INTO tbl PARTITION (part=1) SELECT 1") scala> sql("CACHE TABLE tbl") scala> sql("SELECT * FROM tbl").show(false) +---+----+ |id |part| +---+----+ |0 |0 | |1 |1 | +---+----+ scala> spark.catalog.isCached("tbl") scala> sql("ALTER TABLE tbl DROP PARTITION (part=0)") scala> spark.catalog.isCached("tbl") res19: Boolean = false ``` `ALTER TABLE .. DROP PARTITION` must keep the table in the cache. Yes. After the changes, the drop partition command keeps the table in the cache while updating table stats: ```scala scala> sql("ALTER TABLE tbl DROP PARTITION (part=0)") scala> spark.catalog.isCached("tbl") res19: Boolean = true ``` By running new UT in `AlterTableDropPartitionSuite`. Closes apache#31112 from MaxGekk/fix-caching-hive-table-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d97e991) Signed-off-by: Max Gekk <max.gekk@gmail.com>

MaxGekk · 2021-01-11T09:36:04Z

Here are backports:

branch-3.0: [SPARK-34060][SQL][3.0] Fix Hive table caching while updating stats by ALTER TABLE .. DROP PARTITION #31126
branch-3.1: [SPARK-34060][SQL][3.1] Fix Hive table caching while updating stats by ALTER TABLE .. DROP PARTITION #31124

### What changes were proposed in this pull request? Port the test added by #31112 to: 1. v1 In-Memory catalog for `ALTER TABLE .. DROP PARTITION` 2. v1 In-Memory and Hive external catalogs for `ALTER TABLE .. ADD PARTITION` 3. v1 In-Memory and Hive external catalogs for `ALTER TABLE .. RENAME PARTITION` ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableDropPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableRenamePartitionSuite" ``` Closes #31131 from MaxGekk/cache-stats-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…d CatalogTable ### What changes were proposed in this pull request? Replace `toMap` by `map(identity).toMap` while getting canonicalized representation of `CatalogTable`. `CatalogTable` became not serializable after #31112 due to usage of `filterKeys`. The workaround was taken from scala/bug#7005. ### Why are the changes needed? This prevents the errors like: ``` [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$1 [info] Cause: java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$1 ``` ### Does this PR introduce _any_ user-facing change? Should not. ### How was this patch tested? By running the test suite affected by #31112: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #31197 from MaxGekk/fix-caching-hive-table-2-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…d CatalogTable ### What changes were proposed in this pull request? Replace `toMap` by `map(identity).toMap` while getting canonicalized representation of `CatalogTable`. `CatalogTable` became not serializable after #31112 due to usage of `filterKeys`. The workaround was taken from scala/bug#7005. ### Why are the changes needed? This prevents the errors like: ``` [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$1 [info] Cause: java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$1 ``` ### Does this PR introduce _any_ user-facing change? Should not. ### How was this patch tested? By running the test suite affected by #31112: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #31197 from MaxGekk/fix-caching-hive-table-2-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c3d81fb) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

MaxGekk added 3 commits January 10, 2021 13:05

Add a test

0b1213c

Fix

1a5b126

Update JIRA id

6d3058e

github-actions bot added the SQL label Jan 10, 2021

MaxGekk mentioned this pull request Jan 10, 2021

[SPARK-34055][SQL] Refresh cache in ALTER TABLE .. ADD PARTITION #31101

Closed

MaxGekk commented Jan 10, 2021

View reviewed changes

cloud-fan approved these changes Jan 11, 2021

View reviewed changes

cloud-fan closed this in d97e991 Jan 11, 2021

MaxGekk mentioned this pull request Jan 11, 2021

[SPARK-34071][SQL][TESTS] Check stats of cached v1 tables after altering #31131

Closed

MaxGekk mentioned this pull request Jan 15, 2021

[SPARK-34060][SQL][FOLLOWUP] Preserve serializability of canonicalized CatalogTable #31197

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34060][SQL] Fix Hive table caching while updating stats by `ALTER TABLE .. DROP PARTITION` #31112

[SPARK-34060][SQL] Fix Hive table caching while updating stats by `ALTER TABLE .. DROP PARTITION` #31112

MaxGekk commented Jan 10, 2021

MaxGekk Jan 10, 2021

SparkQA commented Jan 10, 2021

SparkQA commented Jan 10, 2021

SparkQA commented Jan 10, 2021

MaxGekk commented Jan 10, 2021

cloud-fan commented Jan 11, 2021

MaxGekk commented Jan 11, 2021

cloud-fan commented Jan 11, 2021

cloud-fan commented Jan 11, 2021

MaxGekk commented Jan 11, 2021

[SPARK-34060][SQL] Fix Hive table caching while updating stats by ALTER TABLE .. DROP PARTITION #31112

[SPARK-34060][SQL] Fix Hive table caching while updating stats by ALTER TABLE .. DROP PARTITION #31112

Conversation

MaxGekk commented Jan 10, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

MaxGekk Jan 10, 2021

Choose a reason for hiding this comment

SparkQA commented Jan 10, 2021

SparkQA commented Jan 10, 2021

SparkQA commented Jan 10, 2021

MaxGekk commented Jan 10, 2021

cloud-fan commented Jan 11, 2021

MaxGekk commented Jan 11, 2021

cloud-fan commented Jan 11, 2021

cloud-fan commented Jan 11, 2021

MaxGekk commented Jan 11, 2021

[SPARK-34060][SQL] Fix Hive table caching while updating stats by `ALTER TABLE .. DROP PARTITION` #31112

[SPARK-34060][SQL] Fix Hive table caching while updating stats by `ALTER TABLE .. DROP PARTITION` #31112