[SPARK-33591][SQL] Recognize `null` in partition spec values #30538

MaxGekk · 2020-11-29T20:00:26Z

What changes were proposed in this pull request?

Recognize null while parsing partition specs, and put null instead of "null" as partition values.
For V1 catalog: replace null by __HIVE_DEFAULT_PARTITION__.
For V2 catalogs: pass null AS IS, and let catalog implementations to decide how to handle nulls as partition values in spec.

Why are the changes needed?

Currently, null in partition specs is recognized as the "null" string which could lead to incorrect results, for example:

spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED BY (p1);
spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0;
spark-sql> SELECT isnull(p1) FROM tbl5;
false

Even we inserted a row to the partition with the null value, the resulted table doesn't contain null.

Does this PR introduce any user-facing change?

Yes. After the changes, the example above works as expected:

spark-sql> SELECT isnull(p1) FROM tbl5;
true

How was this patch tested?

By running the affected test suites SQLQuerySuite, AlterTablePartitionV2SQLSuite and v1/ShowPartitionsSuite.
Compiling by Scala 2.13:

$  ./dev/change-scala-version.sh 2.13
$ ./build/sbt -Pscala-2.13 compile

SparkQA · 2020-11-30T00:24:36Z

Test build #131936 has finished for PR 30538 at commit cbf79f1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-11-30T06:00:02Z

@cloud-fan @HyukjinKwon Please, take a look at this PR.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

…e-null

SparkQA · 2020-12-22T21:32:40Z

Test build #133236 has finished for PR 30538 at commit 4e4b6cf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…e-null

SparkQA · 2020-12-23T07:46:02Z

Test build #133273 has finished for PR 30538 at commit 343d15d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-23T08:37:46Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37871/

SparkQA · 2020-12-23T09:14:46Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37871/

…e-null # Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowPartitionsExec.scala # sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

SparkQA · 2020-12-28T20:20:38Z

Test build #133459 has started for PR 30538 at commit af2ec3c.

SparkQA · 2020-12-28T20:57:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38048/

SparkQA · 2020-12-28T21:25:59Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38048/

cloud-fan · 2020-12-29T13:48:03Z

retest this please

cloud-fan · 2020-12-29T14:10:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogUtils.scala

@@ -169,9 +169,15 @@ object ExternalCatalogUtils {
      spec1: TablePartitionSpec,
      spec2: TablePartitionSpec): Boolean = {
    spec1.forall {
+      case (partitionColumn, null | DEFAULT_PARTITION_NAME) =>


Can we add a util function isInvalidPartitionValue? then the code can be

case (partitionColumn, value) if isInvalidPartitionValue(value) => isInvalidPartitionValue(spec2(partitionColumn))

will we hit empty string partition value here?

Can we add a util function isInvalidPartitionValue?

Why are partition values invalid? They are still valid here

Where else will the function be used. Since this is only the place, wouldn't be better to keep the code embedded here?

will we hit empty string partition value here?

Empty string is handling earlier. We cannot have it here. For example, SessionCatalog.createPartitions -> requireNonEmptyValueInPartitionSpec which is called before externalCatalog.createPartitions where we convert null to __HIVE_DEFAULT_PARTITION__.

how about isNullPartitionValue

...src/test/scala/org/apache/spark/sql/execution/command/AlterTableDropPartitionSuiteBase.scala

SparkQA · 2020-12-29T14:19:27Z

Test build #133495 has finished for PR 30538 at commit af2ec3c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-29T14:44:34Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38084/

SparkQA · 2020-12-29T15:12:36Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38084/

SparkQA · 2021-01-07T20:56:56Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38397/

SparkQA · 2021-01-07T21:24:45Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38397/

SparkQA · 2021-01-07T22:00:52Z

Test build #133808 has finished for PR 30538 at commit 17938dc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-07T23:01:03Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38400/

SparkQA · 2021-01-07T23:35:46Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38400/

SparkQA · 2021-01-08T02:45:16Z

Test build #133811 has finished for PR 30538 at commit 71ca35a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-01-08T05:14:52Z

[error] /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogUtils.scala:183:19: type mismatch;
[error]  found   : scala.collection.MapView[String,String]
[error]  required: org.apache.spark.sql.catalyst.catalog.CatalogTypes.TablePartitionSpec
[error]     (which expands to)  scala.collection.immutable.Map[String,String]
[error]     spec.mapValues(v => if (v == null) DEFAULT_PARTITION_NAME else v)
[error]                   ^

@MaxGekk It doesn't compile with scala 2.13

SparkQA · 2021-01-08T09:07:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38417/

SparkQA · 2021-01-08T09:41:36Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38417/

SparkQA · 2021-01-08T12:38:01Z

Test build #133828 has finished for PR 30538 at commit 89c1572.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-01-08T14:14:18Z

thanks, merging to master!

cloud-fan · 2021-01-08T14:15:12Z

@MaxGekk can you send backport PRs for 3.1/3.0? thanks!

1. Recognize `null` while parsing partition specs, and put `null` instead of `"null"` as partition values. 2. For V1 catalog: replace `null` by `__HIVE_DEFAULT_PARTITION__`. 3. For V2 catalogs: pass `null` AS IS, and let catalog implementations to decide how to handle `null`s as partition values in spec. Currently, `null` in partition specs is recognized as the `"null"` string which could lead to incorrect results, for example: ```sql spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED BY (p1); spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0; spark-sql> SELECT isnull(p1) FROM tbl5; false ``` Even we inserted a row to the partition with the `null` value, **the resulted table doesn't contain `null`**. Yes. After the changes, the example above works as expected: ```sql spark-sql> SELECT isnull(p1) FROM tbl5; true ``` 1. By running the affected test suites `SQLQuerySuite`, `AlterTablePartitionV2SQLSuite` and `v1/ShowPartitionsSuite`. 2. Compiling by Scala 2.13: ``` $ ./dev/change-scala-version.sh 2.13 $ ./build/sbt -Pscala-2.13 compile ``` Closes apache#30538 from MaxGekk/partition-spec-value-null. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 157b72a) Signed-off-by: Max Gekk <max.gekk@gmail.com>

MaxGekk · 2021-01-08T15:52:50Z

Here are backports:

branch-3.0: [SPARK-33591][SQL][3.0] Recognize null in partition spec values #31095
branch-3.1: [SPARK-33591][SQL][3.1] Recognize null in partition spec values #31094

…artition spec values ### What changes were proposed in this pull request? This is a follow up for #30538. It adds a legacy conf `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` in case users wants the legacy behavior. It also adds document for the behavior change. ### Why are the changes needed? In case users want the legacy behavior, they can set `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` as true. ### Does this PR introduce _any_ user-facing change? Yes, adding a legacy configuration to restore the old behavior. ### How was this patch tested? Unit test. Closes #31421 from gengliangwang/legacyNullStringConstant. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…artition spec values This is a follow up for apache#30538. It adds a legacy conf `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` in case users wants the legacy behavior. It also adds document for the behavior change. In case users want the legacy behavior, they can set `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` as true. Yes, adding a legacy configuration to restore the old behavior. Unit test. Closes apache#31421 from gengliangwang/legacyNullStringConstant. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…ull partition spec values ### What changes were proposed in this pull request? This PR is to backport #31421 and #31421 to branch 3.0 This is a follow up for #30538. It adds a legacy conf `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` in case users wants the legacy behavior. It also adds document for the behavior change. ### Why are the changes needed? In case users want the legacy behavior, they can set `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` as true. ### Does this PR introduce _any_ user-facing change? Yes, adding a legacy configuration to restore the old behavior. ### How was this patch tested? Unit test. Closes #31441 from gengliangwang/backportLegacyConf3.0. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…ull partition spec values ### What changes were proposed in this pull request? This PR is to backport #31421 and #31434 to branch 3.1 This is a follow up for #30538. It adds a legacy conf `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` in case users wants the legacy behavior. It also adds document for the behavior change. ### Why are the changes needed? In case users want the legacy behavior, they can set `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` as true. ### Does this PR introduce _any_ user-facing change? Yes, adding a legacy configuration to restore the old behavior. ### How was this patch tested? Unit test. Closes #31439 from gengliangwang/backportLegacyConf3.1. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…artition spec values ### What changes were proposed in this pull request? This is a follow up for apache#30538. It adds a legacy conf `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` in case users wants the legacy behavior. It also adds document for the behavior change. ### Why are the changes needed? In case users want the legacy behavior, they can set `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` as true. ### Does this PR introduce _any_ user-facing change? Yes, adding a legacy configuration to restore the old behavior. ### How was this patch tested? Unit test. Closes apache#31421 from gengliangwang/legacyNullStringConstant. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

MaxGekk added 7 commits November 29, 2020 18:52

Add a test

6369247

Match to NullLiteralContext

4504ab7

Fix NPE

80c8972

Add tests

cfb2670

Add tests for SHOW PARTITIONS

7f0527d

Add tests for SHOW PARTITIONS with partition

2135dc2

Add JIRA to tests

b0c55eb

github-actions bot added the SQL label Nov 29, 2020

Fix list partitions

cbf79f1

cloud-fan reviewed Dec 2, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala Show resolved Hide resolved

cloud-fan reviewed Dec 2, 2020

View reviewed changes

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala Show resolved Hide resolved

MaxGekk added 2 commits December 22, 2020 19:16

Merge remote-tracking branch 'origin/master' into partition-spec-valu…

9f2e04e

…e-null

Fix null handling in show partitions

4e4b6cf

Merge remote-tracking branch 'origin/master' into partition-spec-valu…

343d15d

…e-null

Merge remote-tracking branch 'origin/master' into partition-spec-valu…

af2ec3c

…e-null # Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowPartitionsExec.scala # sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

cloud-fan reviewed Dec 29, 2020

View reviewed changes

...src/test/scala/org/apache/spark/sql/execution/command/AlterTableDropPartitionSuiteBase.scala Show resolved Hide resolved

Remove unneeded line

17938dc

Fix AlterTableDropPartitionSuite

71ca35a

cloud-fan approved these changes Jan 8, 2021

View reviewed changes

Fix for scala 2.13

89c1572

cloud-fan closed this in 157b72a Jan 8, 2021

gengliangwang mentioned this pull request Feb 1, 2021

[SPARK-33591][SQL][FOLLOWUP] Add legacy config for recognizing null partition spec values #31421

Closed

gengliangwang mentioned this pull request Feb 2, 2021

[SPARK-33591][3.1][SQL][FOLLOWUP] Add legacy config for recognizing null partition spec values #31439

Closed

gengliangwang mentioned this pull request Feb 2, 2021

[SPARK-33591][3.0][SQL][FOLLOWUP] Add legacy config for recognizing null partition spec values #31441

Closed

AngersZhuuuu mentioned this pull request Feb 20, 2021

[SPARK-33474][SQL] Support TypeConstructed partition spec value #30421

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33591][SQL] Recognize `null` in partition spec values #30538

[SPARK-33591][SQL] Recognize `null` in partition spec values #30538

MaxGekk commented Nov 29, 2020 •

edited

Loading

SparkQA commented Nov 30, 2020

MaxGekk commented Nov 30, 2020

SparkQA commented Dec 22, 2020

SparkQA commented Dec 23, 2020

SparkQA commented Dec 23, 2020

SparkQA commented Dec 23, 2020

SparkQA commented Dec 28, 2020

SparkQA commented Dec 28, 2020

SparkQA commented Dec 28, 2020

cloud-fan commented Dec 29, 2020

cloud-fan Dec 29, 2020

cloud-fan Dec 29, 2020

MaxGekk Jan 6, 2021 •

edited

Loading

cloud-fan Jan 7, 2021

SparkQA commented Dec 29, 2020

SparkQA commented Dec 29, 2020

SparkQA commented Dec 29, 2020

SparkQA commented Jan 7, 2021

SparkQA commented Jan 7, 2021

SparkQA commented Jan 7, 2021

SparkQA commented Jan 7, 2021

SparkQA commented Jan 7, 2021

SparkQA commented Jan 8, 2021

cloud-fan commented Jan 8, 2021

SparkQA commented Jan 8, 2021

SparkQA commented Jan 8, 2021

SparkQA commented Jan 8, 2021

cloud-fan commented Jan 8, 2021

cloud-fan commented Jan 8, 2021

MaxGekk commented Jan 8, 2021

[SPARK-33591][SQL] Recognize null in partition spec values #30538

[SPARK-33591][SQL] Recognize null in partition spec values #30538

Conversation

MaxGekk commented Nov 29, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Nov 30, 2020

MaxGekk commented Nov 30, 2020

SparkQA commented Dec 22, 2020

SparkQA commented Dec 23, 2020

SparkQA commented Dec 23, 2020

SparkQA commented Dec 23, 2020

SparkQA commented Dec 28, 2020

SparkQA commented Dec 28, 2020

SparkQA commented Dec 28, 2020

cloud-fan commented Dec 29, 2020

cloud-fan Dec 29, 2020

Choose a reason for hiding this comment

cloud-fan Dec 29, 2020

Choose a reason for hiding this comment

MaxGekk Jan 6, 2021 • edited Loading

Choose a reason for hiding this comment

cloud-fan Jan 7, 2021

Choose a reason for hiding this comment

SparkQA commented Dec 29, 2020

SparkQA commented Dec 29, 2020

SparkQA commented Dec 29, 2020

SparkQA commented Jan 7, 2021

SparkQA commented Jan 7, 2021

SparkQA commented Jan 7, 2021

SparkQA commented Jan 7, 2021

SparkQA commented Jan 7, 2021

SparkQA commented Jan 8, 2021

cloud-fan commented Jan 8, 2021

SparkQA commented Jan 8, 2021

SparkQA commented Jan 8, 2021

SparkQA commented Jan 8, 2021

cloud-fan commented Jan 8, 2021

cloud-fan commented Jan 8, 2021

MaxGekk commented Jan 8, 2021

[SPARK-33591][SQL] Recognize `null` in partition spec values #30538

[SPARK-33591][SQL] Recognize `null` in partition spec values #30538

MaxGekk commented Nov 29, 2020 •

edited

Loading

MaxGekk Jan 6, 2021 •

edited

Loading