-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33591][SQL] Recognize null
in partition spec values
#30538
Conversation
Test build #131936 has finished for PR 30538 at commit
|
@cloud-fan @HyukjinKwon Please, take a look at this PR. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
Show resolved
Hide resolved
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
Show resolved
Hide resolved
Test build #133236 has finished for PR 30538 at commit
|
Test build #133273 has finished for PR 30538 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
…e-null # Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowPartitionsExec.scala # sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Test build #133459 has started for PR 30538 at commit |
Kubernetes integration test starting |
Kubernetes integration test status success |
retest this please |
@@ -169,9 +169,15 @@ object ExternalCatalogUtils { | |||
spec1: TablePartitionSpec, | |||
spec2: TablePartitionSpec): Boolean = { | |||
spec1.forall { | |||
case (partitionColumn, null | DEFAULT_PARTITION_NAME) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a util function isInvalidPartitionValue
? then the code can be
case (partitionColumn, value) if isInvalidPartitionValue(value) =>
isInvalidPartitionValue(spec2(partitionColumn))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will we hit empty string partition value here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a util function isInvalidPartitionValue?
- Why are partition values invalid? They are still valid here
- Where else will the function be used. Since this is only the place, wouldn't be better to keep the code embedded here?
will we hit empty string partition value here?
Empty string is handling earlier. We cannot have it here. For example, SessionCatalog.createPartitions
-> requireNonEmptyValueInPartitionSpec
which is called before externalCatalog.createPartitions
where we convert null
to __HIVE_DEFAULT_PARTITION__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about isNullPartitionValue
...src/test/scala/org/apache/spark/sql/execution/command/AlterTableDropPartitionSuiteBase.scala
Show resolved
Hide resolved
Test build #133495 has finished for PR 30538 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #133808 has finished for PR 30538 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #133811 has finished for PR 30538 at commit
|
@MaxGekk It doesn't compile with scala 2.13 |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #133828 has finished for PR 30538 at commit
|
thanks, merging to master! |
@MaxGekk can you send backport PRs for 3.1/3.0? thanks! |
1. Recognize `null` while parsing partition specs, and put `null` instead of `"null"` as partition values. 2. For V1 catalog: replace `null` by `__HIVE_DEFAULT_PARTITION__`. 3. For V2 catalogs: pass `null` AS IS, and let catalog implementations to decide how to handle `null`s as partition values in spec. Currently, `null` in partition specs is recognized as the `"null"` string which could lead to incorrect results, for example: ```sql spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED BY (p1); spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0; spark-sql> SELECT isnull(p1) FROM tbl5; false ``` Even we inserted a row to the partition with the `null` value, **the resulted table doesn't contain `null`**. Yes. After the changes, the example above works as expected: ```sql spark-sql> SELECT isnull(p1) FROM tbl5; true ``` 1. By running the affected test suites `SQLQuerySuite`, `AlterTablePartitionV2SQLSuite` and `v1/ShowPartitionsSuite`. 2. Compiling by Scala 2.13: ``` $ ./dev/change-scala-version.sh 2.13 $ ./build/sbt -Pscala-2.13 compile ``` Closes apache#30538 from MaxGekk/partition-spec-value-null. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 157b72a) Signed-off-by: Max Gekk <max.gekk@gmail.com>
1. Recognize `null` while parsing partition specs, and put `null` instead of `"null"` as partition values. 2. For V1 catalog: replace `null` by `__HIVE_DEFAULT_PARTITION__`. 3. For V2 catalogs: pass `null` AS IS, and let catalog implementations to decide how to handle `null`s as partition values in spec. Currently, `null` in partition specs is recognized as the `"null"` string which could lead to incorrect results, for example: ```sql spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED BY (p1); spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0; spark-sql> SELECT isnull(p1) FROM tbl5; false ``` Even we inserted a row to the partition with the `null` value, **the resulted table doesn't contain `null`**. Yes. After the changes, the example above works as expected: ```sql spark-sql> SELECT isnull(p1) FROM tbl5; true ``` 1. By running the affected test suites `SQLQuerySuite`, `AlterTablePartitionV2SQLSuite` and `v1/ShowPartitionsSuite`. 2. Compiling by Scala 2.13: ``` $ ./dev/change-scala-version.sh 2.13 $ ./build/sbt -Pscala-2.13 compile ``` Closes apache#30538 from MaxGekk/partition-spec-value-null. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 157b72a) Signed-off-by: Max Gekk <max.gekk@gmail.com>
Here are backports: |
…artition spec values ### What changes were proposed in this pull request? This is a follow up for #30538. It adds a legacy conf `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` in case users wants the legacy behavior. It also adds document for the behavior change. ### Why are the changes needed? In case users want the legacy behavior, they can set `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` as true. ### Does this PR introduce _any_ user-facing change? Yes, adding a legacy configuration to restore the old behavior. ### How was this patch tested? Unit test. Closes #31421 from gengliangwang/legacyNullStringConstant. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…artition spec values This is a follow up for apache#30538. It adds a legacy conf `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` in case users wants the legacy behavior. It also adds document for the behavior change. In case users want the legacy behavior, they can set `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` as true. Yes, adding a legacy configuration to restore the old behavior. Unit test. Closes apache#31421 from gengliangwang/legacyNullStringConstant. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…artition spec values This is a follow up for apache#30538. It adds a legacy conf `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` in case users wants the legacy behavior. It also adds document for the behavior change. In case users want the legacy behavior, they can set `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` as true. Yes, adding a legacy configuration to restore the old behavior. Unit test. Closes apache#31421 from gengliangwang/legacyNullStringConstant. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…ull partition spec values ### What changes were proposed in this pull request? This PR is to backport #31421 and #31421 to branch 3.0 This is a follow up for #30538. It adds a legacy conf `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` in case users wants the legacy behavior. It also adds document for the behavior change. ### Why are the changes needed? In case users want the legacy behavior, they can set `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` as true. ### Does this PR introduce _any_ user-facing change? Yes, adding a legacy configuration to restore the old behavior. ### How was this patch tested? Unit test. Closes #31441 from gengliangwang/backportLegacyConf3.0. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…ull partition spec values ### What changes were proposed in this pull request? This PR is to backport #31421 and #31434 to branch 3.1 This is a follow up for #30538. It adds a legacy conf `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` in case users wants the legacy behavior. It also adds document for the behavior change. ### Why are the changes needed? In case users want the legacy behavior, they can set `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` as true. ### Does this PR introduce _any_ user-facing change? Yes, adding a legacy configuration to restore the old behavior. ### How was this patch tested? Unit test. Closes #31439 from gengliangwang/backportLegacyConf3.1. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…artition spec values ### What changes were proposed in this pull request? This is a follow up for apache#30538. It adds a legacy conf `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` in case users wants the legacy behavior. It also adds document for the behavior change. ### Why are the changes needed? In case users want the legacy behavior, they can set `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` as true. ### Does this PR introduce _any_ user-facing change? Yes, adding a legacy configuration to restore the old behavior. ### How was this patch tested? Unit test. Closes apache#31421 from gengliangwang/legacyNullStringConstant. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
What changes were proposed in this pull request?
null
while parsing partition specs, and putnull
instead of"null"
as partition values.null
by__HIVE_DEFAULT_PARTITION__
.null
AS IS, and let catalog implementations to decide how to handlenull
s as partition values in spec.Why are the changes needed?
Currently,
null
in partition specs is recognized as the"null"
string which could lead to incorrect results, for example:Even we inserted a row to the partition with the
null
value, the resulted table doesn't containnull
.Does this PR introduce any user-facing change?
Yes. After the changes, the example above works as expected:
How was this patch tested?
SQLQuerySuite
,AlterTablePartitionV2SQLSuite
andv1/ShowPartitionsSuite
.