Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33529][SQL] Handle '__HIVE_DEFAULT_PARTITION__' while resolving V2 partition specs #30482

Closed
wants to merge 8 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Nov 24, 2020

What changes were proposed in this pull request?

  1. Extract the code for partition values casting from DSv1 to the common place sql.util.PartitioningUtils - the method castPartitionValues().
  2. Re-use castPartitionValues() from DSv2 resolver of partition specs - ResolvePartitionSpec.

Why are the changes needed?

To have the same behavior as DSv1 which interprets __HIVE_DEFAULT_PARTITION__ as NULL:

spark-sql> CREATE TABLE tbl11 (id int, part0 string) USING parquet PARTITIONED BY (part0);
spark-sql> ALTER TABLE tbl11 ADD PARTITION (part0 = '__HIVE_DEFAULT_PARTITION__');
spark-sql> INSERT INTO tbl11 PARTITION (part0='__HIVE_DEFAULT_PARTITION__') SELECT 1;
spark-sql> SELECT * FROM tbl11;
1	NULL

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Add new test to AlterTablePartitionV2SQLSuite.


object PartitioningUtils {
private[sql] object PartitioningUtils {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed @cloud-fan 's comment #30454 (comment)

.asTableCatalog
.loadTable(Identifier.of(Array("ns1", "ns2"), "tbl"))
.asPartitionable
val expectedPartition = InternalRow.fromSeq(Seq[Any](null))
Copy link
Member Author

@MaxGekk MaxGekk Nov 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'__HIVE_DEFAULT_PARTITION__' should be handled as null

@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131641 has finished for PR 30482 at commit 4ad95c5.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@github-actions github-actions bot added the SQL label Nov 24, 2020
.asPartitionable
val expectedPartition = InternalRow.fromSeq(Seq[Any](null))
assert(!partTable.partitionExists(expectedPartition))
val partSpec = "PARTITION (part0 = '__HIVE_DEFAULT_PARTITION__')"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about it. It's more like a hive specific thing and we should let v2 implementation to decide how to handle null partition values. This should be internal details and shouldn't be exposed to end users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. How can users specify null partition value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does part_col = null work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if we have a string partitioned column - how could we distinguish null from "null"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser should recognize different literals, e.g. part_col = null and part_col = "null"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does part_col = null work?

I have checked that. null is recognized as a string "null".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more like a hive specific thing and we should let v2 implementation to decide ...

It is already Spark specific thing too. Implementations don't see '__HIVE_DEFAULT_PARTITION__' at all because it is replaced by null at the analyzing phase.

@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131647 has finished for PR 30482 at commit 26b83a1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 24, 2020

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131694 has started for PR 30482 at commit 26b83a1.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 26, 2020

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Nov 26, 2020

Test build #131833 has finished for PR 30482 at commit 26b83a1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 28, 2020

@cloud-fan Should I close this?

@cloud-fan
Copy link
Contributor

Yea let's close it. __HIVE_DEFAULT_PARTITION__ should just be a normal string. Only hive catalog should handle it specially.

@MaxGekk MaxGekk closed this Dec 29, 2020
@MaxGekk MaxGekk deleted the dsv2-default-hive-partition branch February 19, 2021 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants