Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-41264][CONNECT][PYTHON] Make Literal support more datatypes #38800

Closed

Conversation

zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

1, in the sever side, try to match https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala#L63-L101, and use CreateArray, CreateStruct, CreateMap for complex inputs;

2, in the client side, try to match https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1335-L1349 ,
but do not support datetime.time since I don't find a corrsponding sql type for it.

Why are the changes needed?

try to support all datatype

Does this PR introduce any user-facing change?

No

How was this patch tested?

updated tests

@zhengruifeng zhengruifeng changed the title [SPARK-41264][CONNECT][PYTHON] Make Literal support more datatypes [SPARK-41264][CONNECT][PYTHON][WIP] Make Literal support more datatypes Nov 25, 2022
Copy link
Contributor

@grundprinzip grundprinzip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing the tedious work!

message CalendarInterval {
int32 months = 1;
int32 days = 2;
int64 microseconds = 3;
}

message IntervalYearToMonth {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you're removing the reference above, please remove it here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update

// Date in units of days since the UNIX epoch.
int32 date = 16;
// Time in units of microseconds past midnight
int64 time = 17;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the ``datetime.time`?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I don't find a corresponding sql type for datetime.time, so remove it.

Comment on lines -69 to -72
DataType null = 29; // a typed null literal
List list = 30;
DataType.Array empty_array = 31;
DataType.Map empty_map = 32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clean up the below elements as well, thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it!

@zhengruifeng zhengruifeng force-pushed the connect_update_literal branch 2 times, most recently from e3e5ccb to e271a41 Compare November 28, 2022 01:41
@zhengruifeng zhengruifeng marked this pull request as ready for review November 28, 2022 01:42
@zhengruifeng zhengruifeng changed the title [SPARK-41264][CONNECT][PYTHON][WIP] Make Literal support more datatypes [SPARK-41264][CONNECT][PYTHON] Make Literal support more datatypes Nov 28, 2022
@zhengruifeng
Copy link
Contributor Author

also cc @HyukjinKwon @amaliujia @cloud-fan

@@ -40,37 +40,34 @@ message Expression {

message Literal {
oneof literal_type {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In catalyst, Literal is consist of value and type. The benefit is, there aren't many physical values (timestamp and timestamp ntz are both int64). When we need to add a new data type with existing physical values (e.g. char are varchar), we don't need to update Literal.

This is not a design decision made by this PR, so I won't block this PR for this reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only works in Scala because of the inherited type hierarchy, the ability of an Any type and deferring the necessary apply / unaply methods.

Since proto does not have this imperative concept of inheritance, we need to embed the physical type with the logical one.

@@ -134,6 +138,37 @@ def test_list_to_literal(self):
lit_list_plan = fun.lit([fun.lit(10), fun.lit("str")]).to_plan(None)
self.assertIsNotNone(lit_list_plan)

def test_tuple_to_literal(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ is this Python specific?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, scala lit do not support tuple and map.

@amaliujia
Copy link
Contributor

amaliujia commented Nov 28, 2022

LGTM

A big left work is testing coverage over literals. For example does NaN, +inf, -inf can pass through Connect proto and any test case? I also have tons of questions about value range for timestamp and timezones.

This is not because of current PR so that is not a blocker.

// (ignoring precision) Always 16 bytes in length
bytes value = 1;
// the string representation.
string value = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not bytes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refer to the constructors of Decimal https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L541-L564

I think we can use :

  def apply(unscaled: Long, precision: Int, scale: Int): Decimal =
    new Decimal().set(unscaled, precision, scale)

  def apply(value: String): Decimal = new Decimal().set(BigDecimal(value))

I just choose string for two reasons:
1, it is convenient;
2, not sure about how to get the unscaled and scale from python decimal

support tuple 2 struct

fix test

address comments

init
@zhengruifeng
Copy link
Contributor Author

A big left work is testing coverage over literals. For example does NaN, +inf, -inf can pass through Connect proto and any test case? I also have tons of questions about value range for timestamp and timezones.

@amaliujia add a simple test for nan and inf. Will add more tests as follow ups.

@zhengruifeng
Copy link
Contributor Author

since I have other PRs depending on this work, let me merge it into master now, thanks for reviews

@zhengruifeng zhengruifeng deleted the connect_update_literal branch November 29, 2022 06:38
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks!

beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 15, 2022
### What changes were proposed in this pull request?
1, in the sever side, try to match https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala#L63-L101, and use `CreateArray`, `CreateStruct`, `CreateMap` for complex inputs;

2, in the client side, try to match https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1335-L1349 ,
but do not support `datetime.time` since I don't find a corrsponding sql type for it.

### Why are the changes needed?
try to support all datatype

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
updated tests

Closes apache#38800 from zhengruifeng/connect_update_literal.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
### What changes were proposed in this pull request?
1, in the sever side, try to match https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala#L63-L101, and use `CreateArray`, `CreateStruct`, `CreateMap` for complex inputs;

2, in the client side, try to match https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1335-L1349 ,
but do not support `datetime.time` since I don't find a corrsponding sql type for it.

### Why are the changes needed?
try to support all datatype

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
updated tests

Closes apache#38800 from zhengruifeng/connect_update_literal.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants