-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47211][CONNECT][PYTHON] Fix ignored PySpark Connect string collation #45316
[SPARK-47211][CONNECT][PYTHON] Fix ignored PySpark Connect string collation #45316
Conversation
@@ -3397,6 +3397,14 @@ def test_df_caache(self): | |||
self.assert_eq(10, df.count()) | |||
self.assertTrue(df.is_cached) | |||
|
|||
def test_collated_string(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add the test to pyspark.sql.tests.test_types.TypesTestsMixin
, so that pyspark.sql.tests.connect.test_parity_types
can inherit automatically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xinrong-meng Moved the test and added to test_parity_types
, please check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just fyi we don't need to add the test to test_parity_types
, it will inherit automatically.
Thanks @HyukjinKwon for the fix.
@@ -139,7 +139,7 @@ def pyspark_types_to_proto_types(data_type: DataType) -> pb2.DataType: | |||
if isinstance(data_type, NullType): | |||
ret.null.CopyFrom(pb2.DataType.NULL()) | |||
elif isinstance(data_type, StringType): | |||
ret.string.CopyFrom(pb2.DataType.String()) | |||
ret.string.collation_id = data_type.collationId |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change tested? The added UT seems to test proto -> type conversion but not vice verse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amaliujia New changes should cover conversions in both directions (both get invoked during testing), please check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -86,6 +86,9 @@ def test_rdd_with_udt(self): | |||
def test_udt(self): | |||
super().test_udt() | |||
|
|||
def test_collated_string(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can actually just remove this. It will automatically run the tests (by inheritance)
@nikolamand-db seems you need to enable the Github Action? |
I'm having trouble with Github Actions, it's disabled for my account. Already filed a ticket with Github support. |
Merged to master. |
…lation ### What changes were proposed in this pull request? When using Connect with PySpark, string collation silently gets dropped: ``` Client connected to the Spark Connect server at localhost SparkSession available as 'spark'. >>> spark.sql("select 'abc' collate 'UNICODE'") DataFrame[collate(abc): string] >>> from pyspark.sql.types import StructType, StringType, StructField >>> spark.createDataFrame([], StructType([StructField('id', StringType(2))])) DataFrame[id: string] ``` Instead of `string` type in dataframe, we should be seeing `string COLLATE 'UNICODE'`. Fix the issue by adding collation info to conversions under connect `UnparsedDataType`. ### Why are the changes needed? To enable correct behavior with PySpark connect collated strings. ### Does this PR introduce _any_ user-facing change? Yes, with these changes the user gets correct string collation from PySpark connect session. ### How was this patch tested? Added test which simulates the described scenario. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45316 from nikolamand-db/nikolamand-db/SPARK-47211. Lead-authored-by: Nikola Mandic <nikola.mandic@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…lation ### What changes were proposed in this pull request? When using Connect with PySpark, string collation silently gets dropped: ``` Client connected to the Spark Connect server at localhost SparkSession available as 'spark'. >>> spark.sql("select 'abc' collate 'UNICODE'") DataFrame[collate(abc): string] >>> from pyspark.sql.types import StructType, StringType, StructField >>> spark.createDataFrame([], StructType([StructField('id', StringType(2))])) DataFrame[id: string] ``` Instead of `string` type in dataframe, we should be seeing `string COLLATE 'UNICODE'`. Fix the issue by adding collation info to conversions under connect `UnparsedDataType`. ### Why are the changes needed? To enable correct behavior with PySpark connect collated strings. ### Does this PR introduce _any_ user-facing change? Yes, with these changes the user gets correct string collation from PySpark connect session. ### How was this patch tested? Added test which simulates the described scenario. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45316 from nikolamand-db/nikolamand-db/SPARK-47211. Lead-authored-by: Nikola Mandic <nikola.mandic@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…lation ### What changes were proposed in this pull request? When using Connect with PySpark, string collation silently gets dropped: ``` Client connected to the Spark Connect server at localhost SparkSession available as 'spark'. >>> spark.sql("select 'abc' collate 'UNICODE'") DataFrame[collate(abc): string] >>> from pyspark.sql.types import StructType, StringType, StructField >>> spark.createDataFrame([], StructType([StructField('id', StringType(2))])) DataFrame[id: string] ``` Instead of `string` type in dataframe, we should be seeing `string COLLATE 'UNICODE'`. Fix the issue by adding collation info to conversions under connect `UnparsedDataType`. ### Why are the changes needed? To enable correct behavior with PySpark connect collated strings. ### Does this PR introduce _any_ user-facing change? Yes, with these changes the user gets correct string collation from PySpark connect session. ### How was this patch tested? Added test which simulates the described scenario. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45316 from nikolamand-db/nikolamand-db/SPARK-47211. Lead-authored-by: Nikola Mandic <nikola.mandic@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
When using Connect with PySpark, string collation silently gets dropped:
Instead of
string
type in dataframe, we should be seeingstring COLLATE 'UNICODE'
.Fix the issue by adding collation info to conversions under connect
UnparsedDataType
.Why are the changes needed?
To enable correct behavior with PySpark connect collated strings.
Does this PR introduce any user-facing change?
Yes, with these changes the user gets correct string collation from PySpark connect session.
How was this patch tested?
Added test which simulates the described scenario.
Was this patch authored or co-authored using generative AI tooling?
No.