[SPARK-47211][CONNECT][PYTHON] Fix ignored PySpark Connect string collation #45316

nikolamand-db · 2024-02-28T15:52:52Z

What changes were proposed in this pull request?

When using Connect with PySpark, string collation silently gets dropped:

Client connected to the Spark Connect server at localhost
SparkSession available as 'spark'.
>>> spark.sql("select 'abc' collate 'UNICODE'")
DataFrame[collate(abc): string]
>>> from pyspark.sql.types import StructType, StringType, StructField
>>> spark.createDataFrame([], StructType([StructField('id', StringType(2))]))
DataFrame[id: string]

Instead of string type in dataframe, we should be seeing string COLLATE 'UNICODE'.

Fix the issue by adding collation info to conversions under connect UnparsedDataType.

Why are the changes needed?

To enable correct behavior with PySpark connect collated strings.

Does this PR introduce any user-facing change?

Yes, with these changes the user gets correct string collation from PySpark connect session.

How was this patch tested?

Added test which simulates the described scenario.

Was this patch authored or co-authored using generative AI tooling?

No.

xinrong-meng · 2024-02-28T19:05:07Z

python/pyspark/sql/tests/connect/test_connect_basic.py

@@ -3397,6 +3397,14 @@ def test_df_caache(self):
        self.assert_eq(10, df.count())
        self.assertTrue(df.is_cached)

+    def test_collated_string(self):


Shall we add the test to pyspark.sql.tests.test_types.TypesTestsMixin, so that pyspark.sql.tests.connect.test_parity_types can inherit automatically?

@xinrong-meng Moved the test and added to test_parity_types, please check.

Looks good, thank you!

Just fyi we don't need to add the test to test_parity_types, it will inherit automatically.
Thanks @HyukjinKwon for the fix.

amaliujia · 2024-02-28T20:39:50Z

python/pyspark/sql/connect/types.py

@@ -139,7 +139,7 @@ def pyspark_types_to_proto_types(data_type: DataType) -> pb2.DataType:
    if isinstance(data_type, NullType):
        ret.null.CopyFrom(pb2.DataType.NULL())
    elif isinstance(data_type, StringType):
-        ret.string.CopyFrom(pb2.DataType.String())
+        ret.string.collation_id = data_type.collationId


Is this change tested? The added UT seems to test proto -> type conversion but not vice verse?

@amaliujia New changes should cover conversions in both directions (both get invoked during testing), please check.

amaliujia

LGTM

HyukjinKwon · 2024-02-29T01:28:59Z

python/pyspark/sql/tests/connect/test_parity_types.py

@@ -86,6 +86,9 @@ def test_rdd_with_udt(self):
    def test_udt(self):
        super().test_udt()

+    def test_collated_string(self):


We can actually just remove this. It will automatically run the tests (by inheritance)

python/pyspark/sql/tests/connect/test_parity_types.py

zhengruifeng · 2024-02-29T08:35:49Z

@nikolamand-db seems you need to enable the Github Action?

nikolamand-db · 2024-02-29T09:31:33Z

@nikolamand-db seems you need to enable the Github Action?

I'm having trouble with Github Actions, it's disabled for my account. Already filed a ticket with Github support.

HyukjinKwon · 2024-03-01T00:45:28Z

Merged to master.

…lation ### What changes were proposed in this pull request? When using Connect with PySpark, string collation silently gets dropped: ``` Client connected to the Spark Connect server at localhost SparkSession available as 'spark'. >>> spark.sql("select 'abc' collate 'UNICODE'") DataFrame[collate(abc): string] >>> from pyspark.sql.types import StructType, StringType, StructField >>> spark.createDataFrame([], StructType([StructField('id', StringType(2))])) DataFrame[id: string] ``` Instead of `string` type in dataframe, we should be seeing `string COLLATE 'UNICODE'`. Fix the issue by adding collation info to conversions under connect `UnparsedDataType`. ### Why are the changes needed? To enable correct behavior with PySpark connect collated strings. ### Does this PR introduce _any_ user-facing change? Yes, with these changes the user gets correct string collation from PySpark connect session. ### How was this patch tested? Added test which simulates the described scenario. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45316 from nikolamand-db/nikolamand-db/SPARK-47211. Lead-authored-by: Nikola Mandic <nikola.mandic@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Fix ignored PySpark Connect string collation

8badece

xinrong-meng reviewed Feb 28, 2024

View reviewed changes

amaliujia reviewed Feb 28, 2024

View reviewed changes

nikolamand-db added 2 commits February 29, 2024 00:52

Check conversions in both directions

2962ace

Move test

1ec1ea5

amaliujia approved these changes Feb 29, 2024

View reviewed changes

HyukjinKwon reviewed Feb 29, 2024

View reviewed changes

HyukjinKwon approved these changes Feb 29, 2024

View reviewed changes

HyukjinKwon reviewed Feb 29, 2024

View reviewed changes

python/pyspark/sql/tests/connect/test_parity_types.py Outdated Show resolved Hide resolved

Update python/pyspark/sql/tests/connect/test_parity_types.py

04ac94a

github-actions bot added SQL PYTHON CONNECT labels Feb 29, 2024

xinrong-meng approved these changes Feb 29, 2024

View reviewed changes

Merge branch 'apache:master' into nikolamand-db/SPARK-47211

d4075e0

HyukjinKwon closed this in 32102bc Mar 1, 2024

nikolamand-db deleted the nikolamand-db/SPARK-47211 branch March 1, 2024 07:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47211][CONNECT][PYTHON] Fix ignored PySpark Connect string collation #45316

[SPARK-47211][CONNECT][PYTHON] Fix ignored PySpark Connect string collation #45316

nikolamand-db commented Feb 28, 2024

xinrong-meng Feb 28, 2024

nikolamand-db Feb 29, 2024

xinrong-meng Feb 29, 2024

xinrong-meng Feb 29, 2024

amaliujia Feb 28, 2024

nikolamand-db Feb 29, 2024

amaliujia left a comment

HyukjinKwon Feb 29, 2024

zhengruifeng commented Feb 29, 2024

nikolamand-db commented Feb 29, 2024

HyukjinKwon commented Mar 1, 2024

[SPARK-47211][CONNECT][PYTHON] Fix ignored PySpark Connect string collation #45316

[SPARK-47211][CONNECT][PYTHON] Fix ignored PySpark Connect string collation #45316

Conversation

nikolamand-db commented Feb 28, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

xinrong-meng Feb 28, 2024

Choose a reason for hiding this comment

nikolamand-db Feb 29, 2024

Choose a reason for hiding this comment

xinrong-meng Feb 29, 2024

Choose a reason for hiding this comment

xinrong-meng Feb 29, 2024

Choose a reason for hiding this comment

amaliujia Feb 28, 2024

Choose a reason for hiding this comment

nikolamand-db Feb 29, 2024

Choose a reason for hiding this comment

amaliujia left a comment

Choose a reason for hiding this comment

HyukjinKwon Feb 29, 2024

Choose a reason for hiding this comment

zhengruifeng commented Feb 29, 2024

nikolamand-db commented Feb 29, 2024

HyukjinKwon commented Mar 1, 2024