[SPARK-43055][CONNECT][PYTHON] Support duplicated nested field names #40692

ueshin · 2023-04-06T23:42:52Z

What changes were proposed in this pull request?

Supports duplicated nested field names when spark.createDataFrame or df.collect.

Why are the changes needed?

If there are duplicated nested field names, the following error is raised:

>>> from pyspark.sql.types import *
>>>
>>> data = [Row(Row("a", 1), Row(2, 3, "b", 4, "c")), Row(Row("x", 6), Row(7, 8, "y", 9, "z"))]
>>> schema = (
...     StructType()
...     .add("struct", StructType().add("x", StringType()).add("x", IntegerType()))
...     .add(
...         "struct",
...         StructType()
...         .add("a", IntegerType())
...         .add("x", IntegerType())
...         .add("x", StringType())
...         .add("y", IntegerType())
...         .add("y", StringType()),
...     )
... )
>>> df = spark.createDataFrame(data, schema=schema)
Traceback (most recent call last):
...
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object

Does this PR introduce any user-facing change?

The duplicated nested field names will be available.

How was this patch tested?

Added a test.

HyukjinKwon · 2023-04-07T00:15:01Z

cc @zhengruifeng

zhengruifeng · 2023-04-07T01:15:29Z

Just FYI, vanilla PySpark's DataFrame.toPandas also has this issue https://issues.apache.org/jira/browse/SPARK-41971
Is it possible to move the changes to ArrowUtils to fix them all?

ueshin · 2023-04-07T01:55:47Z

Just FYI, vanilla PySpark's DataFrame.toPandas also has this issue issues.apache.org/jira/browse/SPARK-41971
Is it possible to move the changes to ArrowUtils to fix them all?

Yes, I'm aware of the issue, but let me hold on it to the following PRs.
(Thanks for filing the ticket, btw. 😄 )

TL;DR

Actually this PR still has an issue with toPandas.

>>> spark.sql("values (1, struct(1 as a, 2 as a)) as t(x, y)").toPandas()
   x                     y
0  1  {'a_0': 1, 'a_1': 2}

The duplicated fields have suffix _1, _2, and so on.

Also, handling struct type in toPandas was not well-defined and there are behavior difference even between Arrow enabled/disabled in PySpark.

>>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', False)
>>> spark.sql("values (1, struct(1 as a, 2 as b)) as t(x, y)").toPandas()
   x       y
0  1  (1, 2)
>>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
>>> spark.sql("values (1, struct(1 as a, 2 as b)) as t(x, y)").toPandas()
   x                 y
0  1  {'a': 1, 'b': 2}

Currently PySpark with Arrow enabled, and Spark Connect, use a map for the struct type object as a result, whereas Row object in PySpark without Arrow.

The options are:

It's ok to be different, also with suffix.
- In this case, the suffix is a must because a map object will hold only one value for the duplicates.
Row object should be used for the struct.
- In this case, we will lose the benefit of Arrow -> pandas fast conversion.

hvanhovell · 2023-04-07T02:14:01Z

...ctor/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/SparkResult.scala

@@ -60,13 +61,19 @@ private[sql] class SparkResult[T](
  private def processResponses(stopOnFirstNonEmptyResponse: Boolean): Boolean = {
    while (responses.hasNext) {
      val response = responses.next()
+      if (response.hasSchema) {
+        structType =


What is the difference between this schema and the one in the arrow batch?

It is the original schema and the one in the arrow batch is modified to deduplicate the struct field names.
Also the original schema contains UDT if it's supported. Python client works fine with that.

This logic actual becomes more confusing now about the structType assingment.

I am wondering if it should becomes something like

if (response.hasSchema) else if (response.hasArrowBatch)

I am becoming not sure as the code is

if response gives a schema, use it

if response didn't give then try arrow's schema

then how to handle when both response has a schema and arrow has schema is not clear, or which one should be used first, etc. Per my read the response schema and arrow schema could be even not consistent?

Now that the original schema arrives earlier than arrow batches, we should use it if it's available; otherwise fallback to the schema from arrow batch.

Yes, the response schema and arrow schema could be inconsistent in terms of the nested field names if there are duplicates, but it's not problem while encoder is handling the ColumnarBatch as long as the data structure is consistent.

Added some comments.

HyukjinKwon · 2023-04-12T06:10:25Z

Merged to master.

Support duplicated nested field names.

e1b4917

github-actions bot added CONNECT CORE PYTHON SQL labels Apr 6, 2023

ueshin added 2 commits April 6, 2023 17:17

Fix.

da3b1ae

Fix.

1e1e503

hvanhovell reviewed Apr 7, 2023

View reviewed changes

Merge branch 'master' into issues/SPARK-43055/duplicate_fields

07d58e9

ueshin marked this pull request as ready for review April 7, 2023 06:49

Merge branch 'master' into issues/SPARK-43055/duplicate_fields

b735af5

ueshin requested review from hvanhovell, HyukjinKwon and zhengruifeng April 7, 2023 19:49

ueshin added 3 commits April 7, 2023 14:55

Fix.

c5e9d22

Test.

789e537

Merge branch 'master' into issues/SPARK-43055/duplicate_fields

79e40f5

zhengruifeng approved these changes Apr 12, 2023

View reviewed changes

amaliujia approved these changes Apr 12, 2023

View reviewed changes

HyukjinKwon closed this in 631ee67 Apr 12, 2023

ueshin mentioned this pull request Apr 20, 2023

[SPARK-41971][SQL] Use deduplicated field names when creating Arrow RecordBatch #40829

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43055][CONNECT][PYTHON] Support duplicated nested field names #40692

[SPARK-43055][CONNECT][PYTHON] Support duplicated nested field names #40692

ueshin commented Apr 6, 2023

HyukjinKwon commented Apr 7, 2023

zhengruifeng commented Apr 7, 2023

ueshin commented Apr 7, 2023 •

edited

hvanhovell Apr 7, 2023

ueshin Apr 7, 2023

amaliujia Apr 7, 2023 •

edited

ueshin Apr 7, 2023 •

edited

HyukjinKwon commented Apr 12, 2023

[SPARK-43055][CONNECT][PYTHON] Support duplicated nested field names #40692

[SPARK-43055][CONNECT][PYTHON] Support duplicated nested field names #40692

Conversation

ueshin commented Apr 6, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HyukjinKwon commented Apr 7, 2023

zhengruifeng commented Apr 7, 2023

ueshin commented Apr 7, 2023 • edited

hvanhovell Apr 7, 2023

Choose a reason for hiding this comment

ueshin Apr 7, 2023

Choose a reason for hiding this comment

amaliujia Apr 7, 2023 • edited

Choose a reason for hiding this comment

ueshin Apr 7, 2023 • edited

Choose a reason for hiding this comment

HyukjinKwon commented Apr 12, 2023

ueshin commented Apr 7, 2023 •

edited

amaliujia Apr 7, 2023 •

edited

ueshin Apr 7, 2023 •

edited