-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-43055][CONNECT][PYTHON] Support duplicated nested field names #40692
Conversation
Just FYI, vanilla PySpark's DataFrame.toPandas also has this issue https://issues.apache.org/jira/browse/SPARK-41971 |
Yes, I'm aware of the issue, but let me hold on it to the following PRs. TL;DR Actually this PR still has an issue with >>> spark.sql("values (1, struct(1 as a, 2 as a)) as t(x, y)").toPandas()
x y
0 1 {'a_0': 1, 'a_1': 2} The duplicated fields have suffix Also, handling struct type in >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', False)
>>> spark.sql("values (1, struct(1 as a, 2 as b)) as t(x, y)").toPandas()
x y
0 1 (1, 2)
>>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
>>> spark.sql("values (1, struct(1 as a, 2 as b)) as t(x, y)").toPandas()
x y
0 1 {'a': 1, 'b': 2} Currently PySpark with Arrow enabled, and Spark Connect, use a map for the struct type object as a result, whereas The options are:
|
@@ -60,13 +61,19 @@ private[sql] class SparkResult[T]( | |||
private def processResponses(stopOnFirstNonEmptyResponse: Boolean): Boolean = { | |||
while (responses.hasNext) { | |||
val response = responses.next() | |||
if (response.hasSchema) { | |||
structType = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between this schema and the one in the arrow batch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the original schema and the one in the arrow batch is modified to deduplicate the struct field names.
Also the original schema contains UDT if it's supported. Python client works fine with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic actual becomes more confusing now about the structType assingment.
I am wondering if it should becomes something like
if (response.hasSchema)
else if (response.hasArrowBatch)
I am becoming not sure as the code is
- if response gives a schema, use it
- if response didn't give then try arrow's schema
then how to handle when both response has a schema
and arrow has schema
is not clear, or which one should be used first, etc. Per my read the response schema and arrow schema could be even not consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that the original schema arrives earlier than arrow batches, we should use it if it's available; otherwise fallback to the schema from arrow batch.
Yes, the response schema and arrow schema could be inconsistent in terms of the nested field names if there are duplicates, but it's not problem while encoder is handling the ColumnarBatch
as long as the data structure is consistent.
Added some comments.
Merged to master. |
What changes were proposed in this pull request?
Supports duplicated nested field names when
spark.createDataFrame
ordf.collect
.Why are the changes needed?
If there are duplicated nested field names, the following error is raised:
Does this PR introduce any user-facing change?
The duplicated nested field names will be available.
How was this patch tested?
Added a test.