-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24554][PYTHON][SQL] Add MapType support for PySpark with Arrow #30393
[SPARK-24554][PYTHON][SQL] Add MapType support for PySpark with Arrow #30393
Conversation
…s fails on unimplemented for arrow to pandas conversion
ping @HyukjinKwon to please take a look when you can, thanks! |
Test build #131203 has finished for PR 30393 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test starting |
Kubernetes integration test status success |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #131228 has finished for PR 30393 at commit
|
Test build #131232 has finished for PR 30393 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanCutler, BTW I believe we should also update the docs :-)
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Lines 1905 to 1906 in 9283484
"The following data types are unsupported: " + | |
"MapType, ArrayType of TimestampType, and nested StructType.") |
https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/functions.py#L287
LGTM otherwise |
Thanks for reminding me! I'll do that now.. |
@@ -306,3 +322,23 @@ def _check_series_convert_timestamps_tz_local(s, timezone): | |||
`pandas.Series` where if it is a timestamp, has been converted to tz-naive | |||
""" | |||
return _check_series_convert_timestamps_localize(s, timezone, None) | |||
|
|||
|
|||
def _convert_map_items_to_dict(s): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: these conversion functions are because pyarrow expects map items as a list of (key, value) pairs, and has this format when converting to Pandas also. The reason is that the arrow spec could allow for duplicate key values in a row, and doesn't say how these should be handled exactly. So by having these conversions, we match the non-arrow behavior for maps, with a dictionary as input/output.
@@ -341,7 +341,7 @@ Supported SQL Types | |||
|
|||
.. currentmodule:: pyspark.sql.types | |||
|
|||
Currently, all Spark SQL data types are supported by Arrow-based conversion except :class:`MapType`, | |||
Currently, all Spark SQL data types are supported by Arrow-based conversion except |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should probably mention MapType only for pyarrow 2.0.0..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Test build #131260 has finished for PR 30393 at commit
|
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test status success |
Kubernetes integration test status failure |
Merged to master. |
Test build #131262 has finished for PR 30393 at commit
|
Thanks @HyukjinKwon ! |
What changes were proposed in this pull request?
This change adds MapType support for PySpark with Arrow, if using pyarrow >= 2.0.0.
Why are the changes needed?
MapType was previous unsupported with Arrow.
Does this PR introduce any user-facing change?
User can now enable MapType for
createDataFrame()
,toPandas()
with Arrow optimization, and with Pandas UDFs.How was this patch tested?
Added new PySpark tests for createDataFrame(), toPandas() and Scalar Pandas UDFs.