[SPARK-24554][PYTHON][SQL] Add MapType support for PySpark with Arrow #30393

BryanCutler · 2020-11-17T07:09:35Z

What changes were proposed in this pull request?

This change adds MapType support for PySpark with Arrow, if using pyarrow >= 2.0.0.

Why are the changes needed?

MapType was previous unsupported with Arrow.

Does this PR introduce any user-facing change?

User can now enable MapType for createDataFrame(), toPandas() with Arrow optimization, and with Pandas UDFs.

How was this patch tested?

Added new PySpark tests for createDataFrame(), toPandas() and Scalar Pandas UDFs.

…s fails on unimplemented for arrow to pandas conversion

…tor in Java

BryanCutler · 2020-11-17T07:17:32Z

ping @HyukjinKwon to please take a look when you can, thanks!

SparkQA · 2020-11-17T07:20:47Z

Test build #131203 has finished for PR 30393 at commit a92af2f.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-17T08:15:08Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35806/

SparkQA · 2020-11-17T08:39:19Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35806/

SparkQA · 2020-11-17T18:14:37Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35831/

SparkQA · 2020-11-17T18:38:28Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35831/

SparkQA · 2020-11-17T20:41:07Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35836/

SparkQA · 2020-11-17T21:12:06Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35836/

SparkQA · 2020-11-17T21:48:52Z

Test build #131228 has finished for PR 30393 at commit dec2797.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-18T00:30:37Z

Test build #131232 has finished for PR 30393 at commit 78b2604.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

@BryanCutler, BTW I believe we should also update the docs :-)

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Lines 1905 to 1906 in 9283484

    
           "The following data types are unsupported: " + 
        
           "MapType, ArrayType of TimestampType, and nested StructType.")

https://github.com/apache/spark/blob/e0538bd38cd43feaa064e30df940edf8fb2088de/python/docs/source/user_guide/arrow_pandas.rst#supported-sql-types

https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/functions.py#L287

HyukjinKwon · 2020-11-18T03:41:45Z

LGTM otherwise

BryanCutler · 2020-11-18T06:06:53Z

BTW I believe we should also update the docs :-)

Thanks for reminding me! I'll do that now..

BryanCutler · 2020-11-18T06:41:23Z

python/pyspark/sql/pandas/types.py

@@ -306,3 +322,23 @@ def _check_series_convert_timestamps_tz_local(s, timezone):
        `pandas.Series` where if it is a timestamp, has been converted to tz-naive
    """
    return _check_series_convert_timestamps_localize(s, timezone, None)
+
+
+def _convert_map_items_to_dict(s):


Note: these conversion functions are because pyarrow expects map items as a list of (key, value) pairs, and has this format when converting to Pandas also. The reason is that the arrow spec could allow for duplicate key values in a row, and doesn't say how these should be handled exactly. So by having these conversions, we match the non-arrow behavior for maps, with a dictionary as input/output.

BryanCutler · 2020-11-18T06:42:36Z

python/docs/source/user_guide/arrow_pandas.rst

@@ -341,7 +341,7 @@ Supported SQL Types

 .. currentmodule:: pyspark.sql.types

-Currently, all Spark SQL data types are supported by Arrow-based conversion except :class:`MapType`,
+Currently, all Spark SQL data types are supported by Arrow-based conversion except


I should probably mention MapType only for pyarrow 2.0.0..

SparkQA · 2020-11-18T08:05:02Z

Test build #131260 has finished for PR 30393 at commit b257470.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-18T08:34:39Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35863/

SparkQA · 2020-11-18T08:54:56Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35865/

SparkQA · 2020-11-18T08:58:43Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35863/

SparkQA · 2020-11-18T09:19:03Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35865/

HyukjinKwon · 2020-11-18T12:18:07Z

Merged to master.

SparkQA · 2020-11-18T12:34:29Z

Test build #131262 has finished for PR 30393 at commit 3f2ef98.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2020-11-18T18:09:20Z

Thanks @HyukjinKwon !

BryanCutler added 8 commits November 10, 2020 14:28

Adding map type support for pyspark, createDataFrame working, toPanda…

642934f

…s fails on unimplemented for arrow to pandas conversion

Add todo notes

8a563ca

added conversion function, toPandas working now

3ee35b8

Fixed up remaining ArrowTests

dfec9a5

Added conversion from Pandas series of dicts to arrow map

aed88b2

Added test for pandas scalar udf

cbe6c23

Added checks to disable map type for pyarrow < 2

2db803b

Remove TODO, MapAccessor only checks key/value vectors, not StructVec…

a92af2f

…tor in Java

BryanCutler requested a review from HyukjinKwon November 17, 2020 07:10

github-actions bot added CORE PYTHON SQL labels Nov 17, 2020

BryanCutler added PYSPARK and removed CORE PYTHON SQL labels Nov 17, 2020

Remove import no longer used

dec2797

github-actions bot added CORE PYTHON SQL labels Nov 17, 2020

Corrected remaining tests that use map as unsupported type

78b2604

HyukjinKwon reviewed Nov 18, 2020

View reviewed changes

Update docs to remove MapType from unsupported

b257470

BryanCutler commented Nov 18, 2020

View reviewed changes

Add note only supported for pyarrow 2

3f2ef98

HyukjinKwon approved these changes Nov 18, 2020

View reviewed changes

HyukjinKwon closed this in 8e2a0bd Nov 18, 2020

BryanCutler deleted the arrow-add-MapType-SPARK-24554 branch November 18, 2020 18:09

HyukjinKwon mentioned this pull request Nov 18, 2021

[SPARK-37277][PYTHON][SQL] Support DayTimeIntervalType in pandas UDF and Arrow optimization #34631

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24554][PYTHON][SQL] Add MapType support for PySpark with Arrow #30393

[SPARK-24554][PYTHON][SQL] Add MapType support for PySpark with Arrow #30393

BryanCutler commented Nov 17, 2020

BryanCutler commented Nov 17, 2020

SparkQA commented Nov 17, 2020

SparkQA commented Nov 17, 2020

SparkQA commented Nov 17, 2020

SparkQA commented Nov 17, 2020

SparkQA commented Nov 17, 2020

SparkQA commented Nov 17, 2020

SparkQA commented Nov 17, 2020

SparkQA commented Nov 17, 2020

SparkQA commented Nov 18, 2020

HyukjinKwon left a comment •

edited

Loading

HyukjinKwon commented Nov 18, 2020

BryanCutler commented Nov 18, 2020

BryanCutler Nov 18, 2020

BryanCutler Nov 18, 2020

BryanCutler Nov 18, 2020

SparkQA commented Nov 18, 2020

SparkQA commented Nov 18, 2020

SparkQA commented Nov 18, 2020

SparkQA commented Nov 18, 2020

SparkQA commented Nov 18, 2020

HyukjinKwon commented Nov 18, 2020

SparkQA commented Nov 18, 2020

BryanCutler commented Nov 18, 2020

	"The following data types are unsupported: " +
	"MapType, ArrayType of TimestampType, and nested StructType.")

[SPARK-24554][PYTHON][SQL] Add MapType support for PySpark with Arrow #30393

[SPARK-24554][PYTHON][SQL] Add MapType support for PySpark with Arrow #30393

Conversation

BryanCutler commented Nov 17, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

BryanCutler commented Nov 17, 2020

SparkQA commented Nov 17, 2020

SparkQA commented Nov 17, 2020

SparkQA commented Nov 17, 2020

SparkQA commented Nov 17, 2020

SparkQA commented Nov 17, 2020

SparkQA commented Nov 17, 2020

SparkQA commented Nov 17, 2020

SparkQA commented Nov 17, 2020

SparkQA commented Nov 18, 2020

HyukjinKwon left a comment • edited Loading

Choose a reason for hiding this comment

HyukjinKwon commented Nov 18, 2020

BryanCutler commented Nov 18, 2020

BryanCutler Nov 18, 2020

Choose a reason for hiding this comment

BryanCutler Nov 18, 2020

Choose a reason for hiding this comment

BryanCutler Nov 18, 2020

Choose a reason for hiding this comment

SparkQA commented Nov 18, 2020

SparkQA commented Nov 18, 2020

SparkQA commented Nov 18, 2020

SparkQA commented Nov 18, 2020

SparkQA commented Nov 18, 2020

HyukjinKwon commented Nov 18, 2020

SparkQA commented Nov 18, 2020

BryanCutler commented Nov 18, 2020

HyukjinKwon left a comment •

edited

Loading