Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24554][PYTHON][SQL] Add MapType support for PySpark with Arrow #30393

Closed

Conversation

BryanCutler
Copy link
Member

What changes were proposed in this pull request?

This change adds MapType support for PySpark with Arrow, if using pyarrow >= 2.0.0.

Why are the changes needed?

MapType was previous unsupported with Arrow.

Does this PR introduce any user-facing change?

User can now enable MapType for createDataFrame(), toPandas() with Arrow optimization, and with Pandas UDFs.

How was this patch tested?

Added new PySpark tests for createDataFrame(), toPandas() and Scalar Pandas UDFs.

@BryanCutler
Copy link
Member Author

ping @HyukjinKwon to please take a look when you can, thanks!

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Test build #131203 has finished for PR 30393 at commit a92af2f.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35806/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35806/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35831/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35831/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35836/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35836/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Test build #131228 has finished for PR 30393 at commit dec2797.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Test build #131232 has finished for PR 30393 at commit 78b2604.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon
Copy link
Member

LGTM otherwise

@BryanCutler
Copy link
Member Author

BTW I believe we should also update the docs :-)

Thanks for reminding me! I'll do that now..

@@ -306,3 +322,23 @@ def _check_series_convert_timestamps_tz_local(s, timezone):
`pandas.Series` where if it is a timestamp, has been converted to tz-naive
"""
return _check_series_convert_timestamps_localize(s, timezone, None)


def _convert_map_items_to_dict(s):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: these conversion functions are because pyarrow expects map items as a list of (key, value) pairs, and has this format when converting to Pandas also. The reason is that the arrow spec could allow for duplicate key values in a row, and doesn't say how these should be handled exactly. So by having these conversions, we match the non-arrow behavior for maps, with a dictionary as input/output.

@@ -341,7 +341,7 @@ Supported SQL Types

.. currentmodule:: pyspark.sql.types

Currently, all Spark SQL data types are supported by Arrow-based conversion except :class:`MapType`,
Currently, all Spark SQL data types are supported by Arrow-based conversion except
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably mention MapType only for pyarrow 2.0.0..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Test build #131260 has finished for PR 30393 at commit b257470.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35863/

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35865/

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35863/

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35865/

@HyukjinKwon
Copy link
Member

Merged to master.

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Test build #131262 has finished for PR 30393 at commit 3f2ef98.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

Thanks @HyukjinKwon !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants