[SPARK-43543][PYTHON] Fix nested MapType behavior in Pandas UDF by xinrong-meng · Pull Request #41147 · apache/spark

xinrong-meng · 2023-05-11T23:08:40Z

What changes were proposed in this pull request?

Fix nested MapType behavior in Pandas UDF (and Arrow-optimized Python UDF).

Previously during Arrow-pandas conversion, only the outermost layer is converted to a dictionary; but now nested MapType will be converted to nested dictionaries.

That applies to Spark Connect as well.

Why are the changes needed?

Correctness and consistency (with createDataFrame and toPandas when Arrow is enabled).

Does this PR introduce any user-facing change?

Yes.

Nested MapType type support is corrected in Pandas UDF

>>> schema = StructType([
...      StructField("id", StringType(), True),
...      StructField("attributes", MapType(StringType(), MapType(StringType(), StringType())), True)
... ])
>>> 
>>> data = [
...    ("1", {"personal": {"name": "John", "city": "New York"}}),
... ]
>>> df = spark.createDataFrame(data, schema)
>>> @pandas_udf(StringType())
... def f(s: pd.Series) -> pd.Series:
...    return s.astype(str)
... 
>>> df.select(f(df.attributes)).show(truncate=False)

The results of df.select(f(df.attributes)).show(truncate=False) is corrected

FROM

+------------------------------------------------------+                        
|f(attributes)                                         |
+------------------------------------------------------+
|{'personal': [('name', 'John'), ('city', 'New York')]}|
+------------------------------------------------------+

TO

>>> df.select(f(df.attributes)).show(truncate=False)
+--------------------------------------------------+
|f(attributes)                                     |
+--------------------------------------------------+
|{'personal': {'name': 'John', 'city': 'New York'}}|
+--------------------------------------------------+

Another more obvious example:

>>> @pandas_udf(StringType())
... def extract_name(s:pd.Series) -> pd.Series:
...     return s.apply(lambda x: x['personal']['name'])
...
>>> df.select(extract_name(df.attributes)).show(truncate=False)

df.select(extract_name(df.attributes)).show(truncate=False) is corrected

FROM

org.apache.spark.api.python.PythonException: Traceback (most recent call last):
...
TypeError: list indices must be integers or slices, not str

TO

+------------------------+
|extract_name(attributes)|
+------------------------+
|John                    |
+------------------------+

How was this patch tested?

Unit tests.

xinrong-meng · 2023-05-17T18:42:17Z

python/pyspark/sql/pandas/serializers.py

        return s

-    # To keep the current UDF behavior.
-    def _create_array(self, series, arrow_type):


Inherit _create_array of ArrowStreamPandasSerializer. After the change, it is consistent with createDataFrame from a pandas DataFrame when Arrow is enabled.

dongjoon-hyun · 2023-05-18T06:35:05Z

python/pyspark/sql/pandas/serializers.py

-            return _convert_map_items_to_dict(s)
-        else:
-            return s
+        # TODO: cache the converter for reuse


Could you file a JIRA issue officially and make this IDed TODO like TODO(SPARK-XXX)?

Certainly, done!

xinrong-meng · 2023-05-18T17:35:42Z

@ueshin @HyukjinKwon @zhengruifeng would you please review?

xinrong-meng · 2023-05-19T21:56:30Z

Merged to master, thank you!

xinrong-meng · 2023-05-19T21:57:16Z

Please free to leave comments if any, I'll adjust them in follow-ups.

github-actions bot added CORE PYTHON SQL labels May 11, 2023

xinrong-meng added 3 commits May 17, 2023 11:07

ArrowStreamPandasSerializer.arrow_to_pandas

1121eeb

test

5bf52f8

pandas to arrow

b152b67

xinrong-meng force-pushed the nestedType branch from beb18c6 to b152b67 Compare May 17, 2023 18:35

xinrong-meng changed the title ~~[WIP] Nested non-atomic input type support in Pandas UDF~~ [WIP] Standardize nested non-atomic input type support in Pandas UDF May 17, 2023

xinrong-meng commented May 17, 2023

View reviewed changes

xinrong-meng changed the title ~~[WIP] Standardize nested non-atomic input type support in Pandas UDF~~ [SPARK-43543][PYTHON] Standardize nested non-atomic input type support in Pandas UDF May 18, 2023

xinrong-meng marked this pull request as ready for review May 18, 2023 00:49

xinrong-meng changed the title ~~[SPARK-43543][PYTHON] Standardize nested non-atomic input type support in Pandas UDF~~ [SPARK-43543][PYTHON] Standardize nested MapType in Pandas UDF May 18, 2023

xinrong-meng changed the title ~~[SPARK-43543][PYTHON] Standardize nested MapType in Pandas UDF~~ [SPARK-43543][PYTHON] Fix nested MapType behavior in Pandas UDF May 18, 2023

test

b92a34c

xinrong-meng removed the CORE label May 18, 2023

dongjoon-hyun reviewed May 18, 2023

View reviewed changes

IDed TODO

38bea11

github-actions bot added the CORE label May 18, 2023

ueshin approved these changes May 19, 2023

View reviewed changes

xinrong-meng closed this in bc6f69a May 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[SPARK-43543][PYTHON] Fix nested MapType behavior in Pandas UDF#41147

[SPARK-43543][PYTHON] Fix nested MapType behavior in Pandas UDF#41147
xinrong-meng wants to merge 5 commits intoapache:masterfrom
xinrong-meng:nestedType

xinrong-meng commented May 11, 2023 •

edited

Loading

Uh oh!

xinrong-meng May 17, 2023 •

edited

Loading

Uh oh!

dongjoon-hyun May 18, 2023

Uh oh!

xinrong-meng May 18, 2023

Uh oh!

xinrong-meng commented May 18, 2023

Uh oh!

xinrong-meng commented May 19, 2023

Uh oh!

xinrong-meng commented May 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

xinrong-meng commented May 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

xinrong-meng May 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun May 18, 2023

Choose a reason for hiding this comment

Uh oh!

xinrong-meng May 18, 2023

Choose a reason for hiding this comment

Uh oh!

xinrong-meng commented May 18, 2023

Uh oh!

xinrong-meng commented May 19, 2023

Uh oh!

xinrong-meng commented May 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xinrong-meng commented May 11, 2023 •

edited

Loading

xinrong-meng May 17, 2023 •

edited

Loading