Skip to content

Comments

[SPARK-43543][PYTHON] Fix nested MapType behavior in Pandas UDF#41147

Closed
xinrong-meng wants to merge 5 commits intoapache:masterfrom
xinrong-meng:nestedType
Closed

[SPARK-43543][PYTHON] Fix nested MapType behavior in Pandas UDF#41147
xinrong-meng wants to merge 5 commits intoapache:masterfrom
xinrong-meng:nestedType

Conversation

@xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented May 11, 2023

What changes were proposed in this pull request?

Fix nested MapType behavior in Pandas UDF (and Arrow-optimized Python UDF).

Previously during Arrow-pandas conversion, only the outermost layer is converted to a dictionary; but now nested MapType will be converted to nested dictionaries.

That applies to Spark Connect as well.

Why are the changes needed?

Correctness and consistency (with createDataFrame and toPandas when Arrow is enabled).

Does this PR introduce any user-facing change?

Yes.

Nested MapType type support is corrected in Pandas UDF

>>> schema = StructType([
...      StructField("id", StringType(), True),
...      StructField("attributes", MapType(StringType(), MapType(StringType(), StringType())), True)
... ])
>>> 
>>> data = [
...    ("1", {"personal": {"name": "John", "city": "New York"}}),
... ]
>>> df = spark.createDataFrame(data, schema)
>>> @pandas_udf(StringType())
... def f(s: pd.Series) -> pd.Series:
...    return s.astype(str)
... 
>>> df.select(f(df.attributes)).show(truncate=False)

The results of df.select(f(df.attributes)).show(truncate=False) is corrected

FROM

+------------------------------------------------------+                        
|f(attributes)                                         |
+------------------------------------------------------+
|{'personal': [('name', 'John'), ('city', 'New York')]}|
+------------------------------------------------------+

TO

>>> df.select(f(df.attributes)).show(truncate=False)
+--------------------------------------------------+
|f(attributes)                                     |
+--------------------------------------------------+
|{'personal': {'name': 'John', 'city': 'New York'}}|
+--------------------------------------------------+

Another more obvious example:

>>> @pandas_udf(StringType())
... def extract_name(s:pd.Series) -> pd.Series:
...     return s.apply(lambda x: x['personal']['name'])
...
>>> df.select(extract_name(df.attributes)).show(truncate=False)

df.select(extract_name(df.attributes)).show(truncate=False) is corrected

FROM

org.apache.spark.api.python.PythonException: Traceback (most recent call last):
...
TypeError: list indices must be integers or slices, not str

TO

+------------------------+
|extract_name(attributes)|
+------------------------+
|John                    |
+------------------------+

How was this patch tested?

Unit tests.

@xinrong-meng xinrong-meng changed the title [WIP] Nested non-atomic input type support in Pandas UDF [WIP] Standardize nested non-atomic input type support in Pandas UDF May 17, 2023
return s

# To keep the current UDF behavior.
def _create_array(self, series, arrow_type):
Copy link
Member Author

@xinrong-meng xinrong-meng May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inherit _create_array of ArrowStreamPandasSerializer. After the change, it is consistent with createDataFrame from a pandas DataFrame when Arrow is enabled.

@xinrong-meng xinrong-meng changed the title [WIP] Standardize nested non-atomic input type support in Pandas UDF [SPARK-43543][PYTHON] Standardize nested non-atomic input type support in Pandas UDF May 18, 2023
@xinrong-meng xinrong-meng marked this pull request as ready for review May 18, 2023 00:49
@xinrong-meng xinrong-meng changed the title [SPARK-43543][PYTHON] Standardize nested non-atomic input type support in Pandas UDF [SPARK-43543][PYTHON] Standardize nested MapType in Pandas UDF May 18, 2023
@xinrong-meng xinrong-meng changed the title [SPARK-43543][PYTHON] Standardize nested MapType in Pandas UDF [SPARK-43543][PYTHON] Fix nested MapType behavior in Pandas UDF May 18, 2023
@xinrong-meng xinrong-meng removed the CORE label May 18, 2023
return _convert_map_items_to_dict(s)
else:
return s
# TODO: cache the converter for reuse
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you file a JIRA issue officially and make this IDed TODO like TODO(SPARK-XXX)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly, done!

@github-actions github-actions bot added the CORE label May 18, 2023
@xinrong-meng
Copy link
Member Author

@ueshin @HyukjinKwon @zhengruifeng would you please review?

@xinrong-meng
Copy link
Member Author

Merged to master, thank you!

@xinrong-meng
Copy link
Member Author

Please free to leave comments if any, I'll adjust them in follow-ups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants