Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-46663][PYTHON] Disable memory profiler for pandas UDFs with iterators #44668

Closed
wants to merge 6 commits into from

Conversation

xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented Jan 10, 2024

What changes were proposed in this pull request?

When using pandas UDFs with iterators, if users enable the profiling spark conf, a warning indicating non-support should be raised, and profiling should be disabled.

However, currently, after raising the not-supported warning, the memory profiler is still being enabled.

The PR proposed to fix that.

Why are the changes needed?

A bug fix to eliminate misleading behavior.

Does this PR introduce any user-facing change?

The noticeable changes will affect only those using the PySpark shell. This is because, in the PySpark shell, the memory profiler will raise an error, which in turn blocks the execution of the UDF.

How was this patch tested?

Manual test.

Was this patch authored or co-authored using generative AI tooling?

Setup:

$ ./bin/pyspark --conf spark.python.profile=true

>>> from typing import Iterator
>>> from pyspark.sql.functions import *
>>> import pandas as pd
>>> @pandas_udf("long")
... def plus_one(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
...     for s in iterator:
...         yield s + 1
... 
>>> df = spark.createDataFrame(pd.DataFrame([1, 2, 3], columns=["v"]))

Before:

>>> df.select(plus_one(df.v)).show()
UserWarning: Profiling UDFs with iterators input/output is not supported.
Traceback (most recent call last):
...
OSError: could not get source code

After:

>>> df.select(plus_one(df.v)).show()
/Users/xinrong.meng/spark/python/pyspark/sql/udf.py:417: UserWarning: Profiling UDFs with iterators input/output is not supported.
+-----------+                                                                   
|plus_one(v)|
+-----------+
|          2|
|          3|
|          4|
+-----------+

@xinrong-meng xinrong-meng changed the title Disable memory profiler for iterator UDFs [SPARK-46663][PYTHON] Disable memory profiler for pandas UDFs with iterators Jan 10, 2024
@xinrong-meng xinrong-meng marked this pull request as ready for review January 11, 2024 19:26
@xinrong-meng
Copy link
Member Author

@ueshin @HyukjinKwon @zhengruifeng may I get a review please?

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test for this?
Otherwise, LGTM.

@xinrong-meng
Copy link
Member Author

Thanks all! Merged to master, will do manual cherry-pick for branch-3.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants