Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-46686][PYTHON][CONNECT] Basic support of SparkSession based Python UDF profiler #44697

Closed
wants to merge 9 commits into from

Conversation

ueshin
Copy link
Member

@ueshin ueshin commented Jan 12, 2024

What changes were proposed in this pull request?

Basic support of SparkSession based Python UDF profiler.

To enable the profiler, use a SQL conf spark.sql.pyspark.udf.profiler:

  • "perf": enable cProfiler
  • "memory": enable memory-profiler (TODO: SPARK-46687)
from pyspark.sql.functions import *

spark.conf.set("spark.sql.pyspark.udf.profiler", "perf")  # enable cProfiler

@udf("string")
def f(x):
      return str(x)

df = spark.range(10).select(f(col("id")))
df.collect()

@pandas_udf("string")
def g(x):
     return x.astype("string")

df = spark.range(10).select(g(col("id")))

spark.conf.unset("spark.sql.pyspark.udf.profiler")  # disable

df.collect()  # won't profile

spark.showPerfProfiles()  # show the result for only the first collect.

Why are the changes needed?

The existing UDF profilers are SparkContext based, which can't support Spark Connect.

We should introduce SparkSession based profilers and support Spark Connect.

Does this PR introduce any user-facing change?

Yes, SparkSession-based UDF profilers will be available.

How was this patch tested?

Added the related tests, manually, and existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@xinrong-meng
Copy link
Member

LGTM after the conflicts resolved, thanks for the nice work!

.version("4.0.0")
.stringConf
.transform(_.toLowerCase(Locale.ROOT))
.checkValues(Set("perf", "memory"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's more straightforward to use the module name. e.g., cProfiler and memory-profiler

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed there are multiple user-facing references to the current "perf" profiler: Python Profilers for UDFs , Workers profiling. It would be great we could make them consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xinrong-meng what's the suggestion? could you elaborate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could adjust those references once we decide a standard name.

self.assertEqual(3, len(self.profile_results), str(list(self.profile_results)))

with self.trap_stdout() as io_all:
self.spark.show_perf_profiles()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should name it like showPerfProfiles or spark.profile.show() or spark.showprofile

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me change it to showPerfProfiles.

* The accumulated results will be sent to the Python client via observed_metrics message.
*/
private[connect] val pythonAccumulator: Option[PythonAccumulator] =
Try(session.sparkContext.collectionAccumulator[Array[Byte]]).toOption
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, looks like we don't need Try(...) here? I took a cursory look, and seems it won't throw an exception.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, if the profile is disabled, we shouldn't probably create this accumulator to avoid performance issue.

Copy link
Member Author

@ueshin ueshin Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we don't need Try(...) here?

In some tests, mocks of session or sparkContext are used and they throw an exception when creating accumulators.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the profile is disabled, we shouldn't probably create this accumulator to avoid performance issue.

It needs to always have the accumulator because:

  • it can't know whether or not / when the profiler is enabled
  • to support the registered UDFs

What kind of performance issue do you concern?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that regisgerting too many acumulators because calling this will create and register accumator for each session. Especially for Spark Connent, there could be a lot of Spark sessions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are already much more accumulators registered for each query, as SQLMetrics. I don't think one more accumulator per session could be an issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

@@ -0,0 +1,176 @@
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq do we want to expose any of them in this file as an API?

Copy link
Member Author

@ueshin ueshin Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the new config and spark.showPerfProfiles should be the new user facing API, and SPARK-46687 will add spark.showMemoryProfiles.

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants