[SPARK-46686][PYTHON][CONNECT] Basic support of SparkSession based Python UDF profiler #44697

ueshin · 2024-01-12T01:03:54Z

What changes were proposed in this pull request?

Basic support of SparkSession based Python UDF profiler.

To enable the profiler, use a SQL conf spark.sql.pyspark.udf.profiler:

"perf": enable cProfiler
"memory": enable memory-profiler (TODO: SPARK-46687)

from pyspark.sql.functions import *

spark.conf.set("spark.sql.pyspark.udf.profiler", "perf")  # enable cProfiler

@udf("string")
def f(x):
      return str(x)

df = spark.range(10).select(f(col("id")))
df.collect()

@pandas_udf("string")
def g(x):
     return x.astype("string")

df = spark.range(10).select(g(col("id")))

spark.conf.unset("spark.sql.pyspark.udf.profiler")  # disable

df.collect()  # won't profile

spark.showPerfProfiles()  # show the result for only the first collect.

Why are the changes needed?

The existing UDF profilers are SparkContext based, which can't support Spark Connect.

We should introduce SparkSession based profilers and support Spark Connect.

Does this PR introduce any user-facing change?

Yes, SparkSession-based UDF profilers will be available.

How was this patch tested?

Added the related tests, manually, and existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

xinrong-meng · 2024-01-16T18:53:09Z

LGTM after the conflicts resolved, thanks for the nice work!

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

python/pyspark/sql/profiler.py

HyukjinKwon · 2024-01-17T01:09:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .version("4.0.0")
+      .stringConf
+      .transform(_.toLowerCase(Locale.ROOT))
+      .checkValues(Set("perf", "memory"))


I wonder if it's more straightforward to use the module name. e.g., cProfiler and memory-profiler

I noticed there are multiple user-facing references to the current "perf" profiler: Python Profilers for UDFs , Workers profiling. It would be great we could make them consistent.

@xinrong-meng what's the suggestion? could you elaborate?

I could adjust those references once we decide a standard name.

HyukjinKwon · 2024-01-17T03:03:51Z

python/pyspark/sql/tests/test_udf_profiler.py

+        self.assertEqual(3, len(self.profile_results), str(list(self.profile_results)))
+
+        with self.trap_stdout() as io_all:
+            self.spark.show_perf_profiles()


I wonder if we should name it like showPerfProfiles or spark.profile.show() or spark.showprofile

Sure, let me change it to showPerfProfiles.

HyukjinKwon · 2024-01-17T03:34:26Z

...ector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SessionHolder.scala

+   * The accumulated results will be sent to the Python client via observed_metrics message.
+   */
+  private[connect] val pythonAccumulator: Option[PythonAccumulator] =
+    Try(session.sparkContext.collectionAccumulator[Array[Byte]]).toOption


Hm, looks like we don't need Try(...) here? I took a cursory look, and seems it won't throw an exception.

BTW, if the profile is disabled, we shouldn't probably create this accumulator to avoid performance issue.

looks like we don't need Try(...) here?

In some tests, mocks of session or sparkContext are used and they throw an exception when creating accumulators.

if the profile is disabled, we shouldn't probably create this accumulator to avoid performance issue.

It needs to always have the accumulator because:

it can't know whether or not / when the profiler is enabled

to support the registered UDFs

What kind of performance issue do you concern?

My concern is that regisgerting too many acumulators because calling this will create and register accumator for each session. Especially for Spark Connent, there could be a lot of Spark sessions

There are already much more accumulators registered for each query, as SQLMetrics. I don't think one more accumulator per session could be an issue.

HyukjinKwon · 2024-01-17T03:38:15Z

python/pyspark/sql/profiler.py

@@ -0,0 +1,176 @@
+#


qq do we want to expose any of them in this file as an API?

No, the new config and spark.showPerfProfiles should be the new user facing API, and SPARK-46687 will add spark.showMemoryProfiles.

HyukjinKwon · 2024-01-17T23:59:15Z

Merged to master.

Basic support of SparkSession based Python UDF profiler.

3a621fb

github-actions bot added SQL BUILD CORE PYTHON CONNECT labels Jan 12, 2024

ueshin added 3 commits January 11, 2024 17:29

Fix.

fca2343

Fix.

4f98d76

Fix.

f490370

xinrong-meng approved these changes Jan 16, 2024

View reviewed changes

xinrong-meng reviewed Jan 16, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Show resolved Hide resolved

xinrong-meng reviewed Jan 16, 2024

View reviewed changes

python/pyspark/sql/profiler.py Outdated Show resolved Hide resolved

ueshin added 3 commits January 16, 2024 11:44

Merge branch 'master' into issues/SPARK-46686/profiler

c2368ff

Fix.

a68bc10

Fix.

e7a27e9

HyukjinKwon reviewed Jan 17, 2024

View reviewed changes

HyukjinKwon approved these changes Jan 17, 2024

View reviewed changes

ueshin added 2 commits January 17, 2024 12:47

Merge branch 'master' into issues/SPARK-46686/profiler

353da26

Fix.

ae91577

HyukjinKwon closed this in d8703dd Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-46686][PYTHON][CONNECT] Basic support of SparkSession based Python UDF profiler #44697

[SPARK-46686][PYTHON][CONNECT] Basic support of SparkSession based Python UDF profiler #44697

ueshin commented Jan 12, 2024 •

edited

xinrong-meng commented Jan 16, 2024

HyukjinKwon Jan 17, 2024 •

edited

xinrong-meng Jan 17, 2024

ueshin Jan 17, 2024

xinrong-meng Jan 18, 2024

HyukjinKwon Jan 17, 2024

ueshin Jan 17, 2024

HyukjinKwon Jan 17, 2024

HyukjinKwon Jan 17, 2024

ueshin Jan 17, 2024 •

edited

ueshin Jan 17, 2024

HyukjinKwon Jan 17, 2024

ueshin Jan 17, 2024

HyukjinKwon Jan 17, 2024

HyukjinKwon Jan 17, 2024

ueshin Jan 17, 2024 •

edited

HyukjinKwon commented Jan 17, 2024

[SPARK-46686][PYTHON][CONNECT] Basic support of SparkSession based Python UDF profiler #44697

[SPARK-46686][PYTHON][CONNECT] Basic support of SparkSession based Python UDF profiler #44697

Conversation

ueshin commented Jan 12, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

xinrong-meng commented Jan 16, 2024

HyukjinKwon Jan 17, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ueshin Jan 17, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ueshin Jan 17, 2024 • edited

Choose a reason for hiding this comment

HyukjinKwon commented Jan 17, 2024

ueshin commented Jan 12, 2024 •

edited

HyukjinKwon Jan 17, 2024 •

edited

ueshin Jan 17, 2024 •

edited

ueshin Jan 17, 2024 •

edited