New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-46686][PYTHON][CONNECT] Basic support of SparkSession based Python UDF profiler #44697
Conversation
LGTM after the conflicts resolved, thanks for the nice work! |
.version("4.0.0") | ||
.stringConf | ||
.transform(_.toLowerCase(Locale.ROOT)) | ||
.checkValues(Set("perf", "memory")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's more straightforward to use the module name. e.g., cProfiler
and memory-profiler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed there are multiple user-facing references to the current "perf" profiler: Python Profilers for UDFs , Workers profiling. It would be great we could make them consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xinrong-meng what's the suggestion? could you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could adjust those references once we decide a standard name.
self.assertEqual(3, len(self.profile_results), str(list(self.profile_results))) | ||
|
||
with self.trap_stdout() as io_all: | ||
self.spark.show_perf_profiles() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should name it like showPerfProfiles
or spark.profile.show()
or spark.showprofile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me change it to showPerfProfiles
.
* The accumulated results will be sent to the Python client via observed_metrics message. | ||
*/ | ||
private[connect] val pythonAccumulator: Option[PythonAccumulator] = | ||
Try(session.sparkContext.collectionAccumulator[Array[Byte]]).toOption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, looks like we don't need Try(...)
here? I took a cursory look, and seems it won't throw an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, if the profile is disabled, we shouldn't probably create this accumulator to avoid performance issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like we don't need
Try(...)
here?
In some tests, mocks of session
or sparkContext
are used and they throw an exception when creating accumulators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the profile is disabled, we shouldn't probably create this accumulator to avoid performance issue.
It needs to always have the accumulator because:
- it can't know whether or not / when the profiler is enabled
- to support the registered UDFs
What kind of performance issue do you concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that regisgerting too many acumulators because calling this will create and register accumator for each session. Especially for Spark Connent, there could be a lot of Spark sessions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are already much more accumulators registered for each query, as SQLMetrics
. I don't think one more accumulator per session could be an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌
@@ -0,0 +1,176 @@ | |||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq do we want to expose any of them in this file as an API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the new config and spark.showPerfProfiles
should be the new user facing API, and SPARK-46687 will add spark.showMemoryProfiles
.
Merged to master. |
What changes were proposed in this pull request?
Basic support of SparkSession based Python UDF profiler.
To enable the profiler, use a SQL conf
spark.sql.pyspark.udf.profiler
:"perf"
: enable cProfiler"memory"
: enable memory-profiler (TODO: SPARK-46687)Why are the changes needed?
The existing UDF profilers are SparkContext based, which can't support Spark Connect.
We should introduce SparkSession based profilers and support Spark Connect.
Does this PR introduce any user-facing change?
Yes, SparkSession-based UDF profilers will be available.
How was this patch tested?
Added the related tests, manually, and existing tests.
Was this patch authored or co-authored using generative AI tooling?
No.