-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47014][PYTHON][CONNECT] Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession #45073
Conversation
6c0d7db
to
77bacc6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, LGTM.
python/pyspark/sql/profiler.py
Outdated
with self._lock: | ||
stats = self._perf_profile_results | ||
|
||
def dump(path: str, id: int) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: path
is not necessary for this internal function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
python/pyspark/sql/profiler.py
Outdated
with self._lock: | ||
code_map = self._memory_profile_results | ||
|
||
def dump(path: str, id: int) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
python/pyspark/sql/profiler.py
Outdated
if s is not None: | ||
if not os.path.exists(path): | ||
os.makedirs(path) | ||
p = os.path.join(path, "udf_%d.pstats" % id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
udf_%d_perf.pstats
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, f"udf_{id}_perf.pstats"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! I'll reuse f"udf_{id}_memory.txt"
for memory profiles for backward compatibility.
python/pyspark/sql/profiler.py
Outdated
if cm is not None: | ||
if not os.path.exists(path): | ||
os.makedirs(path) | ||
p = os.path.join(path, "udf_%d_memory.txt" % id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
@@ -158,6 +159,70 @@ def _profile_results(self) -> "ProfileResults": | |||
""" | |||
... | |||
|
|||
def dump_perf_profiles(self, path: str, id: Optional[int] = None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should have something like spark.profile.memory
or spark.profile.dump(type="memory")
, spark.profile.show(type="memory")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
V1 has sc.dump_profiles/show_profiles
for both perf and memory profiling.
V2 has spark.dumpPerfProfiles
and spark.dumpMemoryProfiles
for perf and memory profiling separately.
It would be more consistent and user-friendly to introduce a uniform interface for both like spark.profile.dump/show
.
Let me create a ticket for now.
What's your thought on that @ueshin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change itself LGTM too
Merged to master, thank you! |
What changes were proposed in this pull request?
Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession
Why are the changes needed?
Complete support of (v2) SparkSession-based profiling.
Does this PR introduce any user-facing change?
Yes. dumpPerfProfiles and dumpMemoryProfiles of SparkSession are supported.
An example of dumpPerfProfiles is shown below.
How was this patch tested?
Unit tests.
Was this patch authored or co-authored using generative AI tooling?
No.