Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47014][PYTHON][CONNECT] Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession #45073

Closed
wants to merge 6 commits into from

Conversation

xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented Feb 9, 2024

What changes were proposed in this pull request?

Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession

Why are the changes needed?

Complete support of (v2) SparkSession-based profiling.

Does this PR introduce any user-facing change?

Yes. dumpPerfProfiles and dumpMemoryProfiles of SparkSession are supported.

An example of dumpPerfProfiles is shown below.

>>> @udf("long")
... def add(x):
...   return x + 1
... 
>>> spark.conf.set("spark.sql.pyspark.udf.profiler", "perf")
>>> spark.range(10).select(add("id")).collect()
...
>>> spark.dumpPerfProfiles("dummy_dir")
>>> os.listdir("dummy_dir")
['udf_2.pstats']

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@xinrong-meng xinrong-meng changed the title Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession [SPARK-47014][PYTHON][CONNECT] Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession Feb 9, 2024
@xinrong-meng xinrong-meng marked this pull request as ready for review February 12, 2024 22:53
Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, LGTM.

with self._lock:
stats = self._perf_profile_results

def dump(path: str, id: int) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: path is not necessary for this internal function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

with self._lock:
code_map = self._memory_profile_results

def dump(path: str, id: int) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

if s is not None:
if not os.path.exists(path):
os.makedirs(path)
p = os.path.join(path, "udf_%d.pstats" % id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

udf_%d_perf.pstats?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, f"udf_{id}_perf.pstats"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! I'll reuse f"udf_{id}_memory.txt" for memory profiles for backward compatibility.

if cm is not None:
if not os.path.exists(path):
os.makedirs(path)
p = os.path.join(path, "udf_%d_memory.txt" % id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@@ -158,6 +159,70 @@ def _profile_results(self) -> "ProfileResults":
"""
...

def dump_perf_profiles(self, path: str, id: Optional[int] = None) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should have something like spark.profile.memory or spark.profile.dump(type="memory"), spark.profile.show(type="memory")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!
V1 has sc.dump_profiles/show_profiles for both perf and memory profiling.
V2 has spark.dumpPerfProfiles and spark.dumpMemoryProfiles for perf and memory profiling separately.
It would be more consistent and user-friendly to introduce a uniform interface for both like spark.profile.dump/show.
Let me create a ticket for now.
What's your thought on that @ueshin?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change itself LGTM too

@xinrong-meng
Copy link
Member Author

Merged to master, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants