New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[SPARK-47014][PYTHON][CONNECT] Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession #45073

Closed

xinrong-meng wants to merge 6 commits into apache:master from xinrong-meng:dump_profile

Member

xinrong-meng commented Feb 9, 2024 •

edited

What changes were proposed in this pull request?

Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession

Why are the changes needed?

Complete support of (v2) SparkSession-based profiling.

Does this PR introduce any user-facing change?

Yes. dumpPerfProfiles and dumpMemoryProfiles of SparkSession are supported.

An example of dumpPerfProfiles is shown below.

>>> @udf("long")
... def add(x):
...   return x + 1
... 
>>> spark.conf.set("spark.sql.pyspark.udf.profiler", "perf")
>>> spark.range(10).select(add("id")).collect()
...
>>> spark.dumpPerfProfiles("dummy_dir")
>>> os.listdir("dummy_dir")
['udf_2.pstats']

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

github-actions bot added SQL CORE PYTHON CONNECT labels

xinrong-meng changed the title ~~Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession~~ [SPARK-47014][PYTHON][CONNECT] Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession

ueshin reviewed

View reviewed changes

python/pyspark/sql/tests/test_udf_profiler.py Outdated Show resolved Hide resolved

python/pyspark/tests/test_memory_profiler.py Outdated Show resolved Hide resolved

xinrong-meng marked this pull request as ready for review

February 12, 2024 22:53

xinrong-meng added 5 commits

February 12, 2024 14:58


          dump for perf

f5e845d


          dump for memory

6ae80ec

fix

11135bc

fix

de88eb9


          use context manager

77bacc6

xinrong-meng force-pushed the dump_profile branch from 6c0d7db to 77bacc6 Compare

February 12, 2024 22:58

ueshin approved these changes

View reviewed changes

Member

ueshin left a comment

Otherwise, LGTM.

python/pyspark/sql/profiler.py Outdated

    
                      with self._lock:

                          stats = self._perf_profile_results

                      def dump(path: str, id: int) -> None:

Member

ueshin Feb 12, 2024

nit: path is not necessary for this internal function?

Member Author

xinrong-meng Feb 13, 2024

Good catch!

python/pyspark/sql/profiler.py Outdated

    
                      with self._lock:

                          code_map = self._memory_profile_results

                      def dump(path: str, id: int) -> None:

Member

ueshin Feb 12, 2024

ditto.

python/pyspark/sql/profiler.py Outdated

+                          if s is not None:
+                              if not os.path.exists(path):
+                                  os.makedirs(path)
+                              p = os.path.join(path, "udf_%d.pstats" % id)

Member

ueshin Feb 12, 2024

udf_%d_perf.pstats?

Member

ueshin Feb 12, 2024

btw, f"udf_{id}_perf.pstats"?

Member Author

xinrong-meng Feb 13, 2024 •

edited

Makes sense! I'll reuse f"udf_{id}_memory.txt" for memory profiles for backward compatibility.

python/pyspark/sql/profiler.py Outdated

+                          if cm is not None:
+                              if not os.path.exists(path):
+                                  os.makedirs(path)
+                              p = os.path.join(path, "udf_%d_memory.txt" % id)

Member

ueshin Feb 12, 2024

ditto.

HyukjinKwon reviewed

View reviewed changes

python/pyspark/sql/profiler.py

@@ @@ -158,6 +159,70 @@ def _profile_results(self) -> "ProfileResults": @@
                       """
                       ...
+                  def dump_perf_profiles(self, path: str, id: Optional[int] = None) -> None:

Member

HyukjinKwon Feb 13, 2024

I wonder if we should have something like spark.profile.memory or spark.profile.dump(type="memory"), spark.profile.show(type="memory")

Member Author

xinrong-meng Feb 13, 2024

Good point!
V1 has sc.dump_profiles/show_profiles for both perf and memory profiling.
V2 has spark.dumpPerfProfiles and spark.dumpMemoryProfiles for perf and memory profiling separately.
It would be more consistent and user-friendly to introduce a uniform interface for both like spark.profile.dump/show.
Let me create a ticket for now.
What's your thought on that @ueshin?

Member

ueshin Feb 14, 2024

Sounds good.

HyukjinKwon approved these changes

View reviewed changes

Member

HyukjinKwon left a comment

The change itself LGTM too


          udf_id_perf.pstats; unnecesary param

f2f8869

xinrong-meng closed this in

4b9e9d7

Member Author

xinrong-meng commented Feb 14, 2024

Merged to master, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment