[SPARK-46687][PYTHON][CONNECT] Basic support of SparkSession-based memory profiler #44775

xinrong-meng · 2024-01-18T00:45:40Z

What changes were proposed in this pull request?

Basic support of SparkSession-based memory profiler in both Spark Connect and non-Spark-Connect.

Why are the changes needed?

We need to make the memory profiler SparkSession-based to support memory profiling in Spark Connect.

Does this PR introduce any user-facing change?

Yes, the SparkSession-based memory profiler is available.

An example is as shown below

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.taskcontext import TaskContext

spark.conf.set("spark.sql.pyspark.udf.profiler", "memory")

@udf("string")
def f(x):
  if TaskContext.get().partitionId() % 2 == 0:
    return str(x)
  else:
    return None

spark.range(10).select(f(col("id"))).show()

spark.showMemoryProfiles()

shows profile result:

============================================================
Profile of UDF<id=2>
============================================================
Filename: /var/folders/h_/60n1p_5s7751jx1st4_sk0780000gp/T/ipykernel_72839/2848225169.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     7    113.2 MiB    113.2 MiB          10   @udf("string")
     8                                         def f(x):
     9    114.4 MiB      1.3 MiB          10     if TaskContext.get().partitionId() % 2 == 0:
    10     31.8 MiB      0.1 MiB           4       return str(x)
    11                                           else:
    12     82.8 MiB      0.1 MiB           6       return None

How was this patch tested?

New and existing unit tests:

pyspark.tests.test_memory_profiler
pyspark.sql.tests.connect.test_parity_memory_profiler

And manual tests on Jupyter notebook.

Was this patch authored or co-authored using generative AI tooling?

No.

xinrong-meng · 2024-01-25T19:01:36Z

https://github.com/xinrong-meng/spark/actions/runs/7648782322/job/20842144027 failure is irrelevant to the PR changes. I will rebase master.
@ueshin would you please review when you are free?

ueshin · 2024-01-27T00:18:46Z

python/pyspark/profiler.py

+                measures = self[code]
+                if not measures:
+                    continue  # skip if no measurement
+                linenos = range(min(measures), max(measures) + 1)


We may want to delay to generate the full linenos until showing the results to reduce the intermediate data?

Good idea! Updated.

============================================================ Profile of UDF<id=2> ============================================================ Filename: /var/folders/h_/60n1p_5s7751jx1st4_sk0780000gp/T/ipykernel_69451/109011680.py Line # Mem usage Increment Occurrences Line Contents ============================================================= 8 147.7 MiB 147.7 MiB 20 @udf("string") 9 def a(x): 10 149.6 MiB 1.8 MiB 20 if TaskContext.get().partitionId() % 2 == 0: 11 59.9 MiB 0.1 MiB 8 return str(x) 12 else: 13 89.9 MiB 0.1 MiB 12 return None

tested on Jupyter.

ueshin

LGTM, pending tests.

ueshin · 2024-01-29T21:07:53Z

Thanks! merging to master.

xinrong-meng · 2024-01-29T21:33:35Z

Thank you @ueshin !

…s when codecov enabled ### What changes were proposed in this pull request? This is a followup of #44775 that skips the tests with codecov on. It fails now (https://github.com/apache/spark/actions/runs/7709423681/job/21010676103) and the coverage report is broken. ### Why are the changes needed? To recover the test coverage report. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45112 from HyukjinKwon/SPARK-46687-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

github-actions bot added SQL CORE PYTHON BUILD CONNECT labels Jan 18, 2024

xinrong-meng force-pushed the memory_profiler_v2 branch from 1c45a3e to 38c3bd9 Compare January 19, 2024 01:01

github-actions bot removed BUILD CONNECT labels Jan 19, 2024

xinrong-meng changed the title ~~[WIP] Basic support of SparkSession-based memory profiler~~ [SPARK-46687][PYTHON] Basic support of SparkSession-based memory profiler Jan 19, 2024

github-actions bot added BUILD CONNECT labels Jan 19, 2024

xinrong-meng force-pushed the memory_profiler_v2 branch from d8ee61e to 4a1be51 Compare January 19, 2024 21:39

xinrong-meng changed the title ~~[SPARK-46687][PYTHON] Basic support of SparkSession-based memory profiler~~ [SPARK-46687][PYTHON][CONNECT] Basic support of SparkSession-based memory profiler Jan 19, 2024

xinrong-meng marked this pull request as ready for review January 22, 2024 21:33

xinrong-meng requested review from ueshin, HyukjinKwon and zhengruifeng January 22, 2024 21:34

xinrong-meng force-pushed the memory_profiler_v2 branch from f3e2e99 to d674f45 Compare January 24, 2024 19:31

xinrong-meng added 2 commits January 25, 2024 11:05

conf max line approach

b05c494

override CodeMap approach

af09746

xinrong-meng force-pushed the memory_profiler_v2 branch from f92b684 to af09746 Compare January 25, 2024 19:06

TRIGGER TEST

bf8f94d

ueshin reviewed Jan 27, 2024

View reviewed changes

xinrong-meng added 2 commits January 26, 2024 17:12

full linenos when show profile

327a00a

TRIGGER TEST

3043e3b

xinrong-meng requested a review from ueshin January 29, 2024 18:53

ueshin approved these changes Jan 29, 2024

View reviewed changes

ueshin closed this in 528ac8b Jan 29, 2024

HyukjinKwon mentioned this pull request Feb 15, 2024

[SPARK-46687][TESTS][PYTHON][FOLLOW-UP] Skip MemoryProfilerParityTests when codecov enabled #45112

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-46687][PYTHON][CONNECT] Basic support of SparkSession-based memory profiler #44775

[SPARK-46687][PYTHON][CONNECT] Basic support of SparkSession-based memory profiler #44775

xinrong-meng commented Jan 18, 2024 •

edited

xinrong-meng commented Jan 25, 2024

ueshin Jan 27, 2024

xinrong-meng Jan 27, 2024

xinrong-meng Jan 27, 2024

ueshin left a comment

ueshin commented Jan 29, 2024

xinrong-meng commented Jan 29, 2024

[SPARK-46687][PYTHON][CONNECT] Basic support of SparkSession-based memory profiler #44775

[SPARK-46687][PYTHON][CONNECT] Basic support of SparkSession-based memory profiler #44775

Conversation

xinrong-meng commented Jan 18, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

xinrong-meng commented Jan 25, 2024

ueshin Jan 27, 2024

Choose a reason for hiding this comment

xinrong-meng Jan 27, 2024

Choose a reason for hiding this comment

xinrong-meng Jan 27, 2024

Choose a reason for hiding this comment

ueshin left a comment

Choose a reason for hiding this comment

ueshin commented Jan 29, 2024

xinrong-meng commented Jan 29, 2024

xinrong-meng commented Jan 18, 2024 •

edited