Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48798][PYTHON] Introduce spark.profile.render for SparkSession-based profiling #47202

Closed
wants to merge 4 commits into from

Conversation

ueshin
Copy link
Member

@ueshin ueshin commented Jul 3, 2024

What changes were proposed in this pull request?

Introduces spark.profile.render for SparkSession-based profiling.

It uses flameprof for the default renderer.

$ pip install flameprof

run pyspark on Jupyter notebook:

from pyspark.sql.functions import pandas_udf

spark.conf.set("spark.sql.pyspark.udf.profiler", "perf")

df = spark.range(10)

@pandas_udf("long")
def add1(x):
    return x + 1

added = df.select(add1("id"))
added.show()

spark.profile.render(id=2)
pyspark-udf-profile

On CLI, it will return svg source string.

'<?xml version="1.0" standalone="no"?>\n<!DOCTYPE svg  ...

Currently only renderer="flameprof" for type="perf" is supported as a builtin renderer.

You can also pass an arbitrary renderer.

def render_perf(stats):
    ...
spark.profile.render(id=2, type="perf", renderer=render_perf)

def render_memory(codemap):
    ...
spark.profile.render(id=2, type="memory", renderer=render_memory)

Why are the changes needed?

Better debuggability.

Does this PR introduce any user-facing change?

Yes, spark.profile.render will be available.

How was this patch tested?

Added/updated the related tests, and manually.

Was this patch authored or co-authored using generative AI tooling?

No.

@xinrong-meng
Copy link
Member

LGTM, thank you for working on that!

@ueshin
Copy link
Member Author

ueshin commented Jul 8, 2024

The test failures seem not related to this PR.

@ueshin
Copy link
Member Author

ueshin commented Jul 8, 2024

Thanks! merging to master.

@ueshin ueshin closed this in b062d44 Jul 8, 2024
ericm-db pushed a commit to ericm-db/spark that referenced this pull request Jul 10, 2024
…on-based profiling

### What changes were proposed in this pull request?

Introduces `spark.profile.render` for SparkSession-based profiling.

It uses [`flameprof`](https://github.com/baverman/flameprof/) for the default renderer.

```
$ pip install flameprof
```

run `pyspark` on Jupyter notebook:

```py
from pyspark.sql.functions import pandas_udf

spark.conf.set("spark.sql.pyspark.udf.profiler", "perf")

df = spark.range(10)

pandas_udf("long")
def add1(x):
    return x + 1

added = df.select(add1("id"))
added.show()

spark.profile.render(id=2)
```

<img width="1103" alt="pyspark-udf-profile" src="https://github.com/apache/spark/assets/506656/795972e8-f7eb-4b89-89fc-3d8d18b86541">

On CLI, it will return `svg` source string.

```py
'<?xml version="1.0" standalone="no"?>\n<!DOCTYPE svg  ...
```

Currently only `renderer="flameprof"` for `type="perf"` is supported as a builtin renderer.

You can also pass an arbitrary renderer.

```py
def render_perf(stats):
    ...
spark.profile.render(id=2, type="perf", renderer=render_perf)

def render_memory(codemap):
    ...
spark.profile.render(id=2, type="memory", renderer=render_memory)
```

### Why are the changes needed?

Better debuggability.

### Does this PR introduce _any_ user-facing change?

Yes, `spark.profile.render` will be available.

### How was this patch tested?

Added/updated the related tests, and manually.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47202 from ueshin/issues/SPARK-48798/render.

Authored-by: Takuya Ueshin <ueshin@databricks.com>
Signed-off-by: Takuya Ueshin <ueshin@databricks.com>
jingz-db pushed a commit to jingz-db/spark that referenced this pull request Jul 22, 2024
…on-based profiling

### What changes were proposed in this pull request?

Introduces `spark.profile.render` for SparkSession-based profiling.

It uses [`flameprof`](https://github.com/baverman/flameprof/) for the default renderer.

```
$ pip install flameprof
```

run `pyspark` on Jupyter notebook:

```py
from pyspark.sql.functions import pandas_udf

spark.conf.set("spark.sql.pyspark.udf.profiler", "perf")

df = spark.range(10)

pandas_udf("long")
def add1(x):
    return x + 1

added = df.select(add1("id"))
added.show()

spark.profile.render(id=2)
```

<img width="1103" alt="pyspark-udf-profile" src="https://github.com/apache/spark/assets/506656/795972e8-f7eb-4b89-89fc-3d8d18b86541">

On CLI, it will return `svg` source string.

```py
'<?xml version="1.0" standalone="no"?>\n<!DOCTYPE svg  ...
```

Currently only `renderer="flameprof"` for `type="perf"` is supported as a builtin renderer.

You can also pass an arbitrary renderer.

```py
def render_perf(stats):
    ...
spark.profile.render(id=2, type="perf", renderer=render_perf)

def render_memory(codemap):
    ...
spark.profile.render(id=2, type="memory", renderer=render_memory)
```

### Why are the changes needed?

Better debuggability.

### Does this PR introduce _any_ user-facing change?

Yes, `spark.profile.render` will be available.

### How was this patch tested?

Added/updated the related tests, and manually.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47202 from ueshin/issues/SPARK-48798/render.

Authored-by: Takuya Ueshin <ueshin@databricks.com>
Signed-off-by: Takuya Ueshin <ueshin@databricks.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…on-based profiling

### What changes were proposed in this pull request?

Introduces `spark.profile.render` for SparkSession-based profiling.

It uses [`flameprof`](https://github.com/baverman/flameprof/) for the default renderer.

```
$ pip install flameprof
```

run `pyspark` on Jupyter notebook:

```py
from pyspark.sql.functions import pandas_udf

spark.conf.set("spark.sql.pyspark.udf.profiler", "perf")

df = spark.range(10)

pandas_udf("long")
def add1(x):
    return x + 1

added = df.select(add1("id"))
added.show()

spark.profile.render(id=2)
```

<img width="1103" alt="pyspark-udf-profile" src="https://github.com/apache/spark/assets/506656/795972e8-f7eb-4b89-89fc-3d8d18b86541">

On CLI, it will return `svg` source string.

```py
'<?xml version="1.0" standalone="no"?>\n<!DOCTYPE svg  ...
```

Currently only `renderer="flameprof"` for `type="perf"` is supported as a builtin renderer.

You can also pass an arbitrary renderer.

```py
def render_perf(stats):
    ...
spark.profile.render(id=2, type="perf", renderer=render_perf)

def render_memory(codemap):
    ...
spark.profile.render(id=2, type="memory", renderer=render_memory)
```

### Why are the changes needed?

Better debuggability.

### Does this PR introduce _any_ user-facing change?

Yes, `spark.profile.render` will be available.

### How was this patch tested?

Added/updated the related tests, and manually.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47202 from ueshin/issues/SPARK-48798/render.

Authored-by: Takuya Ueshin <ueshin@databricks.com>
Signed-off-by: Takuya Ueshin <ueshin@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants