[SPARK-50685][PYTHON] Improve Py4J performance by leveraging getattr #49313

HyukjinKwon · 2024-12-27T06:41:05Z

What changes were proposed in this pull request?

This Pr proposes to improve Py4J performance by leveraging getattr, see also #46809

This PR fixes Core, SQL, ML and Structured Streaming. Tests codes, MLLib and DStream are not affected.

Why are the changes needed?

To reduce the overhead of Py4J calls.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manually tested as demonstrated in #49312

Was this patch authored or co-authored using generative AI tooling?

No.

HyukjinKwon · 2024-12-27T09:59:27Z

Merged to master.

…ng getattr ### What changes were proposed in this pull request? This PR is. a followup of #49313 that fixes more places missed. This PR fixes Core, SQL, ML and Structured Streaming. Tests codes, MLLib and DStream are not affected. ### Why are the changes needed? To reduce the overhead of Py4J calls. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested as demonstrated in #49312 ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49412 from HyukjinKwon/SPARK-50685-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Avoids using `jvm.SparkSession` style, to improve Py4J performance similar to #49312, #49313, and #49412. ### Why are the changes needed? To reduce the overhead of Py4J calls. ```py import time def benchmark(f, _n=10, *args, **kwargs): start = time.time() for i in range(_n): f(*args, **kwargs) print(time.time() - start) ``` ```py from pyspark.context import SparkContext jvm = SparkContext._jvm def f(): return jvm.SparkSession benchmark(f, 10000) # -> 3.578310251235962 ``` ```py from pyspark.context import SparkContext jvm = SparkContext._jvm def g(): return getattr(jvm, "org.apache.spark.sql.classic.SparkSession") benchmark(g, 10000) # -> 0.254807710647583 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The existing tests should pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49760 from ueshin/issues/SPARK-51058/spark_session. Authored-by: Takuya Ueshin <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Avoids using `jvm.SparkSession` style, to improve Py4J performance similar to #49312, #49313, and #49412. ### Why are the changes needed? To reduce the overhead of Py4J calls. ```py import time def benchmark(f, _n=10, *args, **kwargs): start = time.time() for i in range(_n): f(*args, **kwargs) print(time.time() - start) ``` ```py from pyspark.context import SparkContext jvm = SparkContext._jvm def f(): return jvm.SparkSession benchmark(f, 10000) # -> 3.578310251235962 ``` ```py from pyspark.context import SparkContext jvm = SparkContext._jvm def g(): return getattr(jvm, "org.apache.spark.sql.classic.SparkSession") benchmark(g, 10000) # -> 0.254807710647583 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The existing tests should pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49760 from ueshin/issues/SPARK-51058/spark_session. Authored-by: Takuya Ueshin <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 2581ca1) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Avoids using `jvm.SparkSession` style, to improve Py4J performance similar to apache#49312, apache#49313, and apache#49412. ### Why are the changes needed? To reduce the overhead of Py4J calls. ```py import time def benchmark(f, _n=10, *args, **kwargs): start = time.time() for i in range(_n): f(*args, **kwargs) print(time.time() - start) ``` ```py from pyspark.context import SparkContext jvm = SparkContext._jvm def f(): return jvm.SparkSession benchmark(f, 10000) # -> 3.578310251235962 ``` ```py from pyspark.context import SparkContext jvm = SparkContext._jvm def g(): return getattr(jvm, "org.apache.spark.sql.classic.SparkSession") benchmark(g, 10000) # -> 0.254807710647583 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The existing tests should pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#49760 from ueshin/issues/SPARK-51058/spark_session. Authored-by: Takuya Ueshin <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 03c93e1) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Improve Py4J performance by leveraging getattr

bf1d2fd

HyukjinKwon requested a review from zhengruifeng December 27, 2024 06:41

github-actions bot added SQL ML STRUCTURED STREAMING PYTHON CONNECT labels Dec 27, 2024

xinrong-meng approved these changes Dec 27, 2024

View reviewed changes

ueshin approved these changes Dec 27, 2024

View reviewed changes

zhengruifeng approved these changes Dec 27, 2024

View reviewed changes

fixup

983aaf9

HyukjinKwon closed this in 2d320aa Dec 27, 2024

HyukjinKwon mentioned this pull request Jan 8, 2025

[SPARK-50685][PYTHON][FOLLOW-UP] Improve Py4J performance by leveraging getattr #49412

Closed

ueshin mentioned this pull request Feb 1, 2025

[SPARK-51058][PYTHON] Avoid using jvm.SparkSession #49760

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50685][PYTHON] Improve Py4J performance by leveraging getattr #49313

[SPARK-50685][PYTHON] Improve Py4J performance by leveraging getattr #49313

Uh oh!

HyukjinKwon commented Dec 27, 2024

Uh oh!

HyukjinKwon commented Dec 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-50685][PYTHON] Improve Py4J performance by leveraging getattr #49313

[SPARK-50685][PYTHON] Improve Py4J performance by leveraging getattr #49313

Uh oh!

Conversation

HyukjinKwon commented Dec 27, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon commented Dec 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants