Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This Pr proposes to improve Py4J performance by leveraging getattr, see also #46809

This PR fixes Core, SQL, ML and Structured Streaming. Tests codes, MLLib and DStream are not affected.

Why are the changes needed?

To reduce the overhead of Py4J calls.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manually tested as demonstrated in #49312

Was this patch authored or co-authored using generative AI tooling?

No.

@HyukjinKwon
Copy link
Member Author

Merged to master.

HyukjinKwon added a commit that referenced this pull request Jan 9, 2025
…ng getattr

### What changes were proposed in this pull request?

This PR is. a followup of #49313 that fixes more places missed.

This PR fixes Core, SQL, ML and Structured Streaming. Tests codes, MLLib and DStream are not affected.

### Why are the changes needed?

To reduce the overhead of Py4J calls.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually tested as demonstrated in #49312

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #49412 from HyukjinKwon/SPARK-50685-followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Feb 4, 2025
### What changes were proposed in this pull request?

Avoids using `jvm.SparkSession` style, to improve Py4J performance similar to #49312, #49313, and #49412.

### Why are the changes needed?

To reduce the overhead of Py4J calls.

```py
import time

def benchmark(f, _n=10, *args, **kwargs):
    start = time.time()

    for i in range(_n):
        f(*args, **kwargs)

    print(time.time() - start)
```

```py
from pyspark.context import SparkContext
jvm = SparkContext._jvm

def f():
  return jvm.SparkSession

benchmark(f, 10000)  # -> 3.578310251235962
```

```py
from pyspark.context import SparkContext
jvm = SparkContext._jvm

def g():
  return getattr(jvm, "org.apache.spark.sql.classic.SparkSession")

benchmark(g, 10000)  # -> 0.254807710647583
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The existing tests should pass.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #49760 from ueshin/issues/SPARK-51058/spark_session.

Authored-by: Takuya Ueshin <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Feb 4, 2025
### What changes were proposed in this pull request?

Avoids using `jvm.SparkSession` style, to improve Py4J performance similar to #49312, #49313, and #49412.

### Why are the changes needed?

To reduce the overhead of Py4J calls.

```py
import time

def benchmark(f, _n=10, *args, **kwargs):
    start = time.time()

    for i in range(_n):
        f(*args, **kwargs)

    print(time.time() - start)
```

```py
from pyspark.context import SparkContext
jvm = SparkContext._jvm

def f():
  return jvm.SparkSession

benchmark(f, 10000)  # -> 3.578310251235962
```

```py
from pyspark.context import SparkContext
jvm = SparkContext._jvm

def g():
  return getattr(jvm, "org.apache.spark.sql.classic.SparkSession")

benchmark(g, 10000)  # -> 0.254807710647583
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The existing tests should pass.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #49760 from ueshin/issues/SPARK-51058/spark_session.

Authored-by: Takuya Ueshin <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 2581ca1)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 14, 2025
### What changes were proposed in this pull request?

Avoids using `jvm.SparkSession` style, to improve Py4J performance similar to apache#49312, apache#49313, and apache#49412.

### Why are the changes needed?

To reduce the overhead of Py4J calls.

```py
import time

def benchmark(f, _n=10, *args, **kwargs):
    start = time.time()

    for i in range(_n):
        f(*args, **kwargs)

    print(time.time() - start)
```

```py
from pyspark.context import SparkContext
jvm = SparkContext._jvm

def f():
  return jvm.SparkSession

benchmark(f, 10000)  # -> 3.578310251235962
```

```py
from pyspark.context import SparkContext
jvm = SparkContext._jvm

def g():
  return getattr(jvm, "org.apache.spark.sql.classic.SparkSession")

benchmark(g, 10000)  # -> 0.254807710647583
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The existing tests should pass.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#49760 from ueshin/issues/SPARK-51058/spark_session.

Authored-by: Takuya Ueshin <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 03c93e1)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants