Skip to content

Conversation

@jonmio
Copy link

@jonmio jonmio commented Jan 15, 2026

What changes were proposed in this pull request?

Allow users to call SparkSessionBuilder.create from PySpark in Classic Mode so that they can a new session with the provided builder configs without mutating configurations in any existing sessions.

Why are the changes needed?

Currently, create is not supported in PySpark classic. Users can call getOrCreate to obtain a session with the provided configurations, but if a session already exists, getOrCreate will mutate the existing configurations in the session. We would like to provide an API for users which guarantees creation of a session and does not have side effects on existing sessions.

This change also unifies the API between Classic and Connect mode since create is already supported in Connect.

Does this PR introduce any user-facing change?

Previously calling SparkSessionBuilder.create() would throw an exception saying that create is only supported in Connect mode.

After this change, SparkSessionBuilder.create() will create a new SparkSession with the specified builder configurations.

How was this patch tested?

Added the new test class SparkSessionBuilderCreateTests in python/pyspark/sql/tests/test_session.py

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions
Copy link

github-actions bot commented Jan 15, 2026

JIRA Issue Information

=== Improvement SPARK-55055 ===
Summary: Support SparkSession.Builder.create for PySpark Classic #53820
Assignee: Jon Mio
Status: Open
Affected: ["4.2.0"]


This comment was automatically generated by GitHub Actions

@jonmio jonmio changed the title Support SparkSession.Builder.create for PySpark Classic [SPARK-55055] Support SparkSession.Builder.create for PySpark Classic Jan 15, 2026
Copy link
Contributor

@sryza sryza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. @cloud-fan – are you interested in taking a look too?


sc = (
self._sc
if getattr(self, "_sc", None) is not None and self._sc._jsc is not None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Class Builder, which function actually sets self._sc? Is it None always?

CC @HyukjinKwon and @cloud-fan

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's set here:

def __init__(
self,
sparkContext: "SparkContext",
jsparkSession: Optional["JavaObject"] = None,
options: Dict[str, Any] = {},
):
self._sc = sparkContext

in the constructor of SparkSession.

messageParameters={},
)

class SparkSessionBuilderCreateTests(unittest.TestCase, PySparkErrorTestUtils):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding the tests!

Could you help clarify coverage for the following areas?

Error Handling Tests
• How does create() behave when called with invalid configurations?
• What is the expected behavior if SparkContext creation fails?
• How should conflicting SparkContext-level configurations be handled when a context already exists?

Thread Safety Tests
• How should the implementation behave when multiple threads call create() concurrently?
• Are there any expected race conditions between create() and getOrCreate() that we should guard against or test explicitly?

Default/Active Session Update Logic
The PR description states:

“This method will update the default and/or active session if they are not set.”

Could we clarify:
• What is the expected behavior for updating the default session, and how should this be tested?
• In what cases should the default session be updated or preserved?
• Are there additional scenarios around active session updates that should be validated?

Session Stop/Cleanup
• What should happen when stop() is called on a session created via create()?
• How can we ensure resources are properly cleaned up?
• Should stopping one session affect other sessions sharing the same SparkContext?

Connect Mode Fallback
• How do we want to validate that the existing Connect mode logic continues to work correctly after these changes?

appName Configuration
• Should we explicitly verify that appName() and other common builder configurations work correctly with create() in addition to master()?

Catalog/Database Operations
• What expectations do we have around catalog operations for sessions created via create()? For example:
• Creating databases or tables
• Cross-session visibility of catalogs

Duplicate Config Keys
• What should the expected behavior be if the same config key is set multiple times in the builder before calling create()?

Remote Mode Error Paths
• Are there specific error scenarios in the Connect/remote path (e.g., invalid remote URLs) that we should explicitly test?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed review! I've added some tests to improve coverage, but also wanted to address a few of your high level concerns around connect mode compatibility, error/handling failure modes, and behaviors with multiple sessions. Importantly, this change is introducing parity with Connect (create already exists there and we are leaning on existing connect test coverage to ensure we have no introduced new regressions), multiple sessions attached to the same context already exists via the newSession API, and we are reusing existing primitives used by getOrCreate and newSession (e.g. session construction helpers and locking). Here are my thoughts on some of the more specific comments:

Connect mode fallback & invalid remote URLs

This PR doesn't touch connect mode at all, and we rely on existing connect test coverage

Are there race conditions between create() and getOrCreate()?

Both methods use the same self._lock (line 606) and rely on SparkContext.getOrCreate() which has its own locking via SparkContext._lock. No new race conditions introduced here - just reusing the existing thread-safe mechanisms.

Duplicate config keys

Standard SparkConf behavior - last write wins. Same as getOrCreate(), nothing specific to create().

Session stop/cleanup:

Stopping a session stops the parent SparkContext. This stops all sessions attached to the SparkContext, but this is not new behavior in Classic.

Conflicting SparkContext configs

SparkContext.getOrCreate() handles this at line 615. When a SparkContext already exists, new configs apply to the session only, not the shared context. Same behavior as getOrCreate() - there's even a comment about this at line 558: "Do not update SparkConf for existing SparkContext, as it's shared by all sessions."

Let me know if you would like specific tests for any of these behaviors though, happy to add it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some additional tests for the error handling and common confs

@HyukjinKwon HyukjinKwon changed the title [SPARK-55055] Support SparkSession.Builder.create for PySpark Classic [SPARK-55055][PYTHON] Support SparkSession.Builder.create for PySpark Classic Jan 19, 2026

# Spark Connect-specific API
@remote_only
def create(self) -> "SparkSession":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the scala side, def create and def shareOrCreate share the code by using a forceCreate flag, shall we do the same at the python side?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the build(foreCreate: Boolean) pattern works really well on the Scala side within classic mode, but for the Python code we have to handle both connect and classic and it seems like there is a bunch of complex handling for each mode. IMO it would be even more confusing to put this all under one shared function. Lmk if you have good ideas on how to make this code easier to reason about though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants