-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-55055][PYTHON] Support SparkSession.Builder.create for PySpark Classic #53820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
JIRA Issue Information=== Improvement SPARK-55055 === This comment was automatically generated by GitHub Actions |
sryza
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. @cloud-fan – are you interested in taking a look too?
|
|
||
| sc = ( | ||
| self._sc | ||
| if getattr(self, "_sc", None) is not None and self._sc._jsc is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the Class Builder, which function actually sets self._sc? Is it None always?
CC @HyukjinKwon and @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's set here:
spark/python/pyspark/sql/session.py
Lines 616 to 622 in 70a3ab5
| def __init__( | |
| self, | |
| sparkContext: "SparkContext", | |
| jsparkSession: Optional["JavaObject"] = None, | |
| options: Dict[str, Any] = {}, | |
| ): | |
| self._sc = sparkContext |
in the constructor of SparkSession.
| messageParameters={}, | ||
| ) | ||
|
|
||
| class SparkSessionBuilderCreateTests(unittest.TestCase, PySparkErrorTestUtils): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding the tests!
Could you help clarify coverage for the following areas?
Error Handling Tests
• How does create() behave when called with invalid configurations?
• What is the expected behavior if SparkContext creation fails?
• How should conflicting SparkContext-level configurations be handled when a context already exists?
Thread Safety Tests
• How should the implementation behave when multiple threads call create() concurrently?
• Are there any expected race conditions between create() and getOrCreate() that we should guard against or test explicitly?
Default/Active Session Update Logic
The PR description states:
“This method will update the default and/or active session if they are not set.”
Could we clarify:
• What is the expected behavior for updating the default session, and how should this be tested?
• In what cases should the default session be updated or preserved?
• Are there additional scenarios around active session updates that should be validated?
Session Stop/Cleanup
• What should happen when stop() is called on a session created via create()?
• How can we ensure resources are properly cleaned up?
• Should stopping one session affect other sessions sharing the same SparkContext?
Connect Mode Fallback
• How do we want to validate that the existing Connect mode logic continues to work correctly after these changes?
appName Configuration
• Should we explicitly verify that appName() and other common builder configurations work correctly with create() in addition to master()?
Catalog/Database Operations
• What expectations do we have around catalog operations for sessions created via create()? For example:
• Creating databases or tables
• Cross-session visibility of catalogs
Duplicate Config Keys
• What should the expected behavior be if the same config key is set multiple times in the builder before calling create()?
Remote Mode Error Paths
• Are there specific error scenarios in the Connect/remote path (e.g., invalid remote URLs) that we should explicitly test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed review! I've added some tests to improve coverage, but also wanted to address a few of your high level concerns around connect mode compatibility, error/handling failure modes, and behaviors with multiple sessions. Importantly, this change is introducing parity with Connect (create already exists there and we are leaning on existing connect test coverage to ensure we have no introduced new regressions), multiple sessions attached to the same context already exists via the newSession API, and we are reusing existing primitives used by getOrCreate and newSession (e.g. session construction helpers and locking). Here are my thoughts on some of the more specific comments:
Connect mode fallback & invalid remote URLs
This PR doesn't touch connect mode at all, and we rely on existing connect test coverage
Are there race conditions between create() and getOrCreate()?
Both methods use the same self._lock (line 606) and rely on SparkContext.getOrCreate() which has its own locking via SparkContext._lock. No new race conditions introduced here - just reusing the existing thread-safe mechanisms.
Duplicate config keys
Standard SparkConf behavior - last write wins. Same as getOrCreate(), nothing specific to create().
Session stop/cleanup:
Stopping a session stops the parent SparkContext. This stops all sessions attached to the SparkContext, but this is not new behavior in Classic.
Conflicting SparkContext configs
SparkContext.getOrCreate() handles this at line 615. When a SparkContext already exists, new configs apply to the session only, not the shared context. Same behavior as getOrCreate() - there's even a comment about this at line 558: "Do not update SparkConf for existing SparkContext, as it's shared by all sessions."
Let me know if you would like specific tests for any of these behaviors though, happy to add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some additional tests for the error handling and common confs
|
|
||
| # Spark Connect-specific API | ||
| @remote_only | ||
| def create(self) -> "SparkSession": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the scala side, def create and def shareOrCreate share the code by using a forceCreate flag, shall we do the same at the python side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the build(foreCreate: Boolean) pattern works really well on the Scala side within classic mode, but for the Python code we have to handle both connect and classic and it seems like there is a bunch of complex handling for each mode. IMO it would be even more confusing to put this all under one shared function. Lmk if you have good ideas on how to make this code easier to reason about though
ab5e359 to
54f4113
Compare
What changes were proposed in this pull request?
Allow users to call
SparkSessionBuilder.createfrom PySpark in Classic Mode so that they can a new session with the provided builder configs without mutating configurations in any existing sessions.Why are the changes needed?
Currently,
createis not supported in PySpark classic. Users can callgetOrCreateto obtain a session with the provided configurations, but if a session already exists,getOrCreatewill mutate the existing configurations in the session. We would like to provide an API for users which guarantees creation of a session and does not have side effects on existing sessions.This change also unifies the API between Classic and Connect mode since
createis already supported in Connect.Does this PR introduce any user-facing change?
Previously calling
SparkSessionBuilder.create()would throw an exception saying that create is only supported in Connect mode.After this change,
SparkSessionBuilder.create()will create a new SparkSession with the specified builder configurations.How was this patch tested?
Added the new test class
SparkSessionBuilderCreateTestsinpython/pyspark/sql/tests/test_session.pyWas this patch authored or co-authored using generative AI tooling?
No