-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48970][PYTHON][ML] Avoid using SparkSession.getActiveSession in spark ML reader/writer #47453
Conversation
merged to master. |
@@ -588,7 +588,7 @@ private[ml] object DefaultParamsReader { | |||
*/ | |||
def loadMetadata(path: String, sc: SparkContext, expectedClassName: String = ""): Metadata = { | |||
val metadataPath = new Path(path, "metadata").toString | |||
val spark = SparkSession.getActiveSession.get | |||
val spark = SparkSession.builder().sparkContext(sc).getOrCreate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @WeichenXu123 , @HyukjinKwon , @zhengruifeng .
This sounds like a regression of
If we cannot get an existing one, I believe we should not create SparkSession here.
Can we recover the existing code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will not be a regression. This is Spark ML which is DataFrame-based MLlib by definition. Therefore we should always have default session running. Active session is specific to a thread, so it might not exist within the same thread. Alternatively we could use SparkSession.getDefaultSession
.
spark.createDataFrame( # type: ignore[union-attr] | ||
[(metadataJson,)], schema=["value"] | ||
).coalesce(1).write.text(metadataPath) | ||
spark = SparkSession._getActiveSessionOrCreate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
@@ -580,8 +580,8 @@ def loadMetadata(path: str, sc: "SparkContext", expectedClassName: str = "") -> | |||
If non empty, this is checked against the loaded metadata. | |||
""" | |||
metadataPath = os.path.join(path, "metadata") | |||
spark = SparkSession.getActiveSession() | |||
metadataStr = spark.read.text(metadataPath).first()[0] # type: ignore[union-attr,index] | |||
spark = SparkSession._getActiveSessionOrCreate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
Initially, the existing PRs assumes that there is no regression because we use the active sessions. AFAIK, this assumption was the same in the dev mailing discussion . https://lists.apache.org/thread/s24lqtmno0xtoxxz6pk6tyn726bfwp8q Is this regression inevitable, @HyukjinKwon ?
|
I replied on the existing thread. |
There is no regression. This is Spark ML which is DataFrame-based MLlib. There should be a running Spark session always. |
@dongjoon-hyun spark/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala Lines 1322 to 1358 in 2e1a39c
loads the metadata
then loads the model coefficients, you can see the |
I think probably we can change the signature of
to
to avoid such confusion. I will have a try |
Thank you, @HyukjinKwon and @zhengruifeng . I'm +1 for both to have a clear semantic.
|
For the record and the other reviewers, (2) is implemented and merged to Apache Spark 4.0.0. |
…n spark ML reader/writer ### What changes were proposed in this pull request? `SparkSession.getActiveSession` is thread-local session, but spark ML reader / writer might be executed in different threads which causes `SparkSession.getActiveSession` returning None. ### Why are the changes needed? It fixes the bug like: ``` spark = SparkSession.getActiveSession() > spark.createDataFrame( # type: ignore[union-attr] [(metadataJson,)], schema=["value"] ).coalesce(1).write.text(metadataPath) E AttributeError: 'NoneType' object has no attribute 'createDataFrame' ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47453 from WeichenXu123/SPARK-48970. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
…n spark ML reader/writer ### What changes were proposed in this pull request? `SparkSession.getActiveSession` is thread-local session, but spark ML reader / writer might be executed in different threads which causes `SparkSession.getActiveSession` returning None. ### Why are the changes needed? It fixes the bug like: ``` spark = SparkSession.getActiveSession() > spark.createDataFrame( # type: ignore[union-attr] [(metadataJson,)], schema=["value"] ).coalesce(1).write.text(metadataPath) E AttributeError: 'NoneType' object has no attribute 'createDataFrame' ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47453 from WeichenXu123/SPARK-48970. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
…n spark ML reader/writer ### What changes were proposed in this pull request? `SparkSession.getActiveSession` is thread-local session, but spark ML reader / writer might be executed in different threads which causes `SparkSession.getActiveSession` returning None. ### Why are the changes needed? It fixes the bug like: ``` spark = SparkSession.getActiveSession() > spark.createDataFrame( # type: ignore[union-attr] [(metadataJson,)], schema=["value"] ).coalesce(1).write.text(metadataPath) E AttributeError: 'NoneType' object has no attribute 'createDataFrame' ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47453 from WeichenXu123/SPARK-48970. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
What changes were proposed in this pull request?
SparkSession.getActiveSession
is thread-local session, but spark ML reader / writer might be executed in different threads which causesSparkSession.getActiveSession
returning None.Why are the changes needed?
It fixes the bug like:
Does this PR introduce any user-facing change?
No
How was this patch tested?
Manually.
Was this patch authored or co-authored using generative AI tooling?
No.