New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25525][SQL][PYSPARK] Do not update conf for existing SparkContext in SparkSession.getOrCreate. #22545
[SPARK-25525][SQL][PYSPARK] Do not update conf for existing SparkContext in SparkSession.getOrCreate. #22545
Conversation
Test build #96550 has finished for PR 22545 at commit
|
Test build #96553 has finished for PR 22545 at commit
|
cc @cloud-fan |
python/pyspark/sql/context.py
Outdated
@@ -485,7 +485,8 @@ def __init__(self, sparkContext, jhiveContext=None): | |||
"SparkSession.builder.enableHiveSupport().getOrCreate() instead.", | |||
DeprecationWarning) | |||
if jhiveContext is None: | |||
sparkSession = SparkSession.builder.enableHiveSupport().getOrCreate() | |||
sparkContext._conf.set("spark.sql.catalogImplementation", "hive") | |||
sparkSession = SparkSession.builder._sparkContext(sparkContext).getOrCreate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun I'm sorry that I didn't share the context, but I talked with @cloud-fan about #22545 (comment) off-line, and he wanted to separate the |
LGTM |
Test build #96655 has finished for PR 22545 at commit
|
@@ -181,17 +181,11 @@ def getOrCreate(self): | |||
sparkConf.set(key, value) | |||
sc = SparkContext.getOrCreate(sparkConf) | |||
# This SparkContext may be an existing one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny nit: can we move this comment above sc = ...
@cloud-fan, do we target this 2.4? Looks it might break an existing app, in particular, when a Python shell creates a session and another shell (like Zeppelin) or another session depends on a configuration in spark context. For instance, the fixed doctest: s1 = SparkSession.builder.config("k1", "v1").getOrCreate()
>>> s1 = SparkSession.builder.config("k1", "v1").getOrCreate()
>>> s1.conf.get("k1") == s1.sparkContext.getConf().get("k1") == "v1" Looks like a bug but the behaviour change by this looks potentially quite crucial. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM apart from if we should backport or not
The scala side change is already in 2.3.0. If we are ok with the behavior inconsistency between python and scala, it's fine to merge it to master only (and revert #22552 from 2.4 as well). |
I think the session support is kind of partially implemented in Python side, and not being very well tested. There are some inconsistency between Python and Scala side (for instance see #21990). I was also thinking of targeting this one only into master. If the release is not quite close and the code here's well tested and implemented, I'd be happy with going to branch-2.4 but .. if you guys are fine, I hope we can target this only into master. WDYT, @ueshin? |
Test build #96663 has finished for PR 22545 at commit
|
Anyway, merged to master. Let me take #22552 out of branch-2.4 for now but please feel free to get this and that into branch-2.4 (without checking along with me) if you guys feel strongly. |
I'm okay with merging this only into master. |
SGTM |
…ext in SparkSession.getOrCreate. ## What changes were proposed in this pull request? In [SPARK-20946](https://issues.apache.org/jira/browse/SPARK-20946), we modified `SparkSession.getOrCreate` to not update conf for existing `SparkContext` because `SparkContext` is shared by all sessions. We should not update it in PySpark side as well. ## How was this patch tested? Added tests. Closes apache#22545 from ueshin/issues/SPARK-25525/not_update_existing_conf. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>
@@ -156,7 +156,7 @@ def getOrCreate(self): | |||
default. | |||
|
|||
>>> s1 = SparkSession.builder.config("k1", "v1").getOrCreate() | |||
>>> s1.conf.get("k1") == s1.sparkContext.getConf().get("k1") == "v1" | |||
>>> s1.conf.get("k1") == "v1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin Could we also update the migration guide about this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, we might have to put the behaviour changes by #18536 together to the migration guide as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do it together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Submitted a pr to update the migration guide #22682.
## What changes were proposed in this pull request? This is a follow-up pr of #18536 and #22545 to update the migration guide. ## How was this patch tested? Build and check the doc locally. Closes #22682 from ueshin/issues/SPARK-20946_25525/migration_guide. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request? This is a follow-up pr of apache#18536 and apache#22545 to update the migration guide. ## How was this patch tested? Build and check the doc locally. Closes apache#22682 from ueshin/issues/SPARK-20946_25525/migration_guide. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
In SPARK-20946, we modified
SparkSession.getOrCreate
to not update conf for existingSparkContext
becauseSparkContext
is shared by all sessions.We should not update it in PySpark side as well.
How was this patch tested?
Added tests.