[SPARK-25525][SQL][PYSPARK] Do not update conf for existing SparkContext in SparkSession.getOrCreate. #22545

ueshin · 2018-09-25T11:37:18Z

What changes were proposed in this pull request?

In SPARK-20946, we modified SparkSession.getOrCreate to not update conf for existing SparkContext because SparkContext is shared by all sessions.
We should not update it in PySpark side as well.

How was this patch tested?

Added tests.

…ate.

SparkQA · 2018-09-25T12:15:31Z

Test build #96550 has finished for PR 22545 at commit 52584d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-25T14:18:51Z

Test build #96553 has finished for PR 22545 at commit ac0243a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-09-26T02:52:19Z

cc @cloud-fan

cloud-fan · 2018-09-26T03:03:26Z

python/pyspark/sql/context.py

@@ -485,7 +485,8 @@ def __init__(self, sparkContext, jhiveContext=None):
            "SparkSession.builder.enableHiveSupport().getOrCreate() instead.",
            DeprecationWarning)
        if jhiveContext is None:
-            sparkSession = SparkSession.builder.enableHiveSupport().getOrCreate()
+            sparkContext._conf.set("spark.sql.catalogImplementation", "hive")
+            sparkSession = SparkSession.builder._sparkContext(sparkContext).getOrCreate()


why this change?

dongjoon-hyun

@ueshin . #22545 (this) is very similar with #22552 in terms of code.
Actually, #22552 looks like a subset of this. Given that, do we need both PRs?

ueshin · 2018-09-27T02:01:09Z

@dongjoon-hyun I'm sorry that I didn't share the context, but I talked with @cloud-fan about #22545 (comment) off-line, and he wanted to separate the HiveContext fix into another pr, so I submitted #22552.

cloud-fan · 2018-09-27T02:14:49Z

LGTM

SparkQA · 2018-09-27T02:52:14Z

Test build #96655 has finished for PR 22545 at commit d8c60cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-09-27T03:29:45Z

python/pyspark/sql/session.py

@@ -181,17 +181,11 @@ def getOrCreate(self):
                            sparkConf.set(key, value)
                        sc = SparkContext.getOrCreate(sparkConf)
                        # This SparkContext may be an existing one.


tiny nit: can we move this comment above sc = ...

HyukjinKwon · 2018-09-27T03:44:19Z

@cloud-fan, do we target this 2.4? Looks it might break an existing app, in particular, when a Python shell creates a session and another shell (like Zeppelin) or another session depends on a configuration in spark context. For instance, the fixed doctest:

s1 = SparkSession.builder.config("k1", "v1").getOrCreate()
>>> s1 = SparkSession.builder.config("k1", "v1").getOrCreate()
>>> s1.conf.get("k1") == s1.sparkContext.getConf().get("k1") == "v1"

Looks like a bug but the behaviour change by this looks potentially quite crucial.

HyukjinKwon

LGTM apart from if we should backport or not

cloud-fan · 2018-09-27T03:57:22Z

The scala side change is already in 2.3.0. If we are ok with the behavior inconsistency between python and scala, it's fine to merge it to master only (and revert #22552 from 2.4 as well).

HyukjinKwon · 2018-09-27T04:05:18Z

I think the session support is kind of partially implemented in Python side, and not being very well tested. There are some inconsistency between Python and Scala side (for instance see #21990). I was also thinking of targeting this one only into master.

If the release is not quite close and the code here's well tested and implemented, I'd be happy with going to branch-2.4 but .. if you guys are fine, I hope we can target this only into master.

WDYT, @ueshin?

SparkQA · 2018-09-27T04:32:04Z

Test build #96663 has finished for PR 22545 at commit 2ff180e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-09-27T04:36:58Z

Anyway, merged to master.

Let me take #22552 out of branch-2.4 for now but please feel free to get this and that into branch-2.4 (without checking along with me) if you guys feel strongly.

ueshin · 2018-09-27T04:38:18Z

I'm okay with merging this only into master.

cloud-fan · 2018-09-27T04:41:54Z

SGTM

…ext in SparkSession.getOrCreate. ## What changes were proposed in this pull request? In [SPARK-20946](https://issues.apache.org/jira/browse/SPARK-20946), we modified `SparkSession.getOrCreate` to not update conf for existing `SparkContext` because `SparkContext` is shared by all sessions. We should not update it in PySpark side as well. ## How was this patch tested? Added tests. Closes apache#22545 from ueshin/issues/SPARK-25525/not_update_existing_conf. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>

gatorsmile · 2018-10-08T15:05:50Z

python/pyspark/sql/session.py

@@ -156,7 +156,7 @@ def getOrCreate(self):
            default.

            >>> s1 = SparkSession.builder.config("k1", "v1").getOrCreate()
-            >>> s1.conf.get("k1") == s1.sparkContext.getConf().get("k1") == "v1"
+            >>> s1.conf.get("k1") == "v1"


@ueshin Could we also update the migration guide about this change?

In that case, we might have to put the behaviour changes by #18536 together to the migration guide as well.

We can do it together.

Submitted a pr to update the migration guide #22682.

## What changes were proposed in this pull request? This is a follow-up pr of #18536 and #22545 to update the migration guide. ## How was this patch tested? Build and check the doc locally. Closes #22682 from ueshin/issues/SPARK-20946_25525/migration_guide. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? This is a follow-up pr of apache#18536 and apache#22545 to update the migration guide. ## How was this patch tested? Build and check the doc locally. Closes apache#22682 from ueshin/issues/SPARK-20946_25525/migration_guide. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Do not update conf for existing SparkContext in SparkSession.getOrCre…

52584d9

…ate.

Fix.

ac0243a

cloud-fan reviewed Sep 26, 2018

View reviewed changes

dongjoon-hyun reviewed Sep 26, 2018

View reviewed changes

Merge branch 'master' into issues/SPARK-25525/not_update_existing_conf

d8c60cb

cloud-fan mentioned this pull request Sep 27, 2018

[SPARK-25540][SQL][PYSPARK] Make HiveContext in PySpark behave as the same as Scala. #22552

Closed

HyukjinKwon reviewed Sep 27, 2018

View reviewed changes

Address a comment.

2ff180e

HyukjinKwon approved these changes Sep 27, 2018

View reviewed changes

asfgit closed this in ee214ef Sep 27, 2018

huaxingao mentioned this pull request Sep 27, 2018

[SPARK-25255][PYTHON]Add getActiveSession to SparkSession in PySpark #22295

Closed

gatorsmile reviewed Oct 8, 2018

View reviewed changes

ueshin mentioned this pull request Oct 10, 2018

[SPARK-20946][SPARK-25525][SQL][FOLLOW-UP] Update the migration guide. #22682

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25525][SQL][PYSPARK] Do not update conf for existing SparkContext in SparkSession.getOrCreate. #22545

[SPARK-25525][SQL][PYSPARK] Do not update conf for existing SparkContext in SparkSession.getOrCreate. #22545

ueshin commented Sep 25, 2018

SparkQA commented Sep 25, 2018

SparkQA commented Sep 25, 2018

gatorsmile commented Sep 26, 2018

cloud-fan Sep 26, 2018

dongjoon-hyun left a comment

ueshin commented Sep 27, 2018 •

edited

cloud-fan commented Sep 27, 2018

SparkQA commented Sep 27, 2018

HyukjinKwon Sep 27, 2018

HyukjinKwon commented Sep 27, 2018

HyukjinKwon left a comment

cloud-fan commented Sep 27, 2018

HyukjinKwon commented Sep 27, 2018

SparkQA commented Sep 27, 2018

HyukjinKwon commented Sep 27, 2018

ueshin commented Sep 27, 2018

cloud-fan commented Sep 27, 2018

gatorsmile Oct 8, 2018

HyukjinKwon Oct 8, 2018

gatorsmile Oct 8, 2018

ueshin Oct 10, 2018

[SPARK-25525][SQL][PYSPARK] Do not update conf for existing SparkContext in SparkSession.getOrCreate. #22545

[SPARK-25525][SQL][PYSPARK] Do not update conf for existing SparkContext in SparkSession.getOrCreate. #22545

Conversation

ueshin commented Sep 25, 2018

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Sep 25, 2018

SparkQA commented Sep 25, 2018

gatorsmile commented Sep 26, 2018

cloud-fan Sep 26, 2018

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

ueshin commented Sep 27, 2018 • edited

cloud-fan commented Sep 27, 2018

SparkQA commented Sep 27, 2018

HyukjinKwon Sep 27, 2018

Choose a reason for hiding this comment

HyukjinKwon commented Sep 27, 2018

HyukjinKwon left a comment

Choose a reason for hiding this comment

cloud-fan commented Sep 27, 2018

HyukjinKwon commented Sep 27, 2018

SparkQA commented Sep 27, 2018

HyukjinKwon commented Sep 27, 2018

ueshin commented Sep 27, 2018

cloud-fan commented Sep 27, 2018

gatorsmile Oct 8, 2018

Choose a reason for hiding this comment

HyukjinKwon Oct 8, 2018

Choose a reason for hiding this comment

gatorsmile Oct 8, 2018

Choose a reason for hiding this comment

ueshin Oct 10, 2018

Choose a reason for hiding this comment

ueshin commented Sep 27, 2018 •

edited