Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25525][SQL][PYSPARK] Do not update conf for existing SparkContext in SparkSession.getOrCreate. #22545

Closed

Conversation

ueshin
Copy link
Member

@ueshin ueshin commented Sep 25, 2018

What changes were proposed in this pull request?

In SPARK-20946, we modified SparkSession.getOrCreate to not update conf for existing SparkContext because SparkContext is shared by all sessions.
We should not update it in PySpark side as well.

How was this patch tested?

Added tests.

@SparkQA
Copy link

SparkQA commented Sep 25, 2018

Test build #96550 has finished for PR 22545 at commit 52584d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 25, 2018

Test build #96553 has finished for PR 22545 at commit ac0243a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

cc @cloud-fan

@@ -485,7 +485,8 @@ def __init__(self, sparkContext, jhiveContext=None):
"SparkSession.builder.enableHiveSupport().getOrCreate() instead.",
DeprecationWarning)
if jhiveContext is None:
sparkSession = SparkSession.builder.enableHiveSupport().getOrCreate()
sparkContext._conf.set("spark.sql.catalogImplementation", "hive")
sparkSession = SparkSession.builder._sparkContext(sparkContext).getOrCreate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin . #22545 (this) is very similar with #22552 in terms of code.
Actually, #22552 looks like a subset of this. Given that, do we need both PRs?

@ueshin
Copy link
Member Author

ueshin commented Sep 27, 2018

@dongjoon-hyun I'm sorry that I didn't share the context, but I talked with @cloud-fan about #22545 (comment) off-line, and he wanted to separate the HiveContext fix into another pr, so I submitted #22552.

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Sep 27, 2018

Test build #96655 has finished for PR 22545 at commit d8c60cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -181,17 +181,11 @@ def getOrCreate(self):
sparkConf.set(key, value)
sc = SparkContext.getOrCreate(sparkConf)
# This SparkContext may be an existing one.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny nit: can we move this comment above sc = ...

@HyukjinKwon
Copy link
Member

@cloud-fan, do we target this 2.4? Looks it might break an existing app, in particular, when a Python shell creates a session and another shell (like Zeppelin) or another session depends on a configuration in spark context. For instance, the fixed doctest:

s1 = SparkSession.builder.config("k1", "v1").getOrCreate()
>>> s1 = SparkSession.builder.config("k1", "v1").getOrCreate()
>>> s1.conf.get("k1") == s1.sparkContext.getConf().get("k1") == "v1"

Looks like a bug but the behaviour change by this looks potentially quite crucial.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM apart from if we should backport or not

@cloud-fan
Copy link
Contributor

The scala side change is already in 2.3.0. If we are ok with the behavior inconsistency between python and scala, it's fine to merge it to master only (and revert #22552 from 2.4 as well).

@HyukjinKwon
Copy link
Member

I think the session support is kind of partially implemented in Python side, and not being very well tested. There are some inconsistency between Python and Scala side (for instance see #21990). I was also thinking of targeting this one only into master.

If the release is not quite close and the code here's well tested and implemented, I'd be happy with going to branch-2.4 but .. if you guys are fine, I hope we can target this only into master.

WDYT, @ueshin?

@SparkQA
Copy link

SparkQA commented Sep 27, 2018

Test build #96663 has finished for PR 22545 at commit 2ff180e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Anyway, merged to master.

Let me take #22552 out of branch-2.4 for now but please feel free to get this and that into branch-2.4 (without checking along with me) if you guys feel strongly.

@ueshin
Copy link
Member Author

ueshin commented Sep 27, 2018

I'm okay with merging this only into master.

@asfgit asfgit closed this in ee214ef Sep 27, 2018
@cloud-fan
Copy link
Contributor

SGTM

daspalrahul pushed a commit to daspalrahul/spark that referenced this pull request Sep 29, 2018
…ext in SparkSession.getOrCreate.

## What changes were proposed in this pull request?

In [SPARK-20946](https://issues.apache.org/jira/browse/SPARK-20946), we modified `SparkSession.getOrCreate` to not update conf for existing `SparkContext` because `SparkContext` is shared by all sessions.
We should not update it in PySpark side as well.

## How was this patch tested?

Added tests.

Closes apache#22545 from ueshin/issues/SPARK-25525/not_update_existing_conf.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
@@ -156,7 +156,7 @@ def getOrCreate(self):
default.

>>> s1 = SparkSession.builder.config("k1", "v1").getOrCreate()
>>> s1.conf.get("k1") == s1.sparkContext.getConf().get("k1") == "v1"
>>> s1.conf.get("k1") == "v1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin Could we also update the migration guide about this change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, we might have to put the behaviour changes by #18536 together to the migration guide as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do it together.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitted a pr to update the migration guide #22682.

asfgit pushed a commit that referenced this pull request Oct 10, 2018
## What changes were proposed in this pull request?

This is a follow-up pr of #18536 and #22545 to update the migration guide.

## How was this patch tested?

Build and check the doc locally.

Closes #22682 from ueshin/issues/SPARK-20946_25525/migration_guide.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

This is a follow-up pr of apache#18536 and apache#22545 to update the migration guide.

## How was this patch tested?

Build and check the doc locally.

Closes apache#22682 from ueshin/issues/SPARK-20946_25525/migration_guide.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants