Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24665][PySpark] Use SQLConf in PySpark to manage all sql configs #21648

Closed
wants to merge 5 commits into from

Conversation

xuanyuanking
Copy link
Member

@xuanyuanking xuanyuanking commented Jun 27, 2018

What changes were proposed in this pull request?

Use SQLConf for PySpark to manage all sql configs, drop all the hard code in config usage.

How was this patch tested?

Existing UT.


ARROW_FALLBACK_ENABLED = ConfigEntry("spark.sql.execution.arrow.fallback.enabled")\
.boolConf()\
.withDefault("true")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicates codes in Scala side. What's a gain by doing this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to remove the hard coding as we discussed in #21370 (comment). For the duplication of Scala code, currently I have an idea is just call buildConf and doc in Scala side to register the config and leave its doc, and manage the name also default value in python SQLConf. May I ask your suggestion? :) Thx.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For current approach, It introduces another chunk of codes - 100 addition and it doesn't quite look extendable too. I suggested few possible thoughts below. If that's difficult or impossible, I wouldn't do this for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for your advice.

class SQLConf(object):
"""A class that enables the getting of SQL config parameters in pyspark"""

REPL_EAGER_EVAL_ENABLED = ConfigEntry("spark.sql.repl.eagerEval.enabled")\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this by wrapping existing SQLConf? We can make them static properties by, for example, this hack

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for your guidance, I'll take a look at them in detail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During see the currently implement, maybe we can directly use the SQLConf in SessionState? I do this in last commit 453004a. Please have a look when you have time, thanks :)

.boolConf()\
.withDefault("false")

REPL_EAGER_EVAL_MAX_NUM_ROWS = ConfigEntry("spark.sql.repl.eagerEval.maxNumRows")\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also look up the attributes via Py4J and dynamically define the attributes here. It's internal purpose

@@ -81,6 +82,7 @@ def __init__(self, jdf, sql_ctx):
# Check whether _repr_html is supported or not, we use it to avoid calling _jdf twice
# by __repr__ and _repr_html_ while eager evaluation opened.
self._support_repr_html = False
self.sql_conf = SQLConf(sql_ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make this "private" self._sql_conf

@@ -81,6 +82,7 @@ def __init__(self, jdf, sql_ctx):
# Check whether _repr_html is supported or not, we use it to avoid calling _jdf twice
# by __repr__ and _repr_html_ while eager evaluation opened.
self._support_repr_html = False
self.sql_conf = SQLConf(sql_ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, does Scala side has such attribute? If no, I won't do it in this way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I'll try to do this by wrapping existing SQLConf.

@@ -358,22 +360,19 @@ def show(self, n=20, truncate=True, vertical=False):
def _eager_eval(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this can be replaced by sql_conf access, we don't really need this private function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, agree.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Done in 453004a.

@SparkQA
Copy link

SparkQA commented Jun 27, 2018

Test build #92372 has finished for PR 21648 at commit 6a77ed5.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ConfigEntry(object):
  • class SQLConf(object):

@xuanyuanking xuanyuanking changed the title [SPARK-24665][PySpark] Add SQLConf in PySpark to manage all sql configs [SPARK-24665][PySpark] Use SQLConf in PySpark to manage all sql configs Jun 30, 2018
@xuanyuanking
Copy link
Member Author

I changed the PR title and description cause maybe we can just use the SQLConf in SessionState, don't need to do a extra wrapping work.

@@ -93,6 +93,10 @@ def _ssql_ctx(self):
"""
return self._jsqlContext

def conf(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this _conf with a property.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done in 4fc0ae4.

@HyukjinKwon
Copy link
Member

Seems fine if the tests pass. I (or someone else) should take another look before merging it though.

@xuanyuanking
Copy link
Member Author

Got it, thanks again for your advise and guidance.

@SparkQA
Copy link

SparkQA commented Jun 30, 2018

Test build #92496 has finished for PR 21648 at commit b816549.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 30, 2018

Test build #92497 has finished for PR 21648 at commit 4fc0ae4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -93,6 +93,11 @@ def _ssql_ctx(self):
"""
return self._jsqlContext

@property
def _conf(self):
"""Get SQLConf from current SqlContext"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Accessor for the JVM SQL-specific configurations

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks and done in next commit.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for a nit

@ueshin
Copy link
Member

ueshin commented Jul 2, 2018

LGTM for the changes.
I'm just wondering if all config types are supported by the PY4J method call. Do we need some tests for that?

@HyukjinKwon
Copy link
Member

I roughly double checked most of primitive types like Boolean, Long and String work via Py4J, and some codes related with it in Py4J. Some types like Option[...] will probably just return the Py4J object (via those private helper methods like isReplEagerEvalEnabled).

@property
def _conf(self):
"""Accessor for the JVM SQL-specific configurations"""
return self.sparkSession._jsparkSession.sessionState().conf()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between this and SQLContext.getConf ? Can users being confused by two similar APIs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between this and SQLContext.getConf ?

getConf returns the value of Spark SQL conf for given key and we add this _conf wants to access the helper methods in SQLConf. Actually this following the behavior in SQLContext.scala.

private[sql] def conf: SQLConf = sessionState.conf

Can users being confused by two similar APIs?

Maybe we should add more comments to explain this? AFAIC its more natural than get conf in hard code config key.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, _conf sounds like a property internal to the class/object, why we have a underscore there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add more comments to explain this? AFAIC its more natural than get conf in hard code config key.

Yeah, I think this is good idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one should be considered as an internal API only to use it to reduce the hardcoded ones and the leading underscore implies it. I think we are fine with it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Make sense to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your advise, I want to change the comments in

"""Accessor for the JVM SQL-specific configurations. 

This property returns a SQLConf which keep same behavior in scala SQLContext,
we mainly use this to call the helper methods in SQLConf.
"""

Do you think it's reasonable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay too but let's just leave it. It's private and not a big deal though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it :)

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented Jul 2, 2018

Test build #92518 has finished for PR 21648 at commit 1f40375.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in 8f91c69 Jul 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants