New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24665][PySpark] Use SQLConf in PySpark to manage all sql configs #21648
Conversation
python/pyspark/sql/conf.py
Outdated
|
||
ARROW_FALLBACK_ENABLED = ConfigEntry("spark.sql.execution.arrow.fallback.enabled")\ | ||
.boolConf()\ | ||
.withDefault("true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This duplicates codes in Scala side. What's a gain by doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to remove the hard coding as we discussed in #21370 (comment). For the duplication of Scala code, currently I have an idea is just call buildConf and doc in Scala side to register the config and leave its doc, and manage the name also default value in python SQLConf. May I ask your suggestion? :) Thx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For current approach, It introduces another chunk of codes - 100 addition and it doesn't quite look extendable too. I suggested few possible thoughts below. If that's difficult or impossible, I wouldn't do this for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for your advice.
python/pyspark/sql/conf.py
Outdated
class SQLConf(object): | ||
"""A class that enables the getting of SQL config parameters in pyspark""" | ||
|
||
REPL_EAGER_EVAL_ENABLED = ConfigEntry("spark.sql.repl.eagerEval.enabled")\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do this by wrapping existing SQLConf? We can make them static properties by, for example, this hack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or this hack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for your guidance, I'll take a look at them in detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During see the currently implement, maybe we can directly use the SQLConf in SessionState? I do this in last commit 453004a. Please have a look when you have time, thanks :)
python/pyspark/sql/conf.py
Outdated
.boolConf()\ | ||
.withDefault("false") | ||
|
||
REPL_EAGER_EVAL_MAX_NUM_ROWS = ConfigEntry("spark.sql.repl.eagerEval.maxNumRows")\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also look up the attributes via Py4J and dynamically define the attributes here. It's internal purpose
python/pyspark/sql/dataframe.py
Outdated
@@ -81,6 +82,7 @@ def __init__(self, jdf, sql_ctx): | |||
# Check whether _repr_html is supported or not, we use it to avoid calling _jdf twice | |||
# by __repr__ and _repr_html_ while eager evaluation opened. | |||
self._support_repr_html = False | |||
self.sql_conf = SQLConf(sql_ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd make this "private" self._sql_conf
python/pyspark/sql/dataframe.py
Outdated
@@ -81,6 +82,7 @@ def __init__(self, jdf, sql_ctx): | |||
# Check whether _repr_html is supported or not, we use it to avoid calling _jdf twice | |||
# by __repr__ and _repr_html_ while eager evaluation opened. | |||
self._support_repr_html = False | |||
self.sql_conf = SQLConf(sql_ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, does Scala side has such attribute? If no, I won't do it in this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I'll try to do this by wrapping existing SQLConf.
python/pyspark/sql/dataframe.py
Outdated
@@ -358,22 +360,19 @@ def show(self, n=20, truncate=True, vertical=False): | |||
def _eager_eval(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this can be replaced by sql_conf access, we don't really need this private function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Done in 453004a.
Test build #92372 has finished for PR 21648 at commit
|
6a77ed5
to
b816549
Compare
I changed the PR title and description cause maybe we can just use the SQLConf in SessionState, don't need to do a extra wrapping work. |
python/pyspark/sql/context.py
Outdated
@@ -93,6 +93,10 @@ def _ssql_ctx(self): | |||
""" | |||
return self._jsqlContext | |||
|
|||
def conf(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this _conf
with a property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done in 4fc0ae4.
Seems fine if the tests pass. I (or someone else) should take another look before merging it though. |
Got it, thanks again for your advise and guidance. |
Test build #92496 has finished for PR 21648 at commit
|
Test build #92497 has finished for PR 21648 at commit
|
python/pyspark/sql/context.py
Outdated
@@ -93,6 +93,11 @@ def _ssql_ctx(self): | |||
""" | |||
return self._jsqlContext | |||
|
|||
@property | |||
def _conf(self): | |||
"""Get SQLConf from current SqlContext""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Accessor for the JVM SQL-specific configurations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks and done in next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for a nit
LGTM for the changes. |
I roughly double checked most of primitive types like Boolean, Long and String work via Py4J, and some codes related with it in Py4J. Some types like |
@property | ||
def _conf(self): | ||
"""Accessor for the JVM SQL-specific configurations""" | ||
return self.sparkSession._jsparkSession.sessionState().conf() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between this and SQLContext.getConf
? Can users being confused by two similar APIs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between this and SQLContext.getConf ?
getConf returns the value of Spark SQL conf for given key and we add this _conf wants to access the helper methods in SQLConf. Actually this following the behavior in SQLContext.scala.
private[sql] def conf: SQLConf = sessionState.conf |
Can users being confused by two similar APIs?
Maybe we should add more comments to explain this? AFAIC its more natural than get conf in hard code config key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, _conf
sounds like a property internal to the class/object, why we have a underscore there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should add more comments to explain this? AFAIC its more natural than get conf in hard code config key.
Yeah, I think this is good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one should be considered as an internal API only to use it to reduce the hardcoded ones and the leading underscore implies it. I think we are fine with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Make sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your advise, I want to change the comments in
"""Accessor for the JVM SQL-specific configurations.
This property returns a SQLConf which keep same behavior in scala SQLContext,
we mainly use this to call the helper methods in SQLConf.
"""
Do you think it's reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am okay too but let's just leave it. It's private and not a big deal though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Test build #92518 has finished for PR 21648 at commit
|
Merged to master. |
What changes were proposed in this pull request?
Use SQLConf for PySpark to manage all sql configs, drop all the hard code in config usage.
How was this patch tested?
Existing UT.