-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-529] [core] [yarn] Add type-safe config keys to SparkConf. #10205
Conversation
This is, in a way, the basics to enable SPARK-529 (which was closed as won't fix but I think is still valuable). In fact, Spark SQL created something for that, and this change basically factors out that code and inserts it into SparkConf, with some extra bells and whistles. To showcase the usage of this pattern, I modified the YARN backend to use the new config keys (defined in the new YarnConfigKeys object). Most of the changes are mechanic, although logic had to be slightly modified in a handful of places. I did not port the conf entry registry that SQLConf keeps back to core, although in the future that could be a useful addition (e.g. to programatically list all known config options). That required some code duplication in SQLConf, although it shares the actual implementation of the entries with SparkConf.
I'm leaving this as an RFC to start with; if people like it, I'll probably just attach this to SPARK-529 (after reopening it). /cc @andrewor14 @rxin @srowen |
Test build #47350 has finished for PR 10205 at commit
|
Test build #47357 has finished for PR 10205 at commit
|
Test build #47388 has finished for PR 10205 at commit
|
Test build #47434 has started for PR 10205 at commit |
Haven't seen any feedback so I guess people are ok with turning this into a real PR? |
hm - is the value in doing this for core large enough yet? one problem with core is that a lot of config options are defined in yarn, etc, outside of core. |
I don't really understand your question. There are config options all over; this allows all of them to be declared with this type-safe construct borrowed from Spark SQL. I changed YARN only to show how it would be done, not because that's the only place that can be changed. |
SparkSQL has one place for all the config options. How are we going to do this for core when there are configs declared in the YARN module? Anyway it is not a huge deal. Let's hold off this one for a bit. In the context of Spark 2.0, I want to think a little bit about what we should do w.r.t. sql conf and core conf (whether it'd make sense to merge them). We can revive this pr if we decide they are going to stay completely separate. |
There would be a class in core for the config keys declared in core, and YARN would use those. As the comment in my code says, I'm just avoiding touching core right now, since moving all config keys to this approach at once would be very noisy. Once there's a similar class in core, code in YARN that needs to use those would just do something like:
And be done with it. Note everything is still in SparkConf. I'm just changing the way you declare keys so that the key name and the default values and types are declared in one place instead of scattered and duplicated all over the code base. |
Hopefully makes approach clearer.
I am in favor of the high level idea. The biggest problem right now is several things are duplicated everywhere around the code:
I'm thinking we can just create a
|
Though IIRC there was a similar patch you wrote a while ago that was shot down for some reason. What was the counterargument back then and is it still valid? I think Spark 2.0 is a good time to do this since we're already cleaning up a bunch of other things. |
Test build #47586 has finished for PR 10205 at commit
|
Yes. That patch tried to centralize all config key definitions in |
I like that, makes the interface the same for all modules. |
A little bit noisier since package objects cannot be `private[foo]`, apparently.
Test build #47679 has finished for PR 10205 at commit
|
Test build #48514 has finished for PR 10205 at commit
|
(This comment is a bump to test something with https://spark-prs.appspot.com) |
Test build #52253 has finished for PR 10205 at commit
|
Hi @rxin, do you have any remaining concerns? |
Another ping; I'd like to get this in so that other cleanups can happen (and the inevitable conflicts with the other open PRs can be fixed). |
I plan to merge this as soon as unit tests pass, so last chance to comment. |
Go for it. I will look later and just do some small api tweaks and submit a followup pr. |
So what happens if i don't say either .withDefault or .optional? I assume it fails to compile? Perhaps adding some documentation on those functions and classes might help devs pick it up easier |
You're going to get a compilation error when trying to use the configuration (e.g. |
thanks, looks good. |
Test build #52575 has finished for PR 10205 at commit
|
Test build #52580 has finished for PR 10205 at commit
|
This is, in a way, the basics to enable SPARK-529 (which was closed as won't fix but I think is still valuable). In fact, Spark SQL created something for that, and this change basically factors out that code and inserts it into SparkConf, with some extra bells and whistles. To showcase the usage of this pattern, I modified the YARN backend to use the new config keys (defined in the new `config` package object under `o.a.s.deploy.yarn`). Most of the changes are mechanic, although logic had to be slightly modified in a handful of places. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10205 from vanzin/conf-opts.
Merged to master. |
## What changes were proposed in this pull request? Fire and forget is disabled by default, with this patch #10205 it is enabled by default, so this is a regression should be fixed. ## How was this patch tested? Manually verified this change. Author: jerryshao <sshao@hortonworks.com> Closes #11577 from jerryshao/hot-fix-yarn-cluster.
@vanzin I finally had some time to look over this change. I like the direction this is going to apply more semantics for configs, but I find the pre-existing SQLConf much easier to understand as its class structure is much simpler. A few questions/comments:
|
It was the intent from the beginning to have that distinction, to avoid the currently common case of the default values being hardcoded in different places where the configs are used.
Because with the default arguments, you have to copy & paste the argument list for every type-specific builder (look at the current SQLConf). Also, you can't overload methods with default arguments, in case for some reason you want to.
That's intentional, although maybe the naming could be a little better.
I'm open to suggestions, but I couldn't find a clean way to need less classes, because of how the types are propagated, and to be able to easily reuse parsing functions to generate optional configs. |
This is, in a way, the basics to enable SPARK-529 (which was closed as won't fix but I think is still valuable). In fact, Spark SQL created something for that, and this change basically factors out that code and inserts it into SparkConf, with some extra bells and whistles. To showcase the usage of this pattern, I modified the YARN backend to use the new config keys (defined in the new `config` package object under `o.a.s.deploy.yarn`). Most of the changes are mechanic, although logic had to be slightly modified in a handful of places. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#10205 from vanzin/conf-opts.
## What changes were proposed in this pull request? Fire and forget is disabled by default, with this patch apache#10205 it is enabled by default, so this is a regression should be fixed. ## How was this patch tested? Manually verified this change. Author: jerryshao <sshao@hortonworks.com> Closes apache#11577 from jerryshao/hot-fix-yarn-cluster.
I realize I'm super late to this party, but I just spent a bunch of time trying to understand this new system while rebasing a PR. Overall, I think all of the different classes make this pretty complicated.
The refactored SQLConf looks pretty close to what it was before. Just now we are copying builder methods instead of named arguments. In what case do we need overloading? |
This was part of a previous conversation in this thread. One of the features in the new API that wasn't present in the old one is that it handles optional config entries properly; that introduced some issues with the API, because to make it user-friendly you need different types for optional and non-optional entries (so you can say
I've explained this to Reynold; the goal is not necessarily to make the implementation of the config builders simple, but their use. If you can simplify the config builder code while still keeping their use simple, I'm open to suggestions. The reason the new API has more classes than the old SQL conf is because it has more features. The old SQL conf did not handle optional configs, and it did not have an explicit fallback mechanism (instead using comments in the code to indicate that). |
In particular the implicit wrapping of val checkpointConfig: Option[String] = df.sqlContext.conf.getConf(SQLConf.CHECKPOINT_LOCATION)
To get it to work correctly I have to write: val checkpointConfig: Option[String] =
df.sqlContext.conf.getConf(
SQLConf.CHECKPOINT_LOCATION.asInstanceOf[ConfigEntry[Option[String]]],
None) Am I doing something wrong? |
There's a new method in
That should have the same behavior as the previous one for optional configs; meaning, if they're not set in the configuration, you'll get a Also, because of that method, the return value of |
Well, it actually depends on which one of the overloads you are invoking, which is why I found this design confusing. I guess with the explicit type you can hint to the scala compiler which one you want to call? and the intelij compiler is less advanced? What is the correct way to ask if an optional configuration is set or not? |
This is, in a way, the basics to enable SPARK-529 (which was closed as
won't fix but I think is still valuable). In fact, Spark SQL created
something for that, and this change basically factors out that code
and inserts it into SparkConf, with some extra bells and whistles.
To showcase the usage of this pattern, I modified the YARN backend
to use the new config keys (defined in the new
config
package objectunder
o.a.s.deploy.yarn
). Most of the changes are mechanic, althoughlogic had to be slightly modified in a handful of places.