Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-529] [core] [yarn] Add type-safe config keys to SparkConf. #10205

Closed
wants to merge 36 commits into from

Conversation

vanzin
Copy link
Contributor

@vanzin vanzin commented Dec 8, 2015

This is, in a way, the basics to enable SPARK-529 (which was closed as
won't fix but I think is still valuable). In fact, Spark SQL created
something for that, and this change basically factors out that code
and inserts it into SparkConf, with some extra bells and whistles.

To showcase the usage of this pattern, I modified the YARN backend
to use the new config keys (defined in the new config package object
under o.a.s.deploy.yarn). Most of the changes are mechanic, although
logic had to be slightly modified in a handful of places.

This is, in a way, the basics to enable SPARK-529 (which was closed as
won't fix but I think is still valuable). In fact, Spark SQL created
something for that, and this change basically factors out that code
and inserts it into SparkConf, with some extra bells and whistles.

To showcase the usage of this pattern, I modified the YARN backend
to use the new config keys (defined in the new YarnConfigKeys object).
Most of the changes are mechanic, although logic had to be slightly
modified in a handful of places.

I did not port the conf entry registry that SQLConf keeps back to
core, although in the future that could be a useful addition (e.g.
to programatically list all known config options). That required
some code duplication in SQLConf, although it shares the actual
implementation of the entries with SparkConf.
@vanzin
Copy link
Contributor Author

vanzin commented Dec 8, 2015

I'm leaving this as an RFC to start with; if people like it, I'll probably just attach this to SPARK-529 (after reopening it).

/cc @andrewor14 @rxin @srowen

@SparkQA
Copy link

SparkQA commented Dec 8, 2015

Test build #47350 has finished for PR 10205 at commit 8b00d76.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 8, 2015

Test build #47357 has finished for PR 10205 at commit 85ab5f1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 9, 2015

Test build #47388 has finished for PR 10205 at commit bda7e2b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 9, 2015

Test build #47434 has started for PR 10205 at commit 93736c2.

@vanzin
Copy link
Contributor Author

vanzin commented Dec 10, 2015

Haven't seen any feedback so I guess people are ok with turning this into a real PR?

@rxin
Copy link
Contributor

rxin commented Dec 10, 2015

hm - is the value in doing this for core large enough yet? one problem with core is that a lot of config options are defined in yarn, etc, outside of core.

@vanzin
Copy link
Contributor Author

vanzin commented Dec 10, 2015

I don't really understand your question. There are config options all over; this allows all of them to be declared with this type-safe construct borrowed from Spark SQL. I changed YARN only to show how it would be done, not because that's the only place that can be changed.

@rxin
Copy link
Contributor

rxin commented Dec 11, 2015

SparkSQL has one place for all the config options. How are we going to do this for core when there are configs declared in the YARN module? Anyway it is not a huge deal.

Let's hold off this one for a bit. In the context of Spark 2.0, I want to think a little bit about what we should do w.r.t. sql conf and core conf (whether it'd make sense to merge them). We can revive this pr if we decide they are going to stay completely separate.

@vanzin
Copy link
Contributor Author

vanzin commented Dec 11, 2015

How are we going to do this for core when there are configs declared in the YARN module?

There would be a class in core for the config keys declared in core, and YARN would use those. As the comment in my code says, I'm just avoiding touching core right now, since moving all config keys to this approach at once would be very noisy.

Once there's a similar class in core, code in YARN that needs to use those would just do something like:

import CoreConfigKeys._

And be done with it. Note everything is still in SparkConf. I'm just changing the way you declare keys so that the key name and the default values and types are declared in one place instead of scattered and duplicated all over the code base.

Hopefully makes approach clearer.
@andrewor14
Copy link
Contributor

I am in favor of the high level idea. The biggest problem right now is several things are duplicated everywhere around the code:

  • config key name
  • config default value
  • deprecation logic (maybe we fixed this already)

I'm thinking we can just create a conf.scala file in each module that introduces configs, and just do something like

import org.apache.spark.conf._
import org.apache.spark.sql.conf._

@andrewor14
Copy link
Contributor

Though IIRC there was a similar patch you wrote a while ago that was shot down for some reason. What was the counterargument back then and is it still valid? I think Spark 2.0 is a good time to do this since we're already cleaning up a bunch of other things.

@SparkQA
Copy link

SparkQA commented Dec 11, 2015

Test build #47586 has finished for PR 10205 at commit c858fa8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Dec 12, 2015

IIRC there was a similar patch you wrote a while ago that was shot down for some reason

Yes. That patch tried to centralize all config key definitions in SparkConf.scala, which people were not very comfortable with. The code still exists:

https://github.com/vanzin/spark/commits/SPARK-529

@vanzin
Copy link
Contributor Author

vanzin commented Dec 14, 2015

I'm thinking we can just create a conf.scala file in each module

I like that, makes the interface the same for all modules.

A little bit noisier since package objects cannot be `private[foo]`,
apparently.
@SparkQA
Copy link

SparkQA commented Dec 14, 2015

Test build #47679 has finished for PR 10205 at commit c1de25f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin vanzin changed the title [RFC] Add type-safe config keys to SparkConf. [SPARK-529] [core] [yarn] Add type-safe config keys to SparkConf. Dec 30, 2015
@SparkQA
Copy link

SparkQA commented Dec 30, 2015

Test build #48514 has finished for PR 10205 at commit 6b6eb27.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

(This comment is a bump to test something with https://spark-prs.appspot.com)

@SparkQA
Copy link

SparkQA commented Mar 1, 2016

Test build #52253 has finished for PR 10205 at commit be1daed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Mar 3, 2016

Hi @rxin, do you have any remaining concerns?

@vanzin
Copy link
Contributor Author

vanzin commented Mar 4, 2016

Another ping; I'd like to get this in so that other cleanups can happen (and the inevitable conflicts with the other open PRs can be fixed).

@vanzin
Copy link
Contributor Author

vanzin commented Mar 7, 2016

I plan to merge this as soon as unit tests pass, so last chance to comment.

@rxin
Copy link
Contributor

rxin commented Mar 7, 2016

Go for it. I will look later and just do some small api tweaks and submit a followup pr.

@tgravescs
Copy link
Contributor

So what happens if i don't say either .withDefault or .optional? I assume it fails to compile? Perhaps adding some documentation on those functions and classes might help devs pick it up easier

@vanzin
Copy link
Contributor Author

vanzin commented Mar 7, 2016

I assume it fails to compile?

You're going to get a compilation error when trying to use the configuration (e.g. conf.get(blah)). I can add something to the javadoc.

@tgravescs
Copy link
Contributor

thanks, looks good.

@SparkQA
Copy link

SparkQA commented Mar 7, 2016

Test build #52575 has finished for PR 10205 at commit dbfed7d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2016

Test build #52580 has finished for PR 10205 at commit 31156aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin vanzin closed this Mar 7, 2016
@vanzin vanzin deleted the conf-opts branch March 7, 2016 22:14
asfgit pushed a commit that referenced this pull request Mar 7, 2016
This is, in a way, the basics to enable SPARK-529 (which was closed as
won't fix but I think is still valuable). In fact, Spark SQL created
something for that, and this change basically factors out that code
and inserts it into SparkConf, with some extra bells and whistles.

To showcase the usage of this pattern, I modified the YARN backend
to use the new config keys (defined in the new `config` package object
under `o.a.s.deploy.yarn`). Most of the changes are mechanic, although
logic had to be slightly modified in a handful of places.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10205 from vanzin/conf-opts.
@vanzin
Copy link
Contributor Author

vanzin commented Mar 7, 2016

Merged to master.

@vanzin vanzin restored the conf-opts branch March 8, 2016 00:56
asfgit pushed a commit that referenced this pull request Mar 8, 2016
## What changes were proposed in this pull request?

Fire and forget is disabled by default, with this patch #10205 it is enabled by default, so this is a regression should be fixed.

## How was this patch tested?

Manually verified this change.

Author: jerryshao <sshao@hortonworks.com>

Closes #11577 from jerryshao/hot-fix-yarn-cluster.
@rxin
Copy link
Contributor

rxin commented Mar 16, 2016

@vanzin I finally had some time to look over this change. I like the direction this is going to apply more semantics for configs, but I find the pre-existing SQLConf much easier to understand as its class structure is much simpler.

A few questions/comments:

  • What's the difference between OptionalConfigEntry and something that has default value null? I only find OptionalConfigEntry used in the setter, but nothing specific to a getter.
  • Why is the builder pattern better than just a ctor with default argument values?
  • Function naming is inconsistent. E.g. "withDefault" and "doc" vs "withDoc"
  • The convention is for state mutation functions (e.g. internal) to have parentheses. Right now they look like some variable that's being accessed.
  • Do we really need all these classes? Even if we use the builder pattern, I'd imagine we can live with only two classes, a builder and a config.

@vanzin
Copy link
Contributor Author

vanzin commented Mar 16, 2016

What's the difference between OptionalConfigEntry and something that has default value null?

null is not a valid default value for lots of different types (e.g. int or long or even boolean). Also, there's a semantic different between a config value that is not set and a config value that has a default value, and this captures it. It's basically the difference between calling SparkConf.get(foo, default) and SparkConf.getOption(foo).

It was the intent from the beginning to have that distinction, to avoid the currently common case of the default values being hardcoded in different places where the configs are used.

Why is the builder pattern better than just a ctor with default argument values?

Because with the default arguments, you have to copy & paste the argument list for every type-specific builder (look at the current SQLConf). Also, you can't overload methods with default arguments, in case for some reason you want to.

Function naming is inconsistent. E.g. "withDefault" and "doc" vs "withDoc"

That's intentional, although maybe the naming could be a little better. doc just modifies the internal state of the builder, withDefault actually builds a config entry.

Do we really need all these classes?

I'm open to suggestions, but I couldn't find a clean way to need less classes, because of how the types are propagated, and to be able to easily reuse parsing functions to generate optional configs.

roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
This is, in a way, the basics to enable SPARK-529 (which was closed as
won't fix but I think is still valuable). In fact, Spark SQL created
something for that, and this change basically factors out that code
and inserts it into SparkConf, with some extra bells and whistles.

To showcase the usage of this pattern, I modified the YARN backend
to use the new config keys (defined in the new `config` package object
under `o.a.s.deploy.yarn`). Most of the changes are mechanic, although
logic had to be slightly modified in a handful of places.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes apache#10205 from vanzin/conf-opts.
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
## What changes were proposed in this pull request?

Fire and forget is disabled by default, with this patch apache#10205 it is enabled by default, so this is a regression should be fixed.

## How was this patch tested?

Manually verified this change.

Author: jerryshao <sshao@hortonworks.com>

Closes apache#11577 from jerryshao/hot-fix-yarn-cluster.
@marmbrus
Copy link
Contributor

marmbrus commented Apr 5, 2016

I realize I'm super late to this party, but I just spent a bunch of time trying to understand this new system while rebasing a PR. Overall, I think all of the different classes make this pretty complicated.

Because with the default arguments, you have to copy & paste the argument list for every type-specific builder (look at the current SQLConf). Also, you can't overload methods with default arguments, in case for some reason you want to.

The refactored SQLConf looks pretty close to what it was before. Just now we are copying builder methods instead of named arguments. In what case do we need overloading?

@vanzin
Copy link
Contributor Author

vanzin commented Apr 5, 2016

In what case do we need overloading?

This was part of a previous conversation in this thread. One of the features in the new API that wasn't present in the old one is that it handles optional config entries properly; that introduced some issues with the API, because to make it user-friendly you need different types for optional and non-optional entries (so you can say set(nonOptionalIntConf, 1) and set(optionalIntConf, 1), for example). It was suggested to use overloaded methods, because the first approach was a little weird. But you can't have default parameters and overloaded methods together. So I changed it to the current builder approach.

Overall, I think all of the different classes make this pretty complicated.

I've explained this to Reynold; the goal is not necessarily to make the implementation of the config builders simple, but their use. If you can simplify the config builder code while still keeping their use simple, I'm open to suggestions.

The reason the new API has more classes than the old SQL conf is because it has more features. The old SQL conf did not handle optional configs, and it did not have an explicit fallback mechanism (instead using comments in the code to indicate that).

@marmbrus
Copy link
Contributor

marmbrus commented Apr 5, 2016

In particular the implicit wrapping of OptionalConfigEntry[T] to be a ConfigEntry[Option[T]] coupled with the unwrapping done via overloading of getConf took me a while to unpack. Maybe I'm just hitting a bug, but if I do the following to get an optional config:

val checkpointConfig: Option[String] = df.sqlContext.conf.getConf(SQLConf.CHECKPOINT_LOCATION)
  • intelij for some reason isn't understanding it and is underlining it in red.
  • its throwing java.util.NoSuchElementException: spark.sql.streaming.checkpointLocation.

To get it to work correctly I have to write:

val checkpointConfig: Option[String] =
  df.sqlContext.conf.getConf(
    SQLConf.CHECKPOINT_LOCATION.asInstanceOf[ConfigEntry[Option[String]]],
    None)

Am I doing something wrong?

@vanzin
Copy link
Contributor Author

vanzin commented Apr 6, 2016

CHECKPOINT_LOCATION was a weird one; because it didn't have a default value before my changes, and this was the best way I found to map it.

There's a new method in SQLConf to handle it:

def getConf[T](entry: OptionalConfigEntry[T]): T

That should have the same behavior as the previous one for optional configs; meaning, if they're not set in the configuration, you'll get a NoSuchElementException. I'm not sure how you were not getting the exception before, since there was no default value.

Also, because of that method, the return value of getConf(CHECKPOINT_LOCATION) would be String and not Option[String], which is probably why intellij is complaining.

@marmbrus
Copy link
Contributor

marmbrus commented Apr 6, 2016

Also, because of that method, the return value of getConf(CHECKPOINT_LOCATION) would be String and not Option[String], which is probably why intellij is complaining.

Well, it actually depends on which one of the overloads you are invoking, which is why I found this design confusing. I guess with the explicit type you can hint to the scala compiler which one you want to call? and the intelij compiler is less advanced?

What is the correct way to ask if an optional configuration is set or not?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants