Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Mar 4, 2016

When both the number of executor instances and dynamic allocation are configured, dynamic allocation is ignored and the number of executors is statically set. This introduces a configuration property to change this behavior so that cluster administrators can make dynamic allocation the default when users set --num-executors instead of spark.dynamicAllocation.minExecutors and unintentionally disable dynamic allocation for a job.

This includes unit tests for the new cases and updates to documentation.

When both the number of executor instances and dynamic allocation are
configured, dynamic allocation is ignored and the number of executors is
statically set. This introduces a configuration property to change this
behavior so that cluster administrators can make dynamic allocation the
default when users set --num-executors instead of
spark.dynamicAllocation.minExecutors and unintentionally disable dynamic
allocation for a job.
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@andrewor14
Copy link
Contributor

I am against this change. We're creating a flag to control the behavior of two other flags. This level of complexity is difficult to maintain. We've already established the precedence order in SPARK-9092 so we shouldn't try to change that. Besides the simple workaround is just to not set one of the configs.

@rdblue
Copy link
Contributor Author

rdblue commented Mar 5, 2016

@andrewor14, thanks for taking a look at this so quickly.

From what we've seen, what you suggest isn't a viable work-around in practice because the two properties are set at different times and aren't obviously related. We default jobs to dynamic allocation, so that's an admin configuration most users don't see (unless they want to opt in for static). Users set up their applications and it isn't clear to at that time that requesting some number of executors results in static allocation. They also don't necessarily see the notification in their job logs.

I think a reasonable interpretation is that --num-executors would control the minimum or the initial number of executors. It's fine with me that the default goes the other way, but we lose a lot of resources to accidental static allocation. I think this additional option is worth the trade-off, but I'm happy to discuss other approaches. The main problem with your suggestion is just that users aren't aware that the two are related.

def isDynamicAllocationEnabled(conf: SparkConf): Boolean = {
val numExecutor = conf.getInt("spark.executor.instances", 0)
val dynamicAllocationEnabled = conf.getBoolean("spark.dynamicAllocation.enabled", false)
val overrideExecutors = conf.getBoolean("spark.dynamicAllocation.overrideNumInstances", false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this just piles onto the confusion though, with yet another flag to control flags. I think this is a documentation issue, and it already generates warnings when you set these things together. I personally favor leaving it as is.

@srowen
Copy link
Member

srowen commented Mar 5, 2016

Letting --num-executors control initial executors is more reasonable to me. We unfortunately already have a different flag for that, which makes it kind of funny. It also only controls initial executors; someone who thinks "this job needs more resource" really wants to up the max, or not enable dynamic allocation at all.

I agree it's an issue for something like CDH that sets dynamic allocation by default, which is a good default. I still think the best solution is docs.

@tgravescs
Copy link
Contributor

I agree with the above that we shouldn't add another config for this. I would rather see the default behavior changed in the 2.x line. If dynamic allocation config is on then num-executors goes to max and initial # of executors. I think this would allow users to easily cap their usage and would still allow it to free up executors. It would also allow users doing ML start out with a # of executors and if they are actually caching the data the executors wouldn't be freed up. So you would get very similar behavior to if dynamic allocation was off.

thoughts?

We have been running into this issue a lot recently on our clusters where we see a lot of wasted resources and so the dynamic allocation would help but users are so used to specifying --num-executors they don't end up using it.

@tgravescs
Copy link
Contributor

filed https://issues.apache.org/jira/browse/SPARK-13723 for this so we can move discussion there

@andrewor14
Copy link
Contributor

OK, either way I think we've decided we're not going to proceed with the changes in this patch. @rdblue would you mind closing this PR? We'll continue the discussion on the JIRA.

@rdblue
Copy link
Contributor Author

rdblue commented Mar 30, 2016

Close

@rdblue rdblue closed this Mar 30, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants