[SPARK-6284][MESOS] Add mesos role, principal and secret #4960

tnachen · 2015-03-10T07:33:38Z

Mesos supports framework authentication and role to be set per framework, which the role is used to identify the framework's role which impacts the sharing weight of resource allocation and optional authentication information to allow the framework to be connected to the master.

adam-mesos · 2015-03-10T07:53:41Z

LGTM, from a Mesos perspective. 👍

srowen · 2015-03-10T08:33:03Z

(JIRA please?)

adam-mesos · 2015-03-10T08:37:02Z

Resolves https://issues.apache.org/jira/browse/SPARK-4416

SparkQA · 2015-03-10T08:58:50Z

Test build #28424 has finished for PR 4960 at commit 99c3a85.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-03-10T09:03:10Z

Have a look at the other PRs -- write SPARK-xxxx [yyyy] ... to have it properly "indexed" by https://spark-prs.appspot.com/

SparkQA · 2015-03-10T09:21:48Z

Test build #28425 has finished for PR 4960 at commit 8fa250b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tnachen · 2015-03-11T16:54:37Z

@srowen Sorry missed your notification, opened a JIRA now

realoptimal · 2015-03-12T22:23:00Z

How have you tested this? I get ExecutorLostFailure and tasks are not launched on mesos slave configured with Slave resources: cpus(prod):2; cpus(dev):2; mem():29655; disk():745874; ports(*):[31000-32000]

and trying to run for e.g.
./bin/spark-submit --master mesos://iq-cluster-master:5050 --total-executor-cores 2 --executor-memory 3G --conf spark.mesos.role=dev ./examples/src/main/python/pi.py 100

realoptimal · 2015-03-16T21:57:12Z

Also if Slave resources are all of the default type, i.e. *. The framework should be still be able to use those resources even with spark.mesos.role != *

tnachen · 2015-03-16T22:51:26Z

@realoptimal you did indeed found a problem about roles, I only tried it with seeing the framework registered with the right role and tasks launched, but didn't try it in your case where multiple roles with different resources and no wildcard resources are available. The scheduler currently just uses * in the role everywhere.

I'm having a fix and will push a fix plus tests to this PR, thanks again!

tnachen · 2015-03-17T08:55:43Z

@realoptimal Can you try again? I've updated the PR that includes test and I've tested myself with fine-grained and coarse-grained mode, and it is correctly launching the tasks.

SparkQA · 2015-03-17T10:18:10Z

Test build #28715 has finished for PR 4960 at commit eba5669.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

realoptimal · 2015-03-17T18:13:33Z

@tnachen You are much faster than I am; was working on modifying your previous code to dole out reserved offers first and then to use unreserved offers (those with role="*"), but you beat me to it.

Tested your code with my configuration and this works when you have specified spark.mesos.role in the spark-defaults.conf file. Not sure why it won't pick it up via --conf on the command line but I think that is an issue with SparkSubmit, not in your your code changes.

tnachen · 2015-03-17T18:16:56Z

@realoptimal thanks for testing this. We can file another issue for the --conf problem, and I'm sure there are more rough edges to smooth out. @andrewor14 can you take a look of this PR?

SparkQA · 2015-03-22T09:52:55Z

Test build #28958 timed out for PR 4960 at commit a5d3e8d after a configured wait of 120m.

andrewor14 · 2015-03-25T00:38:27Z

retest this please. This needs to be rebased to master

andrewor14 · 2015-03-25T00:40:54Z

docs/running-on-mesos.md

+  <td>Role for the Spark framework</td>
+  <td>
+    Set the role of this Spark framework for Mesos. Roles are used in Mesos for reservations
+    and resource weigth sharing.


tnachen · 2015-07-14T08:38:41Z

retest this please

SparkQA · 2015-07-14T08:41:26Z

Test build #14 has finished for PR 4960 at commit 32da933.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-14T10:45:58Z

Test build #37220 has finished for PR 4960 at commit 32da933.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tnachen · 2015-07-16T00:49:50Z

@andrewor14 I think I figured out what's going on with the test. It's hard to figure out from the logs since it was just hitting a System.exit(1). Hopefully this goes through!

SparkQA · 2015-07-16T03:07:17Z

Test build #37431 has finished for PR 4960 at commit 0f9f03e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-07-17T02:35:24Z

core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala

@@ -183,6 +186,18 @@ private[spark] class MesosSchedulerBackend(

  override def reregistered(d: SchedulerDriver, masterInfo: MasterInfo) {}

+  def getTasksSummary(tasks: JArrayList[MesosTaskInfo]): String = {


I'll make this private when I merge. This is only used in 1 place.

andrewor14 · 2015-07-17T02:36:29Z

LGTM I'm merging this into master. Thanks @tnachen!

reachbach · 2015-07-17T06:31:03Z

@tnachen @andrewor14 thanks for getting this done! Much needed for mesos deployments.

adam-mesos · 2015-07-17T06:35:52Z

Thanks! 🎉 🍰 ✌️

samklr · 2015-07-20T13:25:59Z

Just discovering this. Great jobs guys

…or.cores This is a regression introduced in #4960, this commit fixes it and adds a test. tnachen andrewor14 please review, this should be an easy one. Author: Iulian Dragos <jaguarul@gmail.com> Closes #8653 from dragos/issue/mesos/fine-grained-maxExecutorCores.

…or.cores This is a regression introduced in #4960, this commit fixes it and adds a test. tnachen andrewor14 please review, this should be an easy one. Author: Iulian Dragos <jaguarul@gmail.com> Closes #8653 from dragos/issue/mesos/fine-grained-maxExecutorCores. (cherry picked from commit f0562e8) Signed-off-by: Andrew Or <andrew@databricks.com>

ohal · 2015-09-21T17:10:32Z

@tnachen
I'm trying to get roles implementation working without any luck, can you please point me what should be configured?
Have a cluster, Mesos 0.22.1, hadoop 2.5.0, 3 masters, 8 slaves, which are 3 - role fast, 3 - role "*", 2 other roles.
Spark 1.5.0 is deployed to use cluster mode with dispatcher, it looks working, however I can not get it working on the particular nodes with role configured.
Job submitted as # bin/spark-submit --class org.apache.spark.examples.SparkPi --master master.novalocal:7077 --deploy-mode cluster --conf spark.mesos.role=fast http://www/spark.jar 10
Also, I've tried to configure Spark defaults for submitted jobs and added to /opt/spark/conf/spark-defaults.conf the line spark.mesos.role fast
In any case after submitting job is staging/running on the "*" nodes not on the fast nodes.
Can you please describe how to configure it properly?

adam-mesos · 2015-09-21T17:16:17Z

Oleh, did you specify --roles=fast,foo,bar when you started the Mesos
master?

On Mon, Sep 21, 2015 at 10:11 AM, Oleh Halenok notifications@github.com
wrote:

@tnachen https://github.com/tnachen
I'm trying to get roles implementation working without any luck, can you
please point me what should be configured?
Have a cluster, Mesos 0.22.1, hadoop 2.5.0, 3 masters, 8 slaves, which are
3 - role fast, 3 - role "", 2 other roles.
Spark 1.5.0 is deployed to use cluster mode with dispatcher, it looks
working, however I can not get it working on the particular nodes with role
configured.
Job submitted as # bin/spark-submit --class
org.apache.spark.examples.SparkPi --master master.novalocal:7077
--deploy-mode cluster --conf spark.mesos.role=fast http://www/spark.jar 10
Also, I've tried to configure Spark defaults for submitted jobs and added
to /opt/spark/conf/spark-defaults.conf the line spark.mesos.role fast
In any case after submitting job is staging/running on the *""* nodes
not on the fast nodes.
Can you please describe how to configure it properly?

—
Reply to this email directly or view it on GitHub
#4960 (comment).

ohal · 2015-09-21T17:37:55Z

yes, sure, all roles are configured on masters as MESOS_ROLES=fast,big,kafka

ohal · 2015-09-21T17:43:42Z

also, it looks like dispatcher does not receive resource offers other then "*" role:

15/09/21 17:40:54 TRACE MesosClusterScheduler: Received offers from Mesos:
Offer id: 20150911-070321-538888970-5050-1632-O282247, cpu: 8.0, mem: 14863.0
Offer id: 20150911-070321-538888970-5050-1632-O282248, cpu: 8.0, mem: 14863.0
Offer id: 20150911-070321-538888970-5050-1632-O282249, cpu: 8.0, mem: 14863.0

tnachen · 2015-09-21T17:47:50Z

Hi @ohal, you'll need to set spark.mesos.role when you launch the
dispatcher, which you can do so by setting the spark-defaults.conf in the
conf dir and it will automaitcally be loaded when you run the dispatcher.

On Mon, Sep 21, 2015 at 10:44 AM, Oleh Halenok notifications@github.com
wrote:

also, it looks like dispatcher does not receive resource offers other then
""* role:

15/09/21 17:40:54 TRACE MesosClusterScheduler: Received offers from Mesos:
Offer id: 20150911-070321-538888970-5050-1632-O282247, cpu: 8.0, mem: 14863.0
Offer id: 20150911-070321-538888970-5050-1632-O282248, cpu: 8.0, mem: 14863.0
Offer id: 20150911-070321-538888970-5050-1632-O282249, cpu: 8.0, mem: 14863.0

—
Reply to this email directly or view it on GitHub
#4960 (comment).

ohal · 2015-09-21T18:18:14Z

actually, it was done before:

# cat conf/spark-defaults.conf
spark.master                     spark://master.novalocal:7077
spark.eventLog.enabled           true
spark.mesos.role                 fast

when job is starting I've got some output:

...
  "mainClass" : "org.apache.spark.examples.SparkPi",
  "sparkProperties" : {
    "spark.jars" : "http://www/spark.jar",
    "spark.driver.supervise" : "false",
    "spark.app.name" : "org.apache.spark.examples.SparkPi",
    "spark.mesos.role" : "fast",
    "spark.submit.deployMode" : "cluster",
    "spark.master" : "spark://master.novalocal:7077"
  }
...

tnachen · 2015-09-21T18:22:15Z

That's the spark properties for the job, but might not for the dispatcher.

The easiest way to check is to go to the Mesos UI and look at the
Dispatcher framework, and see if it does have the correct role or not.

Tim

On Mon, Sep 21, 2015 at 11:19 AM, Oleh Halenok notifications@github.com
wrote:

actually, it was done before:

cat conf/spark-defaults.conf

spark.master spark://master.novalocal:7077
spark.eventLog.enabled true
spark.mesos.role fast

when job is starting I've got some output:

...
"mainClass" : "org.apache.spark.examples.SparkPi",
"sparkProperties" : {
"spark.jars" : "http://www/spark.jar",
"spark.driver.supervise" : "false",
"spark.app.name" : "org.apache.spark.examples.SparkPi",
"spark.mesos.role" : "fast",
"spark.submit.deployMode" : "cluster",
"spark.master" : "spark://master.novalocal:7077"
}
...

—
Reply to this email directly or view it on GitHub
#4960 (comment).

ohal · 2015-09-21T21:27:56Z

looks driver starts with some defaults - cpu, mem, below from spark ui:
State: TASK_ERROR, Message: Task uses more resources cpus(*):1; mem(*):1024 than available cpus(fast):8; mem(fast):14863; disk(fast):45243; ports(fast):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-65535], Source: SOURCE_MASTER, Reason: REASON_TASK_INVALID, Time: 1.442868857232848E9

tnachen · 2015-09-21T22:47:25Z

Ah this is indeed a bug, need to port the multiple roles logic that's in coarse and fine grain scheduler to cluster scheduler. Will fix this asap

AndriiOmelianenko · 2015-10-05T16:35:48Z

@tnachen Hi Tim,
any update here?

tnachen · 2015-10-05T18:38:41Z

Hi @AndriiOmelianenko, I have a PR out to fix that here #8872

tnachen changed the title ~~Add mesos role, principal and secret~~ [MESOS] Add mesos role, principal and secret Mar 10, 2015

tnachen changed the title ~~[MESOS] Add mesos role, principal and secret~~ [SPARK-6284][MESOS] Add mesos role, principal and secret Mar 11, 2015

tnachen force-pushed the mesos_fw_auth branch from 8fa250b to eba5669 Compare March 17, 2015 08:53

tnachen force-pushed the mesos_fw_auth branch from eba5669 to a5d3e8d Compare March 22, 2015 07:44

andrewor14 reviewed Mar 25, 2015
View reviewed changes

tnachen added 3 commits July 15, 2015 17:48

Add mesos role, auth and secret.

f7fc2a9

Fix rebase

8f9488a

Fix review comments.

0f9f03e

tnachen force-pushed the mesos_fw_auth branch from 32da933 to 0f9f03e Compare July 16, 2015 00:48

andrewor14 reviewed Jul 17, 2015
View reviewed changes

asfgit closed this in d86bbb4 Jul 17, 2015

tnachen deleted the mesos_fw_auth branch July 17, 2015 08:04

dragos mentioned this pull request Sep 8, 2015

[SPARK-6350][Mesos] Fine-grained mode scheduler respects mesosExecutor.cores #8653

Closed

		@@ -183,6 +186,18 @@ private[spark] class MesosSchedulerBackend(

		override def reregistered(d: SchedulerDriver, masterInfo: MasterInfo) {}

		def getTasksSummary(tasks: JArrayList[MesosTaskInfo]): String = {

[SPARK-6284][MESOS] Add mesos role, principal and secret #4960

[SPARK-6284][MESOS] Add mesos role, principal and secret #4960

Conversation

tnachen commented Mar 10, 2015

adam-mesos commented Mar 10, 2015

srowen commented Mar 10, 2015

adam-mesos commented Mar 10, 2015

SparkQA commented Mar 10, 2015

srowen commented Mar 10, 2015

SparkQA commented Mar 10, 2015

tnachen commented Mar 11, 2015

realoptimal commented Mar 12, 2015

realoptimal commented Mar 16, 2015

tnachen commented Mar 16, 2015

tnachen commented Mar 17, 2015

SparkQA commented Mar 17, 2015

realoptimal commented Mar 17, 2015

tnachen commented Mar 17, 2015

SparkQA commented Mar 22, 2015

andrewor14 commented Mar 25, 2015

andrewor14 Mar 25, 2015

Choose a reason for hiding this comment

tnachen commented Jul 14, 2015

SparkQA commented Jul 14, 2015

SparkQA commented Jul 14, 2015

tnachen commented Jul 16, 2015

SparkQA commented Jul 16, 2015

andrewor14 Jul 17, 2015

Choose a reason for hiding this comment

andrewor14 commented Jul 17, 2015

reachbach commented Jul 17, 2015

adam-mesos commented Jul 17, 2015

samklr commented Jul 20, 2015

ohal commented Sep 21, 2015

adam-mesos commented Sep 21, 2015

ohal commented Sep 21, 2015

ohal commented Sep 21, 2015

tnachen commented Sep 21, 2015

ohal commented Sep 21, 2015

tnachen commented Sep 21, 2015

cat conf/spark-defaults.conf

ohal commented Sep 21, 2015

tnachen commented Sep 21, 2015

AndriiOmelianenko commented Oct 5, 2015

tnachen commented Oct 5, 2015