Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6284][MESOS] Add mesos role, principal and secret #4960

Closed
wants to merge 3 commits into from

Conversation

tnachen
Copy link
Contributor

@tnachen tnachen commented Mar 10, 2015

Mesos supports framework authentication and role to be set per framework, which the role is used to identify the framework's role which impacts the sharing weight of resource allocation and optional authentication information to allow the framework to be connected to the master.

@tnachen tnachen changed the title Add mesos role, principal and secret [MESOS] Add mesos role, principal and secret Mar 10, 2015
@adam-mesos
Copy link

LGTM, from a Mesos perspective. 👍

@srowen
Copy link
Member

srowen commented Mar 10, 2015

(JIRA please?)

@adam-mesos
Copy link

@SparkQA
Copy link

SparkQA commented Mar 10, 2015

Test build #28424 has finished for PR 4960 at commit 99c3a85.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Mar 10, 2015

Have a look at the other PRs -- write SPARK-xxxx [yyyy] ... to have it properly "indexed" by https://spark-prs.appspot.com/

@SparkQA
Copy link

SparkQA commented Mar 10, 2015

Test build #28425 has finished for PR 4960 at commit 8fa250b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tnachen tnachen changed the title [MESOS] Add mesos role, principal and secret [SPARK-6284][MESOS] Add mesos role, principal and secret Mar 11, 2015
@tnachen
Copy link
Contributor Author

tnachen commented Mar 11, 2015

@srowen Sorry missed your notification, opened a JIRA now

@realoptimal
Copy link

How have you tested this? I get ExecutorLostFailure and tasks are not launched on mesos slave configured with Slave resources: cpus(prod):2; cpus(dev):2; mem():29655; disk():745874; ports(*):[31000-32000]

and trying to run for e.g.
./bin/spark-submit --master mesos://iq-cluster-master:5050 --total-executor-cores 2 --executor-memory 3G --conf spark.mesos.role=dev ./examples/src/main/python/pi.py 100

@realoptimal
Copy link

Also if Slave resources are all of the default type, i.e. *. The framework should be still be able to use those resources even with spark.mesos.role != *

@tnachen
Copy link
Contributor Author

tnachen commented Mar 16, 2015

@realoptimal you did indeed found a problem about roles, I only tried it with seeing the framework registered with the right role and tasks launched, but didn't try it in your case where multiple roles with different resources and no wildcard resources are available. The scheduler currently just uses * in the role everywhere.

I'm having a fix and will push a fix plus tests to this PR, thanks again!

@tnachen
Copy link
Contributor Author

tnachen commented Mar 17, 2015

@realoptimal Can you try again? I've updated the PR that includes test and I've tested myself with fine-grained and coarse-grained mode, and it is correctly launching the tasks.

@SparkQA
Copy link

SparkQA commented Mar 17, 2015

Test build #28715 has finished for PR 4960 at commit eba5669.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@realoptimal
Copy link

@tnachen You are much faster than I am; was working on modifying your previous code to dole out reserved offers first and then to use unreserved offers (those with role="*"), but you beat me to it.

Tested your code with my configuration and this works when you have specified spark.mesos.role in the spark-defaults.conf file. Not sure why it won't pick it up via --conf on the command line but I think that is an issue with SparkSubmit, not in your your code changes.

@tnachen
Copy link
Contributor Author

tnachen commented Mar 17, 2015

@realoptimal thanks for testing this. We can file another issue for the --conf problem, and I'm sure there are more rough edges to smooth out. @andrewor14 can you take a look of this PR?

@SparkQA
Copy link

SparkQA commented Mar 22, 2015

Test build #28958 timed out for PR 4960 at commit a5d3e8d after a configured wait of 120m.

@andrewor14
Copy link
Contributor

retest this please. This needs to be rebased to master

<td>Role for the Spark framework</td>
<td>
Set the role of this Spark framework for Mesos. Roles are used in Mesos for reservations
and resource weigth sharing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weight

@tnachen
Copy link
Contributor Author

tnachen commented Jul 14, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Jul 14, 2015

Test build #14 has finished for PR 4960 at commit 32da933.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 14, 2015

Test build #37220 has finished for PR 4960 at commit 32da933.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tnachen
Copy link
Contributor Author

tnachen commented Jul 16, 2015

@andrewor14 I think I figured out what's going on with the test. It's hard to figure out from the logs since it was just hitting a System.exit(1). Hopefully this goes through!

@SparkQA
Copy link

SparkQA commented Jul 16, 2015

Test build #37431 has finished for PR 4960 at commit 0f9f03e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -183,6 +186,18 @@ private[spark] class MesosSchedulerBackend(

override def reregistered(d: SchedulerDriver, masterInfo: MasterInfo) {}

def getTasksSummary(tasks: JArrayList[MesosTaskInfo]): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make this private when I merge. This is only used in 1 place.

@andrewor14
Copy link
Contributor

LGTM I'm merging this into master. Thanks @tnachen!

@asfgit asfgit closed this in d86bbb4 Jul 17, 2015
@reachbach
Copy link

@tnachen @andrewor14 thanks for getting this done! Much needed for mesos deployments.

@adam-mesos
Copy link

Thanks! 🎉 🍰 ✌️

@tnachen tnachen deleted the mesos_fw_auth branch July 17, 2015 08:04
@samklr
Copy link

samklr commented Jul 20, 2015

Just discovering this. Great jobs guys

asfgit pushed a commit that referenced this pull request Sep 10, 2015
…or.cores

This is a regression introduced in #4960, this commit fixes it and adds a test.

tnachen andrewor14 please review, this should be an easy one.

Author: Iulian Dragos <jaguarul@gmail.com>

Closes #8653 from dragos/issue/mesos/fine-grained-maxExecutorCores.
asfgit pushed a commit that referenced this pull request Sep 10, 2015
…or.cores

This is a regression introduced in #4960, this commit fixes it and adds a test.

tnachen andrewor14 please review, this should be an easy one.

Author: Iulian Dragos <jaguarul@gmail.com>

Closes #8653 from dragos/issue/mesos/fine-grained-maxExecutorCores.

(cherry picked from commit f0562e8)
Signed-off-by: Andrew Or <andrew@databricks.com>
@ohal
Copy link

ohal commented Sep 21, 2015

@tnachen
I'm trying to get roles implementation working without any luck, can you please point me what should be configured?
Have a cluster, Mesos 0.22.1, hadoop 2.5.0, 3 masters, 8 slaves, which are 3 - role fast, 3 - role "*", 2 other roles.
Spark 1.5.0 is deployed to use cluster mode with dispatcher, it looks working, however I can not get it working on the particular nodes with role configured.
Job submitted as # bin/spark-submit --class org.apache.spark.examples.SparkPi --master master.novalocal:7077 --deploy-mode cluster --conf spark.mesos.role=fast http://www/spark.jar 10
Also, I've tried to configure Spark defaults for submitted jobs and added to /opt/spark/conf/spark-defaults.conf the line spark.mesos.role fast
In any case after submitting job is staging/running on the "*" nodes not on the fast nodes.
Can you please describe how to configure it properly?

@adam-mesos
Copy link

Oleh, did you specify --roles=fast,foo,bar when you started the Mesos
master?

On Mon, Sep 21, 2015 at 10:11 AM, Oleh Halenok notifications@github.com
wrote:

@tnachen https://github.com/tnachen
I'm trying to get roles implementation working without any luck, can you
please point me what should be configured?
Have a cluster, Mesos 0.22.1, hadoop 2.5.0, 3 masters, 8 slaves, which are
3 - role fast, 3 - role "", 2 other roles.
Spark 1.5.0 is deployed to use cluster mode with dispatcher, it looks
working, however I can not get it working on the particular nodes with role
configured.
Job submitted as # bin/spark-submit --class
org.apache.spark.examples.SparkPi --master master.novalocal:7077
--deploy-mode cluster --conf spark.mesos.role=fast http://www/spark.jar 10
Also, I've tried to configure Spark defaults for submitted jobs and added
to /opt/spark/conf/spark-defaults.conf the line spark.mesos.role fast
In any case after submitting job is staging/running on the *"
"* nodes
not on the fast nodes.
Can you please describe how to configure it properly?


Reply to this email directly or view it on GitHub
#4960 (comment).

@ohal
Copy link

ohal commented Sep 21, 2015

yes, sure, all roles are configured on masters as MESOS_ROLES=fast,big,kafka

@ohal
Copy link

ohal commented Sep 21, 2015

also, it looks like dispatcher does not receive resource offers other then "*" role:

15/09/21 17:40:54 TRACE MesosClusterScheduler: Received offers from Mesos:
Offer id: 20150911-070321-538888970-5050-1632-O282247, cpu: 8.0, mem: 14863.0
Offer id: 20150911-070321-538888970-5050-1632-O282248, cpu: 8.0, mem: 14863.0
Offer id: 20150911-070321-538888970-5050-1632-O282249, cpu: 8.0, mem: 14863.0   

@tnachen
Copy link
Contributor Author

tnachen commented Sep 21, 2015

Hi @ohal, you'll need to set spark.mesos.role when you launch the
dispatcher, which you can do so by setting the spark-defaults.conf in the
conf dir and it will automaitcally be loaded when you run the dispatcher.

On Mon, Sep 21, 2015 at 10:44 AM, Oleh Halenok notifications@github.com
wrote:

also, it looks like dispatcher does not receive resource offers other then
""* role:

15/09/21 17:40:54 TRACE MesosClusterScheduler: Received offers from Mesos:
Offer id: 20150911-070321-538888970-5050-1632-O282247, cpu: 8.0, mem: 14863.0
Offer id: 20150911-070321-538888970-5050-1632-O282248, cpu: 8.0, mem: 14863.0
Offer id: 20150911-070321-538888970-5050-1632-O282249, cpu: 8.0, mem: 14863.0


Reply to this email directly or view it on GitHub
#4960 (comment).

@ohal
Copy link

ohal commented Sep 21, 2015

actually, it was done before:

# cat conf/spark-defaults.conf
spark.master                     spark://master.novalocal:7077
spark.eventLog.enabled           true
spark.mesos.role                 fast

when job is starting I've got some output:

...
  "mainClass" : "org.apache.spark.examples.SparkPi",
  "sparkProperties" : {
    "spark.jars" : "http://www/spark.jar",
    "spark.driver.supervise" : "false",
    "spark.app.name" : "org.apache.spark.examples.SparkPi",
    "spark.mesos.role" : "fast",
    "spark.submit.deployMode" : "cluster",
    "spark.master" : "spark://master.novalocal:7077"
  }
...

@tnachen
Copy link
Contributor Author

tnachen commented Sep 21, 2015

That's the spark properties for the job, but might not for the dispatcher.

The easiest way to check is to go to the Mesos UI and look at the
Dispatcher framework, and see if it does have the correct role or not.

Tim

On Mon, Sep 21, 2015 at 11:19 AM, Oleh Halenok notifications@github.com
wrote:

actually, it was done before:

cat conf/spark-defaults.conf

spark.master spark://master.novalocal:7077
spark.eventLog.enabled true
spark.mesos.role fast

when job is starting I've got some output:

...
"mainClass" : "org.apache.spark.examples.SparkPi",
"sparkProperties" : {
"spark.jars" : "http://www/spark.jar",
"spark.driver.supervise" : "false",
"spark.app.name" : "org.apache.spark.examples.SparkPi",
"spark.mesos.role" : "fast",
"spark.submit.deployMode" : "cluster",
"spark.master" : "spark://master.novalocal:7077"
}
...


Reply to this email directly or view it on GitHub
#4960 (comment).

@ohal
Copy link

ohal commented Sep 21, 2015

looks driver starts with some defaults - cpu, mem, below from spark ui:
State: TASK_ERROR, Message: Task uses more resources cpus(*):1; mem(*):1024 than available cpus(fast):8; mem(fast):14863; disk(fast):45243; ports(fast):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-65535], Source: SOURCE_MASTER, Reason: REASON_TASK_INVALID, Time: 1.442868857232848E9

@tnachen
Copy link
Contributor Author

tnachen commented Sep 21, 2015

Ah this is indeed a bug, need to port the multiple roles logic that's in coarse and fine grain scheduler to cluster scheduler. Will fix this asap

@AndriiOmelianenko
Copy link

@tnachen Hi Tim,
any update here?

@tnachen
Copy link
Contributor Author

tnachen commented Oct 5, 2015

Hi @AndriiOmelianenko, I have a PR out to fix that here #8872

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
10 participants