New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2080] Yarn: report HS URL in client mode, correct user in cluster mode. #1002
Conversation
Yarn client mode was not setting the app's tracking URL to the History Server's URL when configured by the user. Now client mode behaves the same as cluster mode. In SparkContext.scala, the "user.name" system property had precedence over the SPARK_USER environment variable. This means that SPARK_USER was never used, since "user.name" is always set by the JVM. In Yarn cluster mode, this means the application always reported itself as being run by user "yarn" (or whatever user was running the Yarn NM). One could argue that the correct fix would be to use UGI.getCurrentUser() here, but at least for Yarn that will match what SPARK_USER is set to.
Note: tested with yarn-client, yarn-cluster and local, checked the spark user listed in the history server for each app. |
Can one of the admins verify this patch? |
note that this is jira SPARK-1291 |
Sorry I misunderstood, this is not SPARK-1291, can you please file jira. |
Can you also please update for yarn-alpha |
yarn-alpha seems to already do it:
|
I think that is the ApplicationMaster, I was referring to the ExecutorLauncher for yarn-alpha. def finishApplicationMaster(status: FinalApplicationStatus) {
} We should add a call to finishReq.setTrackingUrl(...) |
D'oh. Will update shortly. |
@@ -297,7 +297,7 @@ class SparkContext(config: SparkConf) extends Logging { | |||
|
|||
// Set SPARK_USER for user who is running SparkContext. | |||
val sparkUser = Option { | |||
Option(System.getProperty("user.name")).getOrElse(System.getenv("SPARK_USER")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerryshao I think you had originally added this logic, Is there a usecase where this order is required? It always seemed odd to me since I expect user.name to always be set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tgravescs , originally I put SPARK_USER before user.name when I submitted a PR, and someone suggested me to change the order to keep consistent with other Spark parameter, so I changed this. I assume it's OK in standalone and mesos mode, but I didn't test it in Yarn mode.
It looks good to me to change this order if needed. :)
BTW, just ran into this in SecurityManager.scala:
So that probably needs the same treatment. |
(Oh wait, that's adding both user.name and SPARK_USER to the default users list? Maybe it works. but looks strange.) |
Yeah the Securitymanager is adding both to handle the case the containers run as one user but is accessing HDFS as whoever specified in SPARK_USER, it also came up in the review. @vanzin can you tell me your yarn setup where it was showing yarn as the user in web UI? I have a single node setup with daemons running as yarn and then launch job as myself and it shows up as myself (without your change). Are you running your client as super user and then specifying SPARK_USER to access hdfs files? |
also note that yarn does set the SPARK_USER to UGI in ClientBase env("SPARK_USER") = UserGroupInformation.getCurrentUser().getShortUserName() |
@tgravescs I have a (non-secured) yarn cluster (where daemons run as "yarn"), and I was launching processes from my own machine (not in the cluster). I'm using the event logger to write app logs to HDFS. In client mode, my machine is the driver so my user gets correctly written to the app log. But in cluster mode the AM is launched as the same user as the node manager (since it's not a kerberized setup), so the spark user written to the app log is the same user running the node manager, which is "yarn", and not myself. |
(Ah, to clarify, I wasn't looking at the web ui, but at the history server ui. It should sort of show the same info, though, as far as I can tell.) |
Ah ok, that makes more sense. The history UI just displays what was recorded as the user on application start event from spark. In the case the daemons are running as user yarn without security it appears this is being put in as user yarn: {"Event":"SparkListenerApplicationStart","App Name":"JavaWordCount","Timestamp":1402600427658,"User":"yarn"} I'm a little bit surprised that the change you made fixes that for multiple users. is SPARK_USER set to your user when you start the history server? Does it work with multiple users? We should probably change it to put the correct user in the event log. Or perhaps your change is fixing it so that it gets logs properly? |
Ok so looking more at it that is what is happening the postApplicationStart is just using sparkUser and since you changed it, it is properly logging the right thing now. So ignore my questions. |
Changes look good. Note that this only fixes the url for unsecure YARN cluster. For secure clusters they generally use the web app proxy which requires more setup and will be handled under SPARK-1291. Thanks @vanzin |
…ter mode. Yarn client mode was not setting the app's tracking URL to the History Server's URL when configured by the user. Now client mode behaves the same as cluster mode. In SparkContext.scala, the "user.name" system property had precedence over the SPARK_USER environment variable. This means that SPARK_USER was never used, since "user.name" is always set by the JVM. In Yarn cluster mode, this means the application always reported itself as being run by user "yarn" (or whatever user was running the Yarn NM). One could argue that the correct fix would be to use UGI.getCurrentUser() here, but at least for Yarn that will match what SPARK_USER is set to. Author: Marcelo Vanzin <vanzin@cloudera.com> This patch had conflicts when merged, resolved by Committer: Thomas Graves <tgraves@apache.org> Closes #1002 from vanzin/yarn-client-url and squashes the following commits: 4046e04 [Marcelo Vanzin] Set HS link in yarn-alpha also. 4c692d9 [Marcelo Vanzin] Yarn: report HS URL in client mode, correct user in cluster mode.
Also committed to branch-1 |
…ter mode. Yarn client mode was not setting the app's tracking URL to the History Server's URL when configured by the user. Now client mode behaves the same as cluster mode. In SparkContext.scala, the "user.name" system property had precedence over the SPARK_USER environment variable. This means that SPARK_USER was never used, since "user.name" is always set by the JVM. In Yarn cluster mode, this means the application always reported itself as being run by user "yarn" (or whatever user was running the Yarn NM). One could argue that the correct fix would be to use UGI.getCurrentUser() here, but at least for Yarn that will match what SPARK_USER is set to. Author: Marcelo Vanzin <vanzin@cloudera.com> This patch had conflicts when merged, resolved by Committer: Thomas Graves <tgraves@apache.org> Closes apache#1002 from vanzin/yarn-client-url and squashes the following commits: 4046e04 [Marcelo Vanzin] Set HS link in yarn-alpha also. 4c692d9 [Marcelo Vanzin] Yarn: report HS URL in client mode, correct user in cluster mode.
…ter mode. Yarn client mode was not setting the app's tracking URL to the History Server's URL when configured by the user. Now client mode behaves the same as cluster mode. In SparkContext.scala, the "user.name" system property had precedence over the SPARK_USER environment variable. This means that SPARK_USER was never used, since "user.name" is always set by the JVM. In Yarn cluster mode, this means the application always reported itself as being run by user "yarn" (or whatever user was running the Yarn NM). One could argue that the correct fix would be to use UGI.getCurrentUser() here, but at least for Yarn that will match what SPARK_USER is set to. Author: Marcelo Vanzin <vanzin@cloudera.com> This patch had conflicts when merged, resolved by Committer: Thomas Graves <tgraves@apache.org> Closes apache#1002 from vanzin/yarn-client-url and squashes the following commits: 4046e04 [Marcelo Vanzin] Set HS link in yarn-alpha also. 4c692d9 [Marcelo Vanzin] Yarn: report HS URL in client mode, correct user in cluster mode.
Yarn client mode was not setting the app's tracking URL to the
History Server's URL when configured by the user. Now client mode
behaves the same as cluster mode.
In SparkContext.scala, the "user.name" system property had precedence
over the SPARK_USER environment variable. This means that SPARK_USER
was never used, since "user.name" is always set by the JVM. In Yarn
cluster mode, this means the application always reported itself as
being run by user "yarn" (or whatever user was running the Yarn NM).
One could argue that the correct fix would be to use UGI.getCurrentUser()
here, but at least for Yarn that will match what SPARK_USER is set
to.