-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12893][YARN] Fix history URL redirect error in yarn-cluster mode #10821
Conversation
Test build #49637 has finished for PR 10821 at commit
|
I've been using |
I see, thanks. |
..but if you can't get at it, copy & paste is probably the best way to make do. Make the new method private[spark] and we could add a test in the history server to verify they give the same answer, always |
Test build #49693 has finished for PR 10821 at commit
|
Can you add more description here? is this only on master branch (2.0)? |
@tgravescs , the URL here is not correct, using this URL to redirect from RM web UI to history server is not accessible, so the patch is going to fix it. This is a bug for not only in master branch, but also for old version. |
hmm, so my confusion is this works just fine for me on spark 1.6 running on yarn cluster. Are you running secure hadoop or in-secure? Are you using the RM proxy? I see where the code looks like it would have the wrong link but somehow I'm getting the correct link so I want to understand the problem fully. Also note that when I manually go to the link with http://historyserver:18080/history/application_1455663314062_13672/1 it just redirects to http://historyserver:18080/history/application_1455663314062_13672/jobs/ when I don't have an application with multiple attempts, but when I do have an application with multiple attempts then it goes to the /1 or /2 properly. Note I'm running on secure hadoop using the RM proxy. I also see you are getting the double url with history/appid/1/history/appid/1, not sure why that is. I remember I'd seen something like this a long time ago but thought that was fixed. If you manually go to the history/appid/1 what happens? |
Hi @tgravescs , I don't enable security, and don't have specific rm proxy setting and proxy server. All use the default settings. I just made a fresh building against latest Spark master to test this problem. In the yarn-client mode, urls can be redirected and accessed to the right application page. But for the yarn-cluster mode, this url (http://localhost:18080/history/application_1456728554477_0006/1/jobs/) is failed to access, here is the screenshot: But when changing to (http://localhost:18080/history/application_1456728554477_0006/**appattempt_1456728554477_0006_000001**/jobs/) then it can be accessed. Have you tested with yarn-cluster mode? |
/1, /2 is not actually an attempt id, so how to convert this to an valid attempt id? In the code of history server: private val loaderServlet = new HttpServlet {
protected override def doGet(req: HttpServletRequest, res: HttpServletResponse): Unit = {
// Parse the URI created by getAttemptURI(). It contains an app ID and an optional
// attempt ID (separated by a slash).
val parts = Option(req.getPathInfo()).getOrElse("").split("/")
if (parts.length < 2) {
res.sendError(HttpServletResponse.SC_BAD_REQUEST,
s"Unexpected path info in request (URI = ${req.getRequestURI()}")
return
}
val appId = parts(1)
val attemptId = if (parts.length >= 3) Some(parts(2)) else None The GET url will be split into two fields, And there's some logs in history server also complained about this thing:
|
If you get a link like appId/1 then it means that the web UI/spark doesn't have an instance ID; that's the default "single" link. So the issue is how is the FsHistoryProvider coming up with the wrong attempt Id. Either it's not in the history as saved, or its not being picked up during parse. Assuming this is yarn client mode, then an attempt ID of "1" is potentially valid. 1 This is the FsHistoryProvider, right? |
Hi @steveloughran , yes it uses FsHistoryProvider. In the client mode, actually You could simply run a spark application with yarn-cluster mode. Access the URL either from RM's web UI, or directly from history server's web UI. you'll find out the difference. |
OK...so the question is "where is the 1 coming from" |
@steveloughran , here "1" is the number of attempts here, and it used to generate a URL here. Also in the yarn code, this "1" or "2" is gotten from attempt id. This "1" or "2" as attempt id to concatenate the URL is not accessable in my local test. |
@tgravescs , I think here this PR fix the same issue as in #11518 , the difference here is to access from RM's web UI, the URL should keep consistent, would you please help to review again, thanks a lot. |
Test build #52722 has finished for PR 10821 at commit
|
So at this point this isn't really changing anything right? Except not to set attempt is not in cluster mode? But that was handled by redirect in history server anyway. Otherwise I'm not following how the attemptId would be wrong. In cluster mode the attemptid should be set to something. Its getting it directly from the containerid that is set on your yarn cluster. If you don't have containerid then I assume you aren't running on yarn or yarn has a bug. Which version of hadoop are you using? Also are you sure the history server just hasn't loaded the file yet? |
Hi @tgravescs , here in the original code:
This will get the attempt id as "1" or "2"... And finally the history server url is: Actually history server's expected attempt id is: appattempt_1455663314062_13672_000001, and the right url for accessing history server should be: So here I change to |
I just tried locally and I see that something seems to have changed in 2.0. This is an example of an event log file name generated by 1.6 in my cluster: And this is an example of one generated by 2.0: Can you confirm that you change restores the behavior to the one that existed in 1.6 (and actually explain that in the change summary)? The new ultra verbose attempt id is not necessary - it contains the same information as the application id, plus a counter at the end. |
Hi @vanzin , thanks a lot for your response. I just checked the branch-1.6, looks like the behavior (attempt id) is actually changed, and this change is introduced in #9182. Originally And in But this behavior is changed in master branch. Here we use the full So this affects not only the file name of event log, also the url of history server's each application. Here if we accept the new way of What's your opinion? @vanzin . |
I think we should do that. The verbose attempt ID does not add any information on top of app ID + counter, and looks very ugly for users. |
I'm not sure how that crept in on the patch; it wasn't something intentional.
BTW, some bits of code related to splitting up attempts, container IDs and such like has proven to be brittle in the past across Hadoop versions; if someone is trying to break things up, they should test across Hadoop 2.2-2.6+ |
Fix RM redirect to wrong history URL issue, details can be seen in SPARK-12893.
Please review, CC @vanzin @steveloughran , thanks a lot.