Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12893][YARN] Fix history URL redirect error in yarn-cluster mode #10821

Closed
wants to merge 2 commits into from

Conversation

jerryshao
Copy link
Contributor

Fix RM redirect to wrong history URL issue, details can be seen in SPARK-12893.

Please review, CC @vanzin @steveloughran , thanks a lot.

@SparkQA
Copy link

SparkQA commented Jan 19, 2016

Test build #49637 has finished for PR 10821 at commit a853933.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor

I've been using HistoryServer.getAttemptURI() to do this mapping in the timeline integration -but that method is private[history]. The code here appears to generate the same path. Of course, making that history server method private[spark] would be the best approach if a single codepath is what you want

@jerryshao
Copy link
Contributor Author

I see, thanks.

@steveloughran
Copy link
Contributor

..but if you can't get at it, copy & paste is probably the best way to make do. Make the new method private[spark] and we could add a test in the history server to verify they give the same answer, always

@SparkQA
Copy link

SparkQA commented Jan 19, 2016

Test build #49693 has finished for PR 10821 at commit a503bf7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

@vanzin

@tgravescs
Copy link
Contributor

Can you add more description here? is this only on master branch (2.0)?

@jerryshao
Copy link
Contributor Author

@tgravescs , the URL here is not correct, using this URL to redirect from RM web UI to history server is not accessible, so the patch is going to fix it.

This is a bug for not only in master branch, but also for old version.

@tgravescs
Copy link
Contributor

hmm, so my confusion is this works just fine for me on spark 1.6 running on yarn cluster.

Are you running secure hadoop or in-secure? Are you using the RM proxy?

I see where the code looks like it would have the wrong link but somehow I'm getting the correct link so I want to understand the problem fully.

Also note that when I manually go to the link with http://historyserver:18080/history/application_1455663314062_13672/1 it just redirects to http://historyserver:18080/history/application_1455663314062_13672/jobs/ when I don't have an application with multiple attempts, but when I do have an application with multiple attempts then it goes to the /1 or /2 properly.

Note I'm running on secure hadoop using the RM proxy.

I also see you are getting the double url with history/appid/1/history/appid/1, not sure why that is. I remember I'd seen something like this a long time ago but thought that was fixed.

If you manually go to the history/appid/1 what happens?

@jerryshao
Copy link
Contributor Author

Hi @tgravescs , I don't enable security, and don't have specific rm proxy setting and proxy server. All use the default settings.

I just made a fresh building against latest Spark master to test this problem.

In the yarn-client mode, urls can be redirected and accessed to the right application page. But for the yarn-cluster mode, this url (http://localhost:18080/history/application_1456728554477_0006/1/jobs/) is failed to access, here is the screenshot:

screen shot 2016-02-29 at 3 57 32 pm

But when changing to (http://localhost:18080/history/application_1456728554477_0006/**appattempt_1456728554477_0006_000001**/jobs/) then it can be accessed.

Have you tested with yarn-cluster mode?

@jerryshao
Copy link
Contributor Author

/1, /2 is not actually an attempt id, so how to convert this to an valid attempt id?

In the code of history server:

 private val loaderServlet = new HttpServlet {
    protected override def doGet(req: HttpServletRequest, res: HttpServletResponse): Unit = {
      // Parse the URI created by getAttemptURI(). It contains an app ID and an optional
      // attempt ID (separated by a slash).
      val parts = Option(req.getPathInfo()).getOrElse("").split("/")
      if (parts.length < 2) {
        res.sendError(HttpServletResponse.SC_BAD_REQUEST,
          s"Unexpected path info in request (URI = ${req.getRequestURI()}")
        return
      }

      val appId = parts(1)
      val attemptId = if (parts.length >= 3) Some(parts(2)) else None

The GET url will be split into two fields, appId and attemptId, so if the URL you mentioned is accessible, attemptId will be 1 or 2, am not sure how to use this to get the right event log?

And there's some logs in history server also complained about this thing:

16/02/29 16:47:02 INFO ApplicationCache: Failed to load application attempt application_1456728554477_0006/Some(1)

@steveloughran
Copy link
Contributor

If you get a link like appId/1 then it means that the web UI/spark doesn't have an instance ID; that's the default "single" link. So the issue is how is the FsHistoryProvider coming up with the wrong attempt Id. Either it's not in the history as saved, or its not being picked up during parse.

Assuming this is yarn client mode, then an attempt ID of "1" is potentially valid.

1 This is the FsHistoryProvider, right?
2. Send me a link to a history and I'll have a look at it (actually, I was thinking of having a dump command which would just dump a history replay...)

@jerryshao
Copy link
Contributor Author

Hi @steveloughran , yes it uses FsHistoryProvider.

In the client mode, actually attemptId is None, and any URL can be redirected to the right one, so "1" is finally redirected. But this is not worked for yarn cluster mode according to my test.

You could simply run a spark application with yarn-cluster mode. Access the URL either from RM's web UI, or directly from history server's web UI. you'll find out the difference.

@steveloughran
Copy link
Contributor

OK...so the question is "where is the 1 coming from"

@jerryshao
Copy link
Contributor Author

@steveloughran , here "1" is the number of attempts here, and it used to generate a URL here. Also in the yarn code, this "1" or "2" is gotten from attempt id.

This "1" or "2" as attempt id to concatenate the URL is not accessable in my local test.

@jerryshao
Copy link
Contributor Author

@tgravescs , I think here this PR fix the same issue as in #11518 , the difference here is to access from RM's web UI, the URL should keep consistent, would you please help to review again, thanks a lot.

@SparkQA
Copy link

SparkQA commented Mar 9, 2016

Test build #52722 has finished for PR 10821 at commit c98c3b5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

So at this point this isn't really changing anything right? Except not to set attempt is not in cluster mode? But that was handled by redirect in history server anyway. Otherwise I'm not following how the attemptId would be wrong. In cluster mode the attemptid should be set to something. Its getting it directly from the containerid that is set on your yarn cluster. If you don't have containerid then I assume you aren't running on yarn or yarn has a bug.

Which version of hadoop are you using?

Also are you sure the history server just hasn't loaded the file yet?

@jerryshao
Copy link
Contributor Author

Hi @tgravescs , here in the original code:

val attemptId = client.getAttemptId().getAttemptId().toString()

This will get the attempt id as "1" or "2"...

And finally the history server url is: xxx/history/application_1455663314062_13672/1. In history server this "1" will be parsed as attempt id to find out the cached event log. But since here the attempt id is not a valid one, so this url will be failed to access.

Actually history server's expected attempt id is: appattempt_1455663314062_13672_000001, and the right url for accessing history server should be: xxx/history/application_1455663314062_13672/appattempt_1455663314062_13672_000001.

So here I change to client.getAttemptId().toString() to get the full attempt id. That's the point of this fix.

@vanzin
Copy link
Contributor

vanzin commented Mar 10, 2016

I just tried locally and I see that something seems to have changed in 2.0.

This is an example of an event log file name generated by 1.6 in my cluster:
/user/spark/applicationHistory/application_1456510586527_0005_1

And this is an example of one generated by 2.0:
/user/spark/applicationHistory/application_1457635277787_0001_appattempt_1457635277787_0001_000001

Can you confirm that you change restores the behavior to the one that existed in 1.6 (and actually explain that in the change summary)? The new ultra verbose attempt id is not necessary - it contains the same information as the application id, plus a counter at the end.

@jerryshao
Copy link
Contributor Author

Hi @vanzin , thanks a lot for your response. I just checked the branch-1.6, looks like the behavior (attempt id) is actually changed, and this change is introduced in #9182.

Originally attemptId is gotten from spark.yarn.app.attemptId which is set in ApplicationMaster.

And in ApplicationMaster, the way to get attemptId is appAttemptId.getAttemptId().toString(), so here the attemptId is "1", "2".

But this behavior is changed in master branch. Here we use the full attemptId rather than attempt counter here.

So this affects not only the file name of event log, also the url of history server's each application.

Here if we accept the new way of attempId, then this url link should be updated to the new one. Oppositely if we treat this new way of attemptId as a regression, then there's no issue here, all we should change is to loop back the original attemptId.

What's your opinion? @vanzin .

@vanzin
Copy link
Contributor

vanzin commented Mar 14, 2016

all we should change is to loop back the original attemptId.

I think we should do that. The verbose attempt ID does not add any information on top of app ID + counter, and looks very ugly for users.

@steveloughran
Copy link
Contributor

I'm not sure how that crept in on the patch; it wasn't something intentional.

  1. It is needed for the spark timeline stuff —but that can be handled in history load code of the YarnHistoryProvider
  2. Having a simple "1", "2", is actually useful, especially in the REST API; you don't need to know the attempt Id to know that /1 is for the first attempt; /2 for the second
  3. Really to test link validity there should be something in the web UI, though being JSON that's too complex now. Hitting the REST API should suffice

BTW, some bits of code related to splitting up attempts, container IDs and such like has proven to be brittle in the past across Hadoop versions; if someone is trying to break things up, they should test across Hadoop 2.2-2.6+

@jerryshao
Copy link
Contributor Author

I will close this PR, since there's no issue here, it is due to the regression of #9182, will fix that instead.

Also #11518 should be reverted, since that one is the same url link issue, should fix the attempt id instead.

Thanks a lot for your review.

@jerryshao jerryshao closed this Mar 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants