[SPARK-12893][YARN] Fix history URL redirect error in yarn-cluster mode #10821

jerryshao · 2016-01-18T23:58:52Z

Fix RM redirect to wrong history URL issue, details can be seen in SPARK-12893.

Please review, CC @vanzin @steveloughran , thanks a lot.

SparkQA · 2016-01-19T00:34:43Z

Test build #49637 has finished for PR 10821 at commit a853933.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

steveloughran · 2016-01-19T01:42:56Z

I've been using HistoryServer.getAttemptURI() to do this mapping in the timeline integration -but that method is private[history]. The code here appears to generate the same path. Of course, making that history server method private[spark] would be the best approach if a single codepath is what you want

jerryshao · 2016-01-19T01:47:11Z

I see, thanks.

steveloughran · 2016-01-19T18:25:20Z

..but if you can't get at it, copy & paste is probably the best way to make do. Make the new method private[spark] and we could add a test in the history server to verify they give the same answer, always

SparkQA · 2016-01-19T21:26:49Z

Test build #49693 has finished for PR 10821 at commit a503bf7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-02-01T22:01:26Z

@vanzin

tgravescs · 2016-02-24T22:53:11Z

Can you add more description here? is this only on master branch (2.0)?

jerryshao · 2016-02-25T00:34:36Z

@tgravescs , the URL here is not correct, using this URL to redirect from RM web UI to history server is not accessible, so the patch is going to fix it.

This is a bug for not only in master branch, but also for old version.

tgravescs · 2016-02-25T14:46:25Z

hmm, so my confusion is this works just fine for me on spark 1.6 running on yarn cluster.

Are you running secure hadoop or in-secure? Are you using the RM proxy?

I see where the code looks like it would have the wrong link but somehow I'm getting the correct link so I want to understand the problem fully.

Also note that when I manually go to the link with http://historyserver:18080/history/application_1455663314062_13672/1 it just redirects to http://historyserver:18080/history/application_1455663314062_13672/jobs/ when I don't have an application with multiple attempts, but when I do have an application with multiple attempts then it goes to the /1 or /2 properly.

Note I'm running on secure hadoop using the RM proxy.

I also see you are getting the double url with history/appid/1/history/appid/1, not sure why that is. I remember I'd seen something like this a long time ago but thought that was fixed.

If you manually go to the history/appid/1 what happens?

jerryshao · 2016-02-29T08:03:27Z

Hi @tgravescs , I don't enable security, and don't have specific rm proxy setting and proxy server. All use the default settings.

I just made a fresh building against latest Spark master to test this problem.

In the yarn-client mode, urls can be redirected and accessed to the right application page. But for the yarn-cluster mode, this url (http://localhost:18080/history/application_1456728554477_0006/1/jobs/) is failed to access, here is the screenshot:

But when changing to (http://localhost:18080/history/application_1456728554477_0006/**appattempt_1456728554477_0006_000001**/jobs/) then it can be accessed.

Have you tested with yarn-cluster mode?

jerryshao · 2016-02-29T08:54:41Z

/1, /2 is not actually an attempt id, so how to convert this to an valid attempt id?

In the code of history server:

 private val loaderServlet = new HttpServlet {
    protected override def doGet(req: HttpServletRequest, res: HttpServletResponse): Unit = {
      // Parse the URI created by getAttemptURI(). It contains an app ID and an optional
      // attempt ID (separated by a slash).
      val parts = Option(req.getPathInfo()).getOrElse("").split("/")
      if (parts.length < 2) {
        res.sendError(HttpServletResponse.SC_BAD_REQUEST,
          s"Unexpected path info in request (URI = ${req.getRequestURI()}")
        return
      }

      val appId = parts(1)
      val attemptId = if (parts.length >= 3) Some(parts(2)) else None

The GET url will be split into two fields, appId and attemptId, so if the URL you mentioned is accessible, attemptId will be 1 or 2, am not sure how to use this to get the right event log?

And there's some logs in history server also complained about this thing:

16/02/29 16:47:02 INFO ApplicationCache: Failed to load application attempt application_1456728554477_0006/Some(1)

steveloughran · 2016-02-29T11:49:01Z

If you get a link like appId/1 then it means that the web UI/spark doesn't have an instance ID; that's the default "single" link. So the issue is how is the FsHistoryProvider coming up with the wrong attempt Id. Either it's not in the history as saved, or its not being picked up during parse.

Assuming this is yarn client mode, then an attempt ID of "1" is potentially valid.

1 This is the FsHistoryProvider, right?
2. Send me a link to a history and I'll have a look at it (actually, I was thinking of having a dump command which would just dump a history replay...)

jerryshao · 2016-02-29T12:36:51Z

Hi @steveloughran , yes it uses FsHistoryProvider.

In the client mode, actually attemptId is None, and any URL can be redirected to the right one, so "1" is finally redirected. But this is not worked for yarn cluster mode according to my test.

You could simply run a spark application with yarn-cluster mode. Access the URL either from RM's web UI, or directly from history server's web UI. you'll find out the difference.

steveloughran · 2016-02-29T13:45:56Z

OK...so the question is "where is the 1 coming from"

jerryshao · 2016-03-01T05:20:43Z

@steveloughran , here "1" is the number of attempts here, and it used to generate a URL here. Also in the yarn code, this "1" or "2" is gotten from attempt id.

This "1" or "2" as attempt id to concatenate the URL is not accessable in my local test.

jerryshao · 2016-03-09T04:08:41Z

@tgravescs , I think here this PR fix the same issue as in #11518 , the difference here is to access from RM's web UI, the URL should keep consistent, would you please help to review again, thanks a lot.

SparkQA · 2016-03-09T06:36:23Z

Test build #52722 has finished for PR 10821 at commit c98c3b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2016-03-09T15:05:54Z

So at this point this isn't really changing anything right? Except not to set attempt is not in cluster mode? But that was handled by redirect in history server anyway. Otherwise I'm not following how the attemptId would be wrong. In cluster mode the attemptid should be set to something. Its getting it directly from the containerid that is set on your yarn cluster. If you don't have containerid then I assume you aren't running on yarn or yarn has a bug.

Which version of hadoop are you using?

Also are you sure the history server just hasn't loaded the file yet?

jerryshao · 2016-03-10T01:44:48Z

Hi @tgravescs , here in the original code:

val attemptId = client.getAttemptId().getAttemptId().toString()

This will get the attempt id as "1" or "2"...

And finally the history server url is: xxx/history/application_1455663314062_13672/1. In history server this "1" will be parsed as attempt id to find out the cached event log. But since here the attempt id is not a valid one, so this url will be failed to access.

Actually history server's expected attempt id is: appattempt_1455663314062_13672_000001, and the right url for accessing history server should be: xxx/history/application_1455663314062_13672/appattempt_1455663314062_13672_000001.

So here I change to client.getAttemptId().toString() to get the full attempt id. That's the point of this fix.

vanzin · 2016-03-10T19:00:17Z

I just tried locally and I see that something seems to have changed in 2.0.

This is an example of an event log file name generated by 1.6 in my cluster:
/user/spark/applicationHistory/application_1456510586527_0005_1

And this is an example of one generated by 2.0:
/user/spark/applicationHistory/application_1457635277787_0001_appattempt_1457635277787_0001_000001

Can you confirm that you change restores the behavior to the one that existed in 1.6 (and actually explain that in the change summary)? The new ultra verbose attempt id is not necessary - it contains the same information as the application id, plus a counter at the end.

jerryshao · 2016-03-14T05:52:10Z

Hi @vanzin , thanks a lot for your response. I just checked the branch-1.6, looks like the behavior (attempt id) is actually changed, and this change is introduced in #9182.

Originally attemptId is gotten from spark.yarn.app.attemptId which is set in ApplicationMaster.

And in ApplicationMaster, the way to get attemptId is appAttemptId.getAttemptId().toString(), so here the attemptId is "1", "2".

But this behavior is changed in master branch. Here we use the full attemptId rather than attempt counter here.

So this affects not only the file name of event log, also the url of history server's each application.

Here if we accept the new way of attempId, then this url link should be updated to the new one. Oppositely if we treat this new way of attemptId as a regression, then there's no issue here, all we should change is to loop back the original attemptId.

What's your opinion? @vanzin .

vanzin · 2016-03-14T17:29:20Z

all we should change is to loop back the original attemptId.

I think we should do that. The verbose attempt ID does not add any information on top of app ID + counter, and looks very ugly for users.

steveloughran · 2016-03-14T17:35:53Z

I'm not sure how that crept in on the patch; it wasn't something intentional.

It is needed for the spark timeline stuff —but that can be handled in history load code of the YarnHistoryProvider
Having a simple "1", "2", is actually useful, especially in the REST API; you don't need to know the attempt Id to know that /1 is for the first attempt; /2 for the second
Really to test link validity there should be something in the web UI, though being JSON that's too complex now. Hitting the REST API should suffice

BTW, some bits of code related to splitting up attempts, container IDs and such like has proven to be brittle in the past across Hadoop versions; if someone is trying to break things up, they should test across Hadoop 2.2-2.6+

jerryshao · 2016-03-15T02:42:14Z

I will close this PR, since there's no issue here, it is due to the regression of #9182, will fix that instead.

Also #11518 should be reverted, since that one is the same url link issue, should fix the attempt id instead.

Thanks a lot for your review.

jerryshao added 2 commits March 9, 2016 11:41

Fix history URL redirect error in yarn-cluster mode

df6027a

Address the comment

c98c3b5

jerryshao force-pushed the SPARK-12893 branch from a503bf7 to c98c3b5 Compare March 9, 2016 04:25

jerryshao closed this Mar 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12893][YARN] Fix history URL redirect error in yarn-cluster mode #10821

[SPARK-12893][YARN] Fix history URL redirect error in yarn-cluster mode #10821

jerryshao commented Jan 18, 2016

SparkQA commented Jan 19, 2016

steveloughran commented Jan 19, 2016

jerryshao commented Jan 19, 2016

steveloughran commented Jan 19, 2016

SparkQA commented Jan 19, 2016

andrewor14 commented Feb 1, 2016

tgravescs commented Feb 24, 2016

jerryshao commented Feb 25, 2016

tgravescs commented Feb 25, 2016

jerryshao commented Feb 29, 2016

jerryshao commented Feb 29, 2016

steveloughran commented Feb 29, 2016

jerryshao commented Feb 29, 2016

steveloughran commented Feb 29, 2016

jerryshao commented Mar 1, 2016

jerryshao commented Mar 9, 2016

SparkQA commented Mar 9, 2016

tgravescs commented Mar 9, 2016

jerryshao commented Mar 10, 2016

vanzin commented Mar 10, 2016

jerryshao commented Mar 14, 2016

vanzin commented Mar 14, 2016

steveloughran commented Mar 14, 2016

jerryshao commented Mar 15, 2016

[SPARK-12893][YARN] Fix history URL redirect error in yarn-cluster mode #10821

[SPARK-12893][YARN] Fix history URL redirect error in yarn-cluster mode #10821

Conversation

jerryshao commented Jan 18, 2016

SparkQA commented Jan 19, 2016

steveloughran commented Jan 19, 2016

jerryshao commented Jan 19, 2016

steveloughran commented Jan 19, 2016

SparkQA commented Jan 19, 2016

andrewor14 commented Feb 1, 2016

tgravescs commented Feb 24, 2016

jerryshao commented Feb 25, 2016

tgravescs commented Feb 25, 2016

jerryshao commented Feb 29, 2016

jerryshao commented Feb 29, 2016

steveloughran commented Feb 29, 2016

jerryshao commented Feb 29, 2016

steveloughran commented Feb 29, 2016

jerryshao commented Mar 1, 2016

jerryshao commented Mar 9, 2016

SparkQA commented Mar 9, 2016

tgravescs commented Mar 9, 2016

jerryshao commented Mar 10, 2016

vanzin commented Mar 10, 2016

jerryshao commented Mar 14, 2016

vanzin commented Mar 14, 2016

steveloughran commented Mar 14, 2016

jerryshao commented Mar 15, 2016