Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WebUI][SPARK-7889] HistoryServer updates UI for incomplete apps #11118

Closed
wants to merge 54 commits into from

Conversation

squito
Copy link
Contributor

@squito squito commented Feb 8, 2016

When the HistoryServer is showing an incomplete app, it needs to check if there is a newer version of the app available. It does this by checking if a version of the app has been loaded with a larger filesize. If so, it detaches the current UI, attaches the new one, and redirects back to the same URL to show the new UI.

https://issues.apache.org/jira/browse/SPARK-7889

… metrics used to track load & time —and for testing
…llelize().count() call, so the FS history provider isn't seeing an update, etc, etc.
…o scans through modified files to verify this takes.
…LoggingListener attempts to do so afterwards, swallowing exceptions raised
… its a race condition between probe time and the scanner thread -if the initial load is after the file update but before the scanner thread has looked @ the file, the file isn't detected as updated. The provider has to return the actual file timestamp of its choice for use in update checks, not the time that the initial load took place
…ore time details, but I'm about to move the fshistory off time and into a generic "attempt version" counter which will be compared on the probe. If an update has happened, this will know
log.debug(s"Probing at time $now for updated application $cacheKey -> $entry")
metrics.updateProbeCount.inc()
updated = time(metrics.updateProbeTimer) {
entry.updateProbe()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this check is now extremely cheap (at least with the FSHistoryProvider). Actually checking for an update to the logs happens on its own schedule, as that scans logs looking for both new apps and updates to existing ones. That suggests that we could either drop this extra interval completely, and just do this check on every request, or if we want to leave it for other HistoryProviders, we could at least make the default very rapid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my real concern was not cost of probe, but what if there was an app updating rapidly, with a lot of user requests coming in; it'd trigger replay too often. it's the cost of replay which I worried about

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I guess I just wanted to point out that with the changes here, this probe is entirely independent from replay. Replay happens with normal log-checking -- that frequency is controlled by spark.history.fs.update.interval. Here, we're just checking whether that regular log scanning has already loaded an updated UI for this attempt, and that is it. Since spark.history.fs.update.interval is entirely controlling the expensive part, we may not need any other interval.

@SparkQA
Copy link

SparkQA commented Feb 8, 2016

Test build #2524 has finished for PR 11118 at commit bfbf348.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 8, 2016

Test build #50929 has finished for PR 11118 at commit bfbf348.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 8, 2016

Test build #2525 has finished for PR 11118 at commit d4740bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 8, 2016

Test build #50936 has finished for PR 11118 at commit 488da80.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 9, 2016

Test build #2526 has finished for PR 11118 at commit 488da80.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// actually read, we may never refresh the app
// we expect FileStatus to return the file size when it was initially created, but the api
// is not explicit about this so lets be extra-safe.
val eventLogLength = eventLog.getLen()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is usually just another call to getFileStatus().length; {{FileStatus}} is required to be static once created. (see http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html, though it skimps on concurrency issues)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see, I expected it to behave that way but couldn't find any documentation which really made that explicit. I guess you're saying its guaranteed by the post-conditions for getFileStatus()? I've updated the comment now.

@steveloughran
Copy link
Contributor

LGTM; unifying the different probes for new-ness makes sense.

@SparkQA
Copy link

SparkQA commented Feb 11, 2016

Test build #51105 has started for PR 11118 at commit 2286aa8.

@shaneknapp
Copy link
Contributor

jenkins, test this please

1 similar comment
@shaneknapp
Copy link
Contributor

jenkins, test this please

@squito
Copy link
Contributor Author

squito commented Feb 11, 2016

Plan to merge this a little later (assuming tests pass), any other comments?

@SparkQA
Copy link

SparkQA commented Feb 11, 2016

Test build #51117 has finished for PR 11118 at commit 57e937b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 11, 2016

Test build #2536 has finished for PR 11118 at commit 04f5385.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 11, 2016

Test build #51119 has finished for PR 11118 at commit 04f5385.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in a2c7dcf Feb 12, 2016
@squito
Copy link
Contributor Author

squito commented Feb 12, 2016

merged to master, thanks @steveloughran!

@rxin
Copy link
Contributor

rxin commented Feb 12, 2016

Just saw this got merged. I'm probably missing some context, but can somebody explain to me why something so conceptually simple leads to such a big patch?

@steveloughran
Copy link
Contributor

Good Q. We thought it'd be simple at first too.

  1. We need a notion of "out-of-dateness" which (a) supports different back ends, and (b) works reliably for files stored in hdfs:// and other filesystems (not handled against S3 or other object stores, but that's because they only save their data on a close(), that is: the end of a successful application.
  2. The google cache class is, well, limited. Essentially what we are doing is adding a probe to the cache entries which is triggered on retrieval, which can then cause a new web UI to be loaded.
  3. the current probe comes from the fs provider. Initially the patch looked at modification timestamps, but that proved unreliable (modtime granularity and issues about when it actually becomes visible in the namenode). Hence, a move to file length.
  4. The timeline provider, which I'm no working on elsewhere, does a GET of the timeline server metadata for that instance, looks at an event count pushed up there. That one is going to add a bit of a window on checks too, (somehow), to keep load on Yarn timeline server down.
  5. We need to trigger an update check on GETs all the way down the UI. The way the servlet API works, something that still expects to be configured by web.xml, that's hard to do without singletons, hence the singleton at the bottom.
  6. Finally there's some metrics of what's going on. SPARK-11373 adds metrics to the history server, of which this becomes a part.
  7. Oh, and then there's the tests. They actually use the metrics as the grey-box view into the cache, ensures that the metrics actually get written, and that they'll remain stable over time. Break the metrics and the tests fail, so you find out before ops teams come after you.

There's actually two other bigger things which would be possible to do on this chain

  1. incremental playback of changes. Rather than replay an entire app's history, start from where you left off. (i.e. file.length()+1). Maybe I'll look at that sometime, as it would really benefit streaming work.
  2. something that works on object stores. There'd I'd go for spark application instances to write to HDFS, with a copy to S3 on completion —and the history provider to be able to (a) scan both dirs, (b) do the copy if the app is no longer running (i.e. fails while declared incomplete). That's not on my todo list.

Oh, and faster boot time with a summary file alongside the full history, with main details (finished: Boolean, spark-version, ...) so that the boot time goes from O(apps*events) to O(apps)

superbobry pushed a commit to criteo-forks/spark that referenced this pull request Feb 15, 2017
When the HistoryServer is showing an incomplete app, it needs to check if there is a newer version of the app available.  It does this by checking if a version of the app has been loaded with a larger *filesize*.  If so, it detaches the current UI, attaches the new one, and redirects back to the same URL to show the new UI.

https://issues.apache.org/jira/browse/SPARK-7889

Author: Steve Loughran <stevel@hortonworks.com>
Author: Imran Rashid <irashid@cloudera.com>

Closes apache#11118 from squito/SPARK-7889-alternate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants