[WebUI][SPARK-7889] HistoryServer updates UI for incomplete apps #11118

squito · 2016-02-08T17:50:03Z

When the HistoryServer is showing an incomplete app, it needs to check if there is a newer version of the app available. It does this by checking if a version of the app has been loaded with a larger filesize. If so, it detaches the current UI, attaches the new one, and redirects back to the same URL to show the new UI.

https://issues.apache.org/jira/browse/SPARK-7889

…ompletion state; use mock spark UI

…ault = 60; document

…om string to case class

… metrics used to track load & time —and for testing

… comments & stylecheck

…viction is taking place

…this triggers callbacs in the cache

… to keep scalastyle happy

…cking up any changes

…t being saved

…llelize().count() call, so the FS history provider isn't seeing an update, etc, etc.

…o scans through modified files to verify this takes.

…LoggingListener attempts to do so afterwards, swallowing exceptions raised

… its a race condition between probe time and the scanner thread -if the initial load is after the file update but before the scanner thread has looked @ the file, the file isn't detected as updated. The provider has to return the actual file timestamp of its choice for use in update checks, not the time that the initial load took place

…ore time details, but I'm about to move the fshistory off time and into a generic "attempt version" counter which will be compared on the probe. If an update has happened, this will know

…r and equality check

squito · 2016-02-08T18:12:31Z

core/src/main/scala/org/apache/spark/deploy/history/ApplicationCache.scala

+        log.debug(s"Probing at time $now for updated application $cacheKey -> $entry")
+        metrics.updateProbeCount.inc()
+        updated = time(metrics.updateProbeTimer) {
+          entry.updateProbe()


Note that this check is now extremely cheap (at least with the FSHistoryProvider). Actually checking for an update to the logs happens on its own schedule, as that scans logs looking for both new apps and updates to existing ones. That suggests that we could either drop this extra interval completely, and just do this check on every request, or if we want to leave it for other HistoryProviders, we could at least make the default very rapid.

my real concern was not cost of probe, but what if there was an app updating rapidly, with a lot of user requests coming in; it'd trigger replay too often. it's the cost of replay which I worried about

right, I guess I just wanted to point out that with the changes here, this probe is entirely independent from replay. Replay happens with normal log-checking -- that frequency is controlled by spark.history.fs.update.interval. Here, we're just checking whether that regular log scanning has already loaded an updated UI for this attempt, and that is it. Since spark.history.fs.update.interval is entirely controlling the expensive part, we may not need any other interval.

SparkQA · 2016-02-08T18:17:41Z

Test build #2524 has finished for PR 11118 at commit bfbf348.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-08T18:19:24Z

Test build #50929 has finished for PR 11118 at commit bfbf348.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

…has issues

SparkQA · 2016-02-08T21:58:43Z

Test build #2525 has finished for PR 11118 at commit d4740bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-08T22:32:13Z

Test build #50936 has finished for PR 11118 at commit 488da80.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-09T01:14:22Z

Test build #2526 has finished for PR 11118 at commit 488da80.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

steveloughran · 2016-02-10T14:12:22Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

+    // actually read, we may never refresh the app
+    // we expect FileStatus to return the file size when it was initially created, but the api
+    // is not explicit about this so lets be extra-safe.
+    val eventLogLength = eventLog.getLen()


this is usually just another call to getFileStatus().length; {{FileStatus}} is required to be static once created. (see http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html, though it skimps on concurrency issues)

ah I see, I expected it to behave that way but couldn't find any documentation which really made that explicit. I guess you're saying its guaranteed by the post-conditions for getFileStatus()? I've updated the comment now.

steveloughran · 2016-02-10T14:45:55Z

LGTM; unifying the different probes for new-ness makes sense.

SparkQA · 2016-02-11T15:26:48Z

Test build #51105 has started for PR 11118 at commit 2286aa8.

shaneknapp · 2016-02-11T16:07:44Z

jenkins, test this please

shaneknapp · 2016-02-11T16:32:41Z

jenkins, test this please

squito · 2016-02-11T16:41:41Z

Plan to merge this a little later (assuming tests pass), any other comments?

SparkQA · 2016-02-11T18:01:16Z

Test build #51117 has finished for PR 11118 at commit 57e937b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-11T18:36:19Z

Test build #2536 has finished for PR 11118 at commit 04f5385.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-11T18:43:59Z

Test build #51119 has finished for PR 11118 at commit 04f5385.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2016-02-12T03:39:42Z

merged to master, thanks @steveloughran!

rxin · 2016-02-12T06:23:46Z

Just saw this got merged. I'm probably missing some context, but can somebody explain to me why something so conceptually simple leads to such a big patch?

steveloughran · 2016-02-12T10:22:29Z

Good Q. We thought it'd be simple at first too.

We need a notion of "out-of-dateness" which (a) supports different back ends, and (b) works reliably for files stored in hdfs:// and other filesystems (not handled against S3 or other object stores, but that's because they only save their data on a close(), that is: the end of a successful application.
The google cache class is, well, limited. Essentially what we are doing is adding a probe to the cache entries which is triggered on retrieval, which can then cause a new web UI to be loaded.
the current probe comes from the fs provider. Initially the patch looked at modification timestamps, but that proved unreliable (modtime granularity and issues about when it actually becomes visible in the namenode). Hence, a move to file length.
The timeline provider, which I'm no working on elsewhere, does a GET of the timeline server metadata for that instance, looks at an event count pushed up there. That one is going to add a bit of a window on checks too, (somehow), to keep load on Yarn timeline server down.
We need to trigger an update check on GETs all the way down the UI. The way the servlet API works, something that still expects to be configured by web.xml, that's hard to do without singletons, hence the singleton at the bottom.
Finally there's some metrics of what's going on. SPARK-11373 adds metrics to the history server, of which this becomes a part.
Oh, and then there's the tests. They actually use the metrics as the grey-box view into the cache, ensures that the metrics actually get written, and that they'll remain stable over time. Break the metrics and the tests fail, so you find out before ops teams come after you.

There's actually two other bigger things which would be possible to do on this chain

incremental playback of changes. Rather than replay an entire app's history, start from where you left off. (i.e. file.length()+1). Maybe I'll look at that sometime, as it would really benefit streaming work.
something that works on object stores. There'd I'd go for spark application instances to write to HDFS, with a copy to S3 on completion —and the history provider to be able to (a) scan both dirs, (b) do the copy if the app is no longer running (i.e. fails while declared incomplete). That's not on my todo list.

Oh, and faster boot time with a summary file alongside the full history, with main details (finished: Boolean, spark-version, ...) so that the boot time goes from O(apps*events) to O(apps)

When the HistoryServer is showing an incomplete app, it needs to check if there is a newer version of the app available. It does this by checking if a version of the app has been loaded with a larger *filesize*. If so, it detaches the current UI, attaches the new one, and redirects back to the same URL to show the new UI. https://issues.apache.org/jira/browse/SPARK-7889 Author: Steve Loughran <stevel@hortonworks.com> Author: Imran Rashid <irashid@cloudera.com> Closes apache#11118 from squito/SPARK-7889-alternate.

steveloughran added 30 commits February 4, 2016 10:36

SPARK-7889 cache with initial unit tests

ae0b8b2

SPARK-7889 cache tests of refresh interval working.

72945e2

SPARK-7889 switch to explicit scan of SparkUI attempts to determine c…

6764f49

…ompletion state; use mock spark UI

SPARK-7889 use spark.history.cache.interval as config option; set def…

bd47afb

…ault = 60; document

SPARK-7889 resync with trunk & slightly improve code layout

aaadf64

SPARK-7889 scalastyle

fe6d9e0

SPARK-7889 scalastyle on ApplicationCacheSuite

f69d391

SPARK-7889 cleanup of comments and imports & the like

334fc91

SPARK-7889 intermin checkpoint on changes; about to move cache map fr…

c19fee2

…om string to case class

SPARK-7889 Cache designed to ask history server/provider for updates;…

ec5b652

… metrics used to track load & time —and for testing

SPARK-7889 playing losing battles with a test

feda232

SPARK-7889 test working, verified other tests in package work, review…

04d8c64

… comments & stylecheck

SPARK-7889 - stylecheck on tests apparently skipped on mvn

9a7ca9f

SPARK-7889 style and javadoc only

a1024aa

SPARK-7889 pull out metrics into own class, make visible for testing

6e0e26d

history server web UI update test from @squito

b3c7069

SPARK-7889 more on the cache suite tests: there is no evidence that e…

ea2afbb

…viction is taking place

SPARK-7889: tests to make sure app eviction is taking place and that …

d113c5b

…this triggers callbacs in the cache

SPARK-7889 code style check

a128d8c

SPARK-7889: adding return type of tests embedded inside a test method…

f1c7fe5

… to keep scalastyle happy

SPARK-7889 starting to add web filters

07f1af4

SPARK-7889 ongoing test dev. Looks like the history provider isn't pi…

a33bdd7

…cking up any changes

SPARK-7889: looks like the test is failing because a new history isn'…

9166ba6

…t being saved

SPARK-7889 address scalastyle warnings

523390a

SPARK-7889 : we aren't getting an updated log file on the second para…

9831ad4

…llelize().count() call, so the FS history provider isn't seeing an update, etc, etc.

SPARK-7889: filesize update time included in probe; update thread als…

6fdaab1

…o scans through modified files to verify this takes.

SPARK-7889: not all filesystems update modtime on a rename; the Event…

163e218

…LoggingListener attempts to do so afterwards, swallowing exceptions raised

SPARK-7889 still looking at a race condition in the test. This adds m…

f81cfe1

…ore time details, but I'm about to move the fshistory off time and into a generic "attempt version" counter which will be compared on the probe. If an update has happened, this will know

SPARK-7889 moved off time differences to a simple generational counte…

78a463e

…r and equality check

squito mentioned this pull request Feb 8, 2016

[SPARK-7889] [CORE] HistoryServer to refresh cache of incomplete applications #6935

Closed

squito reviewed Feb 8, 2016
View reviewed changes

squito added 2 commits February 8, 2016 13:30

mima update

d4740bc

refresh apps based on the log size, not on their modtime, since that …

488da80

…has issues

steveloughran reviewed Feb 10, 2016
View reviewed changes

squito added 2 commits February 11, 2016 09:14

cleanup from comments

bb737ec

more cleanup

2286aa8

remove 'spark.history.cache.window' since those updates are now cheap

04f5385

squito force-pushed the SPARK-7889-alternate branch from 57e937b to 04f5385 Compare February 11, 2016 16:33

asfgit closed this in a2c7dcf Feb 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WebUI][SPARK-7889] HistoryServer updates UI for incomplete apps #11118

[WebUI][SPARK-7889] HistoryServer updates UI for incomplete apps #11118

squito commented Feb 8, 2016

squito Feb 8, 2016

steveloughran Feb 8, 2016

squito Feb 9, 2016

SparkQA commented Feb 8, 2016

SparkQA commented Feb 8, 2016

SparkQA commented Feb 8, 2016

SparkQA commented Feb 8, 2016

SparkQA commented Feb 9, 2016

steveloughran Feb 10, 2016

squito Feb 11, 2016

steveloughran commented Feb 10, 2016

SparkQA commented Feb 11, 2016

shaneknapp commented Feb 11, 2016

shaneknapp commented Feb 11, 2016

squito commented Feb 11, 2016

SparkQA commented Feb 11, 2016

SparkQA commented Feb 11, 2016

SparkQA commented Feb 11, 2016

squito commented Feb 12, 2016

rxin commented Feb 12, 2016

steveloughran commented Feb 12, 2016

[WebUI][SPARK-7889] HistoryServer updates UI for incomplete apps #11118

[WebUI][SPARK-7889] HistoryServer updates UI for incomplete apps #11118

Conversation

squito commented Feb 8, 2016

squito Feb 8, 2016

Choose a reason for hiding this comment

steveloughran Feb 8, 2016

Choose a reason for hiding this comment

squito Feb 9, 2016

Choose a reason for hiding this comment

SparkQA commented Feb 8, 2016

SparkQA commented Feb 8, 2016

SparkQA commented Feb 8, 2016

SparkQA commented Feb 8, 2016

SparkQA commented Feb 9, 2016

steveloughran Feb 10, 2016

Choose a reason for hiding this comment

squito Feb 11, 2016

Choose a reason for hiding this comment

steveloughran commented Feb 10, 2016

SparkQA commented Feb 11, 2016

shaneknapp commented Feb 11, 2016

shaneknapp commented Feb 11, 2016

squito commented Feb 11, 2016

SparkQA commented Feb 11, 2016

SparkQA commented Feb 11, 2016

SparkQA commented Feb 11, 2016

squito commented Feb 12, 2016

rxin commented Feb 12, 2016

steveloughran commented Feb 12, 2016