Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17843][Web UI] Indicate event logs pending for processing on history server UI #15410

Closed
wants to merge 5 commits into from

Conversation

vijoshi
Copy link
Contributor

@vijoshi vijoshi commented Oct 10, 2016

What changes were proposed in this pull request?

History Server UI's application listing to display information on currently under process event logs so a user knows that pending this processing an application may not list on the UI.

When there are no event logs under process, the application list page has a "Last Updated" date-time at the top indicating the date-time of the last completed scan of the event logs. The value is displayed to the user in his/her local time zone.

How was this patch tested?

All unit tests pass. Particularly all the suites under org.apache.spark.deploy.history.* were run to test changes.

  • Very first startup - Pending logs - no logs processed yet:

screen shot 2016-10-24 at 3 07 04 pm

- Very first startup - Pending logs - some logs processed:

screen shot 2016-10-24 at 3 18 42 pm

- Last updated - No currently pending logs:

screen shot 2016-10-17 at 8 34 37 pm

- Last updated - With some currently pending logs:

screen shot 2016-10-24 at 3 09 31 pm

- No applications found and No currently pending logs:

screen shot 2016-10-24 at 3 24 26 pm

@vijoshi vijoshi changed the title initial commit [SPARK-17843][Web UI] Indicate event logs pending for processing on history server UI Oct 10, 2016
Copy link
Member

@ajbozarth ajbozarth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of this change, but I think it can be more stream-lined as I commented. From a first look though the general code quality looks good


try {
for (file <- logInfos) {
tasks += replayExecutor.submit(new Runnable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a need for this step? Why can't we just pendingReplayTasksCount.addAndGet(logInfos.size) below and only have to add the finally code block. This seems like a lot of unnecessary code change.

Copy link
Contributor Author

@vijoshi vijoshi Oct 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the replayExecutor.submit can raise an exception resulting in fewer logInfos getting queued for replay. although one can argue that the underlying ExecutorService implementation used today will probably not do this, but I did not feel comfortable leaving this possibility out. note that - if there were to be an exception on the replayExecutor. submit( ) it would break not only the counting logic (assuming we'd already done a pendingReplayTasksCount.addAndGet(logInfos.size) instead of only adding as many as got successfully submitted) but also the older one of preventing a new thread getting scheduled for "checkForLogs( )" while there are tasks already running in replayExecutor.

however if there's a thought that we can completely ignore the possibility of exceptions on replayExecutor.submit( ), I can optimise the code on those lines. Let me know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding the catch at the end of checkForLogs would stop any current issue, but without it your new counting code would break, so with the addition of the counting code this change makes sense.

@tgravescs
Copy link
Contributor

while I understand the problem, the issue here is that what you are printing might not be accurate either. We only pull for new logs every X number of seconds (default is 10s, user could change to longer) so if I come in one of those periods then it could say 0 when there are actually some to be read yet. Maybe this is ok as its better then it was but its also not guaranteed.

@vijoshi
Copy link
Contributor Author

vijoshi commented Oct 11, 2016

@tgravescs - you're right - for newer logs that are generated, there could be a window of time (10 secs or whatever the user configures) where the new logs are not picked up for replay and the UI doesn't say anything about them. however the issue we see is more with old completed apps. a little after history server startup, user browses to the app list and has no idea why the older completed apps are missing (perhaps those that were visible just before the history server was restarted). since the polling for logs/replay is scheduled immediately as part of history server startup (zero delay for the first round), this fix could help these cases.

@tgravescs
Copy link
Contributor

ah, yeah startup would definitely be a good case for this and like I mentioned its better then nothing so I'm ok with concept.

I wonder for the other use case where it hasn't looked in ~ 10 seconds if it would be more clear to users if we put a little string at the top that is like "last updated time XX:XX:XX" or app list current as of XX:XX:XX.

@ajbozarth
Copy link
Member

Just to raise an idea that would possibly mean less code change, would simply having a flag that causing a "currently processing applications" type message to display without an actual count with it?

Overall I think this is a good addition though.

@@ -38,6 +39,13 @@ private[history] class HistoryPage(parent: HistoryServer) extends WebUIPage("")
{providerConfig.map { case (k, v) => <li><strong>{k}:</strong> {v}</li> }}
</ul>
{
if (eventLogsUnderProcessCount > 0) {
<p>There are {eventLogsUnderProcessCount} event log(s) currently being
processed which may result in additional applications getting listed on this page.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of just telling the user how many logs are pending, it will be a lot more useful to tell him which ones are pending so he can wait before clicking on a specific application

@andrewor14
Copy link
Contributor

andrewor14 commented Oct 12, 2016

ok to test

I think the idea is good, but it would be a better UX if we display the pending applications as rows in the existing table (or a new one) and indicate there that they're still being processed. It might be more code change but I think it's worth it.

@tgravescs
Copy link
Contributor

If we are going to list the actual applications being loaded in the table then you have to rely on the filename to know the application id. This may be ok, but its then the interface and we can't change it or whatever we change it to has to be compatible and backwards compatible. Personally I would rather see us do something much smarter with file names/paths to include enough metadata to do simple table list or have metadata file that would allow loading of the history table much faster and load the actual contents only when actually clicked on.

@tgravescs
Copy link
Contributor

Note the reason I mention the above things is I don't want us to start relying on the filename if its just going to change again.

@vijoshi
Copy link
Contributor Author

vijoshi commented Oct 13, 2016

Display "Last updated " with a generic message, possibly omitting the count of logs

Ok, while looking at this I notice that the scan date-time the FsHistoryProvider currently internally captures (lastScanTime) is based on the filesystem date-time:

  private def getNewLastScanTime(): Long = {
      val fileName = "." + UUID.randomUUID().toString
      val path = new Path(logDir, fileName)
      val fos = fs.create(path)`

     try {
        fos.close()
        fs.getFileStatus(path).getModificationTime

This probably made sense when earlier versions of FsHistoryProvider used this value to determine new/updated logs it needed to replay. This is no longer the case. In checkForLogs() the check is based on file size now. So my question is that the fact that getNewLastScanTime is still present - is there a reason to rely on or use the filesystem time value? We don't get a timezone out of the FileStatus api. If I were to display an "Last updated/Updated as of: " value on the app listing page, I would rather read/record the time value from the JVM and ensure the Web UI converted it to user's local timezone for display.

@vijoshi
Copy link
Contributor Author

vijoshi commented Oct 13, 2016

Display applications that are pending, instead of a count

Though agree that this would be better UX-wise, but we would only have the filenames (logInfos) to go by to build this list until a clean way to extract app metadata exists without parsing event log file contents. If the UI just displayed the pending ones in a different table as "names" of pending logs without asserting that they are necessarily the app-id (though presently the names may match the app-ids) - would that be good enough ? The message on the top that I am trying to add on the application listing page could link to a new page that lists these pending logs to avoid cluttering the application listing page?

@tgravescs
Copy link
Contributor

personally I would rather not display directory or filenames unless we are doing it right. User shouldn't know the internal implementation or storage option.

@andrewor14
Copy link
Contributor

We shouldn't display file names but we should display application names and IDs, something the user understands. We don't have to do that as part of this issue.

@ajbozarth
Copy link
Member

@andrewor14 I plan to address that in another issue soon (see by end of the month) as soon as I have time. Fixing how the History Server loads has been coming up in a lot of PRs lately.

@vijoshi
Copy link
Contributor Author

vijoshi commented Oct 17, 2016

@ajbozarth @tgravescs @andrewor14 - updated pull request to show a Last Updated message on the app list page when there are no pending logs being processed. I would have liked to have the ApplicationListResource use a json schema where the application list and associated metadata - updated date, pending logs etc could be included together. but given that the json schema returned today is just a collection of application info json - change to this would be too much code change for this smallish fix - probably something for the /api/v2/ to consider.

@tgravescs
Copy link
Contributor

can you update the description with latest implementation and screen shot.

@tgravescs
Copy link
Contributor

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Oct 17, 2016

Test build #67070 has finished for PR 15410 at commit b43e241.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vijoshi
Copy link
Contributor Author

vijoshi commented Oct 20, 2016

@tgravescs @ajbozarth @andrewor14 can this be merged now? can we get it into the 2.0 branch - should i simply open a pull request on 2.0?

@vijoshi vijoshi closed this Oct 20, 2016
@vijoshi vijoshi reopened this Oct 20, 2016
@vijoshi
Copy link
Contributor Author

vijoshi commented Oct 20, 2016

This needs to remain open - clicked the wrong button!!

@tgravescs
Copy link
Contributor

were the description and screen shots updated?

@vijoshi
Copy link
Contributor Author

vijoshi commented Oct 20, 2016

@tgravescs Yes, the description for the pull request now has screen shots for the "Last Updated: XXX" case as well. I added description for the "Last Updated" message on the app list page just now.

@ajbozarth
Copy link
Member

Should we display the Last Updated even if the currently processing message is displayed? Then the user will know the last update even when a new update is processing, rare case, but still useful.

@vijoshi
Copy link
Contributor Author

vijoshi commented Oct 21, 2016

@ajbozarth yeah sounds useful. but note that for the very first load, we would have no 'Last Updated' value to display since that gets set only after the log scan+replay cycle completes at least once. any suggestions for an alternate text when we're in this situation?

@ajbozarth
Copy link
Member

I would say just not displaying it would be fine, so making the else if an if

@vijoshi
Copy link
Contributor Author

vijoshi commented Oct 24, 2016

@ajbozarth @tgravescs @andrewor14 - updated the implementation to always display "Last Updated Date-Time" when we have atleast one completed scan of the event log directory. I have changed the impl slightly

  • the javascript for "last updated" goes into a "historypage-common.js" since it's now always needed for inclusion on the history page (earlier it was inside the bigger "historypage.js" file that gets included only if there are some apps to list)
  • if there are pending event logs, do not show the "Did you specify the correct logging directory?" message

Have updated description with screenshots for each of the above scenarios.

logError(s"Exception while submitting event log for replay", e)
}

pendingReplayTasksCount.addAndGet (tasks.size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove extra space after addAndGet before (

@@ -123,6 +123,8 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock)
// List of application logs to be deleted by event log cleaner.
private var attemptsToClean = new mutable.ListBuffer[FsApplicationAttemptInfo]

private val pendingReplayTasksCount = new java.util.concurrent.atomic.AtomicInteger (0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra space after AtomicInteger before (

@@ -111,7 +111,7 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock)

// The modification time of the newest log detected during the last scan. Currently only
// used for logging msgs (logs are re-scanned based on file size, rather than modtime)
private var lastScanTime = -1L
private val lastScanTime = new java.util.concurrent.atomic.AtomicLong (-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra space after AtomicLong before (

@tgravescs
Copy link
Contributor

couple minor spacing nits, otherwise looks good.

@vijoshi
Copy link
Contributor Author

vijoshi commented Nov 7, 2016

@tgravescs thanks, have updated the fix.

@tgravescs
Copy link
Contributor

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Nov 8, 2016

Test build #68341 has finished for PR 15410 at commit 596c7ae.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

test failure looks unrelated, kicking again.

Jenkins, test this please

@vijoshi
Copy link
Contributor Author

vijoshi commented Nov 8, 2016

retest this please

@tgravescs
Copy link
Contributor

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Nov 8, 2016

Test build #68357 has finished for PR 15410 at commit 596c7ae.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vijoshi
Copy link
Contributor Author

vijoshi commented Nov 9, 2016

retest this please

@tgravescs
Copy link
Contributor

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Nov 9, 2016

Test build #68409 has finished for PR 15410 at commit 596c7ae.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

I've seen other pr tests passing now, let try again.

@tgravescs
Copy link
Contributor

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Nov 10, 2016

Test build #68471 has finished for PR 15410 at commit 596c7ae.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

@vijoshi please look at test failures in more detail to see if these changes could have affected.

@tgravescs
Copy link
Contributor

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Nov 11, 2016

Test build #68524 has finished for PR 15410 at commit 46357ee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

+1

asfgit pushed a commit that referenced this pull request Nov 11, 2016
…istory server UI

## What changes were proposed in this pull request?

History Server UI's application listing to display information on currently under process event logs so a user knows that pending this processing an application may not list on the UI.

When there are no event logs under process, the application list page has a "Last Updated" date-time at the top indicating the date-time of the last _completed_ scan of the event logs. The value is displayed to the user in his/her local time zone.
## How was this patch tested?

All unit tests pass. Particularly all the suites under org.apache.spark.deploy.history.\* were run to test changes.
- Very first startup - Pending logs - no logs processed yet:

<img width="1280" alt="screen shot 2016-10-24 at 3 07 04 pm" src="https://cloud.githubusercontent.com/assets/12079825/19640981/b8d2a96a-99fc-11e6-9b1f-2d736fe90e48.png">
- Very first startup - Pending logs - some logs processed:

<img width="1280" alt="screen shot 2016-10-24 at 3 18 42 pm" src="https://cloud.githubusercontent.com/assets/12079825/19641087/3f8e3bae-99fd-11e6-9ef1-e0e70d71d8ef.png">
- Last updated - No currently pending logs:

<img width="1280" alt="screen shot 2016-10-17 at 8 34 37 pm" src="https://cloud.githubusercontent.com/assets/12079825/19443100/4d13946c-94a9-11e6-8ee2-c442729bb206.png">
- Last updated - With some currently pending logs:

<img width="1280" alt="screen shot 2016-10-24 at 3 09 31 pm" src="https://cloud.githubusercontent.com/assets/12079825/19640903/7323ba3a-99fc-11e6-8359-6a45753dbb28.png">
- No applications found and No currently pending logs:

<img width="1280" alt="screen shot 2016-10-24 at 3 24 26 pm" src="https://cloud.githubusercontent.com/assets/12079825/19641364/03a2cb04-99fe-11e6-87d6-d09587fc6201.png">

Author: Vinayak <vijoshi5@in.ibm.com>

Closes #15410 from vijoshi/SAAS-608_master.

(cherry picked from commit a531fe1)
Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>
@asfgit asfgit closed this in a531fe1 Nov 11, 2016
asfgit pushed a commit that referenced this pull request Nov 30, 2016
## What changes were proposed in this pull request?

Backport PR #15410 to branch-2.0

## How was this patch tested?

Existing unit tests. Screenshots for UI changes provided in PR #15410.

Author: Vinayak <vijoshi5@in.ibm.com>

Closes #15991 from vijoshi/SAAS-608.
@vijoshi vijoshi deleted the SAAS-608_master branch January 24, 2017 10:31
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…istory server UI

## What changes were proposed in this pull request?

History Server UI's application listing to display information on currently under process event logs so a user knows that pending this processing an application may not list on the UI.

When there are no event logs under process, the application list page has a "Last Updated" date-time at the top indicating the date-time of the last _completed_ scan of the event logs. The value is displayed to the user in his/her local time zone.
## How was this patch tested?

All unit tests pass. Particularly all the suites under org.apache.spark.deploy.history.\* were run to test changes.
- Very first startup - Pending logs - no logs processed yet:

<img width="1280" alt="screen shot 2016-10-24 at 3 07 04 pm" src="https://cloud.githubusercontent.com/assets/12079825/19640981/b8d2a96a-99fc-11e6-9b1f-2d736fe90e48.png">
- Very first startup - Pending logs - some logs processed:

<img width="1280" alt="screen shot 2016-10-24 at 3 18 42 pm" src="https://cloud.githubusercontent.com/assets/12079825/19641087/3f8e3bae-99fd-11e6-9ef1-e0e70d71d8ef.png">
- Last updated - No currently pending logs:

<img width="1280" alt="screen shot 2016-10-17 at 8 34 37 pm" src="https://cloud.githubusercontent.com/assets/12079825/19443100/4d13946c-94a9-11e6-8ee2-c442729bb206.png">
- Last updated - With some currently pending logs:

<img width="1280" alt="screen shot 2016-10-24 at 3 09 31 pm" src="https://cloud.githubusercontent.com/assets/12079825/19640903/7323ba3a-99fc-11e6-8359-6a45753dbb28.png">
- No applications found and No currently pending logs:

<img width="1280" alt="screen shot 2016-10-24 at 3 24 26 pm" src="https://cloud.githubusercontent.com/assets/12079825/19641364/03a2cb04-99fe-11e6-87d6-d09587fc6201.png">

Author: Vinayak <vijoshi5@in.ibm.com>

Closes apache#15410 from vijoshi/SAAS-608_master.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants