-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17843][Web UI] Indicate event logs pending for processing on history server UI #15410
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of this change, but I think it can be more stream-lined as I commented. From a first look though the general code quality looks good
|
||
try { | ||
for (file <- logInfos) { | ||
tasks += replayExecutor.submit(new Runnable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a need for this step? Why can't we just pendingReplayTasksCount.addAndGet(logInfos.size)
below and only have to add the finally
code block. This seems like a lot of unnecessary code change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the replayExecutor.submit can raise an exception resulting in fewer logInfos getting queued for replay. although one can argue that the underlying ExecutorService implementation used today will probably not do this, but I did not feel comfortable leaving this possibility out. note that - if there were to be an exception on the replayExecutor. submit( ) it would break not only the counting logic (assuming we'd already done a pendingReplayTasksCount.addAndGet(logInfos.size)
instead of only adding as many as got successfully submitted) but also the older one of preventing a new thread getting scheduled for "checkForLogs( )" while there are tasks already running in replayExecutor.
however if there's a thought that we can completely ignore the possibility of exceptions on replayExecutor.submit( ), I can optimise the code on those lines. Let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding the catch at the end of checkForLogs would stop any current issue, but without it your new counting code would break, so with the addition of the counting code this change makes sense.
while I understand the problem, the issue here is that what you are printing might not be accurate either. We only pull for new logs every X number of seconds (default is 10s, user could change to longer) so if I come in one of those periods then it could say 0 when there are actually some to be read yet. Maybe this is ok as its better then it was but its also not guaranteed. |
@tgravescs - you're right - for newer logs that are generated, there could be a window of time (10 secs or whatever the user configures) where the new logs are not picked up for replay and the UI doesn't say anything about them. however the issue we see is more with old completed apps. a little after history server startup, user browses to the app list and has no idea why the older completed apps are missing (perhaps those that were visible just before the history server was restarted). since the polling for logs/replay is scheduled immediately as part of history server startup (zero delay for the first round), this fix could help these cases. |
ah, yeah startup would definitely be a good case for this and like I mentioned its better then nothing so I'm ok with concept. I wonder for the other use case where it hasn't looked in ~ 10 seconds if it would be more clear to users if we put a little string at the top that is like "last updated time XX:XX:XX" or app list current as of XX:XX:XX. |
Just to raise an idea that would possibly mean less code change, would simply having a flag that causing a "currently processing applications" type message to display without an actual count with it? Overall I think this is a good addition though. |
@@ -38,6 +39,13 @@ private[history] class HistoryPage(parent: HistoryServer) extends WebUIPage("") | |||
{providerConfig.map { case (k, v) => <li><strong>{k}:</strong> {v}</li> }} | |||
</ul> | |||
{ | |||
if (eventLogsUnderProcessCount > 0) { | |||
<p>There are {eventLogsUnderProcessCount} event log(s) currently being | |||
processed which may result in additional applications getting listed on this page. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of just telling the user how many logs are pending, it will be a lot more useful to tell him which ones are pending so he can wait before clicking on a specific application
ok to test I think the idea is good, but it would be a better UX if we display the pending applications as rows in the existing table (or a new one) and indicate there that they're still being processed. It might be more code change but I think it's worth it. |
If we are going to list the actual applications being loaded in the table then you have to rely on the filename to know the application id. This may be ok, but its then the interface and we can't change it or whatever we change it to has to be compatible and backwards compatible. Personally I would rather see us do something much smarter with file names/paths to include enough metadata to do simple table list or have metadata file that would allow loading of the history table much faster and load the actual contents only when actually clicked on. |
Note the reason I mention the above things is I don't want us to start relying on the filename if its just going to change again. |
Display "Last updated " with a generic message, possibly omitting the count of logs Ok, while looking at this I notice that the scan date-time the FsHistoryProvider currently internally captures (
This probably made sense when earlier versions of FsHistoryProvider used this value to determine new/updated logs it needed to replay. This is no longer the case. In |
Display applications that are pending, instead of a count Though agree that this would be better UX-wise, but we would only have the filenames ( |
personally I would rather not display directory or filenames unless we are doing it right. User shouldn't know the internal implementation or storage option. |
We shouldn't display file names but we should display application names and IDs, something the user understands. We don't have to do that as part of this issue. |
@andrewor14 I plan to address that in another issue soon (see by end of the month) as soon as I have time. Fixing how the History Server loads has been coming up in a lot of PRs lately. |
@ajbozarth @tgravescs @andrewor14 - updated pull request to show a Last Updated message on the app list page when there are no pending logs being processed. I would have liked to have the ApplicationListResource use a json schema where the application list and associated metadata - updated date, pending logs etc could be included together. but given that the json schema returned today is just a collection of application info json - change to this would be too much code change for this smallish fix - probably something for the /api/v2/ to consider. |
can you update the description with latest implementation and screen shot. |
Jenkins, test this please |
Test build #67070 has finished for PR 15410 at commit
|
@tgravescs @ajbozarth @andrewor14 can this be merged now? can we get it into the 2.0 branch - should i simply open a pull request on 2.0? |
This needs to remain open - clicked the wrong button!! |
were the description and screen shots updated? |
@tgravescs Yes, the description for the pull request now has screen shots for the "Last Updated: XXX" case as well. I added description for the "Last Updated" message on the app list page just now. |
Should we display the Last Updated even if the currently processing message is displayed? Then the user will know the last update even when a new update is processing, rare case, but still useful. |
@ajbozarth yeah sounds useful. but note that for the very first load, we would have no 'Last Updated' value to display since that gets set only after the log scan+replay cycle completes at least once. any suggestions for an alternate text when we're in this situation? |
I would say just not displaying it would be fine, so making the |
@ajbozarth @tgravescs @andrewor14 - updated the implementation to always display "Last Updated Date-Time" when we have atleast one completed scan of the event log directory. I have changed the impl slightly
Have updated description with screenshots for each of the above scenarios. |
cfcae6a
to
cd5056b
Compare
logError(s"Exception while submitting event log for replay", e) | ||
} | ||
|
||
pendingReplayTasksCount.addAndGet (tasks.size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove extra space after addAndGet before (
@@ -123,6 +123,8 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) | |||
// List of application logs to be deleted by event log cleaner. | |||
private var attemptsToClean = new mutable.ListBuffer[FsApplicationAttemptInfo] | |||
|
|||
private val pendingReplayTasksCount = new java.util.concurrent.atomic.AtomicInteger (0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove extra space after AtomicInteger before (
@@ -111,7 +111,7 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) | |||
|
|||
// The modification time of the newest log detected during the last scan. Currently only | |||
// used for logging msgs (logs are re-scanned based on file size, rather than modtime) | |||
private var lastScanTime = -1L | |||
private val lastScanTime = new java.util.concurrent.atomic.AtomicLong (-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove extra space after AtomicLong before (
couple minor spacing nits, otherwise looks good. |
@tgravescs thanks, have updated the fix. |
Jenkins, test this please |
Test build #68341 has finished for PR 15410 at commit
|
test failure looks unrelated, kicking again. Jenkins, test this please |
retest this please |
Jenkins, test this please |
Test build #68357 has finished for PR 15410 at commit
|
retest this please |
Jenkins, test this please |
Test build #68409 has finished for PR 15410 at commit
|
I've seen other pr tests passing now, let try again. |
Jenkins, test this please |
Test build #68471 has finished for PR 15410 at commit
|
@vijoshi please look at test failures in more detail to see if these changes could have affected. |
Jenkins, test this please |
Test build #68524 has finished for PR 15410 at commit
|
+1 |
…istory server UI ## What changes were proposed in this pull request? History Server UI's application listing to display information on currently under process event logs so a user knows that pending this processing an application may not list on the UI. When there are no event logs under process, the application list page has a "Last Updated" date-time at the top indicating the date-time of the last _completed_ scan of the event logs. The value is displayed to the user in his/her local time zone. ## How was this patch tested? All unit tests pass. Particularly all the suites under org.apache.spark.deploy.history.\* were run to test changes. - Very first startup - Pending logs - no logs processed yet: <img width="1280" alt="screen shot 2016-10-24 at 3 07 04 pm" src="https://cloud.githubusercontent.com/assets/12079825/19640981/b8d2a96a-99fc-11e6-9b1f-2d736fe90e48.png"> - Very first startup - Pending logs - some logs processed: <img width="1280" alt="screen shot 2016-10-24 at 3 18 42 pm" src="https://cloud.githubusercontent.com/assets/12079825/19641087/3f8e3bae-99fd-11e6-9ef1-e0e70d71d8ef.png"> - Last updated - No currently pending logs: <img width="1280" alt="screen shot 2016-10-17 at 8 34 37 pm" src="https://cloud.githubusercontent.com/assets/12079825/19443100/4d13946c-94a9-11e6-8ee2-c442729bb206.png"> - Last updated - With some currently pending logs: <img width="1280" alt="screen shot 2016-10-24 at 3 09 31 pm" src="https://cloud.githubusercontent.com/assets/12079825/19640903/7323ba3a-99fc-11e6-8359-6a45753dbb28.png"> - No applications found and No currently pending logs: <img width="1280" alt="screen shot 2016-10-24 at 3 24 26 pm" src="https://cloud.githubusercontent.com/assets/12079825/19641364/03a2cb04-99fe-11e6-87d6-d09587fc6201.png"> Author: Vinayak <vijoshi5@in.ibm.com> Closes #15410 from vijoshi/SAAS-608_master. (cherry picked from commit a531fe1) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>
…istory server UI ## What changes were proposed in this pull request? History Server UI's application listing to display information on currently under process event logs so a user knows that pending this processing an application may not list on the UI. When there are no event logs under process, the application list page has a "Last Updated" date-time at the top indicating the date-time of the last _completed_ scan of the event logs. The value is displayed to the user in his/her local time zone. ## How was this patch tested? All unit tests pass. Particularly all the suites under org.apache.spark.deploy.history.\* were run to test changes. - Very first startup - Pending logs - no logs processed yet: <img width="1280" alt="screen shot 2016-10-24 at 3 07 04 pm" src="https://cloud.githubusercontent.com/assets/12079825/19640981/b8d2a96a-99fc-11e6-9b1f-2d736fe90e48.png"> - Very first startup - Pending logs - some logs processed: <img width="1280" alt="screen shot 2016-10-24 at 3 18 42 pm" src="https://cloud.githubusercontent.com/assets/12079825/19641087/3f8e3bae-99fd-11e6-9ef1-e0e70d71d8ef.png"> - Last updated - No currently pending logs: <img width="1280" alt="screen shot 2016-10-17 at 8 34 37 pm" src="https://cloud.githubusercontent.com/assets/12079825/19443100/4d13946c-94a9-11e6-8ee2-c442729bb206.png"> - Last updated - With some currently pending logs: <img width="1280" alt="screen shot 2016-10-24 at 3 09 31 pm" src="https://cloud.githubusercontent.com/assets/12079825/19640903/7323ba3a-99fc-11e6-8359-6a45753dbb28.png"> - No applications found and No currently pending logs: <img width="1280" alt="screen shot 2016-10-24 at 3 24 26 pm" src="https://cloud.githubusercontent.com/assets/12079825/19641364/03a2cb04-99fe-11e6-87d6-d09587fc6201.png"> Author: Vinayak <vijoshi5@in.ibm.com> Closes apache#15410 from vijoshi/SAAS-608_master.
What changes were proposed in this pull request?
History Server UI's application listing to display information on currently under process event logs so a user knows that pending this processing an application may not list on the UI.
When there are no event logs under process, the application list page has a "Last Updated" date-time at the top indicating the date-time of the last completed scan of the event logs. The value is displayed to the user in his/her local time zone.
How was this patch tested?
All unit tests pass. Particularly all the suites under org.apache.spark.deploy.history.* were run to test changes.
- Very first startup - Pending logs - some logs processed:
- Last updated - No currently pending logs:
- Last updated - With some currently pending logs:
- No applications found and No currently pending logs: