[SPARK-33841][CORE][3.0] Fix issue with jobs disappearing intermittently from the SHS under high load #30842

vladhlinsky · 2020-12-18T13:05:15Z

What changes were proposed in this pull request?

Mark SHS event log entries that were processing at the beginning of the checkForLogs run as not stale and check for this mark before deleting an event log. This fixes the issue when a particular job was displayed in the SHS and disappeared after some time, but then, in several minutes showed up again.

Why are the changes needed?

The issue is caused by SPARK-29043, which is designated to improve the concurrent performance of the History Server. The change breaks the "app deletion" logic because of missing proper synchronization for processing event log entries. Since SHS now filters out all processing event log entries, such entries do not have a chance to be updated with the new lastProcessed time and thus any entity that completes processing right after filtering and before the check for stale entities will be identified as stale and will be deleted from the UI until the next checkForLogs run. This is because updated lastProcessed time is used as criteria, and event log entries that missed to be updated with a new time, will match that criteria.

The issue can be reproduced by generating a big number of event logs and uploading them to the SHS event log directory on S3. Essentially, around 800(82.6 MB) copies of an event log file were created using shs-monitor script. Strange behavior of SHS counting the total number of applications was noticed - at first, the number was increasing as expected, but with the next page refresh, the total number of applications decreased. No errors were logged by SHS.

241 entities are displayed at 20:50:42:

203 entities are displayed at 20:52:17:

The number of loaded applications over time:

Does this PR introduce any user-facing change?

Yes, SHS users won't face the behavior when the number of displayed applications decreases periodically.

How was this patch tested?

Tested using shs-monitor script:

Build SHS with the proposed change
Download Hadoop AWS and AWS Java SDK
Prepare S3 bucket and user for programmatic access, grant required roles to the user. Get access key and secret key
Configure SHS to read event logs from S3
Start monitor script to query SHS API
Run 8 producers for ~10 mins, create 805(83.1 MB) event log copies
Wait for SHS to load all the applications
Verify that the number of loaded applications increases continuously over time

For more details, please refer to the shs-monitor repository.

…high load

vladhlinsky · 2020-12-18T13:08:22Z

cc @HeartSaVioR

HeartSaVioR · 2020-12-18T13:23:40Z

ok to test

HeartSaVioR · 2020-12-18T13:24:44Z

cc. @tgravescs

HeartSaVioR · 2020-12-18T13:26:04Z

@vladhlinsky Thanks for the patch! We tend to describe what this PR is doing in the PR title, unlike JIRA issue title. Could you please update the title of the PR? Thanks!

HeartSaVioR

+1

Let's wait a couple of days to give others chance to review. Probably till the early next week.

vladhlinsky · 2020-12-18T14:04:48Z

PR title has been updated.
@HeartSaVioR thank you for the review!

tgravescs

overall changes make sense

tgravescs · 2020-12-18T14:10:30Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

@@ -460,9 +460,17 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock)
      val newLastScanTime = clock.getTimeMillis()
      logDebug(s"Scanning $logDir with lastScanTime==$lastScanTime")

+      val notStale = mutable.HashSet[String]()


it would be nice to add a comment here describing notStale and why we need it.

@tgravescs thank you for the review!
PR has been updated with a comment.

SparkQA · 2020-12-18T14:20:01Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37618/

SparkQA · 2020-12-18T14:41:31Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37618/

SparkQA · 2020-12-18T16:03:01Z

Test build #133019 has finished for PR 30842 at commit f6f0217.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-18T17:09:22Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37626/

SparkQA · 2020-12-18T17:27:10Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37626/

SparkQA · 2020-12-18T18:26:40Z

Test build #133027 has finished for PR 30842 at commit c2b8c2d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

Hi, @vladhlinsky .

Please make a PR to master branch.
To prevent any accidental regression, we merge the patch from master -> branch-3.1 -> branch-3.0 if you want to deliver this to 3.0.2.

This PR should not land to branch-3.0 until we have this at master for Apache Spark 3.2 and branch-3.1 for Apache Spark 3.1.

dongjoon-hyun · 2020-12-18T18:29:34Z

BTW, cc @HyukjinKwon since he is the release manager for Apache Spark 3.1.0 and this seems to target branch-3.1, too.

vladhlinsky · 2020-12-18T19:41:19Z

Hi @dongjoon-hyun,
Submitted #30845 and #30847 to master and to branch-3.1 accordingly.
Thank you!

dongjoon-hyun · 2020-12-18T19:42:25Z

Thank you, @vladhlinsky !

dongjoon-hyun

+1, LGTM. Thanks, @vladhlinsky and all.
Merged to branch-3.0 for Apache Spark 3.0.2.

…tly from the SHS under high load ### What changes were proposed in this pull request? Mark SHS event log entries that were `processing` at the beginning of the `checkForLogs` run as not stale and check for this mark before deleting an event log. This fixes the issue when a particular job was displayed in the SHS and disappeared after some time, but then, in several minutes showed up again. ### Why are the changes needed? The issue is caused by [SPARK-29043](https://issues.apache.org/jira/browse/SPARK-29043), which is designated to improve the concurrent performance of the History Server. The [change](https://github.com/apache/spark/pull/25797/files#) breaks the ["app deletion" logic](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R563) because of missing proper synchronization for `processing` event log entries. Since SHS now [filters out](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R462) all `processing` event log entries, such entries do not have a chance to be [updated with the new `lastProcessed`](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R472) time and thus any entity that completes processing right after [filtering](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R462) and before [the check for stale entities](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R560) will be identified as stale and will be deleted from the UI until the next `checkForLogs` run. This is because [updated `lastProcessed` time is used as criteria](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R557), and event log entries that missed to be updated with a new time, will match that criteria. The issue can be reproduced by generating a big number of event logs and uploading them to the SHS event log directory on S3. Essentially, around 800(82.6 MB) copies of an event log file were created using [shs-monitor](https://github.com/vladhlinsky/shs-monitor) script. Strange behavior of SHS counting the total number of applications was noticed - at first, the number was increasing as expected, but with the next page refresh, the total number of applications decreased. No errors were logged by SHS. 241 entities are displayed at `20:50:42`: ![1-241-entities-at-20-50](https://user-images.githubusercontent.com/61428392/102611539-c2138d00-4137-11eb-9bbd-d77b22041f3b.png) 203 entities are displayed at `20:52:17`: ![2-203-entities-at-20-52](https://user-images.githubusercontent.com/61428392/102611561-cdff4f00-4137-11eb-91ed-7405fe58a695.png) The number of loaded applications over time: ![4-loaded-applications](https://user-images.githubusercontent.com/61428392/102611586-d8b9e400-4137-11eb-8747-4007fc5469de.png) ### Does this PR introduce _any_ user-facing change? Yes, SHS users won't face the behavior when the number of displayed applications decreases periodically. ### How was this patch tested? Tested using [shs-monitor](https://github.com/vladhlinsky/shs-monitor) script: * Build SHS with the proposed change * Download Hadoop AWS and AWS Java SDK * Prepare S3 bucket and user for programmatic access, grant required roles to the user. Get access key and secret key * Configure SHS to read event logs from S3 * Start [monitor](https://github.com/vladhlinsky/shs-monitor/blob/main/monitor.sh) script to query SHS API * Run 8 [producers](https://github.com/vladhlinsky/shs-monitor/blob/main/producer.sh) for ~10 mins, create 805(83.1 MB) event log copies * Wait for SHS to load all the applications * Verify that the number of loaded applications increases continuously over time ![5-loaded-applications-fixed](https://user-images.githubusercontent.com/61428392/102617363-bf1d9a00-4141-11eb-9bae-f982d02fd30f.png) For more details, please refer to the [shs-monitor](https://github.com/vladhlinsky/shs-monitor) repository. Closes #30842 from vladhlinsky/SPARK-33841-branch-3.0. Authored-by: Vlad Glinsky <vladhlinsky@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

HyukjinKwon · 2020-12-21T11:40:57Z

Thanks @dongjoon-hyun for cc'ing me, and @vladhlinsky for the fix :-)

[SPARK-33841][CORE] Jobs disappear intermittently from the SHS under …

f6f0217

…high load

HeartSaVioR approved these changes Dec 18, 2020

View reviewed changes

vladhlinsky changed the title ~~[SPARK-33841][CORE] Jobs disappear intermittently from the SHS under high load~~ [SPARK-33841][CORE] Fix issue with jobs disappearing intermittently from the SHS under high load Dec 18, 2020

tgravescs reviewed Dec 18, 2020

View reviewed changes

Address review comments

c2b8c2d

tgravescs approved these changes Dec 18, 2020

View reviewed changes

dongjoon-hyun requested changes Dec 18, 2020

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-33841][CORE] Fix issue with jobs disappearing intermittently from the SHS under high load~~ [SPARK-33841][CORE][3.0] Fix issue with jobs disappearing intermittently from the SHS under high load Dec 18, 2020

dongjoon-hyun approved these changes Dec 18, 2020

View reviewed changes

dongjoon-hyun closed this Dec 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33841][CORE][3.0] Fix issue with jobs disappearing intermittently from the SHS under high load #30842

[SPARK-33841][CORE][3.0] Fix issue with jobs disappearing intermittently from the SHS under high load #30842

vladhlinsky commented Dec 18, 2020

vladhlinsky commented Dec 18, 2020

HeartSaVioR commented Dec 18, 2020

HeartSaVioR commented Dec 18, 2020

HeartSaVioR commented Dec 18, 2020

HeartSaVioR left a comment •

edited

Loading

vladhlinsky commented Dec 18, 2020

tgravescs left a comment

tgravescs Dec 18, 2020

vladhlinsky Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun commented Dec 18, 2020 •

edited

Loading

vladhlinsky commented Dec 18, 2020

dongjoon-hyun commented Dec 18, 2020

dongjoon-hyun left a comment

HyukjinKwon commented Dec 21, 2020

[SPARK-33841][CORE][3.0] Fix issue with jobs disappearing intermittently from the SHS under high load #30842

[SPARK-33841][CORE][3.0] Fix issue with jobs disappearing intermittently from the SHS under high load #30842

Conversation

vladhlinsky commented Dec 18, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

vladhlinsky commented Dec 18, 2020

HeartSaVioR commented Dec 18, 2020

HeartSaVioR commented Dec 18, 2020

HeartSaVioR commented Dec 18, 2020

HeartSaVioR left a comment • edited Loading

Choose a reason for hiding this comment

vladhlinsky commented Dec 18, 2020

tgravescs left a comment

Choose a reason for hiding this comment

tgravescs Dec 18, 2020

Choose a reason for hiding this comment

vladhlinsky Dec 18, 2020

Choose a reason for hiding this comment

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 18, 2020 • edited Loading

vladhlinsky commented Dec 18, 2020

dongjoon-hyun commented Dec 18, 2020

dongjoon-hyun left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Dec 21, 2020

HeartSaVioR left a comment •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun commented Dec 18, 2020 •

edited

Loading