[SPARK-33841][CORE] Fix issue with jobs disappearing intermittently from the SHS under high load #30845

vladhlinsky · 2020-12-18T18:50:18Z

What changes were proposed in this pull request?

Mark SHS event log entries that were processing at the beginning of the checkForLogs run as not stale and check for this mark before deleting an event log. This fixes the issue when a particular job was displayed in the SHS and disappeared after some time, but then, in several minutes showed up again.

Why are the changes needed?

The issue is caused by SPARK-29043, which is designated to improve the concurrent performance of the History Server. The change breaks the "app deletion" logic because of missing proper synchronization for processing event log entries. Since SHS now filters out all processing event log entries, such entries do not have a chance to be updated with the new lastProcessed time and thus any entity that completes processing right after filtering and before the check for stale entities will be identified as stale and will be deleted from the UI until the next checkForLogs run. This is because updated lastProcessed time is used as criteria, and event log entries that missed to be updated with a new time, will match that criteria.

The issue can be reproduced by generating a big number of event logs and uploading them to the SHS event log directory on S3. Essentially, around 236(26.7 MB) copies of an event log directory were created using shs-monitor script. Strange behavior of SHS counting the total number of applications was noticed - at first, the number was increasing as expected, but with the next page refresh, the total number of applications decreased. No errors were logged by SHS.

58 entities are displayed at 17:35:35:

25 entities are displayed at 17:36:40:

Does this PR introduce any user-facing change?

Yes, SHS users won't face the behavior when the number of displayed applications decreases periodically.

How was this patch tested?

Tested using shs-monitor script:

Build SHS with the proposed change
Download Hadoop AWS and AWS Java SDK
Prepare S3 bucket and user for programmatic access, grant required roles to the user. Get access key and secret key
Configure SHS to read event logs from S3
Start monitor script to query SHS API
Run 5 producers for ~5 mins, create 125(14.2 MB) event log directory copies
Wait for SHS to load all the applications
Verify that the number of loaded applications increases continuously over time

For more details, please refer to the shs-monitor repository.

This version of the reproduction uses event log directories instead of single files, since recent optimization
SPARK-33790 makes it hard to reproduce the issue with single event log files.

vladhlinsky · 2020-12-18T19:38:17Z

cc @HeartSaVioR

dongjoon-hyun

Thank you for making a PR to master, @vladhlinsky .
Could you rebase to the master once more to bring GitHub action fix?

…rom the SHS under high load

vladhlinsky · 2020-12-18T19:46:46Z

The PR has been rebased.
Thank you, @dongjoon-hyun!

HeartSaVioR · 2020-12-18T20:00:20Z

add to whitelist

HeartSaVioR · 2020-12-18T20:00:29Z

retest this, please

HeartSaVioR

+1 and I'd consider @tgravescs approved this PR as only target branch is different.

dongjoon-hyun · 2020-12-18T20:10:40Z

I don't think this PR has an issue, but we need to ping @tgravescs to get his approval technically. Ping, @tgravescs .

+1 and I'd consider @tgravescs approved this PR as only target branch is different.

SparkQA · 2020-12-18T21:13:41Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37637/

SparkQA · 2020-12-18T21:18:13Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37637/

dongjoon-hyun

+1, LGTM. Thank you, @vladhlinsky .
Merged to master for Apache Spark 3.2.0.

dongjoon-hyun · 2020-12-18T21:27:58Z

Thank you for your first contribution, @vladhlinsky .
I added you to the Apache Spark contributor group and assigned SPARK-33841 to you.
Welcome to the Apache Spark community.

vladhlinsky · 2020-12-18T22:03:31Z

Thank you, @dongjoon-hyun!

SparkQA · 2020-12-18T22:45:34Z

Test build #133038 has finished for PR 30845 at commit e9f13a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

github-actions bot added the CORE label Dec 18, 2020

vladhlinsky mentioned this pull request Dec 18, 2020

[SPARK-33841][CORE][3.0] Fix issue with jobs disappearing intermittently from the SHS under high load #30842

Closed

dongjoon-hyun requested changes Dec 18, 2020

View reviewed changes

[SPARK-33841][CORE] Fix issue with jobs disappearing intermittently f…

e9f13a5

…rom the SHS under high load

vladhlinsky force-pushed the SPARK-33841 branch from a9aa5c9 to e9f13a5 Compare December 18, 2020 19:44

HeartSaVioR approved these changes Dec 18, 2020

View reviewed changes

tgravescs approved these changes Dec 18, 2020

View reviewed changes

dongjoon-hyun approved these changes Dec 18, 2020

View reviewed changes

dongjoon-hyun closed this in 554600c Dec 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33841][CORE] Fix issue with jobs disappearing intermittently from the SHS under high load #30845

[SPARK-33841][CORE] Fix issue with jobs disappearing intermittently from the SHS under high load #30845

vladhlinsky commented Dec 18, 2020 •

edited

vladhlinsky commented Dec 18, 2020

dongjoon-hyun left a comment •

edited

vladhlinsky commented Dec 18, 2020

HeartSaVioR commented Dec 18, 2020

HeartSaVioR commented Dec 18, 2020

HeartSaVioR left a comment

dongjoon-hyun commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

dongjoon-hyun left a comment

dongjoon-hyun commented Dec 18, 2020

vladhlinsky commented Dec 18, 2020

SparkQA commented Dec 18, 2020

[SPARK-33841][CORE] Fix issue with jobs disappearing intermittently from the SHS under high load #30845

[SPARK-33841][CORE] Fix issue with jobs disappearing intermittently from the SHS under high load #30845

Conversation

vladhlinsky commented Dec 18, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

vladhlinsky commented Dec 18, 2020

dongjoon-hyun left a comment • edited

Choose a reason for hiding this comment

vladhlinsky commented Dec 18, 2020

HeartSaVioR commented Dec 18, 2020

HeartSaVioR commented Dec 18, 2020

HeartSaVioR left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 18, 2020

vladhlinsky commented Dec 18, 2020

SparkQA commented Dec 18, 2020

vladhlinsky commented Dec 18, 2020 •

edited

dongjoon-hyun left a comment •

edited