New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33841][CORE] Fix issue with jobs disappearing intermittently from the SHS under high load #30845
Conversation
cc @HeartSaVioR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making a PR to master, @vladhlinsky .
Could you rebase to the master once more to bring GitHub action fix?
…rom the SHS under high load
a9aa5c9
to
e9f13a5
Compare
The PR has been rebased. |
add to whitelist |
retest this, please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 and I'd consider @tgravescs approved this PR as only target branch is different.
I don't think this PR has an issue, but we need to ping @tgravescs to get his approval technically. Ping, @tgravescs .
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @vladhlinsky .
Merged to master for Apache Spark 3.2.0.
Thank you for your first contribution, @vladhlinsky . |
Thank you, @dongjoon-hyun! |
Test build #133038 has finished for PR 30845 at commit
|
What changes were proposed in this pull request?
Mark SHS event log entries that were
processing
at the beginning of thecheckForLogs
run as not stale and check for this mark before deleting an event log. This fixes the issue when a particular job was displayed in the SHS and disappeared after some time, but then, in several minutes showed up again.Why are the changes needed?
The issue is caused by SPARK-29043, which is designated to improve the concurrent performance of the History Server. The change breaks the "app deletion" logic because of missing proper synchronization for
processing
event log entries. Since SHS now filters out allprocessing
event log entries, such entries do not have a chance to be updated with the newlastProcessed
time and thus any entity that completes processing right after filtering and before the check for stale entities will be identified as stale and will be deleted from the UI until the nextcheckForLogs
run. This is because updatedlastProcessed
time is used as criteria, and event log entries that missed to be updated with a new time, will match that criteria.The issue can be reproduced by generating a big number of event logs and uploading them to the SHS event log directory on S3. Essentially, around 236(26.7 MB) copies of an event log directory were created using shs-monitor script. Strange behavior of SHS counting the total number of applications was noticed - at first, the number was increasing as expected, but with the next page refresh, the total number of applications decreased. No errors were logged by SHS.
58 entities are displayed at
17:35:35
:25 entities are displayed at
17:36:40
:Does this PR introduce any user-facing change?
Yes, SHS users won't face the behavior when the number of displayed applications decreases periodically.
How was this patch tested?
Tested using shs-monitor script:
For more details, please refer to the shs-monitor repository.