Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-657] Expose application failure reasons #258

Merged
merged 2 commits into from May 10, 2021
Merged

[YUNIKORN-657] Expose application failure reasons #258

merged 2 commits into from May 10, 2021

Conversation

ghost
Copy link

@ghost ghost commented May 5, 2021

What is this PR for?

When an application fails or is rejected by the scheduler, the pods currently stay in pending forever without visibility on what happened to them. This PR propagates the scheduling error or failure to the pods so that the user will be able to see the reason of failure. If third-party operators (e.g. Spark K8s Operator) are used, they can also react according to the pod status change and update the status of the corresponding CRD and possibly retry. Core side changes are in apache/yunikorn-core#271

What type of PR is it?

  • - Bug Fix
  • - Improvement
  • - Feature
  • - Documentation
  • - Hot Fix
  • - Refactoring

Todos

  • - Task

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-657

How should this be tested?

Added unit tests to cover this change. When running in a cluster, tested that scheduling failures will be reflected on the pods:

For example, if an application is rejected, its pods will be marked as failed and its status field will be populated with the following information:

status:
  message: ' application ''spark-<ID>'' rejected, cannot
    create queue ''root.spark'' without placement rules'
  phase: Failed
  reason: ApplicationRejected

Another example is that if a gang-scheduled application cannot successfully run all placeholders before they expire, the app pods' status field will show:

status:
  message: Scheduling has timed out due to insufficient resources
  phase: Failed
  reason: InsufficientResources

Screenshots (if appropriate)

Questions:

  • - The licenses files need update.
  • - There is breaking changes for older versions.
  • - It needs documentation.

@codecov
Copy link

codecov bot commented May 6, 2021

Codecov Report

Merging #258 (5b5c36d) into master (c47ed51) will increase coverage by 0.90%.
The diff coverage is 65.15%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #258      +/-   ##
==========================================
+ Coverage   59.75%   60.65%   +0.90%     
==========================================
  Files          35       36       +1     
  Lines        3133     3342     +209     
==========================================
+ Hits         1872     2027     +155     
- Misses       1180     1232      +52     
- Partials       81       83       +2     
Impacted Files Coverage Δ
pkg/appmgmt/appmgmt_recovery.go 67.50% <0.00%> (-8.18%) ⬇️
pkg/cache/amprotocol_mock.go 0.00% <0.00%> (ø)
pkg/cache/node.go 84.55% <ø> (ø)
pkg/cache/nodes.go 79.80% <ø> (ø)
pkg/common/events/recorder_mock.go 0.00% <0.00%> (ø)
pkg/common/resource.go 90.72% <0.00%> (-9.28%) ⬇️
pkg/common/utils/gang_utils.go 65.43% <0.00%> (-16.11%) ⬇️
pkg/controller/application/app_controller.go 71.05% <ø> (-0.26%) ⬇️
...missioncontrollers/webhook/admission_controller.go 33.74% <0.00%> (+1.00%) ⬆️
pkg/shim/main.go 0.00% <ø> (ø)
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 54c9b31...5b5c36d. Read the comment docs.

@yangwwei yangwwei assigned ghost May 8, 2021
Comment on lines 134 to 145
func (nc SchedulerKubeClient) Update(pod *v1.Pod) (*v1.Pod, error) {
var updatedPod *v1.Pod
var err error
if updatedPod, err = nc.clientSet.CoreV1().Pods(pod.Namespace).UpdateStatus(pod); err != nil {
log.Logger().Warn("failed to update pod",
zap.String("namespace", pod.Namespace),
zap.String("podName", pod.Name),
zap.Error(err))
return pod, err
}
return updatedPod, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is called "Update" but the actual impl is "UpdateStatus".
I think we should either re-implement this with "nc.clientSet.CoreV1().Pods(pod.Namespace).Update()" or change this function name to "UpdateStatus" to avoid mis-usage.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Changed to UpdateStatus

Comment on lines 61 to 63
if pod.Status.Phase == v1.PodFailed && pod.Status.Reason == constants.ApplicationRejectedFailure || pod.Status.Reason == constants.ApplicationInsufficientResourcesFailure {
return false, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simply return false here when pod phase is Failed?
Do we have to check the reason?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Removed reason check

Comment on lines 237 to 248
var err error
if len(se.GetArgs()) >= 1 {
eventArgs := make([]string, 2)
if err = events.GetEventArgsAsStrings(eventArgs, se.GetArgs()); err != nil {
log.Logger().Error("fail to parse event arg", zap.Error(err))
}
message := eventArgs[1]
err = ss.stateMachine.Event(string(se.GetEvent()), message)
} else {
err = ss.stateMachine.Event(string(se.GetEvent()))
}
if err != nil && err.Error() == "no transition" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a more general form? The interface looks like

type SchedulerEvent interface {
  GetEvent() SchedulerEventType
  GetArgs() []interface{}
}

Can we leverage the args here:

err = ss.stateMachine.Event(string(se.GetEvent()), se.GetArgs())

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Updated accordingly. Now it's more generic and cleaner 👍

@yangwwei
Copy link
Contributor

yangwwei commented May 8, 2021

Overall looks good, thanks for improving this! Please see some minor comments, thank you @yuchaoran-apple .

@yangwwei yangwwei changed the title Shim side changes for YUNIKORN-657 [YUNIKORN-657] Expose application failure reasons May 10, 2021
Copy link
Contributor

@yangwwei yangwwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yangwwei yangwwei merged commit 8034184 into apache:master May 10, 2021
@ghost ghost deleted the YUNIKORN-657 branch May 10, 2021 15:17
craigcondit pushed a commit to craigcondit/yunikorn-k8shim that referenced this pull request May 10, 2022
When the application transits to failed state, publish pod events to all pods of this application
to indicate why the application was failed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants