Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] gRPC error codes are not correctly parsed when retrieving launchplan #2404

Closed
2 tasks done
hamersaw opened this issue Apr 25, 2022 · 3 comments · Fixed by flyteorg/flytepropeller#441
Closed
2 tasks done
Assignees
Labels
bug Something isn't working
Milestone

Comments

@hamersaw
Copy link
Contributor

Describe the bug

This was identified because a dynamic task requested a launchplan that does not exist. Propeller attempts to identify the cause of the detected error so that it can label the failure as either a system or a user failure. In this case, a user failure will automatically fail the node (because it is unrecoverable). However, it is wrongly labeled as a system failure. Therefore, propeller continually attempts to execute the node with backoff and it results in the node taking a long time to properly fail (ex. >1hour).

I believe this is caused by the GetErrorCode function using the coder interface, whereas gRPC errors require the status.Code function. This should be tested extensively.

Expected behavior

Propeller should detect the missing launchplan as a user error and fail immediately.

Additional context to reproduce

Create a dynamic task which requests a launchplan that does not exist.

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@hamersaw hamersaw added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers and removed untriaged This issues has not yet been looked at by the Maintainers labels Apr 25, 2022
@hamersaw hamersaw self-assigned this Apr 25, 2022
@hamersaw hamersaw added this to the 1.0.1 milestone Apr 25, 2022
@kumare3
Copy link
Contributor

kumare3 commented Apr 25, 2022

This is a very good find @hamersaw. I think the interface is very close and so probably very hard to identify (unit32 vs string)
Does this affect all other error codes from admin? I have seen other error codes identified.
Interesting that it seems events is handled correctly - here

@hamersaw
Copy link
Contributor Author

hamersaw commented Apr 25, 2022

@wild-endeavor actually discovered the repetitive system retries.

It should affect everywhere the GetErrorCode is called, which includes the Is and IsCausedBy functions in flytestdlib. So everywhere we use those to find the error code of a gRPC error (which may be frequent).

I think the example you linked works because we parse the error with status.FromError first. For example:

type ErrorCode = string
type coder interface {
	Code() ErrorCode
}

func main() {
	var e error
	e = status.Error(codes.NotFound, "entity is not found")

	if er, ok := e.(coder); ok {
		fmt.Printf("coder code: %v\n", er.Code())
	}

	if er, ok := status.FromError(e); ok {
		fmt.Printf("status code: %v\n", er.Code())
	}
}

prints status code: NotFound.

This should be a quick fix, but it is important.

@EngHabu EngHabu modified the milestones: 1.0.1, 1.0.2 May 11, 2022
@hamersaw hamersaw changed the title [BUG] gRPC error codes causes are not correctly parsed [BUG] gRPC error codes are not correctly parsed when retrieving launchplan May 18, 2022
@hamersaw
Copy link
Contributor Author

hamersaw commented May 18, 2022

This seems to be a localized issue with retrieving launch plans. In other locations the gRPC error code is parsed and the error is wrapped with RemoteErrorNotFound. It looks like that is how the rest of the repo handles these so that we don't have to handle gRPC error codes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants