[BUG] gRPC error codes are not correctly parsed when retrieving launchplan #2404

hamersaw · 2022-04-25T01:25:28Z

Describe the bug

This was identified because a dynamic task requested a launchplan that does not exist. Propeller attempts to identify the cause of the detected error so that it can label the failure as either a system or a user failure. In this case, a user failure will automatically fail the node (because it is unrecoverable). However, it is wrongly labeled as a system failure. Therefore, propeller continually attempts to execute the node with backoff and it results in the node taking a long time to properly fail (ex. >1hour).

I believe this is caused by the GetErrorCode function using the coder interface, whereas gRPC errors require the status.Code function. This should be tested extensively.

Expected behavior

Propeller should detect the missing launchplan as a user error and fail immediately.

Additional context to reproduce

Create a dynamic task which requests a launchplan that does not exist.

Screenshots

No response

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

kumare3 · 2022-04-25T05:15:56Z

This is a very good find @hamersaw. I think the interface is very close and so probably very hard to identify (unit32 vs string)
Does this affect all other error codes from admin? I have seen other error codes identified.
Interesting that it seems events is handled correctly - here

hamersaw · 2022-04-25T10:37:01Z

@wild-endeavor actually discovered the repetitive system retries.

It should affect everywhere the GetErrorCode is called, which includes the Is and IsCausedBy functions in flytestdlib. So everywhere we use those to find the error code of a gRPC error (which may be frequent).

I think the example you linked works because we parse the error with status.FromError first. For example:

type ErrorCode = string
type coder interface {
	Code() ErrorCode
}

func main() {
	var e error
	e = status.Error(codes.NotFound, "entity is not found")

	if er, ok := e.(coder); ok {
		fmt.Printf("coder code: %v\n", er.Code())
	}

	if er, ok := status.FromError(e); ok {
		fmt.Printf("status code: %v\n", er.Code())
	}
}

prints status code: NotFound.

This should be a quick fix, but it is important.

hamersaw · 2022-05-18T14:34:46Z

This seems to be a localized issue with retrieving launch plans. In other locations the gRPC error code is parsed and the error is wrapped with RemoteErrorNotFound. It looks like that is how the rest of the repo handles these so that we don't have to handle gRPC error codes.

hamersaw added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers and removed untriaged This issues has not yet been looked at by the Maintainers labels Apr 25, 2022

hamersaw self-assigned this Apr 25, 2022

hamersaw added this to the 1.0.1 milestone Apr 25, 2022

EngHabu modified the milestones: 1.0.1, 1.0.2 May 11, 2022

hamersaw mentioned this issue May 18, 2022

GetLaunchPlan checks for NotFound gRPC code rather than nil launchplan flyteorg/flytepropeller#441

Merged

8 tasks

hamersaw changed the title ~~[BUG] gRPC error codes causes are not correctly parsed~~ [BUG] gRPC error codes are not correctly parsed when retrieving launchplan May 18, 2022

hamersaw closed this as completed in flyteorg/flytepropeller#441 Jun 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] gRPC error codes are not correctly parsed when retrieving launchplan #2404

[BUG] gRPC error codes are not correctly parsed when retrieving launchplan #2404

hamersaw commented Apr 25, 2022

kumare3 commented Apr 25, 2022 •

edited

Loading

hamersaw commented Apr 25, 2022 •

edited

Loading

hamersaw commented May 18, 2022 •

edited

Loading

[BUG] gRPC error codes are not correctly parsed when retrieving launchplan #2404

[BUG] gRPC error codes are not correctly parsed when retrieving launchplan #2404

Comments

hamersaw commented Apr 25, 2022

Describe the bug

Expected behavior

Additional context to reproduce

Screenshots

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

kumare3 commented Apr 25, 2022 • edited Loading

hamersaw commented Apr 25, 2022 • edited Loading

hamersaw commented May 18, 2022 • edited Loading

kumare3 commented Apr 25, 2022 •

edited

Loading

hamersaw commented Apr 25, 2022 •

edited

Loading

hamersaw commented May 18, 2022 •

edited

Loading