You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This was identified because a dynamic task requested a launchplan that does not exist. Propeller attempts to identify the cause of the detected error so that it can label the failure as either a system or a user failure. In this case, a user failure will automatically fail the node (because it is unrecoverable). However, it is wrongly labeled as a system failure. Therefore, propeller continually attempts to execute the node with backoff and it results in the node taking a long time to properly fail (ex. >1hour).
I believe this is caused by the GetErrorCode function using the coder interface, whereas gRPC errors require the status.Code function. This should be tested extensively.
Expected behavior
Propeller should detect the missing launchplan as a user error and fail immediately.
Additional context to reproduce
Create a dynamic task which requests a launchplan that does not exist.
Screenshots
No response
Are you sure this issue hasn't been raised already?
Yes
Have you read the Code of Conduct?
Yes
The text was updated successfully, but these errors were encountered:
hamersaw
added
bug
Something isn't working
untriaged
This issues has not yet been looked at by the Maintainers
and removed
untriaged
This issues has not yet been looked at by the Maintainers
labels
Apr 25, 2022
This is a very good find @hamersaw. I think the interface is very close and so probably very hard to identify (unit32 vs string)
Does this affect all other error codes from admin? I have seen other error codes identified.
Interesting that it seems events is handled correctly - here
@wild-endeavor actually discovered the repetitive system retries.
It should affect everywhere the GetErrorCode is called, which includes the Is and IsCausedBy functions in flytestdlib. So everywhere we use those to find the error code of a gRPC error (which may be frequent).
I think the example you linked works because we parse the error with status.FromError first. For example:
type ErrorCode = string
type coder interface {
Code() ErrorCode
}
func main() {
var e error
e = status.Error(codes.NotFound, "entity is not found")
if er, ok := e.(coder); ok {
fmt.Printf("coder code: %v\n", er.Code())
}
if er, ok := status.FromError(e); ok {
fmt.Printf("status code: %v\n", er.Code())
}
}
hamersaw
changed the title
[BUG] gRPC error codes causes are not correctly parsed
[BUG] gRPC error codes are not correctly parsed when retrieving launchplan
May 18, 2022
This seems to be a localized issue with retrieving launch plans. In other locations the gRPC error code is parsed and the error is wrapped with RemoteErrorNotFound. It looks like that is how the rest of the repo handles these so that we don't have to handle gRPC error codes.
Describe the bug
This was identified because a dynamic task requested a launchplan that does not exist. Propeller attempts to identify the cause of the detected error so that it can label the failure as either a system or a user failure. In this case, a user failure will automatically fail the node (because it is unrecoverable). However, it is wrongly labeled as a system failure. Therefore, propeller continually attempts to execute the node with backoff and it results in the node taking a long time to properly fail (ex. >1hour).
I believe this is caused by the GetErrorCode function using the coder interface, whereas gRPC errors require the status.Code function. This should be tested extensively.
Expected behavior
Propeller should detect the missing launchplan as a user error and fail immediately.
Additional context to reproduce
Create a dynamic task which requests a launchplan that does not exist.
Screenshots
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: