Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canceled grpc is now not considered an error #556

Closed
wants to merge 1 commit into from

Conversation

ksa-real
Copy link

"Canceled" grpc state means client-initiated cancellation of streaming
grpc call, e.g. Watch. This is a legitimate non-error case. Before this
change such codes were considered an error and created false positive
alerts.

"Canceled" grpc state means client-initiated cancellation of streaming
grpc call, e.g. Watch. This is a legitimate non-error case. Before this
change such codes were considered an error and generated false positive
alerts.

Signed-off-by: Sergei Kuzmin <sergeikuzmin@gmail.com>
@ksa-real
Copy link
Author

ksa-real commented Jul 9, 2022

@spzala Can you please take a look and suggest how to proceed? I see that op-guide/etcd3_alert.rules.yml is not present anymore in op-guide in v3.5 and v3.6. The v3.4 was the latest. Are the rules not supported anymore? What is the new place for the source of truth for the rules? The issue is that kube-prometheus-stack is using etcd-io rules for its setup. At the moment I think it still takes rules from v3.4.

ksa-real added a commit to ksa-real/helm-charts that referenced this pull request Jul 9, 2022
The related PR to etcd-io/website:
etcd-io/website#556

Signed-off-by: Sergei Kuzmin <sergeikuzmin@gmail.com>
ksa-real added a commit to ksa-real/helm-charts that referenced this pull request Jul 9, 2022
The related PR to etcd-io/website:
etcd-io/website#556

There is a bunch or related issues since 2020:
https://github.com/search?q=etcdHighNumberOfFailedGRPCRequests&type=issues
I think the most common resolution was to just disable the rule.

Signed-off-by: Sergei Kuzmin <sergeikuzmin@gmail.com>
ksa-real added a commit to ksa-real/helm-charts that referenced this pull request Jul 9, 2022
"Canceled" grpc state means client-initiated cancellation of streaming
grpc call, e.g. Watch. This is a legitimate non-error case. Before this
change such codes were considered an error and generated false positive
alerts.

The related PR to etcd-io/website:
etcd-io/website#556

There is a bunch or related issues since 2020:
https://github.com/search?q=etcdHighNumberOfFailedGRPCRequests&type=issues
I think the most common resolution was to just disable the rule.

Signed-off-by: Sergei Kuzmin <sergeikuzmin@gmail.com>
ksa-real added a commit to ksa-real/helm-charts that referenced this pull request Jul 9, 2022
"Canceled" grpc state means client-initiated cancellation of streaming
grpc call, e.g. Watch. This is a legitimate non-error case. Before this
change such codes were considered an error and generated false positive
alerts.

The related PR to etcd-io/website:
etcd-io/website#556

There is a bunch or related issues since 2020:
https://github.com/search?q=etcdHighNumberOfFailedGRPCRequests&type=issues
I think the most common resolution was to just disable the rule.

Signed-off-by: Sergei Kuzmin <sergeikuzmin@gmail.com>
Copy link
Member

@spzala spzala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ksa-real I am looking at it but quickly - to reduce the noise these changes were merged etcd-io/etcd#13127 The rules are supported afaik. I think it's the doc that's missing update. As the rules file says # These rules were manually synced from https://github.com/etcd-io/etcd/blob/master/contrib/mixin/mixin.libsonnet The mixin.libsonnet is there on the master branch.

@spzala
Copy link
Member

spzala commented Jul 18, 2022

@ksa-real so as I mentioned above, the issue as mentioned in this PR is taken care in the, https://github.com/etcd-io/etcd/blob/master/contrib/mixin/mixin.libsonnet
Isn't it that the kube-prometheus use the mixin.libsonnet file?. This PR has some related old discussions, etcd-io/etcd#12689

@spzala
Copy link
Member

spzala commented Jul 18, 2022

fyi @ahrtr @serathius

@ksa-real
Copy link
Author

ksa-real commented Jul 19, 2022

Thanks for pointing it out. Few thoughts:

  • As the source of truth moved to mixin it is more logical to continue changes there instead of etcd3_alert.rules.yml. One issue is that Kube Prometheus Stack hasn't created a tool to import from the mixin yet, so I guess it is up to you whether to allow backport to 3.4.
  • I think it makes more sense to interpret all codes as an error, unless whitelisted, instead of the current way: interpret all codes as non-error unless explicitly interpreted as an error. I.e. grpc_code!~"OK|Canceled" vs grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded". This way we will be more cautious if a new code is ever added.

spzala added a commit to spzala/website that referenced this pull request Jul 19, 2022
The added file  in this PR was missing in the v3.5 doc. It was there
in the previous releases and used by the Kube Prometheus Stack.

Related etcd-io#556
spzala added a commit to spzala/website that referenced this pull request Jul 19, 2022
The added file  in this PR was missing in the v3.5 doc. It was there
in the previous releases and used by the Kube Prometheus Stack.

Related etcd-io#556

Signed-off-by: Sahdev Zala spzala@us.ibm.com
@spzala spzala mentioned this pull request Jul 19, 2022
spzala added a commit to spzala/website that referenced this pull request Jul 19, 2022
The added file  in this PR was missing in the v3.5 doc. It was there
in the previous releases and used by the Kube Prometheus Stack.

Related etcd-io#556

Signed-off-by: Sahdev Zala <spzala@us.ibm.com>
@spzala
Copy link
Member

spzala commented Jul 19, 2022

@ksa-real yrw! If changing etcd3_alert.rules.ymlper your PR helps Kube Prometheus Stack, I think that should be fine and we can merge it after review. Does that sounds good?
About your second point - this was discussed under etcd-io/etcd#13127 and the decision was to go with the current approach, so I think it's good to leave it that way. In the future, if we see this approach needs to be changed due to maintenance or usability reasons, please propose accordingly.
Also, I have created a new file from https://github.com/etcd-io/etcd/blob/release-3.5/contrib/mixin/mixin.libsonnet for the v3.5, if that sounds good? - #600

@ksa-real
Copy link
Author

SGTM. As for the second point, I guess GRPC codes change rarely, so I don't have too high motivation to insist :) I think I'll leave my arguments at the original discussion and close this diff.

@ksa-real ksa-real closed this Jul 19, 2022
@spzala
Copy link
Member

spzala commented Jul 20, 2022

Ok, cool :) Thanks @ksa-real

@ksa-real ksa-real deleted the canceled-watch branch July 20, 2022 01:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants