Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

contrib/mixin/mixin.libsonnet: Adjust gRPC failed requests #13127

Merged
merged 1 commit into from Jun 23, 2021

Conversation

lilic
Copy link
Contributor

@lilic lilic commented Jun 21, 2021

OK gRPC code is not the only one that is successful, this also captured
context canceled, NotFound, and other non-error requests codes.

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

OK is not the only one that is allowed, this before also captured
context canceled, NotFound, and other non error requests.
@lilic
Copy link
Contributor Author

lilic commented Jun 21, 2021

@hexfusion please have a look, thanks!

@@ -88,7 +88,7 @@
{
alert: 'etcdHighNumberOfFailedGRPCRequests',
expr: |||
100 * sum(rate(grpc_server_handled_total{%(etcd_selector)s, grpc_code!="OK"}[5m])) without (grpc_type, grpc_code)
100 * sum(rate(grpc_server_handled_total{%(etcd_selector)s, grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded"}[5m])) without (grpc_type, grpc_code)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be easier to maintain list of failed codes or successful ones? I lean towards keeping the list of successful codes, as it better handles case when new code is introduced. It assumes that new codes are errors, so it will alert by default which is better than assuming that new codes are ok and not noticing problem. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure can do that, I get your point, just not sure when we would be adding new codes, as these are the gRPC standard status codes listed here https://grpc.github.io/grpc/core/md_doc_statuscodes.html.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serathius let me know what you think, happy to reverse the order if you think it still makes sense here, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok as it doesn't look there will be any difference in cost of maintaining this list up to date.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's old, but haven't we left out codes that actually weren't creating the noise? E.g. in K8s, only the 'Canceled' code was really creating noise for me.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the 'Canceled' code

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, only 'Canceled' is giving the noise

Copy link
Contributor

@hexfusion hexfusion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 this cleans up a fair amount of noise, thanks @lilic. Regarding new codes we have not seen an addition to the list since the initial commit[1] so I think this approach is fine.

[1]grpc/grpc-go@c463038

@hexfusion hexfusion merged commit c44d89f into etcd-io:main Jun 23, 2021
@lilic lilic deleted the fix-grpc-req-alert branch June 23, 2021 16:37
svparamzin added a commit to svparamzin/helm-charts that referenced this pull request May 17, 2022
etcd-io/etcd#13127

Let me try to summarize the issue:

The problem of the expression comes from the grpc_code!="OK" part: it treats other non-error codes (canceled, invalidArgument, notFound, and etc) as errors so the expression reports a high rate of errors thus triggers too many alerts (see GRPC standard here).
The above problem in the expression has been fixed in the upstream etcd repo: etcd-io/etcd#13127

Signed-off-by: svparamzin <60502949+svparamzin@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants