-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
observability: Add a alert for the network connections. #11825
Conversation
description: 'Network connection numbers remain high for 5 minutes.', | ||
}, | ||
expr: ||| | ||
node_nf_conntrack_entries{instance=~"workspace-.*"} > 5000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fear that people will get paged needlessly if the alert already fires when the number of connections goes above 5000. Is this not very similar to the GitpodNodeConntrackTableIsFull and GitpodNodeConntrackTableGettingFull alert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@utam0k can you change this back to draft? 🙏
Why draft? I think more testing needs to be done with the expression, prior to marking ready for review. For example, given the expression you want to use here, can you test it to see if it helps you find abusers on nodes? If no, you cannot find related abuse with it, then the # of connections is probably too low.
I fear that people will get paged needlessly if the alert already fires when the number of connections goes above 5000.
Agreed
Is this not very similar to the GitpodNodeConntrackTableIsFull and GitpodNodeConntrackTableGettingFull alert?
This is a bit different, because those are percentage based like (node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.95
. What we're trying to build here is a water mark, where, if it gets exceeded for a period of time, that's a signal that potential abuse needs to be investigated. I agree with @Furisto , we don't want it to fire excessively.
wdyt about something like node_nf_conntrack_entries{node=~"workspace.*", instance!~"serv.*"} > 20000
where the duration is 10m
? If you look at this now, you'll see there are three nodes, which could potentially have abusers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit different, because those are percentage based like (node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.95.
Isn't that the same thing, just expressed differently? A percentage of something resolves to a number in the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GitpodNodeConntrackTableIsFull is an alert that detects node limits, this time for abuse users. I believe that is the difference. So I think the main key in this alert is that node_nf_contrack_entries_limit is not related to the threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I create a metrics.
How about having the on-caller check this metric for abuse?
wdyt about something like node_nf_conntrack_entries{node=
"workspace.*", instance!"serv.*"} > 20000 where the duration is 10m? If you look at this now, you'll see there are three nodes, which could potentially have abusers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's route to Slack, instead of PagerDuty, for this initial alert.
Also, consulting with @aledbf Re: the prometheus expression being used to trigger the alert. He catches abuse often when inspecting connections, so will be "closer" to what is "reasonable".
labels: { | ||
severity: 'critical', | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
labels: { | |
severity: 'critical', | |
}, | |
labels: { | |
severity: 'warning', | |
team: 'workspace' | |
}, |
This way it routes to https://gitpod.slack.com/archives/C03QZ1NH517, and we can gauge if this is too frequent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
☝️ because I don't want to send the alert to on-callers until we (on the workspace team) have felt it first in Slack.
description: 'Network connection numbers remain high for 5 minutes.', | ||
}, | ||
expr: ||| | ||
node_nf_conntrack_entries{instance=~"workspace-.*"} > 5000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
}, | ||
'for': '10m', | ||
annotations: { | ||
runbook_url: 'https://github.com/gitpod-io/runbooks/blob/main/runbooks/NetworkConnectionsTooHigh.md', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@utam0k can you remove the runbook or leave it empty for now? I ask because eventually I think we'll want to reuse this runbook for high normalized load and high # of connections.
Aside from that, this looks great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I couldn't get 🙇 Should I close this PR and make runbook_url
empty?
https://github.com/gitpod-io/runbooks/pull/379
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@utam0k this PR is good, no need to close. I was just sharing the runbook doesn't exist, and for the future we can model it after the one I shared.
You are welcome to mark this ready for review, so we can approve. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go go gadget alert
Description
Related Issue(s)
Fixes #
How to test
Release Notes
Documentation
Werft options: