Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

observability: Add a alert for the network connections. #11825

Merged
merged 1 commit into from
Aug 10, 2022
Merged

Conversation

utam0k
Copy link
Contributor

@utam0k utam0k commented Aug 3, 2022

Description

Related Issue(s)

Fixes #

How to test

Release Notes

NONE

Documentation

Werft options:

  • /werft with-preview

@utam0k utam0k requested a review from a team August 3, 2022 05:18
@github-actions github-actions bot added the team: workspace Issue belongs to the Workspace team label Aug 3, 2022
description: 'Network connection numbers remain high for 5 minutes.',
},
expr: |||
node_nf_conntrack_entries{instance=~"workspace-.*"} > 5000
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear that people will get paged needlessly if the alert already fires when the number of connections goes above 5000. Is this not very similar to the GitpodNodeConntrackTableIsFull and GitpodNodeConntrackTableGettingFull alert?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@utam0k can you change this back to draft? 🙏

Why draft? I think more testing needs to be done with the expression, prior to marking ready for review. For example, given the expression you want to use here, can you test it to see if it helps you find abusers on nodes? If no, you cannot find related abuse with it, then the # of connections is probably too low.

I fear that people will get paged needlessly if the alert already fires when the number of connections goes above 5000.

Agreed

Is this not very similar to the GitpodNodeConntrackTableIsFull and GitpodNodeConntrackTableGettingFull alert?

This is a bit different, because those are percentage based like (node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.95. What we're trying to build here is a water mark, where, if it gets exceeded for a period of time, that's a signal that potential abuse needs to be investigated. I agree with @Furisto , we don't want it to fire excessively.

@utam0k @Furisto

wdyt about something like node_nf_conntrack_entries{node=~"workspace.*", instance!~"serv.*"} > 20000 where the duration is 10m? If you look at this now, you'll see there are three nodes, which could potentially have abusers.

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit different, because those are percentage based like (node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.95.

Isn't that the same thing, just expressed differently? A percentage of something resolves to a number in the end.

Copy link
Contributor Author

@utam0k utam0k Aug 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitpodNodeConntrackTableIsFull is an alert that detects node limits, this time for abuse users. I believe that is the difference. So I think the main key in this alert is that node_nf_contrack_entries_limit is not related to the threshold.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I create a metrics.
How about having the on-caller check this metric for abuse?

wdyt about something like node_nf_conntrack_entries{node="workspace.*", instance!"serv.*"} > 20000 where the duration is 10m? If you look at this now, you'll see there are three nodes, which could potentially have abusers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@utam0k I'm not sure we should route this to on-callers, yet.

@aledbf tracks connections and finds abusers very fast. @aledbf , wdyt of min_over_time(node_nf_conntrack_entries{instance=~"workspace-.*", instance!~"serv.*"}[10m]) > 20000 versus the expression that you use now?

@utam0k utam0k marked this pull request as draft August 4, 2022 02:44
Copy link
Contributor

@kylos101 kylos101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's route to Slack, instead of PagerDuty, for this initial alert.

Also, consulting with @aledbf Re: the prometheus expression being used to trigger the alert. He catches abuse often when inspecting connections, so will be "closer" to what is "reasonable".

Comment on lines 53 to 56
labels: {
severity: 'critical',
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
labels: {
severity: 'critical',
},
labels: {
severity: 'warning',
team: 'workspace'
},

This way it routes to https://gitpod.slack.com/archives/C03QZ1NH517, and we can gauge if this is too frequent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☝️ because I don't want to send the alert to on-callers until we (on the workspace team) have felt it first in Slack.

description: 'Network connection numbers remain high for 5 minutes.',
},
expr: |||
node_nf_conntrack_entries{instance=~"workspace-.*"} > 5000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@utam0k I'm not sure we should route this to on-callers, yet.

@aledbf tracks connections and finds abusers very fast. @aledbf , wdyt of min_over_time(node_nf_conntrack_entries{instance=~"workspace-.*", instance!~"serv.*"}[10m]) > 20000 versus the expression that you use now?

@utam0k
Copy link
Contributor Author

utam0k commented Aug 8, 2022

@kylos101 @Furisto Thanks for your review. I updated this PR.

},
'for': '10m',
annotations: {
runbook_url: 'https://github.com/gitpod-io/runbooks/blob/main/runbooks/NetworkConnectionsTooHigh.md',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@utam0k can you remove the runbook or leave it empty for now? I ask because eventually I think we'll want to reuse this runbook for high normalized load and high # of connections.

Aside from that, this looks great!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I couldn't get 🙇 Should I close this PR and make runbook_url empty?
https://github.com/gitpod-io/runbooks/pull/379

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@utam0k this PR is good, no need to close. I was just sharing the runbook doesn't exist, and for the future we can model it after the one I shared.

You are welcome to mark this ready for review, so we can approve. :)

Copy link
Contributor

@kylos101 kylos101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go go gadget alert

@utam0k utam0k marked this pull request as ready for review August 10, 2022 03:54
@roboquat roboquat merged commit 2d1f66a into main Aug 10, 2022
@roboquat roboquat deleted the to/alert branch August 10, 2022 03:55
@roboquat roboquat added deployed: workspace Workspace team change is running in production deployed Change is completely running in production labels Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: workspace Workspace team change is running in production deployed Change is completely running in production release-note-none size/S team: workspace Issue belongs to the Workspace team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants