observability: Add a alert for the network connections. #11825

utam0k · 2022-08-03T05:18:14Z

Description

Related Issue(s)

Fixes #

How to test

Release Notes

NONE

Documentation

Werft options:

/werft with-preview

utam0k · 2022-08-03T05:18:29Z

operations/observability/mixins/workspace/rules/components/nodes/alerts.libsonnet

+              description: 'Network connection numbers remain high for 5 minutes.',
+            },
+            expr: |||
+              node_nf_conntrack_entries{instance=~"workspace-.*"} > 5000


I fear that people will get paged needlessly if the alert already fires when the number of connections goes above 5000. Is this not very similar to the GitpodNodeConntrackTableIsFull and GitpodNodeConntrackTableGettingFull alert?

@utam0k can you change this back to draft? 🙏

Why draft? I think more testing needs to be done with the expression, prior to marking ready for review. For example, given the expression you want to use here, can you test it to see if it helps you find abusers on nodes? If no, you cannot find related abuse with it, then the # of connections is probably too low.

I fear that people will get paged needlessly if the alert already fires when the number of connections goes above 5000.

Agreed

Is this not very similar to the GitpodNodeConntrackTableIsFull and GitpodNodeConntrackTableGettingFull alert?

This is a bit different, because those are percentage based like (node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.95. What we're trying to build here is a water mark, where, if it gets exceeded for a period of time, that's a signal that potential abuse needs to be investigated. I agree with @Furisto , we don't want it to fire excessively.

@utam0k @Furisto

wdyt about something like node_nf_conntrack_entries{node=~"workspace.*", instance!~"serv.*"} > 20000 where the duration is 10m? If you look at this now, you'll see there are three nodes, which could potentially have abusers.

This is a bit different, because those are percentage based like (node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.95.

Isn't that the same thing, just expressed differently? A percentage of something resolves to a number in the end.

GitpodNodeConntrackTableIsFull is an alert that detects node limits, this time for abuse users. I believe that is the difference. So I think the main key in this alert is that node_nf_contrack_entries_limit is not related to the threshold.

I create a metrics.
How about having the on-caller check this metric for abuse?

wdyt about something like node_nf_conntrack_entries{node=~~"workspace.*", instance!~~"serv.*"} > 20000 where the duration is 10m? If you look at this now, you'll see there are three nodes, which could potentially have abusers.

@utam0k I'm not sure we should route this to on-callers, yet.

@aledbf tracks connections and finds abusers very fast. @aledbf , wdyt of min_over_time(node_nf_conntrack_entries{instance=~"workspace-.*", instance!~"serv.*"}[10m]) > 20000 versus the expression that you use now?

kylos101

Let's route to Slack, instead of PagerDuty, for this initial alert.

Also, consulting with @aledbf Re: the prometheus expression being used to trigger the alert. He catches abuse often when inspecting connections, so will be "closer" to what is "reasonable".

kylos101 · 2022-08-05T18:45:15Z

operations/observability/mixins/workspace/rules/components/nodes/alerts.libsonnet

+            labels: {
+              severity: 'critical',
+            },


Suggested change

labels: {

severity: 'critical',

},

labels: {

severity: 'warning',

team: 'workspace'

},

This way it routes to https://gitpod.slack.com/archives/C03QZ1NH517, and we can gauge if this is too frequent.

☝️ because I don't want to send the alert to on-callers until we (on the workspace team) have felt it first in Slack.

kylos101 · 2022-08-05T18:47:05Z

operations/observability/mixins/workspace/rules/components/nodes/alerts.libsonnet

+              description: 'Network connection numbers remain high for 5 minutes.',
+            },
+            expr: |||
+              node_nf_conntrack_entries{instance=~"workspace-.*"} > 5000


@utam0k I'm not sure we should route this to on-callers, yet.

@aledbf tracks connections and finds abusers very fast. @aledbf , wdyt of min_over_time(node_nf_conntrack_entries{instance=~"workspace-.*", instance!~"serv.*"}[10m]) > 20000 versus the expression that you use now?

utam0k · 2022-08-08T06:27:56Z

@kylos101 @Furisto Thanks for your review. I updated this PR.

kylos101 · 2022-08-08T17:52:33Z

operations/observability/mixins/workspace/rules/components/nodes/alerts.libsonnet

+            },
+            'for': '10m',
+            annotations: {
+              runbook_url: 'https://github.com/gitpod-io/runbooks/blob/main/runbooks/NetworkConnectionsTooHigh.md',


@utam0k can you remove the runbook or leave it empty for now? I ask because eventually I think we'll want to reuse this runbook for high normalized load and high # of connections.

Aside from that, this looks great!

Sorry, I couldn't get 🙇 Should I close this PR and make runbook_url empty?
https://github.com/gitpod-io/runbooks/pull/379

@utam0k this PR is good, no need to close. I was just sharing the runbook doesn't exist, and for the future we can model it after the one I shared.

You are welcome to mark this ready for review, so we can approve. :)

kylos101

Go go gadget alert

utam0k requested a review from a team August 3, 2022 05:18

roboquat added release-note-none size/S labels Aug 3, 2022

github-actions bot added the team: workspace Issue belongs to the Workspace team label Aug 3, 2022

utam0k commented Aug 3, 2022

View reviewed changes

utam0k marked this pull request as draft August 4, 2022 02:44

roboquat added the do-not-merge/work-in-progress label Aug 4, 2022

kylos101 requested changes Aug 5, 2022

View reviewed changes

observability: Add a alert for the network connections.

1d7c3fd

utam0k force-pushed the to/alert branch from 438f5b4 to 1d7c3fd Compare August 8, 2022 06:27

kylos101 reviewed Aug 8, 2022

View reviewed changes

kylos101 approved these changes Aug 9, 2022

View reviewed changes

utam0k marked this pull request as ready for review August 10, 2022 03:54

roboquat removed the do-not-merge/work-in-progress label Aug 10, 2022

roboquat merged commit 2d1f66a into main Aug 10, 2022

roboquat deleted the to/alert branch August 10, 2022 03:55

roboquat added deployed: workspace Workspace team change is running in production deployed Change is completely running in production labels Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

observability: Add a alert for the network connections. #11825

observability: Add a alert for the network connections. #11825

utam0k commented Aug 3, 2022

utam0k Aug 3, 2022

Furisto Aug 3, 2022

kylos101 Aug 3, 2022

Furisto Aug 4, 2022

utam0k Aug 5, 2022 •

edited

utam0k Aug 5, 2022

kylos101 Aug 5, 2022

kylos101 left a comment

kylos101 Aug 5, 2022

kylos101 Aug 5, 2022

kylos101 Aug 5, 2022

utam0k commented Aug 8, 2022

kylos101 Aug 8, 2022

utam0k Aug 8, 2022

kylos101 Aug 9, 2022

kylos101 left a comment

observability: Add a alert for the network connections. #11825

observability: Add a alert for the network connections. #11825

Conversation

utam0k commented Aug 3, 2022

Description

Related Issue(s)

How to test

Release Notes

Documentation

Werft options:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

utam0k Aug 5, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylos101 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

utam0k commented Aug 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylos101 left a comment

Choose a reason for hiding this comment

utam0k Aug 5, 2022 •

edited