Skip to content

[observability] Introduce "ReplicaUnavailable" alerts #20344

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 7, 2024

Conversation

geropl
Copy link
Member

@geropl geropl commented Nov 5, 2024

Description

Adds alerts based on kube_deployment_status_replicas_unavailable.
It much better fits what we are aiming for then the existing "replica mismatch" alerts, see this image:

  • green: existing alerts, only notify at the end (overlapping spike)
  • yellow: new alerts, notifying from the beginning (with the 10m timeout window)
    image

Related Issue(s)

Related: CLC-909

How to test

Documentation

Preview status

gitpod:summary

Build Options

Build
  • /werft with-werft
    Run the build with werft instead of GHA
  • leeway-no-cache
  • /werft no-test
    Run Leeway with --dont-test
Publish
  • /werft publish-to-npm
  • /werft publish-to-jb-marketplace
Installer
  • analytics=segment
  • with-dedicated-emulation
  • workspace-feature-flags
    Add desired feature flags to the end of the line above, space separated
Preview Environment / Integration Tests
  • /werft with-local-preview
    If enabled this will build install/preview
  • /werft with-preview
  • /werft with-large-vm
  • /werft with-gce-vm
    If enabled this will create the environment on GCE infra
  • /werft preemptible
    Saves cost. Untick this only if you're really sure you need a non-preemtible machine.
  • with-integration-tests=all
    Valid options are all, workspace, webapp, ide, jetbrains, vscode, ssh. If enabled, with-preview and with-large-vm will be enabled.
  • with-monitoring

/hold

Copy link
Member

@filiptronicek filiptronicek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we need a review from @kylos101, but changes LGTM.

@kylos101
Copy link
Contributor

kylos101 commented Nov 6, 2024

Not sure if we need a review from @kylos101, but changes LGTM.

Thanks for the ping, @filiptronicek !

I talked with @geropl about this earlier today, we were thinking to do a few things:

  1. leave this as a warning
  2. remove the dedicated: included label, so that it is exclusive to gitpod.io only

💡 This way related alerts for gitpod.io route to Slack.

  1. create a separate set of alerts in gitpod-io/gitpod-dedicated, factor in maintenance mode in the alert promQL, and keep it at the error level.
sum by(cell) (kube_deployment_status_replicas_unavailable{deployment=~"image-builder-mk3", cluster!~"ephemeral.*", cell="foo"}) > 0
and
sum by(cell) (gitpod_ws_manager_mk2_maintenance_enabled{cell="foo"}) == 0

💡 this way we avoid false positives during rollout, but alert on-call via pagerduty when replicas are unavailable after sufficient time has passed. This way the partial or complete outage can be inspected.

For example, replicas could be unavailable due to a cloud provider outage, a CNI issue on a node, etc.

@geropl geropl force-pushed the gpl/909-replica-mismatch branch from e81edb7 to 919e23f Compare November 7, 2024 07:38
@roboquat roboquat added the size/M label Nov 7, 2024
@geropl
Copy link
Member Author

geropl commented Nov 7, 2024

I dropped the dedicated: included labels and will merge in that state. 👍

@geropl
Copy link
Member Author

geropl commented Nov 7, 2024

/unhold

@roboquat roboquat merged commit 9de8339 into main Nov 7, 2024
16 of 17 checks passed
@roboquat roboquat deleted the gpl/909-replica-mismatch branch November 7, 2024 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants