Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alertmanager alerts limits #4253

Merged
merged 9 commits into from Jun 10, 2021

Conversation

pstibrany
Copy link
Contributor

What this PR does: This PR adds new limits in Alertmanager:

  • total number of alerts that a user can have in Alertmanager's memory (-alertmanager.max-alerts-count)
  • total size of alerts that user can have in Alertmanager's memory (-alertmanager.max-alerts-size-bytes)

These are overrideable per tenant. When limits are reached, Alertmanager will reject more alerts, produce a log message and increment cortex_alertmanager_insert_alert_failures_total metric. Additional metric to track behaviour of alerts limiter are: cortex_alertmanager_alerts_limiter_current_alerts_count and cortex_alertmanager_alerts_limiter_current_alerts_size_bytes.

This PR builds on top of #4237, and will be rebased once #4237 is merged.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@pstibrany
Copy link
Contributor Author

Marking as draft until #4237 is merged.

@pstibrany pstibrany marked this pull request as draft June 3, 2021 13:23
@gotjosh
Copy link
Contributor

gotjosh commented Jun 3, 2021

Can you assign this to me? I'll give you a first pass on it.

@pstibrany
Copy link
Contributor Author

Can you assign this to me? I'll give you a first pass on it.

Unfortunately assigning only cortex-project members :( But don't let that stop you. Thank you!

@pstibrany
Copy link
Contributor Author

Rebased on top of master now, it's ready for review.

@pstibrany pstibrany marked this pull request as ready for review June 3, 2021 16:52
@stevesg
Copy link
Contributor

stevesg commented Jun 4, 2021

Will let Josh do a first pass and come back to this one.

Copy link
Contributor

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎖️ Fantastic job @pstibrany ! Most of my comments are nits and questions (for the benefit of my understanding).

a.mx.Lock()
defer a.mx.Unlock()

if !existing {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alert is considered unique based on the hash of its labels (fingerprint), technically if an alert comes in with a size that's OK, then its annotations change and its size is no longer acceptable we would bypass it because we would consider its size to be OK because it was already there.

I believe we need to check that if it does exist, its new size would not tip us over the limit. Unless you want the semantic to be: If we had the alert before, we'll always accept it regardless of whenever it tips us over the limit

If this is the case, can we add a comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we had the alert before, we'll always accept it regardless of whenever it tips us over the limit

That was indeed my intention, mostly to make sure that we don't accidentally reject "resolved" alerts. But if the size has changed, it may make sense to reject it if it's over limit. I don't have strong opinion on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also ... when new alert arrives and "merges" with existing one, it cannot update labels (fingerprint would change) or annotations (right now, at least).

See the merging code:

func (a *Alert) Merge(o *Alert) *Alert {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or annotations (right now, at least)

I see that is true for the labels, which is expected but is it also for the annotations? I can't seem to see it in the code - am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that is true for the labels, which is expected but is it also for the annotations? I can't seem to see it in the code - am I missing something?

No, you’re right. I misread the code I linked… line 372 is copying everything over, including annotations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. In this case, I think rejection would be the best course of action - don't you think?

The alert that existed before should have a short enough endsAt that it would resolve itself eventually so there's little to no risk of just leaving the existing one to resolve by itself. On the contrary, accepting the alert that might tip us over the limit risks our availability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right, and I will make this change before merging.

Copy link
Contributor Author

@pstibrany pstibrany Jun 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the logic such that existing alert that grows and doesn't fit the limit anymore is rejected.

pkg/alertmanager/alertmanager.go Outdated Show resolved Hide resolved
pkg/alertmanager/alertmanager.go Outdated Show resolved Hide resolved
pkg/alertmanager/alertmanager.go Show resolved Hide resolved
pkg/alertmanager/alertmanager.go Show resolved Hide resolved
pkg/alertmanager/alertmanager.go Outdated Show resolved Hide resolved
pkg/util/validation/limits.go Outdated Show resolved Hide resolved
pkg/util/validation/limits.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! I left a couple of comments I would be glad if you could take a look 🙏 🙇

pkg/alertmanager/alertmanager_metrics.go Outdated Show resolved Hide resolved
pkg/alertmanager/alertmanager_metrics.go Outdated Show resolved Hide resolved
pkg/alertmanager/alertmanager.go Outdated Show resolved Hide resolved
pkg/alertmanager/alertmanager.go Outdated Show resolved Hide resolved
pkg/alertmanager/alertmanager.go Outdated Show resolved Hide resolved
pkg/alertmanager/alertmanager.go Show resolved Hide resolved
@pstibrany
Copy link
Contributor Author

Thank you for reviews, I've addressed all your comments. Please take a look again.

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (modulo a nit). Thanks a lot for addressing my feedback 🙏

pkg/alertmanager/alertmanager.go Outdated Show resolved Hide resolved
pstibrany and others added 9 commits June 10, 2021 17:22
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
…is rejected.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
@pracucci pracucci enabled auto-merge (squash) June 10, 2021 15:52
@pracucci pracucci merged commit cae36dc into cortexproject:master Jun 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants