Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup obsolete local files for alertmanager. #3910

Merged

Conversation

pstibrany
Copy link
Contributor

What this PR does: This PR implements cleanup of local files on alertmanager when AM is no longer running for given user.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! I'm a bit scared about the alertmanagersMtx locking complexity in MultitenantAlertmanager. I'm wondering if it's time to properly address it (with a per-tenant alertmanager lock and a state to CAS on it, so we can safely handle all cases). I'm also open to work on such refactoring.

pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Show resolved Hide resolved
pkg/alertmanager/alertmanager.go Outdated Show resolved Hide resolved
pkg/alertmanager/alertmanager.go Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
@pstibrany pstibrany force-pushed the alertmanager-cleanup-local-files branch from b3d987d to abecf4c Compare March 8, 2021 08:19
@pstibrany
Copy link
Contributor Author

pstibrany commented Mar 8, 2021

I've pushed next version of PR that implemens following changes:

  • per-tenant state is now kept in per-tenant directory, under cfg.DataDir
  • cleanup for the user means deleting entire directory
  • there is a migration procedure, that moves existing nflog:<tenantID>, silences:<tenantID> and templates into per-tenant directory on startup.

I'd be interested to hear opinions on these design decisions. (Tests are not yet updated to cover this, so please ignore tests for now)

One con of this approach is that reverting back to previous version of AM will ignore existing snapshot files in per-tenant directory. Hopefully moving from new to previous AM version is rare.

/cc @gotjosh @pracucci @stevesg

pracucci
pracucci previously approved these changes Mar 8, 2021
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! Changes LGTM. I've left a couple of nits and a comment about testing.

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated
@@ -2,6 +2,8 @@

## master / unreleased

* [CHANGE] Alertmanager: clean obsolete local files after Alertmanager is no longer running for removed or resharded user. #3910
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this CHANGELOG entry. Also remember to mention that the path where each tenant alertmanager data is stored has changed.

pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Show resolved Hide resolved
Copy link
Contributor

@stevesg stevesg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a quick pass over. I had a question though:

there is a migration procedure, that moves existing nflog:, silences: and > templates into per-tenant directory on startup.

I'm wondering what the benefit is from having the migration - as there is some complexity here we could avoid. Is listing out the files for a particular user difficult in some way? Or perhaps are we worried about the number of files in a directory?

Not saying one way or the other, just curious on the reasoning.

pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
@pstibrany
Copy link
Contributor Author

I'm wondering what the benefit is from having the migration - as there is some complexity here we could avoid. Is listing out the files for a particular user difficult in some way? Or perhaps are we worried about the number of files in a directory?

Not saying one way or the other, just curious on the reasoning.

Rationale behind single directory per tenant is to make it easy to delete tenant files when needed. Previous version of this PR only deleted silence and nflog file, but today I've found that I've missed template files – which then led to idea of using per-tenant directory.

Reasoning behind migrating is to avoid losing notifications and silences state. Even though this state is propagated by other alertmanagers via gossip, it takes some time and in the meantime missing silences could cause spurious notifications. Note that this problem still exists when doing downgrade to previous AM, which doesn't understand new structure.

@pstibrany pstibrany force-pushed the alertmanager-cleanup-local-files branch from 69a9383 to e432d82 Compare March 8, 2021 17:16
@pstibrany pstibrany requested a review from pracucci March 8, 2021 20:24
@pstibrany pstibrany dismissed pracucci’s stale review March 8, 2021 20:26

PR has changed since it was approved, please take a look again.

@pstibrany
Copy link
Contributor Author

PR is now ready for review. Since original version, it has also added change to how files are stored locally. I've also extended unit tests to cover use of templates, that were missing.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Add test for migration.
Fix test for deletion of unused dirs.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
@pstibrany pstibrany force-pushed the alertmanager-cleanup-local-files branch from 5f6bdc5 to de67abb Compare March 8, 2021 21:09
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, LGTM! I left few nits just to be a bit more clear in log messages (some log messages are generic and the final user may be lost reading them).

pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
func (am *MultitenantAlertmanager) newAlertmanager(userID string, amConfig *amconfig.Config, rawCfg string) (*Alertmanager, error) {
reg := prometheus.NewRegistry()

tenantDir := am.getTenantDirectory(userID)
err := os.MkdirAll(tenantDir, 0777)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
err := os.MkdirAll(tenantDir, 0777)
err := os.MkdirAll(tenantDir, os.ModePerm)


for userID, files := range st {
tenantDir := am.getTenantDirectory(userID)
err := os.MkdirAll(tenantDir, 0777)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
err := os.MkdirAll(tenantDir, 0777)
err := os.MkdirAll(tenantDir, os.ModePerm)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it is the same value, the intent is not the same. os.ModePerm is simply all permission-bits set and defined next to other Mode* constants for higher-bits. Note that there is no such constant for files (0666, without exec-bit).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the fact since this is merged, but why do we need 0777 instead of 0755? Every one of these causes some pain for anyone using source scanning tools for :( I know we have some other similar permissions, so there may be a good reason, but want to understand if so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for using wide permissions is mostly for consistency with vendored Prometheus code, which already uses very wide permissions (see prometheus/prometheus#7782). Cortex users that want reduced permissions need to use umask to do so. (I’m not quite sure how to do that in Kubernetes though).

pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
Copy link
Contributor

@stevesg stevesg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a unit test nit.

pkg/alertmanager/multitenant_test.go Outdated Show resolved Hide resolved
Co-authored-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
@pstibrany pstibrany force-pushed the alertmanager-cleanup-local-files branch from bcf060a to 6746059 Compare March 10, 2021 16:43
@pstibrany pstibrany merged commit f485bc8 into cortexproject:master Mar 11, 2021
roystchiang pushed a commit to roystchiang/cortex that referenced this pull request Apr 6, 2022
* Cleanup obsolete local files for alertmanager.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* CHANGELOG.md

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Comment.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Don't ignore directories. Log error when deletion fails instead.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Address review feedback.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Move per-tenant state into tenant directory to simplify cleanup.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Move migration to separate function.
Add test for migration.
Fix test for deletion of unused dirs.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Store templates to correct place.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* CHANGELOG.md

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Verify that templates are stored properly into correct location.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Comments.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Comments.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Apply suggestions from code review

Co-authored-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Review feedback.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

Co-authored-by: Marco Pracucci <marco@pracucci.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants