Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Create metric for flux manifest errors #2535

Merged
merged 1 commit into from
Oct 24, 2019

Conversation

mpashka
Copy link
Contributor

@mpashka mpashka commented Oct 21, 2019

#2199

Expose Prometheus metric that allows alerting on invalid Kubernetes manifests:
flux_daemon_sync_manifests{success='false'} > 0 - if true then there are either some problems with applying git manifests to kubernetes - e.g. configmap size is too big to fit in annotations or immutable field (like label selector) was changed.

@mpashka mpashka changed the title Create metric for flux manifest errors - Create metric for flux manifest errors Oct 21, 2019
pkg/daemon/sync_test.go Outdated Show resolved Hide resolved
mpashka added a commit to pulsepointinc/flux that referenced this pull request Oct 22, 2019
@stefanprodan
Copy link
Member

I think the metric should be named sync_error and it should act like a boolean, values can be 0 or 1.

Labels:

  • fetch (records git errors)
  • generate (records .flux.yaml generators errors)
  • parse (records manifestsStore errors)
  • apply (records kubectl apply errors)

With this metric and labels one can compose alerts with avg_over_time(flux_daemon_sync_error [5m]) > 0 and compose messages like flux sync apply failed or flux sync parse falied.

@squaremo
Copy link
Member

squaremo commented Oct 22, 2019

It seems like the main thrust of this PR is to be able to know the number of manifests that failed to apply, so you can alert on manifests failing to apply.

There's a metric which record how long each sync takes, and whether it was successful or not -- flux_daemon_sync_duration_seconds -- but that might be too broad for your purposes, since it will label a sample as unsuccessful if any error came up while syncing, e.g., it failed to post a webhook.

So I think there's room another metric to record specifically the application step. A gauge counting how many things were applied is appropriate -- I would record how many were attempted, and label according to whether they were successful applied or not.

Tangentially for your purpose, a metric timing (and labelling the success) of each phase of the syncing might be useful too. E.g., generating the manifests, parsing the manifests, doing the application, doing the garbage collection.

@squaremo
Copy link
Member

I think the metric should be named sync_error and it should act like a boolean

I'm not a fan of metrics that are there just so you can write a very specific alert as simply as possible, but provide no other information. What if you want to alert on the proportion of apply errors to successes? Or an alert on whether the number of manifests being applied is not what you expected -- a mistake in configuration could mean you suddenly have no manifests, and no errors.

@mpashka
Copy link
Contributor Author

mpashka commented Oct 22, 2019

Main idea of initial request - #2199 - is to provide possibility to get information about current flux state - is everything is in sync and works as expected or something can't be synced and manual intervention is needed. flux_demon_sync_duration_seconds doesn't give flux current state information.
I suppose metric is proper way to get flux state.

@squaremo
Copy link
Member

Sorry, this difference of opinion is going to hold up the PR. @stefanprodan Can we agree a minimal change that makes it mergeable?

I suggest: make the metric a gauge of attempted manifest applications, labelled by success or failure. In other words,

flux_daemon_sync_manifests{success=true|false}

If you want to alert on _any_failures, you can use

flux_daemon_sync_manifests{success=false} > 0

@stefanprodan
Copy link
Member

I'm ok with flux_daemon_sync_manifests{success=true|false}. It's easy to reason about and create alerts.

@mpashka
Copy link
Contributor Author

mpashka commented Oct 22, 2019

So if sync was successful we put flux_daemon_sync_manifests{success=true}=manifest_number, if not - flux_daemon_sync_manifests{success=false}=manifest_errors
If there was error reading manifests or parsing manifest we put flux_daemon_sync_manifests{success=true}=0 and flux_daemon_sync_manifests{success=false}=number_of_parse_errors? Or just flux_daemon_sync_manifests{success=false}=1?

@stefanprodan , @squaremo ?

@squaremo
Copy link
Member

I'm not sure of the most convenient way to get these numbers, but

flux_daemon_sync_manifests{success=false}=manifest_errors
flux_daemon_sync_manifests{success=true}=total_manifest_number - manifest_errors

manifest_errors is the number of resource errors received back from the sync, as you have it now (sync.go L163).

mpashka added a commit to pulsepointinc/flux that referenced this pull request Oct 22, 2019
@mpashka
Copy link
Contributor Author

mpashka commented Oct 22, 2019

Done

Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mpashka added a commit to pulsepointinc/flux that referenced this pull request Oct 23, 2019
@mpashka
Copy link
Contributor Author

mpashka commented Oct 23, 2019

Done

Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @mpashka

Copy link
Member

@squaremo squaremo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's wrong to use a metric which is a count (manifests, here) also as a binary 1/0. Rely on another metric to determine the success or failure of the whole sync.

pkg/daemon/sync.go Outdated Show resolved Hide resolved
pkg/daemon/sync.go Outdated Show resolved Hide resolved
pkg/daemon/sync.go Outdated Show resolved Hide resolved
pkg/daemon/sync.go Outdated Show resolved Hide resolved
pkg/daemon/sync.go Outdated Show resolved Hide resolved
@mpashka
Copy link
Contributor Author

mpashka commented Oct 23, 2019

That's correct. Success or failure of the overall sync process can be obtained by running queries:
delta(flux_daemon_sync_duration_seconds_count{success='true'}[6m]) < 1
or
rate(flux_daemon_sync_duration_seconds_count{success='false'}[6m]) > 0

So we need only measure the case where manifests are correct but there were some problems applying manifests to kubernetes - e.g. configmap size is too big to fit in annotations or immutable field (like label selector) was changed.

Probably it is better to put glux alarming info into documentation as well. I suppose https://github.com/fluxcd/flux/blob/master/docs/references/monitoring.md is a good place for it.

mpashka added a commit to pulsepointinc/flux that referenced this pull request Oct 24, 2019
mpashka added a commit to pulsepointinc/flux that referenced this pull request Oct 24, 2019
Copy link
Member

@squaremo squaremo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks all squared away (I have a cosmetic suggestion only). Thank you for your diligent work @mpashka! Can you rebase+squash it into one commit (or I can do that before merging, if you have better things to do :-)

docs/references/monitoring.md Outdated Show resolved Hide resolved
mpashka added a commit to pulsepointinc/flux that referenced this pull request Oct 24, 2019
@mpashka mpashka force-pushed the errors_metrics branch 2 times, most recently from 75368ec to 45c19b1 Compare October 24, 2019 11:59
@squaremo squaremo merged commit 28db2f5 into fluxcd:master Oct 24, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants