Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: More performance metrics #4823

Merged
merged 15 commits into from
Jan 7, 2021
Merged

Conversation

simster7
Copy link
Member

@simster7 simster7 commented Jan 4, 2021

Closes: #4800

Signed-off-by: Simon Behar simbeh7@gmail.com

Checklist:

Signed-off-by: Simon Behar <simbeh7@gmail.com>
Signed-off-by: Simon Behar <simbeh7@gmail.com>
workflow/cron/controller.go Outdated Show resolved Hide resolved
Copy link
Contributor

@alexec alexec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did something similar to this recently, but I took a slightly different approach. I created a new queue:

metrics.NewWorkQueueWithMetrics(workqueue.NewNamedRateLimitingQueue(&fixedItemIntervalRateLimiter{}, "workflow_queue"), "workflow_queue")

I don't think there's much in it mind you. Just a thought.

@alexec
Copy link
Contributor

alexec commented Jan 4, 2021

@simster7 I've got a few metrics we might want to do at some point:

  1. Busy workers.
  2. K8S API Metrics
  3. Count of workflow pods by phase (similar impl to count of workflows)
  4. Count of workflows that actually have running pods (by adding a condition).

I did PoC for these, but 3 and 4 are a bit complex and I don't know if we need both.

@simster7
Copy link
Member Author

simster7 commented Jan 4, 2021

Sure, I'll look into this

Signed-off-by: Simon Behar <simbeh7@gmail.com>
Signed-off-by: Simon Behar <simbeh7@gmail.com>
Signed-off-by: Simon Behar <simbeh7@gmail.com>
@simster7
Copy link
Member Author

simster7 commented Jan 5, 2021

Busy workers Added

K8S API Metrics: not sure it should be our responsibility to emit K8s metrics. Shouldn't K8s have an endpoint for these already? K8s emits metrics already: https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/

Count of workflow pods by phase (similar impl to count of workflows): currently have only 1 metric for active pods (i.e. phase==Running). I had actually first set the metric for all phases, but that would require changing our Pod informer from only looking at not completed pods to looking at all pods. If we think this is worth it, I can work to ensure that changing the nature of the informer doesn't have any intended consequences.

Count of workflows that actually have running pods (by adding a condition): this seems to be tied more to the refactored mentioned in #4824. This should perhaps be changed with that

@simster7 simster7 marked this pull request as ready for review January 5, 2021 17:42
Signed-off-by: Simon Behar <simbeh7@gmail.com>
Copy link
Contributor

@alexec alexec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some tests are probably in order for this PR please.

workflow/controller/indexes/labels.go Outdated Show resolved Hide resolved
workflow/controller/indexes/labels_test.go Outdated Show resolved Hide resolved
workflow/metrics/util.go Outdated Show resolved Hide resolved
@@ -205,6 +213,31 @@ func (m *Metrics) CronWorkflowSubmissionError() {
m.errors[ErrorCauseCronWorkflowSubmissionError].Inc()
}

func (m *Metrics) WorkerBusy(workerType string) {
m.mutex.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these mutex may create contention in the VM - doesn't the gauge already have these?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not too sure what you mean. These mutexes are needed to keep concurrency issues between writing/updating metrics and metrics being scraped. While individual metrics may be concurrent safe, the collection of metrics and the metrics scraper are not without these locks. They are present in all other functions in this file.

workflow/metrics/metrics.go Outdated Show resolved Hide resolved
workflow/controller/controller.go Outdated Show resolved Hide resolved
@alexec
Copy link
Contributor

alexec commented Jan 5, 2021

Can you take a look at https://github.com/argoproj/argo/pull/4811/files#diff-2721f0228996fd51c5f9a8db168a353b398d051154c82cebb03f03ddd1ee0574?

I think it gives a way to do the workers busy metric very safely.

@alexec
Copy link
Contributor

alexec commented Jan 5, 2021

K8S API Metrics: not sure it should be our responsibility to emit K8s metrics. Shouldn't K8s have an endpoint for these already? K8s emits metrics already: https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/

Interesting. @jessesuen are you aware of this?

Count of workflow pods by phase (similar impl to count of workflows): currently have only 1 metric for active pods (i.e. phase==Running). I had actually first set the metric for all phases, but that would require changing our Pod informer from only looking at not completed pods to looking at all pods. If we think this is worth it, I can work to ensure that changing the nature of the informer doesn't have any intended consequences.

I think pods that are pending are useful. I think you can do all in-complete pods.

Count of workflows that actually have running pods (by adding a condition): this seems to be tied more to the refactored mentioned in #4824. This should perhaps be changed with that

This is harder and I'm not 100% sure this is a valuable metric. Lets hold on it.

Signed-off-by: Simon Behar <simbeh7@gmail.com>
Signed-off-by: Simon Behar <simbeh7@gmail.com>
Signed-off-by: Simon Behar <simbeh7@gmail.com>
workflow/metrics/util.go Outdated Show resolved Hide resolved
Signed-off-by: Simon Behar <simbeh7@gmail.com>
Signed-off-by: Simon Behar <simbeh7@gmail.com>
Copy link
Contributor

@alexec alexec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor comments - maybe when you merge give a more specific commit message than the PR currently has?

workflow/controller/indexes/pod_index.go Outdated Show resolved Hide resolved
return map[v1.PodPhase]prometheus.Gauge{
v1.PodPending: prometheus.NewGauge(getOptsByPhase(v1.PodPending)),
v1.PodRunning: prometheus.NewGauge(getOptsByPhase(v1.PodRunning)),
//v1.PodSucceeded: prometheus.NewGauge(getOptsByPhase(v1.PodSucceeded)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete comments?

workflow/metrics/util.go Outdated Show resolved Hide resolved
workflow/metrics/work_queue.go Outdated Show resolved Hide resolved
workflow/metrics/work_queue.go Outdated Show resolved Hide resolved
workflow/metrics/work_queue.go Outdated Show resolved Hide resolved
workflow/metrics/work_queue.go Outdated Show resolved Hide resolved
workflow/metrics/work_queue_test.go Show resolved Hide resolved
Signed-off-by: Simon Behar <simbeh7@gmail.com>
@simster7 simster7 merged commit 6b3ce50 into argoproj:master Jan 7, 2021
@simster7 simster7 mentioned this pull request Jan 12, 2021
15 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

More Prometheus metrics
2 participants