Real-time metrics disappear while workflow is running when metricTTL is < workflow run time. #12790

aleung2021 · 2024-03-12T22:19:17Z

Pre-requisites

I have double-checked my configuration
I can confirm the issue exists when I tested with :latest
I have searched existing issues and could not find a match for this bug
I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

Workflow controller configmap is configured as follows:

  metricsConfig: |
    metricsTTL: "3m"

I created a workflow that is infinitely sleeping. The workflow template has a metric that just tracks the duration as a real-time gauge metric.

Expected:

The metric is continued to be reported on until the workflow terminates. Once the workflow terminates, the metric should live for another 3 minutes before being cleaned up.

This sounds like the expected behavior based on #10503 (comment)

What happened instead:

The metric stopped reporting shortly after the metric TTL while the workflow is still running.

Green = Duration Metric
Yellow = some prometheus metric for the workflow pod, indicating it is not deleted

I tested the ttl with various different timings (10m, 20m, 1h) and confirmed the behavior scales with the TTL, though it's not always exactly the ttl time setting (i.e. when ttl is 10m, it can take up to 20m).

We cannot just turn off the metric TTL as that would affect the non-realtime metrics.

I also confirmed that our Prometheus instance was up and running, and that the Argo controller did not shut down (I did observe separately that if the controller gets shut down and a new one comes up, the metrics would not be continued either, but that is a different scenario and seems like a different bug).

Version

v3.4.10

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: sleep
  namespace: argo
spec:
  entrypoint: sleep
  templates:
    - name: sleep
      container:
        image: alpine:latest
        command: [sh, -c]
        args: ["while true; do sleep 100000; done"]
      metrics:
        prometheus:
        - name: duration_metric
          help: Duration of metric
          gauge:
            realtime: true
            value: '{{duration}}'

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

{"level":"info","msg":"Processing workflow","namespace":"argo","time":"2024-03-12T21:53:59.861Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"Updated phase  -\u003e Running","namespace":"argo","time":"2024-03-12T21:53:59.965Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"Pod node sleep-8ftls initialized Pending","namespace":"argo","time":"2024-03-12T21:53:59.966Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"Created pod: sleep-8ftls (sleep-8ftls)","namespace":"argo","time":"2024-03-12T21:54:00.067Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"TaskSet Reconciliation","namespace":"argo","time":"2024-03-12T21:54:00.067Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"reconcileAgentPod","namespace":"argo","time":"2024-03-12T21:54:00.067Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"Workflow update successful","namespace":"argo","phase":"Running","resourceVersion":"398814","time":"2024-03-12T21:54:00.098Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"Processing workflow","namespace":"argo","time":"2024-03-12T21:54:10.068Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"Task-result reconciliation","namespace":"argo","numObjs":0,"time":"2024-03-12T21:54:10.068Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"node changed","namespace":"argo","new.message":"","new.phase":"Running","new.progress":"0/1","nodeID":"sleep-8ftls","old.message":"","old.phase":"Pending","old.progress":"0/1","time":"2024-03-12T21:54:10.069Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"TaskSet Reconciliation","namespace":"argo","time":"2024-03-12T21:54:10.069Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"reconcileAgentPod","namespace":"argo","time":"2024-03-12T21:54:10.069Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"Workflow update successful","namespace":"argo","phase":"Running","resourceVersion":"399005","time":"2024-03-12T21:54:10.100Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"Processing workflow","namespace":"argo","time":"2024-03-12T21:54:20.101Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"Task-result reconciliation","namespace":"argo","numObjs":0,"time":"2024-03-12T21:54:20.102Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"node unchanged","namespace":"argo","nodeID":"sleep-8ftls","time":"2024-03-12T21:54:20.102Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"TaskSet Reconciliation","namespace":"argo","time":"2024-03-12T21:54:20.102Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"reconcileAgentPod","namespace":"argo","time":"2024-03-12T21:54:20.102Z","workflow":"sleep-8ftls"}

Logs from in your workflow's wait container

{"level":"info","msg":"Starting Workflow Executor","time":"2024-03-12T21:54:02.642Z","version":"untagged"}
{"Duration":1000000000,"Factor":1.6,"Jitter":0.5,"Steps":5,"level":"info","msg":"Using executor retry strategy","time":"2024-03-12T21:54:02.645Z"}
{"deadline":"0001-01-01T00:00:00Z","includeScriptOutput":false,"level":"info","msg":"Executor initialized","namespace":"argo","podName":"sleep-8ftls","template":"{\"name\":\"sleep\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"container\":{\"name\":\"\",\"image\":\"alpine:latest\",\"command\":[\"sh\",\"-c\"],\"args\":[\"while true; do sleep 100000; done\"],\"resources\":{}},\"podSpecPatch\":\"{\\\"enableServiceLinks\\\": false}\\n\",\"metrics\":{\"prometheus\":[{\"name\":\"duration_metric\",\"help\":\"Duration of metric\",\"gauge\":{\"value\":\"{{duration}}\",\"realtime\":true}}]}}","time":"2024-03-12T21:54:02.645Z","version":"\u0026Version{Version:untagged,BuildDate:2024-03-01T06:42:25Z,GitCommit:unknown,GitTag:untagged,GitTreeState:clean,GoVersion:go1.21.7,Compiler:gc,Platform:linux/amd64,}"}
{"level":"info","msg":"Starting deadline monitor","time":"2024-03-12T21:54:02.645Z"}
{"level":"info","msg":"Alloc=6803 TotalAlloc=13592 Sys=20069 NumGC=6 Goroutines=7","time":"2024-03-12T21:59:02.646Z"}

The text was updated successfully, but these errors were encountered:

jswxstw · 2024-03-13T03:54:05Z

The metric stopped reporting shortly after the metric TTL while the workflow is still running.

The real-time metrics are only calculated when the workflow is scheduled, but in the scenario of this issue, the pod status remains unchanged for a long time, so the workflow will also remain unscheduled for an extended period.

jswxstw · 2024-03-13T06:21:26Z

While checking for metricsTTL compliance, the garbage collector also needs to determine whether the workflow has been completed.

argo-workflows/workflow/metrics/server.go

Lines 140 to 143 in 5b3909b

    
           for key, metric := range m.customMetrics { 
        
           	if time.Since(metric.lastUpdated) > m.metricsConfig.TTL { 
        
           		delete(m.customMetrics, key) 
        
           	}

I think it is necessary to add a flag in the custom metrics that represents the completion status of the workflow.

argo-workflows/workflow/metrics/metrics.go

Lines 42 to 45 in 5b3909b

    
           type metric struct { 
        
           	metric      prometheus.Metric 
        
           	lastUpdated time.Time 
        
           }

aleung2021 · 2024-03-13T14:06:09Z

Hi @agilgur5 @jswxstw - thanks for the initial triage. Would it be possible to increase priority on this? We are using these real-time metrics for alerting, and while a temporary workaround is to increase the metricsTTL to something longer than our expected workflow runtime, it will not be sustainable as more metrics are added.

Joibel · 2024-03-13T15:24:20Z

What is your limitation with increasing the TTL? Cardinality will be large in both cases and TTL barely saves anything.

I am reworking metrics as per #12589 but currently the rework will have the same problem as I haven't changed how this works.

epDHowwD · 2024-03-13T16:28:38Z

Hi @Joibel,

thanks for the response and working on the rework. With regards to limitations for the settings: You suggested earlier to decrease the metric TTL to reduce impact of #10503 (comment).

You can enable metricsTTL globally

argo-workflows/docs/workflow-controller-configmap.yaml

Line 252 in 0b8724b

metricsTTL: "10m"

in order to cause this to clear (once the TTL has expired after the workflow has completed).

This is what we did and now setting a higher value will make the other issue where wrong metrics are exposed again more prominent. It might still be acceptable, as we also have the other workaround used to avoid these particular realtime metrics.

epDHowwD · 2024-03-13T16:29:33Z

I would also like to understand generally the purpose of 'realtime' metrics in the context of this statement:

The real-time metrics are only calculated when the workflow is scheduled

If not for exposing correct runtime information during execution of the workflow, what makes realtime metrics different from other workflow metrics?
At least the documentation suggests that they are updated during execution of the workflow:

Argo supports a limited number of real-time metrics. These metrics are emitted in real-time, beginning when the step execution starts and ending when it completes.

Maybe I just need to read @jswxstw 's statement as a confirmation that this is the case right now, but not the actually wanted behavior?

jswxstw · 2024-03-14T02:54:20Z

I would also like to understand generally the purpose of 'realtime' metrics in the context of this statement:

The real-time metrics are only calculated when the workflow is scheduled

@epDHowwD Sorry, I expressed it ambiguously. The real-time metrics is accurate, because it will be calculated each time you pull the metric endpoint. Howerver, the lastUpdated field of the metric is only updated by the controler, which was not updated for a long time in this case.

in order to cause this to clear (once the TTL has expired after the workflow has completed).

According to the current logic of metric garbage collection, metrics will be cleared when the TTL expires, without checking whether the workflow has been completed. See #12790 (comment).

…rgoproj#12790 Signed-off-by: oninowang <oninowang@tencent.com>

aleung2021 added the type/bug label Mar 12, 2024

agilgur5 added area/metrics P3 Low priority labels Mar 12, 2024

jswxstw pushed a commit to jswxstw/argo-workflows that referenced this issue Mar 21, 2024

fix: Enable realtime metric gc after its workflow is completed. Fixes a…

e2309d1

…rgoproj#12790 Signed-off-by: oninowang <oninowang@tencent.com>

jswxstw linked a pull request Mar 21, 2024 that will close this issue

fix: Enable realtime metric gc after its workflow is completed. Fixes #12790 #12830

Open

jswxstw pushed a commit to jswxstw/argo-workflows that referenced this issue Mar 21, 2024

fix: Enable realtime metric gc after its workflow is completed. Fixes a…

a2711f3

…rgoproj#12790 Signed-off-by: oninowang <oninowang@tencent.com>

jswxstw pushed a commit to jswxstw/argo-workflows that referenced this issue Mar 21, 2024

fix: Enable realtime metric gc after its workflow is completed. Fixes a…

f7ef2e6

…rgoproj#12790 Signed-off-by: oninowang <oninowang@tencent.com>

jswxstw pushed a commit to jswxstw/argo-workflows that referenced this issue Mar 25, 2024

fix: Enable realtime metric gc after its workflow is completed. Fixes a…

4215ee7

…rgoproj#12790 Signed-off-by: oninowang <oninowang@tencent.com>

jswxstw mentioned this issue Apr 9, 2024

REQUEST: New membership for @jswxstw argoproj/argoproj#295

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Real-time metrics disappear while workflow is running when metricTTL is < workflow run time. #12790

Real-time metrics disappear while workflow is running when metricTTL is < workflow run time. #12790

aleung2021 commented Mar 12, 2024 •

edited by agilgur5

jswxstw commented Mar 13, 2024

jswxstw commented Mar 13, 2024

aleung2021 commented Mar 13, 2024

Joibel commented Mar 13, 2024 •

edited

epDHowwD commented Mar 13, 2024

epDHowwD commented Mar 13, 2024

jswxstw commented Mar 14, 2024 •

edited

Real-time metrics disappear while workflow is running when metricTTL is < workflow run time. #12790

Real-time metrics disappear while workflow is running when metricTTL is < workflow run time. #12790

Comments

aleung2021 commented Mar 12, 2024 • edited by agilgur5

Pre-requisites

What happened/what did you expect to happen?

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

jswxstw commented Mar 13, 2024

jswxstw commented Mar 13, 2024

aleung2021 commented Mar 13, 2024

Joibel commented Mar 13, 2024 • edited

epDHowwD commented Mar 13, 2024

epDHowwD commented Mar 13, 2024

jswxstw commented Mar 14, 2024 • edited

aleung2021 commented Mar 12, 2024 •

edited by agilgur5

Joibel commented Mar 13, 2024 •

edited

jswxstw commented Mar 14, 2024 •

edited