Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Real-time metrics disappear while workflow is running when metricTTL is < workflow run time. #12790

Open
3 of 4 tasks
aleung2021 opened this issue Mar 12, 2024 · 7 comments · May be fixed by #12830
Open
3 of 4 tasks

Comments

@aleung2021
Copy link

aleung2021 commented Mar 12, 2024

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issue exists when I tested with :latest
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

Workflow controller configmap is configured as follows:

  metricsConfig: |
    metricsTTL: "3m"

I created a workflow that is infinitely sleeping. The workflow template has a metric that just tracks the duration as a real-time gauge metric.

Expected:

The metric is continued to be reported on until the workflow terminates. Once the workflow terminates, the metric should live for another 3 minutes before being cleaned up.

This sounds like the expected behavior based on #10503 (comment)

What happened instead:

The metric stopped reporting shortly after the metric TTL while the workflow is still running.

Green = Duration Metric
Yellow = some prometheus metric for the workflow pod, indicating it is not deleted

image

I tested the ttl with various different timings (10m, 20m, 1h) and confirmed the behavior scales with the TTL, though it's not always exactly the ttl time setting (i.e. when ttl is 10m, it can take up to 20m).

We cannot just turn off the metric TTL as that would affect the non-realtime metrics.

I also confirmed that our Prometheus instance was up and running, and that the Argo controller did not shut down (I did observe separately that if the controller gets shut down and a new one comes up, the metrics would not be continued either, but that is a different scenario and seems like a different bug).

Version

v3.4.10

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: sleep
  namespace: argo
spec:
  entrypoint: sleep
  templates:
    - name: sleep
      container:
        image: alpine:latest
        command: [sh, -c]
        args: ["while true; do sleep 100000; done"]
      metrics:
        prometheus:
        - name: duration_metric
          help: Duration of metric
          gauge:
            realtime: true
            value: '{{duration}}'

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

{"level":"info","msg":"Processing workflow","namespace":"argo","time":"2024-03-12T21:53:59.861Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"Updated phase  -\u003e Running","namespace":"argo","time":"2024-03-12T21:53:59.965Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"Pod node sleep-8ftls initialized Pending","namespace":"argo","time":"2024-03-12T21:53:59.966Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"Created pod: sleep-8ftls (sleep-8ftls)","namespace":"argo","time":"2024-03-12T21:54:00.067Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"TaskSet Reconciliation","namespace":"argo","time":"2024-03-12T21:54:00.067Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"reconcileAgentPod","namespace":"argo","time":"2024-03-12T21:54:00.067Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"Workflow update successful","namespace":"argo","phase":"Running","resourceVersion":"398814","time":"2024-03-12T21:54:00.098Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"Processing workflow","namespace":"argo","time":"2024-03-12T21:54:10.068Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"Task-result reconciliation","namespace":"argo","numObjs":0,"time":"2024-03-12T21:54:10.068Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"node changed","namespace":"argo","new.message":"","new.phase":"Running","new.progress":"0/1","nodeID":"sleep-8ftls","old.message":"","old.phase":"Pending","old.progress":"0/1","time":"2024-03-12T21:54:10.069Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"TaskSet Reconciliation","namespace":"argo","time":"2024-03-12T21:54:10.069Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"reconcileAgentPod","namespace":"argo","time":"2024-03-12T21:54:10.069Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"Workflow update successful","namespace":"argo","phase":"Running","resourceVersion":"399005","time":"2024-03-12T21:54:10.100Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"Processing workflow","namespace":"argo","time":"2024-03-12T21:54:20.101Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"Task-result reconciliation","namespace":"argo","numObjs":0,"time":"2024-03-12T21:54:20.102Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"node unchanged","namespace":"argo","nodeID":"sleep-8ftls","time":"2024-03-12T21:54:20.102Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"TaskSet Reconciliation","namespace":"argo","time":"2024-03-12T21:54:20.102Z","workflow":"sleep-8ftls"}
{"level":"info","msg":"reconcileAgentPod","namespace":"argo","time":"2024-03-12T21:54:20.102Z","workflow":"sleep-8ftls"}

Logs from in your workflow's wait container

{"level":"info","msg":"Starting Workflow Executor","time":"2024-03-12T21:54:02.642Z","version":"untagged"}
{"Duration":1000000000,"Factor":1.6,"Jitter":0.5,"Steps":5,"level":"info","msg":"Using executor retry strategy","time":"2024-03-12T21:54:02.645Z"}
{"deadline":"0001-01-01T00:00:00Z","includeScriptOutput":false,"level":"info","msg":"Executor initialized","namespace":"argo","podName":"sleep-8ftls","template":"{\"name\":\"sleep\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"container\":{\"name\":\"\",\"image\":\"alpine:latest\",\"command\":[\"sh\",\"-c\"],\"args\":[\"while true; do sleep 100000; done\"],\"resources\":{}},\"podSpecPatch\":\"{\\\"enableServiceLinks\\\": false}\\n\",\"metrics\":{\"prometheus\":[{\"name\":\"duration_metric\",\"help\":\"Duration of metric\",\"gauge\":{\"value\":\"{{duration}}\",\"realtime\":true}}]}}","time":"2024-03-12T21:54:02.645Z","version":"\u0026Version{Version:untagged,BuildDate:2024-03-01T06:42:25Z,GitCommit:unknown,GitTag:untagged,GitTreeState:clean,GoVersion:go1.21.7,Compiler:gc,Platform:linux/amd64,}"}
{"level":"info","msg":"Starting deadline monitor","time":"2024-03-12T21:54:02.645Z"}
{"level":"info","msg":"Alloc=6803 TotalAlloc=13592 Sys=20069 NumGC=6 Goroutines=7","time":"2024-03-12T21:59:02.646Z"}
@jswxstw
Copy link
Contributor

jswxstw commented Mar 13, 2024

The metric stopped reporting shortly after the metric TTL while the workflow is still running.

The real-time metrics are only calculated when the workflow is scheduled, but in the scenario of this issue, the pod status remains unchanged for a long time, so the workflow will also remain unscheduled for an extended period.

@jswxstw
Copy link
Contributor

jswxstw commented Mar 13, 2024

While checking for metricsTTL compliance, the garbage collector also needs to determine whether the workflow has been completed.

for key, metric := range m.customMetrics {
if time.Since(metric.lastUpdated) > m.metricsConfig.TTL {
delete(m.customMetrics, key)
}

I think it is necessary to add a flag in the custom metrics that represents the completion status of the workflow.

type metric struct {
metric prometheus.Metric
lastUpdated time.Time
}

@aleung2021
Copy link
Author

Hi @agilgur5 @jswxstw - thanks for the initial triage. Would it be possible to increase priority on this? We are using these real-time metrics for alerting, and while a temporary workaround is to increase the metricsTTL to something longer than our expected workflow runtime, it will not be sustainable as more metrics are added.

@Joibel
Copy link
Member

Joibel commented Mar 13, 2024

What is your limitation with increasing the TTL? Cardinality will be large in both cases and TTL barely saves anything.

I am reworking metrics as per #12589 but currently the rework will have the same problem as I haven't changed how this works.

@epDHowwD
Copy link

Hi @Joibel,

thanks for the response and working on the rework. With regards to limitations for the settings: You suggested earlier to decrease the metric TTL to reduce impact of #10503 (comment).

You can enable metricsTTL globally

in order to cause this to clear (once the TTL has expired after the workflow has completed).

This is what we did and now setting a higher value will make the other issue where wrong metrics are exposed again more prominent. It might still be acceptable, as we also have the other workaround used to avoid these particular realtime metrics.

@epDHowwD
Copy link

I would also like to understand generally the purpose of 'realtime' metrics in the context of this statement:

The real-time metrics are only calculated when the workflow is scheduled

If not for exposing correct runtime information during execution of the workflow, what makes realtime metrics different from other workflow metrics?
At least the documentation suggests that they are updated during execution of the workflow:

Argo supports a limited number of real-time metrics. These metrics are emitted in real-time, beginning when the step execution starts and ending when it completes.

Maybe I just need to read @jswxstw 's statement as a confirmation that this is the case right now, but not the actually wanted behavior?

@jswxstw
Copy link
Contributor

jswxstw commented Mar 14, 2024

I would also like to understand generally the purpose of 'realtime' metrics in the context of this statement:

The real-time metrics are only calculated when the workflow is scheduled

@epDHowwD Sorry, I expressed it ambiguously. The real-time metrics is accurate, because it will be calculated each time you pull the metric endpoint. Howerver, the lastUpdated field of the metric is only updated by the controler, which was not updated for a long time in this case.

in order to cause this to clear (once the TTL has expired after the workflow has completed).

According to the current logic of metric garbage collection, metrics will be cleared when the TTL expires, without checking whether the workflow has been completed. See #12790 (comment).

jswxstw pushed a commit to jswxstw/argo-workflows that referenced this issue Mar 21, 2024
jswxstw pushed a commit to jswxstw/argo-workflows that referenced this issue Mar 21, 2024
jswxstw pushed a commit to jswxstw/argo-workflows that referenced this issue Mar 21, 2024
jswxstw pushed a commit to jswxstw/argo-workflows that referenced this issue Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants