New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Real-time metrics disappear while workflow is running when metricTTL is < workflow run time. #12790
Comments
The real-time metrics are only calculated when the workflow is scheduled, but in the scenario of this issue, the pod status remains unchanged for a long time, so the workflow will also remain unscheduled for an extended period. |
While checking for metricsTTL compliance, the garbage collector also needs to determine whether the workflow has been completed. argo-workflows/workflow/metrics/server.go Lines 140 to 143 in 5b3909b
I think it is necessary to add a flag in the custom metrics that represents the completion status of the workflow. argo-workflows/workflow/metrics/metrics.go Lines 42 to 45 in 5b3909b
|
Hi @agilgur5 @jswxstw - thanks for the initial triage. Would it be possible to increase priority on this? We are using these real-time metrics for alerting, and while a temporary workaround is to increase the metricsTTL to something longer than our expected workflow runtime, it will not be sustainable as more metrics are added. |
What is your limitation with increasing the TTL? Cardinality will be large in both cases and TTL barely saves anything. I am reworking metrics as per #12589 but currently the rework will have the same problem as I haven't changed how this works. |
Hi @Joibel, thanks for the response and working on the rework. With regards to limitations for the settings: You suggested earlier to decrease the metric TTL to reduce impact of #10503 (comment).
This is what we did and now setting a higher value will make the other issue where wrong metrics are exposed again more prominent. It might still be acceptable, as we also have the other workaround used to avoid these particular realtime metrics. |
I would also like to understand generally the purpose of 'realtime' metrics in the context of this statement:
If not for exposing correct runtime information during execution of the workflow, what makes realtime metrics different from other workflow metrics?
Maybe I just need to read @jswxstw 's statement as a confirmation that this is the case right now, but not the actually wanted behavior? |
@epDHowwD Sorry, I expressed it ambiguously. The real-time metrics is accurate, because it will be calculated each time you pull the metric endpoint. Howerver, the
According to the current logic of metric garbage collection, metrics will be cleared when the TTL expires, without checking whether the workflow has been completed. See #12790 (comment). |
…rgoproj#12790 Signed-off-by: oninowang <oninowang@tencent.com>
…rgoproj#12790 Signed-off-by: oninowang <oninowang@tencent.com>
…rgoproj#12790 Signed-off-by: oninowang <oninowang@tencent.com>
…rgoproj#12790 Signed-off-by: oninowang <oninowang@tencent.com>
Pre-requisites
:latest
What happened/what did you expect to happen?
Workflow controller configmap is configured as follows:
I created a workflow that is infinitely sleeping. The workflow template has a metric that just tracks the duration as a real-time gauge metric.
Expected:
The metric is continued to be reported on until the workflow terminates. Once the workflow terminates, the metric should live for another 3 minutes before being cleaned up.
This sounds like the expected behavior based on #10503 (comment)
What happened instead:
The metric stopped reporting shortly after the metric TTL while the workflow is still running.
Green = Duration Metric
Yellow = some prometheus metric for the workflow pod, indicating it is not deleted
I tested the ttl with various different timings (10m, 20m, 1h) and confirmed the behavior scales with the TTL, though it's not always exactly the ttl time setting (i.e. when ttl is 10m, it can take up to 20m).
We cannot just turn off the metric TTL as that would affect the non-realtime metrics.
I also confirmed that our Prometheus instance was up and running, and that the Argo controller did not shut down (I did observe separately that if the controller gets shut down and a new one comes up, the metrics would not be continued either, but that is a different scenario and seems like a different bug).
Version
v3.4.10
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: