New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Workflow] Adding metrics for Dapr Workflow #7109
Comments
Hello @shubham1172 I am new to Dapr, and would like to contribute through this issue. I went through the issue and here is what I understood:
I see that Dapr currently uses opencensus, so I'll be taking a look into that. If no one is working on it, I would love to. |
Hello @prateek041, and welcome to the Dapr project 👋🏻 You can refer to pkg/diagnostics as a starting point to see how other metrics are written today and pkg/runtime/wfengine as a starting point to understand the workflow engine's code. Please feel free to assign this issue to yourself by commenting '/assign'. Note that this will likely be triaged for release 1.13 (upcoming release). |
Sure @shubham1172 thanks ! So, I have around 3-4 weeks, I should get started. |
/assign |
Hey @shubham1172 , I have added a draft PR, It currently does not include big changes, please review. |
It may be useful to expose the age of messages in queues too, because increasing message age is a signal that the message is probably in an unhealthy state and can't be process, i.e. a "poison message". But the system will continue to retry it anyways, leading to the workflow being "stuck". This is one of the more annoying problems to diagnose for Durable Functions. |
thanks for the suggestion @lilyjma Could you elaborate a bit more. Like, do you suggest recording the age of messages in workflow inbox at the time of deletion ? or something else ? |
Following metrics have not been added as part of #7152, as they need enhancements to support them. Workflow Execution Latency: Also, it needs to be decided what needs to be considered as schedule start time. It can be when workflow scheduler returns( dapr/pkg/runtime/wfengine/workflow.go Line 450 in 9e2bcdc
Scheduling Latency The workflow can be scheduled to run at a specific time as well. So in case of scheduled workflow, Scheduling latency can be defined as ( |
Can we have dimensions/tags defined for all the metrics, especially latencies. For example, workflow activity execution latency, what are the different dimensions it will capture? |
Workflow Execution LatencyDefinition: Workflow Execution Latency is the duration between the start and the end of a workflow execution. It measures the time taken for a workflow to complete its execution after it has been started.
Workflow Scheduling LatencyDefinition: Workflow Scheduling Latency is the duration between the time a workflow is scheduled to start and the actual start of its execution. It measures the delay in the start of a workflow execution due to the scheduling process.
|
In what area(s)?
/area runtime
Describe the feature
Workflow should emit metrics that help Dapr users with tracking the overall traffic and health, also improving overall diagnosability.
Dapr Runtime – current code
The following metrics can be extracted from the Dapr sidecars.
workflow/operations
These metrics cover the total successful and failed requests to create, get, purge workflows, and add events. It also covers the overall latency to execute these requests. Note, in case of
create workflow
andadd event
, it only measures the time taken to create the reliable reminder.workflow/reminders
These metrics cover the total number of internal-actor reminder requests created, which includes reminders to start a new workflow, run an activity, raise a new workflow event, or create durable timers.
workflow/execution
These metrics cover the total successful and failed workflow/activity execution requests. When failing, it also contains if the failure was recoverable (i.e., will be retried via actor reminders) or not.
Dapr Runtime – enhancements
workflow/scheduling
By adding request-time in the reminder payload, we can measure the time between (1) a workflow being scheduled and the workflow’s actual start, and (2) an activity being created and the activity’s actual start.
workflow/execution
Using the same payload as above, when a workflow or activity finally finishes (with success or non-recoverable error), capture the total time taken to run a workflow or activity.
Dapr SDKs
Dapr SDKs can configure Workflow resiliency in the underlying implementation of Durable Task Framework (DTF). For example, durabletask-dotnet and durabletask-python have retry policies that take effect while calling activities, which are then configurable in the Dapr SDKs like dotnet-sdk. However, not all Dapr SDKs (e.g., Python or Java) allow the same today.
When Dapr SDKs start supporting retry policies consistently, we can bubble the retry details from DTF clients to Dapr SDKs, and have them shared with the Dapr sidecars to be emitted as a metric. This can be like the existing sidecar resiliency metrics. For this issue, this can be considered out of scope.
Summary
Release Note
RELEASE NOTE: ADD metrics for Dapr Workflow
The text was updated successfully, but these errors were encountered: