[Workflow] Adding metrics for Dapr Workflow #7109

shubham1172 · 2023-10-30T08:14:35Z

In what area(s)?

/area runtime

Describe the feature

Workflow should emit metrics that help Dapr users with tracking the overall traffic and health, also improving overall diagnosability.

Dapr Runtime – current code

The following metrics can be extracted from the Dapr sidecars.

workflow/operations

These metrics cover the total successful and failed requests to create, get, purge workflows, and add events. It also covers the overall latency to execute these requests. Note, in case of create workflow and add event, it only measures the time taken to create the reliable reminder.

workflow/reminders

These metrics cover the total number of internal-actor reminder requests created, which includes reminders to start a new workflow, run an activity, raise a new workflow event, or create durable timers.

workflow/execution

These metrics cover the total successful and failed workflow/activity execution requests. When failing, it also contains if the failure was recoverable (i.e., will be retried via actor reminders) or not.

Dapr Runtime – enhancements

workflow/scheduling

By adding request-time in the reminder payload, we can measure the time between (1) a workflow being scheduled and the workflow’s actual start, and (2) an activity being created and the activity’s actual start.

workflow/execution

Using the same payload as above, when a workflow or activity finally finishes (with success or non-recoverable error), capture the total time taken to run a workflow or activity.

Dapr SDKs

Dapr SDKs can configure Workflow resiliency in the underlying implementation of Durable Task Framework (DTF). For example, durabletask-dotnet and durabletask-python have retry policies that take effect while calling activities, which are then configurable in the Dapr SDKs like dotnet-sdk. However, not all Dapr SDKs (e.g., Python or Java) allow the same today.

When Dapr SDKs start supporting retry policies consistently, we can bubble the retry details from DTF clients to Dapr SDKs, and have them shared with the Dapr sidecars to be emitted as a metric. This can be like the existing sidecar resiliency metrics. For this issue, this can be considered out of scope.

Summary

Metric name	Description	Tags	Status
runtime/workflow/operations/total	The number of successful workflow operation requests.	operation	Ready for implementation
runtime/workflow/operations/failed_total	The number of failed workflow operation requests.	operation	Ready for implementation
runtime/workflow/operations/latencies	The latencies of responses for workflow operation requests.	elapsed	Ready for implementation
runtime/workflow/reminders/total	The number of workflow/activity reminders created.	reminder-type	Ready for implementation
runtime/workflow/execution/total	The number of successful workflow/activity executions.	execution-type	Ready for implementation
runtime/workflow/execution/failed_total	The number of failed workflow/activity executions.	execution-type, is-retriable	Ready for implementation
runtime/workflow/execution/latencies	The total time taken to run a workflow/activity to completion.	execution-type, status	Needs runtime changes
runtime/workflow/scheduling/latencies	The latencies between execution request and actual execution	execution-type	Needs runtime changes

Release Note

RELEASE NOTE: ADD metrics for Dapr Workflow

The text was updated successfully, but these errors were encountered:

prateek041 · 2023-11-03T08:20:11Z

Hello @shubham1172

I am new to Dapr, and would like to contribute through this issue. I went through the issue and here is what I understood:

Currently Dapr Workflows are not exporting metrics, but they should for better diagnosis, hence the metrics you mentioned above need to be implemented.

I see that Dapr currently uses opencensus, so I'll be taking a look into that.

If no one is working on it, I would love to.
PS: Directions from you would be helpful 😄

shubham1172 · 2023-11-03T08:50:21Z

Hello @prateek041, and welcome to the Dapr project 👋🏻

You can refer to pkg/diagnostics as a starting point to see how other metrics are written today and pkg/runtime/wfengine as a starting point to understand the workflow engine's code.

Please feel free to assign this issue to yourself by commenting '/assign'. Note that this will likely be triaged for release 1.13 (upcoming release).

prateek041 · 2023-11-03T10:30:24Z

Sure @shubham1172 thanks !

So, I have around 3-4 weeks, I should get started.

prateek041 · 2023-11-03T10:31:58Z

/assign

prateek041 · 2023-11-07T13:57:28Z

Hey @shubham1172 , I have added a draft PR, It currently does not include big changes, please review.

lilyjma · 2023-11-28T21:20:30Z

It may be useful to expose the age of messages in queues too, because increasing message age is a signal that the message is probably in an unhealthy state and can't be process, i.e. a "poison message". But the system will continue to retry it anyways, leading to the workflow being "stuck". This is one of the more annoying problems to diagnose for Durable Functions.

prateek041 · 2023-12-10T14:55:55Z

thanks for the suggestion @lilyjma

Could you elaborate a bit more. Like, do you suggest recording the age of messages in workflow inbox at the time of deletion ? or something else ?

shivamkm07 · 2023-12-22T20:25:10Z

Following metrics have not been added as part of #7152, as they need enhancements to support them.

Workflow Execution Latency:
To record workflow execution latency, we need to store the time of workflow execution start in database. This is because a workflow executes in different schedules, and the workflow execution time would be (completion time of last schedule - start time of first schedule). It can be stored as a new HistoryEvent probably to record the time.

Also, it needs to be decided what needs to be considered as schedule start time. It can be when workflow scheduler returns(

dapr/pkg/runtime/wfengine/workflow.go

Line 450 in 9e2bcdc

err = wf.scheduler(ctx, wi)

). But technically the execution doesn't start at this point (Instead even request hasn't been sent to application container by then). Actual execution starts at this point e.g. in durabletask-python. So maybe we can return an action from the application containing the startTime.

Note: Activity execution latency has been added as it doesn't execute in multiple schedules. However, if the execution start time is considered to be as in durabletask-python, it would need to be modified accordingly as well

Scheduling Latency
The normal workflow execution start flow looks like this:
Create Workflow request -> Create Workflow reminder -> Return
(Async) Invoke reminder -> Schedule Workflow -> Send request to App to execute -> App starts execution

The workflow can be scheduled to run at a specific time as well. So in case of scheduled workflow, Scheduling latency can be defined as (Time at which App starts execution - Scheduled Time). In case of immediate scheduling, it probably can be (Time at which App starts execution - Time at which workflow reminder is created). In both of these cases, the reminder saved needs to add a new field creationTime which will help calculating the scheduling latency later on when the reminder is invoked.

ASHIQUEMD · 2024-01-03T10:27:47Z

Can we have dimensions/tags defined for all the metrics, especially latencies. For example, workflow activity execution latency, what are the different dimensions it will capture?

ASHIQUEMD · 2024-01-12T09:01:03Z

Workflow Execution Latency

Definition: Workflow Execution Latency is the duration between the start and the end of a workflow execution. It measures the time taken for a workflow to complete its execution after it has been started.

Start of Workflow Execution: The workflow execution is considered to have started after the execution of this line of code. This is the point where the workflow transitions from the scheduled state to the running state.
End of Workflow Execution: The workflow execution is considered to have completed when the function runtimeState.IsCompleted() returns true. This indicates that the workflow has finished processing all its tasks and has reached its final state.

Workflow Scheduling Latency

Definition: Workflow Scheduling Latency is the duration between the time a workflow is scheduled to start and the actual start of its execution. It measures the delay in the start of a workflow execution due to the scheduling process.

Calculation: The Scheduling Latency is calculated by subtracting the ScheduledStartTimestamp (which is passed as part of the workflow state) from the actual start time of the workflow execution (as defined above).

shubham1172 added the kind/enhancement label Oct 30, 2023

mukundansundar added this to the v1.13 milestone Nov 3, 2023

dapr-bot assigned prateek041 Nov 3, 2023

prateek041 mentioned this issue Nov 6, 2023

Defining Workflow metrics #7152

Merged

7 tasks

rabollin mentioned this issue Nov 7, 2023

Workflow to Stable - Phase1 #7156

Closed

13 tasks

JoshVanL added the P0 label Dec 5, 2023

hhunter-ms mentioned this issue Dec 20, 2023

[Workflow] Metrics dapr/docs#3916

Closed

mukundansundar closed this as completed in #7152 Jan 11, 2024

DeepanshuA mentioned this issue Jan 11, 2024

Add Scheduling and WF Execution Latency metrics #7367

Closed

ASHIQUEMD mentioned this issue Jan 12, 2024

Workflow execution and scheduling latency #7370

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Workflow] Adding metrics for Dapr Workflow #7109

[Workflow] Adding metrics for Dapr Workflow #7109

shubham1172 commented Oct 30, 2023 •

edited

prateek041 commented Nov 3, 2023

shubham1172 commented Nov 3, 2023

prateek041 commented Nov 3, 2023 •

edited

prateek041 commented Nov 3, 2023

prateek041 commented Nov 7, 2023

lilyjma commented Nov 28, 2023

prateek041 commented Dec 10, 2023 •

edited

shivamkm07 commented Dec 22, 2023

ASHIQUEMD commented Jan 3, 2024 •

edited

ASHIQUEMD commented Jan 12, 2024

[Workflow] Adding metrics for Dapr Workflow #7109

[Workflow] Adding metrics for Dapr Workflow #7109

Comments

shubham1172 commented Oct 30, 2023 • edited

In what area(s)?

Describe the feature

Dapr Runtime – current code

workflow/operations

workflow/reminders

workflow/execution

Dapr Runtime – enhancements

workflow/scheduling

workflow/execution

Dapr SDKs

Summary

Release Note

prateek041 commented Nov 3, 2023

shubham1172 commented Nov 3, 2023

prateek041 commented Nov 3, 2023 • edited

prateek041 commented Nov 3, 2023

prateek041 commented Nov 7, 2023

lilyjma commented Nov 28, 2023

prateek041 commented Dec 10, 2023 • edited

shivamkm07 commented Dec 22, 2023

ASHIQUEMD commented Jan 3, 2024 • edited

ASHIQUEMD commented Jan 12, 2024

Workflow Execution Latency

Workflow Scheduling Latency

shubham1172 commented Oct 30, 2023 •

edited

prateek041 commented Nov 3, 2023 •

edited

prateek041 commented Dec 10, 2023 •

edited

ASHIQUEMD commented Jan 3, 2024 •

edited