Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Workflow] Adding metrics for Dapr Workflow #7109

Closed
shubham1172 opened this issue Oct 30, 2023 · 10 comments · Fixed by #7152 or #7370
Closed

[Workflow] Adding metrics for Dapr Workflow #7109

shubham1172 opened this issue Oct 30, 2023 · 10 comments · Fixed by #7152 or #7370
Assignees
Milestone

Comments

@shubham1172
Copy link
Member

shubham1172 commented Oct 30, 2023

In what area(s)?

/area runtime

Describe the feature

Workflow should emit metrics that help Dapr users with tracking the overall traffic and health, also improving overall diagnosability.

Dapr Runtime – current code

The following metrics can be extracted from the Dapr sidecars.

workflow/operations

These metrics cover the total successful and failed requests to create, get, purge workflows, and add events. It also covers the overall latency to execute these requests. Note, in case of create workflow and add event, it only measures the time taken to create the reliable reminder.

workflow/reminders

These metrics cover the total number of internal-actor reminder requests created, which includes reminders to start a new workflow, run an activity, raise a new workflow event, or create durable timers.

workflow/execution

These metrics cover the total successful and failed workflow/activity execution requests. When failing, it also contains if the failure was recoverable (i.e., will be retried via actor reminders) or not.

Dapr Runtime – enhancements

workflow/scheduling

By adding request-time in the reminder payload, we can measure the time between (1) a workflow being scheduled and the workflow’s actual start, and (2) an activity being created and the activity’s actual start.

workflow/execution

Using the same payload as above, when a workflow or activity finally finishes (with success or non-recoverable error), capture the total time taken to run a workflow or activity.

Dapr SDKs

Dapr SDKs can configure Workflow resiliency in the underlying implementation of Durable Task Framework (DTF). For example, durabletask-dotnet and durabletask-python have retry policies that take effect while calling activities, which are then configurable in the Dapr SDKs like dotnet-sdk. However, not all Dapr SDKs (e.g., Python or Java) allow the same today.

When Dapr SDKs start supporting retry policies consistently, we can bubble the retry details from DTF clients to Dapr SDKs, and have them shared with the Dapr sidecars to be emitted as a metric. This can be like the existing sidecar resiliency metrics. For this issue, this can be considered out of scope.

Summary

Metric name Description Tags Status
runtime/workflow/operations/total The number of successful workflow operation requests. operation Ready for implementation
runtime/workflow/operations/failed_total The number of failed workflow operation requests. operation Ready for implementation
runtime/workflow/operations/latencies The latencies of responses for workflow operation requests. elapsed Ready for implementation
runtime/workflow/reminders/total The number of workflow/activity reminders created. reminder-type Ready for implementation
runtime/workflow/execution/total The number of successful workflow/activity executions. execution-type Ready for implementation
runtime/workflow/execution/failed_total The number of failed workflow/activity executions. execution-type, is-retriable Ready for implementation
runtime/workflow/execution/latencies The total time taken to run a workflow/activity to completion. execution-type, status Needs runtime changes
runtime/workflow/scheduling/latencies The latencies between execution request and actual execution execution-type Needs runtime changes

Release Note

RELEASE NOTE: ADD metrics for Dapr Workflow

@prateek041
Copy link
Contributor

Hello @shubham1172

I am new to Dapr, and would like to contribute through this issue. I went through the issue and here is what I understood:

  • Currently Dapr Workflows are not exporting metrics, but they should for better diagnosis, hence the metrics you mentioned above need to be implemented.

I see that Dapr currently uses opencensus, so I'll be taking a look into that.

If no one is working on it, I would love to.
PS: Directions from you would be helpful 😄

@shubham1172
Copy link
Member Author

Hello @prateek041, and welcome to the Dapr project 👋🏻

You can refer to pkg/diagnostics as a starting point to see how other metrics are written today and pkg/runtime/wfengine as a starting point to understand the workflow engine's code.

Please feel free to assign this issue to yourself by commenting '/assign'. Note that this will likely be triaged for release 1.13 (upcoming release).

@mukundansundar mukundansundar added this to the v1.13 milestone Nov 3, 2023
@prateek041
Copy link
Contributor

prateek041 commented Nov 3, 2023

Sure @shubham1172 thanks !

So, I have around 3-4 weeks, I should get started.

@prateek041
Copy link
Contributor

/assign

@prateek041
Copy link
Contributor

Hey @shubham1172 , I have added a draft PR, It currently does not include big changes, please review.

@lilyjma
Copy link

lilyjma commented Nov 28, 2023

It may be useful to expose the age of messages in queues too, because increasing message age is a signal that the message is probably in an unhealthy state and can't be process, i.e. a "poison message". But the system will continue to retry it anyways, leading to the workflow being "stuck". This is one of the more annoying problems to diagnose for Durable Functions.

@JoshVanL JoshVanL added the P0 label Dec 5, 2023
@prateek041
Copy link
Contributor

prateek041 commented Dec 10, 2023

thanks for the suggestion @lilyjma

Could you elaborate a bit more. Like, do you suggest recording the age of messages in workflow inbox at the time of deletion ? or something else ?

@shivamkm07
Copy link
Contributor

Following metrics have not been added as part of #7152, as they need enhancements to support them.

Workflow Execution Latency:
To record workflow execution latency, we need to store the time of workflow execution start in database. This is because a workflow executes in different schedules, and the workflow execution time would be (completion time of last schedule - start time of first schedule). It can be stored as a new HistoryEvent probably to record the time.

Also, it needs to be decided what needs to be considered as schedule start time. It can be when workflow scheduler returns(

err = wf.scheduler(ctx, wi)
). But technically the execution doesn't start at this point (Instead even request hasn't been sent to application container by then). Actual execution starts at this point e.g. in durabletask-python. So maybe we can return an action from the application containing the startTime.

Note: Activity execution latency has been added as it doesn't execute in multiple schedules. However, if the execution start time is considered to be as in durabletask-python, it would need to be modified accordingly as well

Scheduling Latency
The normal workflow execution start flow looks like this:
Create Workflow request -> Create Workflow reminder -> Return
(Async) Invoke reminder -> Schedule Workflow -> Send request to App to execute -> App starts execution

The workflow can be scheduled to run at a specific time as well. So in case of scheduled workflow, Scheduling latency can be defined as (Time at which App starts execution - Scheduled Time). In case of immediate scheduling, it probably can be (Time at which App starts execution - Time at which workflow reminder is created). In both of these cases, the reminder saved needs to add a new field creationTime which will help calculating the scheduling latency later on when the reminder is invoked.

@ASHIQUEMD
Copy link
Contributor

ASHIQUEMD commented Jan 3, 2024

Can we have dimensions/tags defined for all the metrics, especially latencies. For example, workflow activity execution latency, what are the different dimensions it will capture?

@ASHIQUEMD
Copy link
Contributor

Workflow Execution Latency

Definition: Workflow Execution Latency is the duration between the start and the end of a workflow execution. It measures the time taken for a workflow to complete its execution after it has been started.

  • Start of Workflow Execution: The workflow execution is considered to have started after the execution of this line of code. This is the point where the workflow transitions from the scheduled state to the running state.

  • End of Workflow Execution: The workflow execution is considered to have completed when the function runtimeState.IsCompleted() returns true. This indicates that the workflow has finished processing all its tasks and has reached its final state.

Workflow Scheduling Latency

Definition: Workflow Scheduling Latency is the duration between the time a workflow is scheduled to start and the actual start of its execution. It measures the delay in the start of a workflow execution due to the scheduling process.

  • Calculation: The Scheduling Latency is calculated by subtracting the ScheduledStartTimestamp (which is passed as part of the workflow state) from the actual start time of the workflow execution (as defined above).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment