Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Workflow building block and engine #4576

Closed
johnewart opened this issue May 2, 2022 · 66 comments
Closed

[Proposal] Workflow building block and engine #4576

johnewart opened this issue May 2, 2022 · 66 comments
Assignees
Labels
stale Issues and PRs without response

Comments

@johnewart
Copy link
Contributor

johnewart commented May 2, 2022

In what area(s)?

/area runtime

What?

This document proposes that the Dapr runtime be extended to include a new workflow building block. This building block, in combination with a lightweight, portable, workflow engine will enable developers to express workflows as code that can be executed, interacted with, monitored, and debugged using the Dapr runtime.

Why?

Many complex business processes are well modeled as a workflow - a set of steps needing to be orchestrated that require resiliency and guarantee completion (success or failure are both completions). To build such workflows, developers are often faced with needing to solve a host of complex problems, including (but not limited to):

  • Scheduling
  • Lifecycle management
  • State storage
  • Monitoring and debugging
  • Resiliency
  • Failure handling mechanisms

Based on available data, it is clear that workflows are quite popular; at the time of this writing, the number of daily executions for hosted workflows and their tasks is in the billions per day across tens of thousands of Azure subscriptions.

What is a workflow?

A workflow, for the purpose of this proposal, is defined as application logic that defines a business process or data flow that:

  • Has a specific, pre-defined, deterministic lifecycle (e.g Pending -> Running -> [Completed | Failed | Terminated])
  • Is guaranteed to complete
  • Is durable (i.e completion in the face of transient errors)
  • Can be scheduled to start or execute steps at or after some future time
  • Can be paused and resumed (explicitly or implicitly)
  • Can execute portions of the workflow in serial or parallel
  • Can be directly addressed by external agents (i.e an instance of the workflow can be interacted with directly - paused, resumed, queried, etc.)
  • May be versioned
  • May be stateful
  • May create new sub-workflows and optionally wait for those to complete before progressing
  • May rely on external components to perform its job (i.e HTTPS API calls, pub/sub message queues, etc.)

Why Dapr?

Dapr already contains many of the building blocks required to provide reliability, scalability, and durability to the execution of workflows. Building such an engine inside Dapr, and providing the necessary building blocks will help to increase developer productivity through re-usability of existing features and independence from the underlying execution mechanism thereby increasing portability.

In addition to the built-in execution engine, Dapr can provide a consistent programming interface for interacting with third-party workflow execution systems (i.e AWS SWF, Apache Camel, Drools) for those who are already using these tools. Thereby providing a standardized interface for working with both external workflows as well as those running inside Dapr.

Proposal

High-level overview of changes

We propose that the following features / capabilities be added to the Dapr runtime:

  • A new "workflow" building block
  • A portable, lightweight, workflow engine embedded into the Dapr sidecar capable of supporting long-running, resilient, and durable workflows through Dapr's building blocks
  • An expressive, developer-friendly, programming model for building workflows as code
  • Support for containerized, declarative, workflows (such as the CNCF Serverless Workflow specification)
  • Extensions to the Dapr dashboard for monitoring / managing workflow execution
  • APIs for interacting with workflows

The Workflow building block

As mentioned before, this proposal includes the addition of a new workflow building block. Like most of the other Dapr building blocks (state stores, pubsub, etc.) the workflow building block will consist of two primary things:

  • A pluggable component model for integrating various workflow engines
  • A set of APIs for managing workflows (start, schedule, pause, resume, cancel)

Similar to the built-in support for actors, we also propose implementing a built-in runtime for workflows (see the DTFx-go engine described in the next section). Unlike actors, the workflow runtime component can be swapped out with an alternate implementation. If developers want to work with other workflow engines, such as externally hosted workflow services like Azure Logic Apps, AWS Step Functions, or Temporal.io, they can do so with alternate community-contributed workflow components.

The value of this building block for vendors is that workflows supported by their platforms can be exposed as APIs with support for HTTP and the Dapr SDKs. The less visible but benefits of mTLS, distributed tracing, etc. will also be available. Various abstractions, such as async HTTP polling, can also be supported via Dapr without the workflow vendor needing to implement it themselves.

Introducing DTFx-go

We propose adding a lightweight, portable, embedded workflow engine (DTFx-go) in the Dapr sidecar that leverages existing Dapr components, including actors and state storage, in its underlying implementation. By being lightweight and portable developers will be able to execute workflows that run inside DFTx-go locally as well as in production with minimal overhead; this enhances the developer experience by integrating workflows with the existing Dapr development model that users enjoy.

The new engine will be written in Go and inspired by the existing Durable Task Framework (DTFx) engine. We’ll call this new version of the framework DTFx-go to distinguish it from the .NET implementation (which is not part of this proposal) and it will exist as an open-source project with a permissive, e.g., Apache 2.0, license so that it remains compatible as a dependency for CNCF projects. Note that it’s important to ensure this engine remains lightweight so as not to noticeably increase the size of the Dapr sidecar.

Importantly, DTFx-go will not be exposed to the application layer. Rather, the Dapr sidecar will expose DTFx-go functionality over a gRPC stream. The Dapr sidecar will not execute any app-specific workflow logic or load any declarative workflow documents. Instead, app containers will be responsible for hosting the actual workflow logic. The Dapr sidecar can send and receive workflow commands over gRPC to and from connected app’s workflow logic, execute commands on behalf of the workflow (service invocation, invoking bindings, etc.). Other concerns such as activation, scale-out, and state persistence will be handled by internally managed actors. More details on all of this will be discussed in subsequent sections.

Execution, scheduling and resilience

Internally, Dapr workflow instances will be implemented as actors. Actors drive workflow execution by communicating with the workflow SDK over a gRPC stream. By using actors, the problem of placement and scalability are already solved for us.

placement

The execution of individual workflows will be triggered using actor reminders as they are both persistent and durable (two critical features of workflows). If a container or node crashes during a workflow’s execution, the actor’s reminder will ensure it gets activated again and resumes where it left off (using state storage to provide durability, see below).

To prevent a workflow from blocking (unintentionally) each workflow will be composed of two separate actor components, one acting as the scheduler / coordinator and the other performing the actual work (calling API services, performing computation, etc.).

execution

Storage of state and durability

In order for a workflow execution to reliably complete in the face of transient errors, it must be durable -- meaning that it is able to store data at checkpoints as it makes progress. To achieve this, workflow executions will rely on Dapr's state storage to provide stable storage such that the workflow can be safely resumed from a known-state in the event that it is explicitly paused or a step is prematurely terminated (system failure, lack of resources, etc.).

Workflows as code

The term "workflow as code" refers to the implementation of a workflow’s logic using general purpose programming languages. "Workflow as code" is used in a growing number of modern workflow frameworks, such as Azure Durable Functions, Temporal.io, and Prefect (Orion). The advantage of this approach is its developer-friendliness. Developers can use a programming language that they already know (no need to learn a new DSL or YAML schema), they have access to the language’s standard libraries, can build their own libraries and abstractions, can use debuggers and examine local variables, and can even write unit tests for their workflows just like they would any other part of their application logic.

The Dapr SDK will internally communicate with the DTFx-go gRPC endpoint in the Dapr sidecar to receive new workflow events and send new workflow commands, but these protocol details will be hidden from the developer. Due to the complexities of the workflow protocol, we are not proposing any HTTP API for the runtime aspect of this feature.

Support for declarative workflows

We expect workflows as code to be very popular for developers because working with code is both very natural for developers and is much more expressive and flexible compared to declarative workflow modeling languages. In spite of this, there will still be users who will prefer or require workflows to be declarative. To support this, we propose building an experience for declarative workflows as a layer on top of the "workflow as code" foundation. A variety of declarative workflows could be supported in this way. For example, this model could be used to support the AWS Step Functions workflow syntax, the Azure Logic Apps workflow syntax, or even the Google Cloud Workflow syntax. However, for the purpose of this proposal, we’ll focus on what it would look like to support the CNCF Serverless Workflow specification. Note, however, that the proposed model could be used to support any number of declarative multiple workflow schemas.

CNCF Serverless Workflows

Serverless Workflow (SLWF) consists of an open-source standards-based DSL and dev tools for authoring and validating workflows in either JSON or YAML. SLWF was specifically selected for this proposal because it represents a cloud native and industry standard way to author workflows. There are a set of already existing open-source tools for generating and validating these workflows that can be adopted by the community. It’s also an ideal fit for Dapr since it’s under the CNCF umbrella (currently as a sandbox project). This proposal would support the SLWF project by providing it with a lightweight, portable runtime – i.e., the Dapr sidecar.

Hosting Serverless Workflows

In this proposal, we use the Dapr SDKs to build a new, portable SLWF runtime that leverages the Dapr sidecar. Most likely it is implemented as a reusable container image and supports loading workflow definition files from Dapr state stores (the exact details need to be worked out). Note that the Dapr sidecar doesn’t load any workflow definitions. Rather, the sidecar simply drives the execution of the workflows, leaving all other details to the application layer.

API

Start Workflow API

HTTP / gRPC

Developers can start workflow instances by issuing an HTTP (or gRPC) API call to the Dapr sidecar:

POST http://localhost:3500/v1.0/workflows/{workflowType}/{instanceId}/start

Workflows are assumed to have a type that is identified by the {workflowType} parameter. Each workflow instance must also be created with a unique {instanceId} value. The payload of the request is the input of the workflow. If a workflow instance with this ID already exists, this call will fail with an HTTP 409 Conflict.

To support asynchronous HTTP polling pattern by HTTP clients, this API will return an HTTP 202 Accepted response with a Location header containing a URL that can be used to get the status of the workflow (see further below). When the workflow completes, this endpoint will return an HTTP 200 response. If it fails, the endpoint can return a 4XX or 5XX error HTTP response code. Some of these details may need to be configurable since there is no universal protocol for async API handling.

Input bindings

For certain types of automation scenarios, it can be useful to trigger new instances of workflows directly from Dapr input bindings. For example, it may be useful to trigger a workflow in response to a tweet from a particular user account using the Twitter input binding. Another example is starting a new workflow in response to a Kubernetes event, like a deployment creation event.

The instance ID and input payload for the workflow depends on the configuration of the input binding. For example, a user may want to use a Tweet’s unique ID or the name of the Kubernetes deployment as the instance ID.

Pub/Sub

Workflows can also be started directly from pub/sub events, similar to the proposal for Actor pub/sub. Configuration on the pub/sub topic can be used to identify an appropriate instance ID and input payload to use for initializing the workflow. In the simplest case, the source + ID of the cloud event message can be used as the workflow’s instance ID.

Terminate workflow API

HTTP / gRPC

Workflow instances can also be terminated using an explicit API call.

POST http://localhost:3500/v1.0/workflows/{workflowType}/{instanceId}/terminate

Workflow termination is primarily an operation that a service operator takes if a particular business process needs to be cancelled, or if a problem with the workflow requires it to be stopped to mitigate impact to other services.

If a payload is included in the POST request, it will be saved as the output of the workflow instance.

Raise Event API

Workflows are especially useful when they can wait for and be driven by external events. For example, a workflow could subscribe to events from a pubsub topic as shown in the Phone Verification sample. However, this capability shouldn’t be limited to pub/sub events.

HTTP / gRPC

An API should exist for publishing events directly to a workflow instance:

POST http://localhost:3500/v1.0/workflows/{workflowType}/{instanceId}/raiseEvent

The result of the "raise event" API is an HTTP 202 Accepted, indicating that the event was received but possibly not yet processed. A workflow can consume an external event using the waitForExternalEvent SDK method.

Get workflow metadata API

HTTP / gRPC

Users can fetch the metadata of a workflow instance using an explicit API call.

GET http://localhost:3500/v1.0/workflows/{workflowType}/{instanceId}

The result of this call is workflow instance metadata, such as its start time, runtime status, completion time (if completed), and custom or runtime-specific status. If supported by the target runtime, workflow inputs and outputs can also be fetched using the query API.

Purge workflow metadata API

Users can delete all state associated with a workflow using the following API:

DELETE http://localhost:3500/v1.0/workflows/{workflowType}/{instanceId}

When using the embedded workflow component, this will delete all state stored by the workflow’s underlying actor(s).

Footnotes and Examples

Example 1: Bank transaction

In this example, the workflow is implemented as a JavaScript generator function. The "bank1" and "bank2" parameters are Microservice apps that use Dapr, each of which expose "withdraw" and "deposit" APIs. The Dapr APIs available to the workflow come from the context parameter object and return a "task" which effectively the same as a Promise. Calling yield on the task causes the workflow to durably checkpoint its progress and wait until Dapr responds with the output of the service method. The value of the task is the service invocation result. If any service method call fails with an error, the error is surfaced as a raised JavaScript error that can be caught using normal try/catch syntax. This code can also be debugged using a Node.js debugger.

Note that the details around how code is written will vary depending on the language. For example, a C# SDK would allow developers to use async/await instead of yield. Regardless of the language details, the core capabilities will be the same across all languages.

import { DaprWorkflowClient, DaprWorkflowContext, HttpMethod } from "dapr-client"; 

const daprHost = process.env.DAPR_HOST || "127.0.0.1"; // Dapr sidecar host 

const daprPort = process.env.DAPR_WF_PORT || "50001"; // Dapr sidecar port for workflow 

const workflowClient = new DaprWorkflowClient(daprHost, daprPort); 

// Funds transfer workflow which receives a context object from Dapr and an input 
workflowClient.addWorkflow('transfer-funds-workflow', function*(context: DaprWorkflowContext, op: any) { 
    // use built-in methods for generating psuedo-random data in a workflow-safe way 
    const transactionId = context.createV5uuid(); 

    // try to withdraw funds from the source account. 
    const success = yield context.invoker.invoke("bank1", "withdraw", HttpMethod.POST, { 
        srcAccount: op.srcAccount, 
        amount: op.amount, 
        transactionId 
    }); 

    if (!success.success) { 
        return "Insufficient funds"; 
    } 

    try { 
        // attempt to deposit into the dest account, which is part of a separate microservice app 
        yield context.invoker.invoke("bank2", "deposit", HttpMethod.POST, {
            destAccount: op.destAccount, 
            amount: op.amount, 
            transactionId 
        }); 
        return "success"; 
    } catch { 
        // compensate for failures by returning the funds to the original account 
        yield context.invoker.invoke("bank1", "deposit", HttpMethod.POST, { 
            destAccount: op.srcAccount, 
            amount: op.amount, 
            transactionId 
        }); 
        return "failure"; 
    } 
}); 

// Call start() to start processing workflow events 
workflowClient.start(); 

Example 2: Phone Verification

Here’s another sample that shows how a developer might build an SMS phone verification workflow. The workflow receives some user’s phone number, creates a challenge code, delivers the challenge code to the user’s SMS number, and waits for the user to respond with the correct challenge code.

The important takeaway is that the end-to-end workflow can be represented as a single, easy-to-understand function. Rather than relying directly on actors to hold state explicitly, state (such as the challenge code) can simply be stored in local variables, drastically reducing the overall code complexity and making the solution easily unit testable.

import { DaprWorkflowClient, DaprWorkflowContext, HttpMethod } from "dapr-client"; 

const daprHost = process.env.DAPR_HOST || "127.0.0.1"; // Dapr sidecar host 
const daprPort = process.env.DAPR_WF_PORT || "50001"; // Dapr sidecar port for workflow 
const workflowClient = new DaprWorkflowClient(daprHost, daprPort); 

// Phone number verification workflow which receives a context object from Dapr and an input 
workflowClient.addWorkflow('phone-verification', function*(context: DaprWorkflowContext, phoneNumber: string) { 

    // Create a challenge code and send a notification to the user's phone 
    const challengeCode = yield context.invoker.invoke("authService", "createSmsChallenge", HttpMethod.POST, { 
        phoneNumber 
    }); 

    // Schedule a durable timer for some future date (e.g. 5 minutes or perhaps even 24 hours from now) 
    const expirationTimer = context.createTimer(challengeCode.expiration); 

    // The user gets three tries to respond with the right challenge code 
    let authenticated = false; 

    for (let i = 0; i <= 3; i++) { 
        // subscribe to the event representing the user challenge response 
        const responseTask = context.pubsub.subscribeOnce("my-pubsub-component", "sms-challenge-topic"); 

        // block the workflow until either the timeout expires or we get a response event 
        const winner = yield context.whenAny([expirationTimer, responseTask]); 

        if (winner === expirationTimer) { 
            break; // timeout expired 
        } 

        // we get a pubsub event with the user's SMS challenge response 
        if (responseTask.result.data.challengeNumber === challengeCode.number) { 
            authenticated = true; // challenge verified! 
            expirationTimer.cancel(); 
            break; 
        } 
    } 

    // the return value is available as part of the workflow status. Alternatively, we could send a notification. 
    return authenticated; 
}); 

// Call listen() to start processing workflow events 
workflowClient.listen(); 

Example 3: Declarative workflow for monitoring patient vitals

The following is an example of a very simple SLWF workflow definition that listens on three different event types and invokes a function depending on which event was received.

{ 
    "id": "monitorPatientVitalsWorkflow", 
    "version": "1.0", 
    "name": "Monitor Patient Vitals Workflow", 
    "states": [ 
      { 
        "name": "Monitor Vitals", 
        "type": "event", 
        "onEvents": [ 
          { 
            "eventRefs": [ 
              "High Body Temp Event", 
              "High Blood Pressure Event" 
            ], 
            "actions": [{"functionRef": "Invoke Dispatch Nurse Function"}] 
          }, 
          { 
            "eventRefs": ["High Respiration Rate Event"], 
            "actions": [{"functionRef": "Invoke Dispatch Pulmonologist Function"}] 
          } 
        ], 
        "end": true 
      } 
    ], 
    "functions": "file://my/services/asyncapipatientservicedefs.json", 
    "events": "file://my/events/patientcloudeventsdefs.yml" 
} 

The functions defined in this workflow would map to Dapr service invocation calls. Similarly, the events would map to incoming Dapr pub/sub events. Behind the scenes, the runtime (which is built using the Dapr SDK APIs mentioned previously) handles the communication with the Dapr sidecar, which in turn manages the checkpointing of state and recovery semantics for the workflows.

@yaron2
Copy link
Member

yaron2 commented May 2, 2022

@johnewart
Copy link
Contributor Author

(Fixed the placement image, there were previously two duplicate Actor entities)

@halspang
Copy link
Contributor

halspang commented May 3, 2022

Workflows are one of my favorite areas for actors so I'm happy to see them here :) Quick question though.

Can you describe more on how you're planning on using reminders? Are they just to let users start a workflow on a given period? Or, are we going to be registering reminders for every workflow step and then deleting them when they are done?

If it's the latter, that could end up being a lot of reminders. If that's the case we should call out that we'll use reminder partitioning as I think it'd be likely that we'd hit the tipping point for reminder scaling (which does vary based on the underlying statestore).

@olitomlinson
Copy link

olitomlinson commented May 3, 2022

Looking forward to this! Initial Qs...

The description of DTFx-go reads "lightweight, portable, embedded workflow engine (DTFx-go) in the Dapr sidecar"

  • Does this mean that you get a workflow engine out-of-the-box, by default, in any environment that is running Dapr? So I don't have to select SLWF, or temporal.io etc?

  • Do I have to specify a state-management component to provide the state store behind the OOTB workflow engine (DTFx-go) ? i.e. With Dapr actors, I must define a state-management component to back the actor state.

Other thoughts

  • Is it wise to call this implementation DTFx-go? I'm specifically talking about the DTFx bit. I ask because DTFx brings a bunch of baggage/concepts/knowledge. Such as TaskHub, and the various storage Providers for the TaskHub. Could this introduce confusion, as customers may expect a degree of interoperability with DTFx?! For example. I had the initial thought that this might be compatible with DF Monitor utility - but it won't as I don't recognise there being a TaskHub concept in DTFx-go proposal thus far.
  • Assuming DF Monitor is NOT compatible, it might be worth lifting some of the concepts from DF Monitor and hosting similar concepts in the Dapr Dashboard to help with the observability and management of workflows.

@cgillum
Copy link
Contributor

cgillum commented May 3, 2022

Can you describe more on how you're planning on using reminders? Are they just to let users start a workflow on a given period? Or, are we going to be registering reminders for every workflow step and then deleting them when they are done?

If it's the latter, that could end up being a lot of reminders. If that's the case we should call out that we'll use reminder partitioning as I think it'd be likely that we'd hit the tipping point for reminder scaling (which does vary based on the underlying statestore).

The exact details are still being ironed out, but it's essentially a variation of the latter - i.e., one active reminder per workflow instance (though not necessarily per action). I agree that leveraging the reminder partitioning work is the right way to ensure this remains scalable. We need to do a bit more research here to figure out the details.

@cgillum
Copy link
Contributor

cgillum commented May 4, 2022

Does this mean that you get a workflow engine out-of-the-box, by default, in any environment that is running Dapr? So I don't have to select SLWF, or temporal.io etc?

Correct - the embedded engine will be the "out of the box" option for anyone that doesn't want to install additional infrastructure into their cluster. We want external workflow services to be supported by the building block (we expect many will prefer to use workflow systems that they already know and love), but not required.

Do I have to specify a state-management component to provide the state store behind the OOTB workflow engine (DTFx-go) ? i.e. With Dapr actors, I must define a state-management component to back the actor state.

Yes, this is essentially a programming model that sits on top of actors, so you'll still need to configure a state store that supports actors.

Is it wise to call this implementation DTFx-go? I'm specifically talking about the DTFx bit. I ask because DTFx brings a bunch of baggage/concepts/knowledge. Such as TaskHub, and the various storage Providers for the TaskHub. Could this introduce confusion, as customers may expect a degree of interoperability with DTFx?! For example.

I'm not too worried about confusion because DTFx-go is just an implementation detail that most users won't know or care about. Users of Dapr will simply be presented with "Dapr Workflow" as a concept and we wouldn't necessarily expose the same extensibility or tooling. The existing DTFx isn't super well-known outside of Azure Functions or internal Microsoft circles, so I'm assuming there won't be a lot of opportunity for confusion even for folks who care to look at the implementation details.

FWIW, the DTFx-go backend storage provider will one built specifically for storing state and load balancing via the Dapr Actors infrastructure.

Assuming DF Monitor is NOT compatible, it might be worth lifting some of the concepts from DF Monitor and hosting similar concepts in the Dapr Dashboard to help with the observability and management of workflows.

Yes, integration with the Dapr Dashboard is definitely part of the plan.

@olitomlinson
Copy link

@cgillum got it thanks.

Suggestion : It might be worth updating the section Storage of state and durability to be a little more explicit that an “Actor compatible” dapr state store component is required in order to light-up the embedded engine.

@olitomlinson
Copy link

  • Any indication if workflows must be deterministic, due to replay semantics?

@jplane
Copy link

jplane commented May 4, 2022

Very interesting proposal... look forward to seeing it progress!

I wonder about the feasibility of a pluggable execution layer... certainly you can define contracts for required integration points (surfacing metadata, lifecycle management, state hydration, etc.) and implement, say, SLWF on top of that from scratch.

How would you envision that working for an existing runtime like AWS Step Functions or Temporal.io, which aren't necessarily built to plug into those abstractions, or in general to be driven "from the outside"?

For this to work, it seems like you would need a "workflow internals spec" to define the required integration points... and then need vendors to implement it. Or do I misunderstand?

@cgillum
Copy link
Contributor

cgillum commented May 4, 2022

Any indication if workflows must be deterministic, due to replay semantics?

Yes, thanks @olitomlinson for calling this out. @johnewart I think we need to update the description above to reflect this important coding constraint.

@cgillum
Copy link
Contributor

cgillum commented May 4, 2022

For this to work, it seems like you would need a "workflow internals spec" to define the required integration points... and then need vendors to implement it. Or do I misunderstand?

@jplane I wonder if there might be a slight misunderstanding. We're not proposing that the internal execution engine should support plugging in existing WF runtimes like Step Functions or Temporal.io. That would be a really hard problem to solve, as you suggested, and may not be in everyone's best interest. Rather, we're proposing two specific stories for how other workflow languages and/or runtimes can be pulled in:

  1. The WF building block contract (i.e. the HTTP APIs mentioned) can be used to interface with externally hosted workflow services, similar to how the state stores and pubsub building blocks work. In this case, the built-in engine isn't used at all.
  2. Things like SLWF (or other declarative workflow languages, including the AWS Step Functions spec - i.e. the Amazon States Language) can be supported by implementing a new runtime layer on top of the built-in engine's programming model. In this case, we're using the Dapr workflow engine and not any other existing engine.

The latter point (2) isn't strictly "pluggable extensibility", per-se, but more of a model for how developers could contribute their own declarative workflow runtimes that internally rely on the Dapr Workflow built-in engine. It's very similar to the POC SLWF prototype you and I built some time back on top of Durable Functions - the existing Durable engine was used to implement scheduling, durability, etc. and a layer on top was built to interpret the SLWF markup and interface with the Durable APIs.

I hope that makes sense. I can try to clarify further if it's still confusing.

@jplane
Copy link

jplane commented May 4, 2022

Thx @cgillum for the explanation... that makes sense now.

I really like the direction... be ruthlessly sparse with the engine programming model, and let a thousand higher-order models bloom on top of it. Nice!

Not 100% sold on the BYOruntime sales pitch just yet... the Dapr WF engine's semantics will become a pseudo-standard, so anything that does state management, work scheduling, etc. differently may not be seamlessly swappable behind the building block contract (or, at least, might violate The Principle of Least Surprise for the unsuspecting caller). I see the appeal of swappable runtimes... just wondering about real-world challenges, too.

@yaron2
Copy link
Member

yaron2 commented May 4, 2022

the Dapr WF engine's semantics will become a pseudo-standard

The goal is to make the Dapr WF APIs (and all Dapr APIs in general) a standard and via work that is on-going on our API-spec special interest group.

so anything that does state management, work scheduling, etc. differently may not be seamlessly swappable behind the building block contract (or, at least, might violate The Principle of Least Surprise for the unsuspecting caller). I see the appeal of swappable runtimes... just wondering about real-world challenges, too.

I agree with what you're saying here, but I actually think its valid. In this case, we certainly want to encourage users to choose the default, tested and optimized path of least resistance yet open the door for other runtimes if there are special considerations to be made.

@davidmrdavid
Copy link

Excited to finally see this being discussed! Which means I get to learn more about it myself :-)

I'm a bit unsure about the capabilities of the "worker" actor. Is it correct to say that all context.invoker.invoke commands in the sample code translate to an operation in the worker? Is it also safe to say the worker actor exists simply to provide a framework-compatible means of performing I/O, which is otherwise not allowed directly in the user code? Thanks!

@cgillum
Copy link
Contributor

cgillum commented May 5, 2022

Is it correct to say that all context.invoker.invoke commands in the sample code translate to an operation in the worker?

Behind the scenes, yes. This actor is designed to do any work that may take an indeterminate amount of time to complete, like service invocation. This frees up the scheduler actor to do other work, like respond to queries.

Is it also safe to say the worker actor exists simply to provide a framework-compatible means of performing I/O, which is otherwise not allowed directly in the user code?

Not necessarily. Technically, the scheduler actor could do all the I/O on behalf of the workflow code. The worker actor is really only for potentially long-running I/O, to keep the scheduler actor from getting blocked for too long (actors are single threaded). We may have the scheduler actor do other types of I/O directly, like publishing pub/sub messages.

@olitomlinson
Copy link

Instead, app containers will be responsible for hosting the actual workflow logic.

  • Can a single container image host multiple different types of workflow? Any theoretical limits you are aware of at this point?

  • Can a container image that hosts a workflow also invoke endpoints hosted in that same container image?

@cgillum
Copy link
Contributor

cgillum commented May 5, 2022

Can a single container image host multiple different types of workflow? Any theoretical limits you are aware of at this point?

Yes, absolutely. The code samples above show only one call to workflowClient.addWorkflow(...), but multiple calls can be made to register multiple workflows from the same app.

Can a container image that hosts a workflow also invoke endpoints hosted in that same container image?

Yes, a single container image/app can host workflows and service invocation endpoints together, so if the context.invoker.invoke call targets the currently running Dapr app, then the same container image would be the one that receives the service invocation request.

@davidmrdavid
Copy link

Is it correct to say that all context.invoker.invoke commands in the sample code translate to an operation in the worker?

Behind the scenes, yes. This actor is designed to do any work that may take an indeterminate amount of time to complete, like service invocation. This frees up the scheduler actor to do other work, like respond to queries.

Is it also safe to say the worker actor exists simply to provide a framework-compatible means of performing I/O, which is otherwise not allowed directly in the user code?

Not necessarily. Technically, the scheduler actor could do all the I/O on behalf of the workflow code. The worker actor is really only for potentially long-running I/O, to keep the scheduler actor from getting blocked for too long (actors are single threaded). We may have the scheduler actor do other types of I/O directly, like publishing pub/sub messages.

Thanks, this makes sense. So I suppose that there will be a pre-determined mapping that makes certain APIs, like context.invoker.invoke always go onto the worker actor, while other more constrained APIs will go to the scheduler.

Something else that stood out to be is that I don't see any reference to context.invoker.invoke in the API listing of the original post. Will that be part of a different proposal? I'm interested in understanding what other methods and utilities can be accessed from the DaprWorkflowContext object :-) .

@cgillum
Copy link
Contributor

cgillum commented May 5, 2022

The original post goes into the details of the workflow building block APIs, which describe how existing app code can interact with Dapr workflows, whether self-hosted or externally hosted. APIs for implementing self-hosted workflows like context.invoker.invoke aren't currently enumerated. Right now, we're expecting to cover core Dapr APIs, like service invocation, pub/sub, bindings, etc. but will likely have a few others as well. Exact details TBD.

@olitomlinson
Copy link

The context object — is part of the Workflow SDK, right?

User code must use this context object from the SDK? There is no option to use a HTTP API?

If what I’ve said is correct, this wouldn’t align with the dapr principle of being language agnostic, right? Might that cause a rub?

@cgillum
Copy link
Contributor

cgillum commented May 5, 2022

User code must use this context object from the SDK? There is no option to use a HTTP API?

Correct, and @johnewart mentioned this in the original post in the "Workflow as code" section:

The Dapr SDK will internally communicate with the DTFx-go gRPC endpoint in the Dapr sidecar to receive new workflow events and send new workflow commands, but these protocol details will be hidden from the developer. Due to the complexities of the workflow protocol, we are not proposing any HTTP API for the runtime aspect of this feature.

The building block APIs will be exposed over HTTP; just not the workflow runtime APIs. Indeed, it's a deviation from how other Dapr building blocks work, including actors, but I think it's an appropriate tradeoff that allows us to build a fully-featured workflows runtime implementation. It would be very difficult for someone to correctly implement the workflow runtime APIs using an HTTP client.

@yaron2
Copy link
Member

yaron2 commented May 5, 2022

Indeed, it's a deviation from how other Dapr building blocks work

Not much of a deviation. The Configuration API started out as gRPC only, and the upcoming Distributed Lock API is also gRPC only.

@olitomlinson
Copy link

Sorry I should have said

User code must use this context object from the SDK? There is no option to use a HTTP API or gRPC API

The API surfaces for the building blocks are simple hence why you can bring your own HTTP or GRPC client (and avoid using SDKs) but with the workflow SDK the API surface is complex (as Chris just mentioned) so not using a language specific WF SDK Is unavoidable.

To me it’s fair to say that this is a deviation. Trade-off? Sure, but still a deviation from the status quo. Not trying to be negative btw, just highlighting the gap and how it lightly challenges my perception and expectations of dapr, as a user/consumer.

@yaron2
Copy link
Member

yaron2 commented May 5, 2022

The API surfaces for the building blocks are simple hence why you can bring your own HTTP or GRPC client (and avoid using SDKs) but with the workflow SDK the API surface is complex (as Chris just mentioned) so not using a language specific WF SDK Is unavoidable.

True, but you'll find it's the same with actors, that de-facto necessitate an Actor SDK to provide a simple programming model that is otherwise very complex when using the APIs directly.

@olitomlinson
Copy link

Got it. I forget that dapr Actors require an SDK for the programming model. In that case, ignore everything I’ve said!

@johnewart
Copy link
Contributor Author

johnewart commented May 6, 2022

@olitomlinson -- I think those are great questions so thanks for asking them! I will avoid sounding like an echo chamber since Chris, Hal and Yaron mostly answered everyone's questions (they're so quick they answered before I even saw them!).

That being said, I can see a world where it might be possible to define workflows without using a language SDK, similar to how GitHub declares its workflows as YAML, with built-in predefined actions (or actions people have written similar to components). However, the explicit goal of this design is to avoid yet another workflow language ("yawful?" 😄) and allow developers to use their language of choice.

@yaron2
Copy link
Member

yaron2 commented May 6, 2022

Got it. I forget that dapr Actors require an SDK for the programming model. In that case, ignore everything I’ve said!

No, you did bring up a valid point. While we aim to keep APIs accessible and usable over standard protocols directly (HTTP REST, gRPC) to increase adoption and be inclusive to all programming languages and frameworks, the first principle of Dapr to make developers successful does allow for more opinionated programming models like Actors and Workflows, provided there is a well reasoned, non-nightmarish way to extend them over HTTP and/or gRPC. So far this has worked very well for actors and I've reason to believe based on this proposal that it'll be the same for workflows.

@beiwei30
Copy link
Member

beiwei30 commented May 7, 2022

Great proposal it is. I only have one minor question against API, that is, as you said, there's "A set of APIs for managing workflows (start, schedule, pause, resume, cancel)", but I cannot tell how a workflow can be paused and resumed from the section of "API".

@jplane
Copy link

jplane commented May 7, 2022

I agree the wording is confusing, but I'm guessing the intent is to use the RaiseEvent API... some external process sends a custom 'PauseEvent' to the workflow, and then later a custom 'ResumeEvent'. The workflow just needs to understand and anticipate those events and react accordingly. Presumably the workflow programming model will expose the events from DaprWorkflowClient or similar (from the code examples above).

Assuming similarity to DTFx (mentioned above as inspiration for this proposal)... application semantics and communication protocol between workflow and outside world are left up to the workflow developer, subject to some constraints imposed by the runtime to provide durability and determinism guarantees.

@cgillum and @johnewart is that the basic idea?

@cgillum
Copy link
Contributor

cgillum commented Sep 27, 2022

@olitomlinson the workflow engine depends on the state management organization of the underlying actors subsystem. Workflows will therefore inherit the isolation behavior of actors. We’re otherwise not making any explicit decisions about isolation for workflows. If actors are improved with better isolation guarantees, then workflows should be able to easily inherit the same improvements.

@olitomlinson
Copy link

@cgillum An Actor is essentially isolated by its Actor ID. Is there any reason that the namespace and app Id can't be included in the Actor ID to get that isolation?

@cgillum
Copy link
Contributor

cgillum commented Sep 27, 2022

That's certainly an option, though I think it would be cleaner if the actor subsystem could take care of namespace prefixing for us. This is already being done for app IDs based on what I see generated in the state stores for actors, so there shouldn't be any conflict across different (uniquely named) apps.

In the current implementation (that's in active development) the actors simply use the instance IDs of the workflows, which are either user specified or randomly generated. You're right that we could theoretically prepend the namespace to all the actor IDs. However, I'm not yet familiar with namespaces in Dapr and would need to check to see whether we have access to the namespace identifier in all the needed places.

@olitomlinson
Copy link

That's certainly an option, though I think it would be cleaner if the actor subsystem could take care of namespace prefixing for us.

It definitely would be cleaner, you're right there.

I only raise this because we can't use Dapr Actors directly in our product due to the inability to have namespace isolation of Actors. I would hate that Dapr Workflows falls into the same trapping, and once again, my team are unable to access another really valuable programming model.

@cgillum
Copy link
Contributor

cgillum commented Oct 8, 2022

JFYI, first PR into the feature/workflows branch has been merged: #5301.

It introduces an internal actor concept which is used by the new durable task-based workflow engine. More PRs will be published over the next few weeks that flesh out the full workflow engine feature set.

@amulyavarote
Copy link
Contributor

Linking dotnet-sdk proposal for dapr workflows:
#5314

@ricardozanini
Copy link

Frieds, If you're interested in the Serverless Workflow specification for this effort, please let us know! We are at the CNCF slack, #serverless-workflow channel.

@olitomlinson
Copy link

@msfussell

FYI that the search term "dapr workflows" in google is probably returning less favourable results for when 1.10 ships - I suspect it would be desirable to be returning the new Workflow building block documentation, rather than the sandbox project?

image

@walterlife
Copy link

Frieds,At present, I have implemented the go runtime based on the ServerlessWorkflow Spec, and I have contributed to ASF(it will have a separate code warehouse under Apache). I also want to communicate with you if there is any intention to cooperate

@dapr-bot
Copy link
Collaborator

dapr-bot commented Apr 4, 2023

This issue has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.

@dapr-bot dapr-bot added the stale Issues and PRs without response label Apr 4, 2023
@yaron2 yaron2 closed this as completed Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues and PRs without response
Projects
Development

Successfully merging a pull request may close this issue.