Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[workflow] Spec contents that are difficult to understand #140

Closed
manuelstein opened this issue Jan 8, 2020 · 34 comments
Closed

[workflow] Spec contents that are difficult to understand #140

manuelstein opened this issue Jan 8, 2020 · 34 comments

Comments

@manuelstein
Copy link
Contributor

I'm having multiple issues with the spec, but maybe that's just me, so this is an attempt get it sorted out and I'd be happy if you could help me.

Why can Delay state types loop?
Why does Delay state type have error handling when there's no action attached (e.g. Relay state type can't)?
How can the Switch state type loop when its choices might have transitioned the workflow to another state already?

If my understanding of the correlationToken is correct, then a workflow can receive multiple events through separate sources with the same token, meaning that they all belong to the same workflow activation. IIUC, the first time a token with content "session1" appears, the workflow would start at the startsAt state and subsequent events with the same token "session1" can be consumed by an Event-type state if there is any.

  • What if the workflow consuming "session1" tokens has not yet reached the Event-type state that the event is targeted at? (would they need queuing?)
  • Why does an Event state contain more than one event definition? This could be a Switch state evaluation that has multiple choices for the received event
  • Why is there a list of actions plus an actionMode that defines if the actions are run in parallel when there is also the Parallel state type?
  • What does it mean if an Event-type state loops - would it transition to another state as defined in its EventDef or consume more events?
  • What if I have parallel branches waiting for external events? Do they consume events or would non-matching events be put back in the queue?

I also noticed the Event state type condition is a string, the Switch state type choices are JSON encoded logic, but transitions and error handling use a separate expression language.

I'd suggest to

  • reduce the Event state to just wait for an external event and transitions
  • unify the conditions
  • consider replacing Switch with just a set of conditional transitions
  • replace error handling with just evaluating whatever the action output (i.e. not having the engine detect if the result of an action is a stack trace due to aborted execution or a valid result)
  • remove loop from attributes and make it its own construct, maybe merge it with subflow

The step-by-step execution is nice, but maybe the control logic is really independent from triggering functions, e.g. something that allows a more flexible use of control elements - like sleeps, loops, conditions and outside interaction (event send/receive).

@tsurdilo
Copy link
Collaborator

tsurdilo commented Jan 8, 2020

"Why can Delay state types loop?" -- logic could ask this state to repeat/loop for a certain data input. For example "delay 5 seconds for each open order"

@tsurdilo
Copy link
Collaborator

tsurdilo commented Jan 8, 2020

"Why does Delay state type have error handling when there's no action attached (e.g. Relay state type can't)?" -- In order to do a delay implementations will have service to keep track of execution/run time, do the actual timer delay operations etc. This services could raise runtime exceptions possibly which might have to be handled.
Relay state just filters states data / injects into input....thinking about it maybe it should have "onError" as well for same reasons :)

@tsurdilo
Copy link
Collaborator

tsurdilo commented Jan 8, 2020

"How can the Switch state type loop when its choices might have transitioned the workflow to another state already?" -- good point! will update asap

@tsurdilo
Copy link
Collaborator

tsurdilo commented Jan 8, 2020

"Why does an Event state contain more than one event definition? This could be a Switch state evaluation that has multiple choices for the received event" - you could if you wanted use switch and delay and operation states to do what event state currently does all in one ..but why would you :) There are always multiple ways of doing things,I do see however that the property "events" is confusing ....prob should look into examples to see how to make this better. wdyt?

@tsurdilo
Copy link
Collaborator

tsurdilo commented Jan 8, 2020

"Why is there a list of actions plus an actionMode that defines if the actions are run in parallel when there is also the Parallel state type?" -- goes back to previous comment about event state,. Parallel state executes a number of states/tasks sequentially or in parallel, for event sate whats being executed are "events" that contain actions. Again you could use switch state with parallel state and operation state i f you wanted. Does not make it wrong either way.

@tsurdilo
Copy link
Collaborator

tsurdilo commented Jan 8, 2020

"What does it mean if an Event-type state loops - would it transition to another state as defined in its EventDef or consume more events?" -- This will be more clear when we introduce internal vs external transitions. Yes currently it is confusing.

@tsurdilo
Copy link
Collaborator

tsurdilo commented Jan 8, 2020

"What if I have parallel branches waiting for external events? Do they consume events or would non-matching events be put back in the queue?" -- dont fully understand the question but I would assume this depends on the timeout property of the "event" in event state as well as particular implementation logic for that scenario. Will definitely look into this some more.

@tsurdilo
Copy link
Collaborator

tsurdilo commented Jan 8, 2020

Suggestion- "reduce the Event state to just wait for an external event and transitions" - this suggestion mas made iirc before so definitely we should look into it. I think there are couple of things to consider:

  • event data vs state data..keep them separate is a good thing (current event state clearly separates the two -- see information passing section)
  • amount of JSON to define the same thing. One "sexy" part of this specification is that not only is it kinda human readable (compared to bpmn2 hehe) but also very light-weight. By doing this you are forced to add more state definitions ..and it is not bad for some states to have overlapping functionality really.

@tsurdilo
Copy link
Collaborator

tsurdilo commented Jan 8, 2020

Suggestion - "unify the conditions" - definitely. jsonpath for filters, expression language for conditions would be good imo

@tsurdilo
Copy link
Collaborator

tsurdilo commented Jan 8, 2020

Suggestion - "consider replacing Switch with just a set of conditional transitions" - I looked into that and yes you can do it, however the "choices" make things alot more readable at a low cost. Especially the "or" or "and" ones.

@cathyhongzhang
Copy link
Collaborator

"Why can Delay state types loop?" -- logic could ask this state to repeat/loop for a certain data input. For example "delay 5 seconds for each open order"

I just found out that "loop" is required for each state. Should it really be required? I think this field should be optional.

@tsurdilo
Copy link
Collaborator

tsurdilo commented Jan 8, 2020

Suggestion - "replace error handling with just evaluating whatever the action output (i.e. not having the engine detect if the result of an action is a stack trace due to aborted execution or a valid result)" - this suggest there is an output :) Handling runtime exception imo is a must-have to be able to properly define business logic in some cases. Comparing stack traces that can change over time is not a good "maintenance" option imo either.

@tsurdilo
Copy link
Collaborator

tsurdilo commented Jan 8, 2020

@cathyhongzhang that is a typo in the documentation...im already on it . look at https://github.com/cncf/wg-serverless/blob/master/workflow/spec/schema/serverless-workflow-schema-01.json .. "loop" is not defined in any of the required definitions

@tsurdilo
Copy link
Collaborator

tsurdilo commented Jan 8, 2020

Suggestion - "remove loop from attributes and make it its own construct, maybe merge it with subflow" - this is imo (pls dont be mad :) ) not a good option. Look at aws or conductor they have things like the "map state" or "foreach state" .. it looks ugly and is truly non functional. What we did enhances imo what they have ...just need to make sure we dont add it to things that should not loop :)

** Update ** after discussion it seems that we do want a separate state for looping - ok I will work on that then :)

@tsurdilo
Copy link
Collaborator

tsurdilo commented Jan 8, 2020

@manuelstein tried to comment on items in your post best I could :) I'm sure others will have their comments as well. Hope it helped and thanks again for raising this issue.

@cathyhongzhang
Copy link
Collaborator

"Why is there a list of actions plus an actionMode that defines if the actions are run in parallel when there is also the Parallel state type?" -- goes back to previous comment about event state,. Parallel state executes a number of states/tasks sequentially or in parallel, for event sate whats being executed are "events" that contain actions. Again you could use switch state with parallel state and operation state i f you wanted. Does not make it wrong either way.

Like to add on Tihomir's reply. Parallel state is mainly used for applications that involves embedded sub-workflow while "parallel actions" just specifies the parallel execution of functions (not states).

@cathyhongzhang
Copy link
Collaborator

Suggestion - "remove loop from attributes and make it its own construct, maybe merge it with subflow" - this is imo (pls dont be mad :) ) not a good option. Look at aws or conductor they have things like the "map state" or "loop state" .. it looks ugly and is truly non functional. What we did enhances imo what they have ...just need to make sure we dont add it to things that should not loop

It seems "loop" causes quite some confusion. I also find it confused. Although it could be used in some scenarios, I don't feel it is a key attribute that will be used a lot. Adding it to the state construct makes each state look kind of complicated and hard to understand. When people feel a spec is complicated and has a lot to interpret and figure out which one is used for what, people tend not to adopt it. At this stage, it is probably better to minimize optional attributes to make it simple and easy to understand. When adoption picks up, and we find a compelling reason to add a new attribute, we can add it. So I think maybe it is a good idea to move it out of the state construct.

@cathyhongzhang
Copy link
Collaborator

If my understanding of the correlationToken is correct, then a workflow can receive multiple events through separate sources with the same token, meaning that they all belong to the same workflow activation. IIUC, the first time a token with content "session1" appears, the workflow would start at the startsAt state and subsequent events with the same token "session1" can be consumed by an Event-type state if there is any.
What if the workflow consuming "session1" tokens has not yet reached the Event-type state that the event is targeted at? (would they need queuing?)
Cathy> Good question. Yes they can be queued. This is an implementation/design decision of the backend workflow engine.

@tsurdilo
Copy link
Collaborator

tsurdilo commented Jan 8, 2020

@cathyhongzhang Looping imo is a key workflow construct for this spec just like branching event handling, transitions, parallel execution, etc. I would be ok with separating looping into its own state, like the map or foreach state others have if that would make it less confusing but taking it out imo is a bad idea. wdyt?

@cathyhongzhang
Copy link
Collaborator

@tsurdilo I am not saying we should remove it from the spec (sorry that I confuse you). Separating looping into its own state or using another way to move it out of the various state definition will be less confusing and simpler. Thank you for taking care of this!

@tsurdilo
Copy link
Collaborator

tsurdilo commented Jan 8, 2020

@cathyhongzhang its nice we have several improvements already to do from this discussion. If you don't mind could we just first look at the open prs (especially #138 if you don't mind) because it would really save me alot of time not to have to rebase the fixes that are being worked on identified here. thanks!

@tsurdilo
Copy link
Collaborator

tsurdilo commented Jan 8, 2020

@cathyhongzhang also suggestions for the "loop state" name are more than welcome

@cathyhongzhang
Copy link
Collaborator

Suggestion- "reduce the Event state to just wait for an external event and transitions" - this suggestion mas made iirc before so definitely we should look into it. I think there are couple of things to consider:

  • event data vs state data..keep them separate is a good thing (current event state clearly separates the two -- see information passing section)
  • amount of JSON to define the same thing. One "sexy" part of this specification is that not only is it kinda human readable (compared to bpmn2 hehe) but also very light-weight. By doing this you are forced to add more state definitions ..and it is not bad for some states to have overlapping functionality really.

To add on Tihomir's point, I think having functions associated with event states makes it compact. The only difference between an event state and an operation state is that execution of functions in the event state are triggered by events(i.e. when workflow transits to this state, it needs to wait there for external events before invoking function execution) while execution of functions in operation state are not triggered by external events (i.e. when workflow transits to this state, it does not need to wait there for external events before invoking function execution).

@cathyhongzhang
Copy link
Collaborator

@tsurdilo will try to do the review of PR as quick as possible

@tsurdilo
Copy link
Collaborator

tsurdilo commented Jan 8, 2020

@cathyhongzhang i have added the fix for "loop" parameter not to be required as part of existing PR #138. Will work on a separate pr for looping state.

@tsurdilo
Copy link
Collaborator

tsurdilo commented Jan 8, 2020

Regarding "I also noticed the Event state type condition is a string" -- this was fixed via b01dbd8

@tsurdilo
Copy link
Collaborator

tsurdilo commented Jan 9, 2020

Regarding looping - this is being addressed via #143. Please take a look and review. thanks.

@tsurdilo
Copy link
Collaborator

tsurdilo commented Jan 16, 2020

@manuelstein any comments/suggestions after all the comments here?

@manuelstein
Copy link
Contributor Author

Yes, first of all thank you for the many replies! Still have to go through all of them.

Quick one - questions on the Delay state:

"Why can Delay state types loop?" -- logic could ask this state to repeat/loop for a certain data input. For example "delay 5 seconds for each open order"
I don't see what you mean. If the open orders are a set of events that each traverse the delay state, then they would each be held up by or up to a certain time (ISO 8601 format). If it's a single event that contains several orders, I don't see how to configure the Delay-type state with a for-each-element logic. It's really just causing a delay "by" or "up to" a certain time with no operation attachable, so the output will always remain the same. But I think this is superfluous now if looping gets its own state.

"Why does Delay state type have error handling when there's no action attached (e.g. Relay state type can't)?" -- In order to do a delay implementations will have service to keep track of execution/run time, do the actual timer delay operations etc. This services could raise runtime exceptions possibly which might have to be handled.
Ok, now I see it, but this means there's an unknown/invisible implementation performing the delay and the WF developer wouldn't know what its output could possibly be. I could imagine that the wallclock time to wait for may have already passed, so a negative wait time might want separate handling - so I wonder, should the Delay-type state specify an actual output (time waited) and possibly some well-known errors (e.g. waiting time too long when time > 100 years)? Would it be possible to specify different preferred implementations (e.g. that provide day or microsecond accuracy)? Or, the spec could simply demand second accuracy, bc in a FaaS environment, the cold-start could kill anything more accurate.

@tsurdilo
Copy link
Collaborator

@manuelstein

  • yes the ForEach state definition PR is open: New State - ForEach #143. Feel free to look and add your comments/suggestions.

  • good points, I think as its defined now delay state means just an explicit halt of workflow execution and should not be used as a "safe" point in stateful orchestrations (workflow cannot resume into delay state). Actions have a timeout parameter as they might be waiting for serverless function replies. We do not have an explicit "ANY/ALL" keyword defined for catching runtime exceptions in onError, but error definition condition body could just return "true" to match any.

  • Currently the workflow definition has this "execStatus" property which is atm useless and we dont even describe it or know how its used :) I think what you mention that states themselves can like self-check would be pretty big! Would you mind looking into that and tell me how you would see that ?

@tsurdilo
Copy link
Collaborator

@manuelstein just wanted to add quickly while thinking about your comments and this self-checking..netflix conductor has this feature for states/tasks data input/output.
There you can define expected data inputs and expected data outputs. If for example
and operation state knows its going to call a "complete order" function, and in its data input json there is no "order" element, whats the point of even attempting the call....
just giving you ideas for maybe a contribution ;)

@salaboy
Copy link
Contributor

salaboy commented Jan 23, 2020

@tsurdilo @manuelstein great thread of discussions and I can see that @tsurdilo already implemented a lot of changes...
What about closing this issue and doing another round of more scoped issues/concerns/questions? So we can keep the threads "short" :)

@tsurdilo
Copy link
Collaborator

@salaboy I think we can close if @manuelstein says his questions have been answered.

@manuelstein
Copy link
Contributor Author

Oops, thought I had replied. @salaboy yes +1, I can see that each of the points raised could have been a short issue on its own. @tsurdilo we can close this to have more scoped issues on the remaining bits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants