Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for loops #495

Open
tetron opened this issue Aug 7, 2017 · 20 comments
Open

Proposal for loops #495

tetron opened this issue Aug 7, 2017 · 20 comments
Milestone

Comments

@tetron
Copy link
Member

tetron commented Aug 7, 2017

Design sketch for loops.

The distinction between loops and scatter is that loops are explicitly sequential, whereas scatter is a parallel operation.

A looping step consists of a while condition and a loop specification. The loop specification describes how to update the input object for the next iteration. The output of the step defaults to the output of the last iteration , but could also be constructed using the result field describe in #494. (note from @mr-c: the result field was dropped from how conditionals were implemented in CWL v1.2).

  step3:
    in:
      a: a
    out: [out]
    while: $(self.a < 5)
    run: blah.cwl
    loop:
      a: $(self.a + 1)
@mr-c mr-c added this to the v1.1 milestone Jan 9, 2018
@tetron tetron removed this from the v1.1 milestone Jan 16, 2018
@tetron tetron changed the title Proposal for loops in v1.1 Proposal for loops Jan 16, 2018
@tetron tetron added this to the post v1.1 milestone Jan 16, 2018
@ghost ghost mentioned this issue Nov 1, 2018
3 tasks
@mr-c
Copy link
Member

mr-c commented Mar 10, 2020

@lukasheinrich
Copy link

lukasheinrich commented Mar 10, 2020

@mr-c
Copy link
Member

mr-c commented Mar 10, 2020

Thanks @lukasheinrich , which bit is the loop? Is it "do the same thing to each of these inputs" or is it "run this part until a condition is met"?

@Karel-van-de-Plassche
Copy link

We are looking at workflow engines for a possible project. This would be a prerequisite for our usecase (e.g. time-loops, optimization, etc.).

@DaanVanVugt @nielsdrost

@mr-c
Copy link
Member

mr-c commented Nov 2, 2020

Hello @Karel-van-de-Plassche ! Do you need loops within the workflow (repeat these steps, folding their outputs into their inputs) or loops over entire workflow (run this simulation workflow until the critical threshold is met)?

@mr-c
Copy link
Member

mr-c commented Nov 2, 2020

To @lukasheinrich @tiborsimko @clelange @Karel-van-de-Plassche @DaanVanVugt @nielsdrost : the point of my questions is to better understand both where loops are desired and what types of operations between iterations are needed.

I can imagine many different types of scenarios, but we should design for specific and not theoretical needs. If incrementing a counter is sufficient, then @tetron design is good enough. If the outputs of one round of the loop need to become inputs, perhaps with a complicated transformation, then I'm not sure the current design is sufficient.

There could be other considerations based on real, specific use cases. But we don't have those use cases to study, which makes it difficult to sketch how these loops would work.

However, do not despair! The loop construct would be a small percentage of a typical scientific/research workflow. The easiest way to advance this proposal would be to implement @tetron 's design as an extension to CWL v1.2 in a fork of the CWL reference runner. This would give us a lot of experience in what works and what does not.

That extension could look like:

  step3:
    in:
      a: a
    out: [out]
    run: blah.cwl
    requirements:
      cwltool:Loop:
        while: $(self.a < 5)
        loop:
          a: $(self.a + 1)

I (@mr-c) and others would help you implement and test such an extension, but I can't do it alone :-)

@cwl-bot
Copy link

cwl-bot commented Apr 18, 2021

This issue has been mentioned on Common Workflow Language Discourse. There might be relevant details there:

https://cwl.discourse.group/t/while-style-reccurent-step-feature-request/349/2

@mr-c
Copy link
Member

mr-c commented Mar 10, 2022

An open question is how to incorporate the results of the previous iteration in the next round.

One approach using the current proposal: the references to the previous iteration's results could occur in the loop CWL Expression using a new out or result object.

@cwl-bot
Copy link

cwl-bot commented Mar 10, 2022

This issue has been mentioned on Common Workflow Language Discourse. There might be relevant details there:

https://cwl.discourse.group/t/access-envdef-of-workflow-in-cwltool/572/10

@tetron
Copy link
Member Author

tetron commented Mar 10, 2022

That's what the loop field is intended to do, it defines the input object of the next iteration, with the ability to reference the input and output objects of the last iteration.

I realize that's not totally clear in the original proposal, here's a revised one, this feeds the value of "out" back into "b":

  step3:
    in:
      a: a
      b: b
    while: $(inputs.a < 5)
    run: blah.cwl
    out: [out]
    loop:
      a: $(inputs.a + 1)
      b: $(outputs.out)

@GlassOfWhiskey
Copy link
Collaborator

GlassOfWhiskey commented Mar 10, 2022

I have a slight more complete proposal that I want to share with you. It comes from both language considerations and real scientific applications requirements. First of all, it is necessary to state that here I am only considering iterations with loop-carried dependencies, as loop-independent iterations can be more efficiently expressed with the scatter directive.

Loops as extensions of conditions

The basic propositions that guide this proposal are the following:

  • Proposition 1: A false conditional block is semantically equivalent to an iterative block with zero iterations.
  • Proposition 2: A true conditional block is semantically equivalent to an iterative block with a single iteration.

I think that the loop construct in CWL should preserve this properties, so I propose to use the already existing when condition to express the termination condition. This will free us to think about steps with loop+when combinations, and also to (eventually) explicitly forbid their combined usage (which could sound a bit artificial). In this setting, we have 3 cases:

  • A loop step with zero iterations will propagate null to all its outputs, according to the false when case;
  • A loop step with a single iteration will simply propagate the outputs of the subworkflow (according to the true when case;
  • A loop step with multiple iteration will propagate its outputs accoring to what described below.

Let's examine the first two cases with the previous example (with the while clause substituted by a when clause):

  step3:
    in:
      a: a
      b: b
    when: $(inputs.a < 5)
    run: blah.cwl
    out: [out]
    loop:
      a:
        valueFrom: $(inputs.a + 1)
      b: out

Assume that blah.cwl does echo ${a}. If a = 5 then it will return {out: null}, while if a = 4 it will return {out: 4}.

Extended when field syntax

First, we need to extend the when directive to deal with step outputs and with multiple step executions. This is the extended version I propose:

The when field controls conditional execution. This is an expression that must be evaluated with inputs bound to the step input object (or individual scatter job) and outputs produced in the last step execution (which evaluate to null before the first execution), and returns a boolean value. It is an error if this expression returns a value other than true or false.

field required type description
when optional Expression If defined, only run the step while the expression evaluates to true. If false and no iteration has been performed, the step is skipped. A skipped step produces a null on each output.

loop field syntax

Since the while clause has been removed, we only need to define the syntax of the loop clause. Here I propose something like this:

field required type description
loop optional array<LoopInput> | map<id, source | LoopInput> Defines the input parameters of the loop iterations after the first one (inputs of the first iteration are the step input parameters). If no loop rule is specified for a given step in field, the initial value is kept constant among all iterations.

The LoopInput will be shaped as follows. It is basically a reduced version of the WorkflowStepInput structure with the possibility to include outputs of the previous step execution in the valueFrom Expression.

field required type description
id optional string It must reference the id of one of the elements in the in field of the step.
source optional string Specifies one or more of the step output parameters that will provide input to the loop iterations after the first one (inputs of the first iteration are the step input parameters).
linkMerge optional LinkMergeMethod The method to use to merge multiple inbound links into a single array. If not specified, the default method is "merge_nested".
pickValue optional PickValueMethod The method to use to choose non-null elements among multiple sources.
valueFrom optional string | Expression To use valueFrom, StepInputExpressionRequirement must be specified in the workflow or workflow step requirements. If valueFrom is a constant string value, use this as the value for this input parameter. If valueFrom is a constant string value, use this as the value for this input parameter. If valueFrom is a parameter reference or expression, it must be evaluated to yield the actual value to be assiged to the loop input field. The self value in the parameter reference or expression must be null if there is no source field, or the value of the parameter(s) specified in the source field. The value of inputs in the parameter reference or expression must be the input object to the last iteration of the workflow step after assigning the source values. The value of outputs in the parameter reference or expression must be the outputs of the last step execution.

Loop output modes

A single way to deal with loop outputs is not enough in my opinion. Therefore, I propose to add a field called outputMethod which behaves similarly to the scatterMethod input. Until now, I have identified three possible values for this field:

symbol description
last Default. Propagates only the last computed element to the subsequent steps when the loop terminates.
all_propagate Propagates each output value to the subsequent steps after every loop iteration.
all_concat Propagates a single array with all output values to the subsequent steps when the loop terminates.

To simplify things, I suggest to have a single outputMethod field for a step instead of specifying a different behaviour for each output element. Note that both last and all_propagate behaviours are transparent when a single iteration is performed (and also when the step is skipped). Conversely, the all_concat option will return arrays with a single value.

Here I present a concrete example for each output behaviour:

last output mode

This is the most recurrent behaviour and it is typical of the optimization processes, when I want to iterate until I reach a given precision of my estimate. For example:

optimization:
  in:
      a: a
      threshold: threshold
    when: $(outputs.error > inputs.threshold)
    run: optimize.cwl
    out: [value, error]
    loop:
      a: value
    outputMethod: last

This loop keeps optimizing the initial a value until the error value falls below a given (constant) threshold. Then, the last values of value and error will be propagated.

all_propagate output mode

This behaviour is needed for example when a recurrent simulation produces loop-carried results, but each result can be processed independently by the rest of the workflow. For example:

simulation:
  in:
      a: a
      day:
        valueFrom: 0
      max_day: max_day
    when: $(outputs.day < inputs.max_day)
    run: optimize.cwl
    out: [value]
    loop:
      a: value
      day:
        valueFrom: $(inputs.day + 1)
    outputMethod: all_propagate

In this case, subsequent steps can start processing outputs even before the simulation step terminates. Note that the implementation of this second case is a bit more complicated, as the subsequent steps must be notified (under the hood) that the loop has terminated, i.e. that no more input values will arrive.

The all_concat mode

This behaviour is needed when a recurrent simulation produces loop-carried results, but the subsequent steps need to know the total amount of computed values to proceed. The example is very similar to the previous one, so it is not reported.

The Arbitrary Cycles pattern

This proposal is general enough to implement the Arbitrary Cycles pattern in CWL. Here is an example of a full Workflow implementing the reference example.

cwlVersion: v1.3
class: Workflow
inputs:
  i1: Any
outputs:
  o1:
    type: Any
    outputSource: subworkflow/o1
steps:
  A:
    run: A.cwl
    in:
      in1: in1
    out: [p1, p2]
  B:
    run: B.cwl
    in:
      p1: A/p1
    out: [p3]
  C:
    run: C.cwl
    in:
      p2: A/p2
    out: [p4]
  subworkflow:
    when: $(outputs.o1 !== null)
    run:
      class: Workflow
      inputs:
        p3: Any
        p4: Any
      outputs:
        o1:
          type: Any
          outputSource: E/o1
        p3:
          type: Any
          outputSource: F/p3
      steps:
        D:
          run: D.cwl
          in:
            p3: p3
          out: [p4]
        E:
          run: E.cwl
          in:
            p4:
              source:
              - p4
              - D/p4
              pickValue: the_only_non_null
          out: [o1, p5]  
        F:
          run: F.cwl
          in:
            p5: E/p5
          out: [p3]
    in:
      p3: B/p3
      p4: C/p4
    out: [o1, p3]
    loop:
      p3: p3
    outputMethod: last

@tetron
Copy link
Member Author

tetron commented Mar 11, 2022

Thank you for this very thoughtful proposal!

use when for loops

I agree with this

LoopInput based on WorkflowStepInput

The version I proposed was basically only valueFrom but I see value in making it behave like the inputs section. One detail, you probably want qualify the source with the step name (the same as if you had a 2nd step consuming the output).

optimization:
  in:
      a: a
      threshold: threshold
    when: $(outputs.error > inputs.threshold)
    run: optimize.cwl
    out: [value, error]
    loop:
      a: optimization/value
    outputMethod: last

last and all_concat output mode

These both makes sense. all_concat would be like a serial version of scatter.

all_propagate

This I have concerns about. It has significant implications for the current CWL model of computation. There's certainly value in being able to being able to start computation of downstream steps before all iterations/instances of a loop or scatter step has completed, but I think we should try to address that as a separate proposal.

@GlassOfWhiskey
Copy link
Collaborator

Few comments here:

Removing the all_propagate output mode

This is not a big problem, as the all_propagate is basically an optimization path for all_concat -> scatter. We can then allow only two values for the outputMethod field: the last value, which behaves as described above, and the all value, which behaves as all_concat. Then the possibility to apply optimization patterns to the all -> scatter combination is left to the WMS implementors, without forcing the behaviour in the standard.

Qualify the source with the step name

I am a bit scared about that, as the next syntactical step here would be to specify outputs from other steps as loop inputs, while I strongly suggest to allow only outputs of the same step to be used in the loop field (users can still define subworkflows as I did in the Arbitrary Cycles example).

Allowing cross-step loops would be obviously more intuitive for users, but from an implementation point of view it would require huge changes in the way CWL processes dependencies, much more than the all_propagate case (I know because I tried several times to "simulate" loops in the current CWL implementation by injecting circular dependencies between steps, but with painful results XD).

@GlassOfWhiskey
Copy link
Collaborator

GlassOfWhiskey commented Mar 12, 2022

Plus, I would like to start a discussion about two remaining open points in the loop behaviour: default values and scatter behaviour.

The default behaviour

Suppose a user defines a loop like this:

  step3:
    in:
      a: a
      b:
        source: b
        default: 3
    when: $(inputs.a < 5)
    run: blah.cwl
    out: [out]
    loop:
      a:
        valueFrom: $(inputs.a + 1)
      b: out

Suppose now that in a given iteration the out value is null. We have two options here:

  • Pass 3 to the next loop iteration, applying the default value to the output of the prevous iteration;
  • Pass null to the next loop iteration, applying the default value only to the first iteration of the loop;

In the second case, we will probably have to allow something like this to explicitly replicate the first behaviour:

  step3:
    in:
      a: a
      b:
        source: b
        default: 3
    when: $(inputs.a < 5)
    run: blah.cwl
    out: [out]
    loop:
      a:
        valueFrom: $(inputs.a + 1)
      b:
        source: out
        default: 3

Honestly I don't have specific use cases to prefer one solution to the other. The only thing I can think is that second option with the default keyword added to the LoopInput schema is more verbose but more flexible, as it can specify both behaviours. Conversely, with the first, more concise syntax I cannot have default values only in the first loop iteration.

The scatter behaviour

The loop + scatter combination is maybe the most difficult scenario to deal with.

Case 1: scatter over input variables

The first case is when the scatter directive does not involve variables coming from the step outputs. The idea here is to first scatter, then apply a loop on any combination of the input variables, and then gather on the loop outputs, without applying the scatter over the intermediate values. This means that the same variable will have different types in the in and loop contexts, but the latter type can be automatically inferred from the scatter scheme so this should not undermine the static validation of the workflow.

Here is an example with a scatter over a single variable, but the same reasoning applies whenever none of the involved variables comes from the step outputs.

  step3:
    in:
      a: a
      b: b
    when: $(inputs.a < 5)
    run: blah.cwl
    out: [out]
    scatter: a
    loop:
      a:
        valueFrom: $(inputs.a + 1)
      b: out

This is probably the most straightforward case. The when clause must be evaluated after the scatter operation, so if we have a = [0, 1, 2] we will start three different loops:

  • One iterating over a = [0, 1, 2, 3, 4]
  • One iterating over a = [1, 2, 3, 4]
  • One iterating over a = [2, 3, 4]

If the outputMethod is last we will have {out: [4, 4, 4]}, while if the outputMethod is all we will have {out: [[0, 1, 2, 3, 4], [1, 2, 3, 4], [2, 3, 4]]}.

The most delicate part here is that the a input will have type int[] in the in schema, while it will have type int in the loop schema. The type can be easily inferred by removing one level of array, but it is something that should be taken into account.

Case 2: scatter over output variables

When the scatter involves at least one variable coming from step outputs, things are less straightforward. Consider this example:

  step3:
    in:
      a: a
      b: b
    when: $(inputs.a < 5)
    run: blah.cwl
    out: [out]
    scatter: b
    loop:
      a:
        valueFrom: $(inputs.a + 1)
      b: out

Assume the initial value of b is again an array [0, 1, 2]. If we scatter only the first time, to be consistent with the previous behaviour, everything works as soon as out is coherent with the b type in the loop context, i.e. if out is a single int value. If out is of type int[], we cannot scatter over it. Nevertheless, in this case the b type will change at every loop iteration, so it would be very difficult to support this pattern (the only allowed type for the output should be Any, without the possibility to perform static validation).

Another important reason to scatter only the first time is to avoid deadlocks when scattering with a cartesian scatterMethod involving one of the loop outputs. If the strategy was to scatter every time, this would basically add a new input value after each loop iteration, causing deadlock.

Finally, this behaviour is in line with the when evaluation, which also wants the scatter to be evaluated at the beginning. Still, there could be some reasons to scatter every time that I did not consider, so I am open to comments in that direction.

IMPORTANT: I think that addressing these two issues could be the last step prior to attempt a basic implementation of the feature.

@mr-c
Copy link
Member

mr-c commented Mar 12, 2022

With regards to default, I think it is simpler to only apply it once, at the beginning. I like your proposal that the default field in the LoopInput can be used to handle other desired behaviors. I don't mind the verbosity as I feel that this is likely an uncommon situation. If a sub-workflow is involved then the default can be specified elsewhere.

With regards to scatter + loop, I would agree that scattering occurs strictly before loop processing and the implicit gathering of the occurs after outputMethod processing.

If an inner scatter is needed then that can be done in a subworkflow; and other data shaping can be done using subsequent ExpressionTools.

In other words:

For a step with both scatter and loop, an implementation would

  1. First create jobs for the results of the scatter operation, modify in the input object in the normal scatter way
  2. Then execute each of theses scatter-generated jobs, obeying the loop and when directives; and finally the outputMethod directive.
  3. Then gather the results of the scatter-generated jobs, just like a non-loop involved scatter

Of course, an implementation is welcome to optimize execution by making elements available to down-stream steps (especially steps that themselves are scattered) even before all the scatter-generated jobs are themselves finished.

Your query about scatter brings up a good point, what are the type rules for outputs, inputs, and the results of valueFrom in the LoopInput?

@mr-c
Copy link
Member

mr-c commented Mar 12, 2022

The Arbitrary Cycles pattern

Figure 13: Arbitrary cycles pattern

To achieve the concept of more than one entry point, shouldn't there be additional conditionals? Step B should be skipped if p1 is null; Step C should be skipped if p2 is null; subworkflow/D should be skipped execution if p3 is null.

For another illustration, download http://www.workflowpatterns.com/patterns/control/images/cp10_flash.swf and play it using https://ruffle.rs/demo/

@tetron
Copy link
Member Author

tetron commented Mar 14, 2022

Default behavior

My vote is for the 2nd case. This follows naturally from having LoopInput be a subset of WorkflowStepInput. Then it is clear in is only evaluated once.

Combining scatter and loop

I would strongly prefer to not allow these in the same step. It seems like an unusual case, and if someone does need to do it, they can use a subworkflow to achieve the correct nesting. We can enforce this in schema by having disjoint ScatterWorkflowStep and LoopWorkflowStep types.

all_propagate

I have an idea for a "Channel" type that would make it possible to support various concurrency cases, including this one. I'll write something up in another ticket.

Qualify the source with the step name

As @mr-c mentioned in chat, this was mainly to do with how schema salad manages relative references within the document, the source field is at the workflow level scope, so you have to refer to step outputs qualified by the step name to have the cross references generated correctly.

mr-c added a commit to common-workflow-language/cwltool that referenced this issue May 25, 2022
mr-c added a commit to common-workflow-language/cwltool that referenced this issue Jun 8, 2022
@cwl-bot
Copy link

cwl-bot commented Jun 9, 2022

This issue has been mentioned on Common Workflow Language Discourse. There might be relevant details there:

https://cwl.discourse.group/t/loop-requirement-implementation/611/1

@cwl-bot
Copy link

cwl-bot commented Jun 14, 2022

This issue has been mentioned on Common Workflow Language Discourse. There might be relevant details there:

https://cwl.discourse.group/t/loop-requirement-implementation/611/3

GlassOfWhiskey added a commit to common-workflow-language/cwltool that referenced this issue Sep 15, 2022
GlassOfWhiskey added a commit to common-workflow-language/cwltool that referenced this issue Sep 15, 2022
GlassOfWhiskey pushed a commit to common-workflow-language/cwltool that referenced this issue Sep 15, 2022
GlassOfWhiskey pushed a commit to common-workflow-language/cwltool that referenced this issue Sep 15, 2022
GlassOfWhiskey added a commit to common-workflow-language/cwltool that referenced this issue Sep 15, 2022
GlassOfWhiskey added a commit to common-workflow-language/cwltool that referenced this issue Sep 15, 2022
GlassOfWhiskey pushed a commit to common-workflow-language/cwltool that referenced this issue Sep 26, 2022
GlassOfWhiskey pushed a commit to common-workflow-language/cwltool that referenced this issue Sep 26, 2022
GlassOfWhiskey added a commit to common-workflow-language/cwltool that referenced this issue Sep 26, 2022
GlassOfWhiskey added a commit to common-workflow-language/cwltool that referenced this issue Sep 26, 2022
mr-c added a commit to common-workflow-language/cwltool that referenced this issue Sep 29, 2022
mr-c added a commit to common-workflow-language/cwltool that referenced this issue Sep 29, 2022
mr-c pushed a commit to common-workflow-language/cwltool that referenced this issue Sep 29, 2022
mr-c pushed a commit to common-workflow-language/cwltool that referenced this issue Sep 29, 2022
GlassOfWhiskey pushed a commit to common-workflow-language/cwltool that referenced this issue Sep 29, 2022
GlassOfWhiskey pushed a commit to common-workflow-language/cwltool that referenced this issue Sep 29, 2022
GlassOfWhiskey added a commit to common-workflow-language/cwltool that referenced this issue Sep 29, 2022
GlassOfWhiskey added a commit to common-workflow-language/cwltool that referenced this issue Sep 29, 2022
GlassOfWhiskey pushed a commit to common-workflow-language/cwltool that referenced this issue Oct 6, 2022
GlassOfWhiskey pushed a commit to common-workflow-language/cwltool that referenced this issue Oct 6, 2022
GlassOfWhiskey added a commit to common-workflow-language/cwltool that referenced this issue Oct 6, 2022
GlassOfWhiskey added a commit to common-workflow-language/cwltool that referenced this issue Oct 6, 2022
tetron pushed a commit to common-workflow-language/cwltool that referenced this issue Oct 8, 2022
Loop construct prototype implemented as an extension, with tests

Based upon @GlassOfWhiskey 's work in common-workflow-language/common-workflow-language#495 (comment)
With comments from @tetron @mr-c

Co-authored-by: GlassOfWhiskey <iacopo.c92@gmail.com>
@mr-c
Copy link
Member

mr-c commented Feb 13, 2023

2023 update: the loops extension has been implemented in cwltool since version 3.1.20221008225030

https://cwltool.readthedocs.io/en/latest/loop.html

The proposal for adding loops as a built-in construct for a future version of CWL (v1.3) is at common-workflow-language/cwl-v1.3#5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants