# Step definition

SoS steps are building blocks of SoS workflows. Although the input and output can be more general, each step typically has its **input**, **output**, and **dependents** files, it executes a **step process** that consists of one or more Python statements and SoS actions (special python functions). Part or all the step process, called **tasks**, can be executed and monitored externally.

![sos_step](../media/sos_step.png)


## Step variables

SoS defines multiple variables during the execution of a step. The first type of variables provides information about the step, which includes

* **`step_name`**: name of the step
* **`step_id`**: Hash ID of the step, which is determined by the content of the step
* **`workflow_id`**: Hash ID of the workflow in which the step is defined. It would be the ID of the nested workflow if the workflow is nested.
* **`master_id`**: Hash ID of the entire workflow, regardless if the step is defined in a nested workflow.

For example, in the following example, two steps have different step names, step IDs, and workflow IDs because `nested` is a nested workflow, but they share the same `master_id`, which equals to `workflow_id` of the outermost workflow.  

In [1]:
%run
[nested]
print(f'Workflow {workflow_id}: step name={step_name}')
print(f'Workflow {workflow_id}: step id={step_id}')
print(f'Workflow {workflow_id}: workflow id={workflow_id}')
print(f'Workflow {workflow_id}: master id={master_id}')

[default]
print(f'Workflow {workflow_id}: step name={step_name}')
print(f'Workflow {workflow_id}: step id={step_id}')
print(f'Workflow {workflow_id}: workflow id={workflow_id}')
print(f'Workflow {workflow_id}: master id={master_id}')
sos_run('nested')

Workflow 1663b815c3bed9f2: step name=default_0
Workflow 1663b815c3bed9f2: step id=950015e2ddfc10d4
Workflow 1663b815c3bed9f2: workflow id=1663b815c3bed9f2
Workflow 1663b815c3bed9f2: master id=1663b815c3bed9f2
Workflow fecfb67fda2f92df: step name=nested_0
Workflow fecfb67fda2f92df: step id=74bf81da4f5098aa
Workflow fecfb67fda2f92df: workflow id=fecfb67fda2f92df
Workflow fecfb67fda2f92df: master id=1663b815c3bed9f2


These variables can be useful, for example, to save runtime information, as in the example of

In [8]:
%preview -n default_10.log
%run 

[10]
with open(step_name + '.log', 'w') as log:
    log.write(f'Step specific log message saved in {step_name}.log')

Step specific log message saved in default_10.log

If a section defines multiple steps, the step variables can be used to define (slightly) different steps according to which step is executing. For example,

In [9]:
%run human

[human_10, mouse_10]
if 'human' in step_name:
   print("I am dealing with human")
else:
   print("I am dealing with mouse")


I am dealing with human


The second type of variables are step input, output, and dependent targets. They are of type `sos_targets` and will be explained later. 

## Step options <a id="Step_options"></a>

**Step options** are specified after step name that assists the specification of workflows. SoS provides the following options

### Option `skip`
 <a id="Option_skip"></a>
Option `skip` takes two formats, the first format has no value

```
[10: skip]
```

and is equivalent to

```
[10: skip=True]
```

The whole step will be skipped as if it is not defined at all in the script. This option provides a quick method to disable a step.

The second format takes a value, which is usually an expression that will be evaluated when the step is executed. For example, step 10 is by default executed by default

In [1]:
%run
parameter: qc = True
[10 (quality check): skip=not qc]
print(f"{step_name} is executed")

default_10 is executed


but will not be executed if the workflow is executed with option `--no-qc` (for `qc=False`)

In [2]:
%run --no-qc
parameter: qc = True
[10 (quality check): skip=not qc]
print(f"{step_name} is executed")

INFO: [32mquality check[0m is [32mignored[0m due to skip option.


### Option `sigil`
 <a id="Option_sigil"></a>
Option `sigil` accepts a string for an alternative sigil, or `None` to disable string interpolation in the step. The sigil must be two strings separated by a space, such as `%( )`, `< >`, and `#{ }`. Sigils with equal left and right symbol such as `# #` can be used although they do not support features such as nested interpolation. Please refer to section [SoS Syntax](SoS_Syntax.html) for details of this option.

### Option `shared` <a id="Option_shared"></a>

SoS executes each step in a separate process and by default does not return any result to the master SoS process. Option `shared` is used to share variables between steps. This option accepts:

* A string (variable name), or
* A map between variable names and expressions (strings) that will be evaluated upon the completion of the step.
* A sequence of strings (variables) or maps.

For example,

In [1]:
%run
[10: shared='myvar']
myvar = 100

[20]
print(myvar)

100


A map syntax is recommended to share `step_output` of one step with others, because the variable assignment will be evaluated only after the step is complete:

In [2]:
%sandbox
%run
[1: shared = {'test_output': 'step_output'}]
output: 'a.txt'
sh:
    touch a.txt
[2]
print(f"Input file {test_output}")
input: test_output

Input file a.txt


The map syntax is evaluated as expressions; therefore it is possible to finer control what specific output, or variations of output, to share with others. For example:

In [3]:
%sandbox
%run
[1: shared={'test_output_1':'step_output[0]', 'test_output_2': 'step_output[1]'}]
output: 'a.txt', 'b.txt'
sh:
    touch a.txt b.txt
[2]
print(f"output 1: {test_output_1}")
print(f"output 2: {test_output_2}")

output 1: a.txt
output 2: b.txt


to shared the first file in `output` (filename `output[0]`) instead of the entire output file list.

The `shared` option also provides a `sos_variable` target.

Things becomes more complicated when there are multiple substeps. For example, when you use option `shared` on the following step with 10 substeps, only one of the random seed is returned because `rng` represent the last value of the variable after the completion of all substeps.

In [4]:
%run
[1: shared='seed']
input: for_each={'i': range(10)}
import random
seed = random.randint(0, 1000)


[2]
print(seed)

450


If you would like to see the variable in all substeps, you can prefix the variable name with `step_`

In [5]:
%run
[1: shared='step_seed']
input: for_each={'i': range(10)}
import random
seed = random.randint(0, 1000)


[2]
print(step_seed)

[858, 513, 328, 610, 142, 275, 458, 57, 762, 981]


You can also use the `step_*` vsriables in expressions as in the following example:

In [8]:
%run
[1: shared={'summed': 'sum(step_rng)', 'rngs': 'step_rng'}]
input: for_each={'i': range(10)}
import random
rng = random.randint(0, 10)


[2]
print(rngs)
print(summed)

[10, 0, 8, 1, 8, 9, 6, 7, 9, 1]
59


Variables generated by external tasks adds another layer of complexity because tasks usually do not share variables with the substep it belongs. To solve this problem, you will have to use the `shared` option of `task` to return the variable to the substep:

In [11]:
%run
[1: shared={'summed': 'sum(step_rng)', 'rngs': 'step_rng'}]
input: for_each={'i': range(10)}

task: shared='rng'
import random
rng = random.randint(0, 10*i)


[2]
print(rngs)
print(summed)


[0, 2, 12, 10, 24, 9, 51, 39, 79, 4]
230


### Option `provides` <a id="Option_provides"></a>

This option lists files or targets a step generates so that it can be called if the target is required but does not exist. Steps with option `procides` are called **auxiliary steps** and are building blocks of makefile style workflows.

Option `provides` accepts

* A filename or file pattern such as `"{sample}.bam.idx"`
* Other types of targets such as `executable("ms")`
* A list (sequence) of one or more file patterns and targets.

A file pattern is a filename with optional patterns with variable names enbraced in `{ }`. SoS matches filenames with the patterns and, if successful, assign variables with matched parts of the names. For example,

```
[compress: provides = '{filename}.bam']
```

would be triggered with target `sample_A.bam` and `sample_B.bam`. When the step is triggered by `sample_A.bam`, it defines variable `filename` as `sample_A` and sets the output of the step as `sample_A.bam`. Please check tutorial [`Auxiliary Steps`](../tutorials/Auxiliary_Steps.ipynb) for details.

## Step Input <a id="Step_Input"></a>

Step input is determined by the `input` statement of steps. It is essentially creates a `sos_targets` varible named `step_input` with groups. SoS iterate through the groups of `step_input`, assign it to variable `_input`, and execute the step process as substeps.

What you will learn from the following section
1. Input statement is optional. In a process-oriented workflow, the default input is the output of its previous step.
2. Variables `step_input` and `step_output` reprents the input and output targets of a step.
3. Sources of targets can be specified with named arguments and be used to differentiate input targets.
4. Function `output_from` can be used to represent complete or partial outputs from other steps.
5. Function `named_output` can be used to specify named output from other steps for use as input of current step.

### Default input of a step

In a process-oriented workflow, **the default input of a step is the output of its previous step**. For example, in the following workflow, the first step (step `10`) produces output `a.txt`, which becomes the input of the second step (step `20`).

In [2]:
%run

[10]
output: 'a.txt'
_output.touch()

[20]
print(f'Input of step {step_name} is {step_input}')

0,1,2,3,4
,default,Workflow ID  763b03b0293e70ac,Index  #1,completed  Ran for < 5 seconds


Input of step 20 is a.txt


### Unnamed input files

The easiest way to explicitly specify step input is to list input targets (mostly files) directly in the `input:` statement. For example

In [3]:
# create some sample files
!touch a.txt b.txt c.txt d.txt

In [4]:
input: 'a.txt'
print(f'step_input is {step_input}')

step_input is a.txt


Multiple files can be listed as multiple paramters, sequences (`list`, `tuple` etc), or variables of string or sequence types. For example, you can define a parameter `in_files` of type `paths` (list of `path`) and specify input files from command line:

In [5]:
%run --in-files a.txt b.txt
parameter: in_files = paths
input: in_files

print(f'step_input is {step_input}')

0,1,2,3,4
,default,Workflow ID  0261827f0e7f005c,Index  #2,completed  Ran for < 5 seconds


step_input is a.txt b.txt


You can list multiple files, mix string literals with variable names, 

In [5]:
in_files = ['a.txt', 'b.txt']
input: in_files, 'c.txt', 'd.txt'

print(f'step_input is {step_input}')

step_input is a.txt b.txt c.txt d.txt


Note that the user-specified input will override default input, so steps with explicit input target does not inherit output from its previous step. 

### Named input files

All input targets of SoS steps have a **source**, which specifies where the target comes from. The `sources` of a `step_input` variable can be used to inspect the source of each target. By default, targets specified directly have the source of the name of the step in which they are specified.

In [6]:
%run
[step_10]
in_files = ['a.txt', 'b.txt']
input: in_files, 'c.txt', 'd.txt'

print(f'step_input is {step_input} with sources {step_input.sources}')

0,1,2,3,4
,step,Workflow ID  6a10bf848352d104,Index  #3,completed  Ran for 0 sec


step_input is a.txt b.txt c.txt d.txt with sources ['step_10', 'step_10', 'step_10', 'step_10']


You can specify the sources of inputs with keyword arguments. For example, in the following step, the first two files are given a name `grp1` and the latter two are given a name `grp2`.

In [7]:
input: grp1 = ['a.txt', 'b.txt'], grp2=['c.txt', 'd.txt']

print(f'step_input is {step_input} with sources {step_input.sources}')

step_input is a.txt b.txt c.txt d.txt with sources ['grp1', 'grp1', 'grp2', 'grp2']


The **sources of the targets can be used to partition input targets and refer them separately**. You can access groups of input files with with syntax `step_input[group_name]`.

In [10]:
!touch ref.txt

input: data = ['a.txt', 'b.txt'], reference='ref.txt'

print(f'Input of step is {step_input} with sources {step_input.sources}')
print(f'Data is {step_input["data"]}')
print(f'Reference is {step_input["reference"]}')

Input of step is a.txt b.txt ref.txt with sources ['data', 'data', 'reference']
Data is a.txt b.txt
Reference is ref.txt


### Function  `output_from` <a id="output_from"></a>

The default input of a step is the output from its previous step but you can use function `output_from` to specify the input of a step asthe output from one or more other steps. For example,

In [12]:
%run
[step_10]
output: 'a.txt'
_output.touch()

[step_20]
output: 'b.txt'
print(f'input of step {step_name} is {step_input}')
_output.touch()

[step_30]
input:  output_from(['step_10', 'step_20']), 'c.txt'
print(f'input of step {step_name} is {step_input} with sources {step_input.sources}')
print(f'Output of step_20 is {step_input["step_20"]}')

0,1,2,3,4
,step,Workflow ID  f3820d6d856346af,Index  #5,completed  Ran for < 5 seconds


input of step step_20 is a.txt
input of step step_30 is a.txt b.txt c.txt with sources ['step_10', 'step_20', 'step_30']
Output of step_20 is b.txt


In this example, `output_from(['step_10', 'step_20'])` takes the output of steps `step_10` and `step_20` as input. The `sources` of these input are `step_10` and `step_20` respectively. In a process-oriented workflow, `output_from(['step_10', 'step_20'])` can be simplified as `output_from([10, 20])` (integers).

You can override the `sources` of input files with keyword arguments

In [13]:
%run
[step_10]
output: 'a.txt'
print(f'input of step {step_name} is {step_input}')
_output.touch()

[step_20]
output: 'b.txt'
print(f'input of step {step_name} is {step_input}')
_output.touch()

[step_30]
input:  output_from(10), s20=output_from(20), s30='c.txt'
print(f'input of step {step_name} is {step_input} with sources {step_input.sources}')


0,1,2,3,4
,step,Workflow ID  5b1a7c8203ac5355,Index  #6,completed  Ran for < 5 seconds


input of step step_10 is 
input of step step_20 is a.txt
input of step step_30 is a.txt b.txt c.txt with sources ['step_10', 's20', 's30']


As you can learn later, output from other steps can also have their own sources. In this case, the `sources` of the output is carried over.

In [15]:
%run
[step_10]
output: output='out.txt', summary='summary.txt'
_output.touch()

[step_30]
input:  output_from(10), 'c.txt'
print(f'input of step {step_name} is {step_input} with sources {step_input.sources}')

0,1,2,3,4
,step,Workflow ID  b2d5163796f0f4f8,Index  #8,completed  Ran for < 5 seconds


input of step step_30 is out.txt summary.txt c.txt with sources ['output', 'summary', 'step_30']


Now, if you are only interested to the `summary` part of the output of `step_10`, you can use `['summary']` to get a subset of the output from `output_from(10)`:

In [18]:
%run
[step_10]
output: output='out.txt', summary='summary.txt'
_output.touch()

[step_30]
input:  output_from(10)['summary'], 'c.txt'
print(f'input of step {step_name} is {step_input} with sources {step_input.sources}')

0,1,2,3,4
,step,Workflow ID  b2d5163796f0f4f8,Index  #11,completed  Ran for < 5 seconds


input of step step_30 is summary.txt c.txt with sources ['summary', 'step_30']


When you use keyword argument to specify all or parts of the outputs, the `sources` are overridden

In [22]:
%run
[step_10]
output: a='a.txt', b='b.txt'
_output.touch()

[step_20]
output: c='c.txt', d='d.txt'
_output.touch()

[step_30]
input:  s10=output_from(10), c=output_from(20)["c"]
print(f'input of step {step_name} is {step_input} with sources {step_input.sources}')

0,1,2,3,4
,step,Workflow ID  84533aaef4b9d9dd,Index  #15,completed  Ran for 0 sec


input of step step_30 is a.txt b.txt c.txt with sources ['s10', 's10', 'c']


Note that both source `a` and `b` from `output_from(10)` are overriden by `s10` so you can no longer differentiate sources `a` and `b` from `output_from(10)`.

### Function `named_output` <a id="named_output"></a>

As we will be discussed later, step output can be named so that we can access part of the output using the `['name']` syntax.

In [1]:
%run
[A]
output: a='a.txt', b='b.txt'
_output.touch()

[10]
input: output_from('A')
print(step_input.sources)
print(step_input['a'])

0,1,2,3,4
,default,Workflow ID  8503597e4f5eebe1,Index  #1,completed  Ran for < 5 seconds


['a', 'b']
a.txt


Although `output_from` can be used to refer to output from any step, it makes the workflow less extensible because renaming or change of step `A` will invalidate `output_from('A')`. If the named output is unique to your workflow, it is more convenient to write

In [2]:
%run
[A]
output: a='a.txt', b='b.txt'
_output.touch()

[10]
input: named_output('a')
print(step_input.sources)
print(step_input['a'])

0,1,2,3,4
,default,Workflow ID  8503597e4f5eebe1,Index  #2,completed  Ran for < 5 seconds


['a']
a.txt


Here `named_output('a')` refers to any step that produces an output with name `a`. Adding `named_output('a')` will create an dependency on that step so that step `10` will be executed only after `A` is executed (so that `named_output('a')` becomes available.

### Parameter `group_by` and substeps <a id="group_by"></a>

Up till now all input targets are processed all at once by the step. If you need to process input files one by one or in pair, you can define **substeps** that basically applies the step to subgroups of input targets, represented by variable `_input`.

In the trivial case when all input targets are processed together, `_input` is the same as `step_input`.

In [23]:
input: 'a.txt', 'b.txt'
print(f'step input is {step_input}')
print(f'substep input is {_input}')

step input is a.txt b.txt
substep input is a.txt b.txt


Using option `group_by`, you can group the input targets in a number of ways, the easiest being group by `1`:

In [3]:
input: 'a.txt', 'b.txt', group_by=1
print(f'input of step is {step_input}')
print(f'input of substep {_index} is {_input}')

input of step is a.txt b.txt
input of substep 0 is a.txt
input of step is a.txt b.txt
input of substep 1 is b.txt


As you can see, the step process is now executed twice. Whereas the `step_input` is the same for both substeps, `_input` is `a.txt` for the first substep, and `b.txt` for the second substep. Here we used an internal variable `_index` to show the index of the substep.

SoS allows you to group input in a number of ways:

| option | group by |
| --- | --- |
| `all` | all in a single group, the default |
| `single` | individual target |
| `pairs` | match first half of files with the second half, take one from each half each time |
| `combinations` | all unordered combinations of 2-sets |
| `pairwise` | all adjacent 2-sets |
| `source` | by source of input |
| `pairsource` | pair input files by their sources and take one from each source each time |
| `N` = `1`, `2`, ... | chunks of size `N` |
| `pairsN`, `N`=`2`, `3`, ... | match first half of files with the second half, take `N` from each half each time | 
| `pairsourceN`, `N`=`2`, `3`, ... | pair input files by their sources and take `N` from each source (if equal size) each time |
| `pairwiseN`, `N`=`2`, `3`, ...| all adjacent 2-sets, but each set has `N` items |
| `combinationsN`, `N`=`2`, `3`, ... | all unorderd combinations of `N` items |
| function (e.g. `lamba x: ...`) | a function that returns groups of inputs |

For example, with the following sos script, the input are groups pairwisely:

In [4]:
!touch file1 file2 file3 file4

input: 'file1', 'file2', 'file3', 'file4', group_by='pairwise'
print(f"{_input}")

file1 file2
file2 file3
file3 file4


To demonstrate more acceptable values, the following example uses `sos_run` action to execute this a step with different grouping method.

In [5]:
!touch file1 file2 file3 file4
%run default

[group]
parameter: group = str
print(f"\ngroup_by={group}")
input: 'file1', 'file2', 'file3', 'file4', group_by=group
print(f"{_index}: {_input}")

[default]
sos_run('group', group=1)
sos_run('group', group=2)
sos_run('group', group='single')
sos_run('group', group='pairs')
sos_run('group', group='pairwise')
sos_run('group', group='combinations')
sos_run('group', group='combinations3')

0,1,2,3,4
,default,Workflow ID  a50ceb5da37415fc,Index  #1,completed  Ran for 2 sec



group_by=1
0: file1
2: file3
3: file4
1: file2

group_by=2
0: file1 file2
1: file3 file4

group_by=single
0: file1
1: file2
3: file4
2: file3

group_by=pairs
0: file1 file3
1: file2 file4

group_by=pairwise
0: file1 file2
2: file3 file4
1: file2 file3

group_by=combinations
0: file1 file2
1: file1 file3
3: file2 file3
4: file2 file4
2: file1 file4
5: file3 file4

group_by=combinations3
0: file1 file2 file3
2: file1 file3 file4
1: file1 file2 file4
3: file2 file3 file4


We did not include options `pairsN` and `pairwiseN` in the example because we need more input files to see what is going on. As you can see from the following example, the `N` groups input targets as small groups of size `N` before `pairs` and `pairwise` are applied.

In [6]:
!touch A1 B1 A2 B2 A3 B3 A4 B4
%run

[group]
parameter: group = str
print(f"\ngroup_by={group}")
input: 'A1', 'B1', 'A2', 'B2', 'A3', 'B3', 'A4', 'B4', group_by=group
print(f"{_index}: {_input}")

[default]
sos_run('group', group='pairs2')
sos_run('group', group='pairwise2')


0,1,2,3,4
,default,Workflow ID  ac0d35740213f296,Index  #2,completed  Ran for 1 sec



group_by=pairs2
0: A1 B1 A3 B3
1: A2 B2 A4 B4

group_by=pairwise2
1: A2 B2 A3 B3
0: A1 B1 A2 B2
2: A3 B3 A4 B4


As we recall from the `source` attribute of `sos_targets`, input targets can have `source` of the present step (if specified directly), or as the output of previouly executed steps. Option `group_by` allows you to group input by sources `by='source'`, or pair sources (`by='pairsource'` and `by='pairsourceN'`). For example,

In [8]:
!touch c1 c2 c3 c4
%run
[step_10]
output: 'a1'
_output.touch()

[step_20]
output: 'b1', 'b2'
_output.touch()

[group]
parameter: group = str
print(f"\ngroup_by={group}")
input: 'c1', 'c2', 'c3', 'c4', output_from(['step_10', 'step_20']), group_by=group
print(f"{_index}: {_input} from {_input.sources}")

[default]
sos_run('group', group='source')
sos_run('group', group='pairsource')
sos_run('group', group='pairsource2')

0,1,2,3,4
,default,Workflow ID  128dc21a4566df7e,Index  #4,completed  Ran for 2 sec



group_by=source
0: c1 c2 c3 c4 from ['group', 'group', 'group', 'group']
2: b1 b2 from ['step_20', 'step_20']
1: a1 from ['step_10']

group_by=pairsource
0: c1 a1 b1 from ['group', 'step_10', 'step_20']
2: c3 a1 b2 from ['group', 'step_10', 'step_20']
1: c2 a1 b1 from ['group', 'step_10', 'step_20']
3: c4 a1 b2 from ['group', 'step_10', 'step_20']

group_by=pairsource2
0: c1 c2 a1 b1 from ['group', 'group', 'step_10', 'step_20']
1: c3 c4 a1 b2 from ['group', 'group', 'step_10', 'step_20']


The options `pairsource` and `pairsource2` need some explanation here because our groups do not have the same size. What these options do are

1. Determine number of groups `m` from `N` and longest source.
2. Either group or repeat items in sources to create `m` groups

For example, with `pairsource2`, we are creating two groups because the largest source have 4 targets (`m=4/2=2`). Then, `a1` is repeated twice, `b1`, `b2` are in two groups, and `c1`, `c2` and `c3`, `c4` are in two groups.

Finally, if none of the predefined grouping mechanism works, you can specify a function that takes `step_input` and returns a list of `sos_targets`. 

In [1]:
!touch c1 c2 c3 c4 c5 c6
input: 'c1', 'c2', 'c3', 'c4', 'c5', 'c6',
    group_by=lambda x: [x[0], x[1:3], x[3:]]
    
print(f"{_index}: {_input}")


0: c1
1: c2 c3
2: c4 c5 c6


### `group_by` as option to input functions

Pairing input from multiple sources is complicated when we apply `group_by` to a list of targets with different sources. It is actually a lot easier if you apply `group_by` to the sources separately. Fortunately, functions `output_from` accepts `group_by` so that you can regroup the targets before merging with other sources.

For example, in the following example, `step_10` has two output files, `step_20` has 4, by applying `group_by=1` to `output_from('step_10')` and `group_by=2` to `output_from('step_20')`, we create two `sos_targets` each with two subgroups. The two `sos_targets` will be joined to create a single `_input` for each substep.

In [2]:
%run group

[step_10]
output: 'a1', 'a2'
_output.touch()

[step_20]
output: 'c1', 'c2', 'c3', 'c4'
_output.touch()

[group]
input: output_from('step_10', group_by=1),
       output_from('step_20', group_by=2)
print(f"{_index}: {_input} from {_input.sources}")

0,1,2,3,4
,group,Workflow ID  a67f4ad6ee689885,Index  #1,completed  Ran for < 5 seconds


0: a1 c1 c2 from ['step_10', 'step_20', 'step_20']
1: a2 c3 c4 from ['step_10', 'step_20', 'step_20']


As usual, keyword arguments overrides the sources of targets, so you can assign names to groups with keyword arguments:

In [3]:
%run group

[step_10]
output: 'a1', 'a2'
_output.touch()

[step_20]
output: 'c1', 'c2', 'c3', 'c4'
_output.touch()

[group]
input: output_from('step_10', group_by=1),
       s20=output_from('step_20', group_by=2)
print(f"{_index}: {_input} from {_input.sources}")

0,1,2,3,4
,group,Workflow ID  a67f4ad6ee689885,Index  #2,completed  Ran for < 5 seconds


0: a1 c1 c2 from ['step_10', 's20', 's20']
1: a2 c3 c4 from ['step_10', 's20', 's20']


Things can become tricky if you specify both "regular" input and grouped targets from `output_from`. In this case, the regular input will be considered as a `sos_targets` with a single group, and be merged to every group of another `sos_targets`. 

In [4]:
!touch e1 e2
%run group

[step_10]
output: 'a1', 'a2'
_output.touch()

[step_20]
output: 'c1', 'c2', 'c3', 'c4'
_output.touch()

[group]
input: output_from('step_10', group_by=1),
       output_from('step_20', group_by=2),
       my=('e1', 'e2')
print(f'\nSubstep {_index}')
print(f"substep input is {_input} from {_input.sources}")

0,1,2,3,4
,group,Workflow ID  ac135058d33d47ab,Index  #3,completed  Ran for < 5 seconds



Substep 0
substep input is a1 c1 c2 e1 e2 from ['step_10', 'step_20', 'step_20', 'my', 'my']

Substep 1
substep input is a2 c3 c4 e1 e2 from ['step_10', 'step_20', 'step_20', 'my', 'my']


However, if option `group_by` is specified outside of `output_from`, it will group all targets regardless of original grouping. For example, in the following example, output from `step_10` will be grouped by 2.

In [7]:
!touch e1 e2
%run group

[step_10]
output: 'c1', 'c2', 'c3', 'c4'
_output.touch()

[group]
input: output_from('step_10', group_by=1),
       my=('e1', 'e2'), group_by=2
print(f'\nSubstep {_index}')
print(f"substep input is {_input} from {_input.sources}")

0,1,2,3,4
,group,Workflow ID  c2e04d326bb31bc0,Index  #6,completed  Ran for 0 sec



Substep 0
substep input is c1 c2 from ['step_10', 'step_10']

Substep 1
substep input is c3 c4 from ['step_10', 'step_10']

Substep 2
substep input is e1 e2 from ['my', 'my']


### Option `for_each` <a id="Option_for_each"></a>

Option `for_each` allows you to repeat step process for each value of a variable. For example,

In [1]:
!touch file1 file2
%run

method = ['m1', 'm2']
input: 'file1', 'file2', for_each='method'
print(f"{_index}: {_input} {_method}")

0,1,2,3,4
,default,Workflow ID  6c2844dc1a8b592c,Index  #1,completed  Ran for < 5 seconds


0: file1 file2 m1
1: file1 file2 m2


will repeat the step with each item of variable `method`

SoS automatically creates a loop variable `_method` for variable `method`, which assumes a slice of the variable at each iteration. 

Nested loops are also allowed. For example,

In [2]:
!touch file1 file2
%run
[0]
method = ['m1', 'm2']
pars = [1, 2]
input: 'file1', 'file2', for_each=['method', 'pars']
print(f"{_index}: _input={_input} _method={_method}, _pars={_pars}")

0,1,2,3,4
,default,Workflow ID  7ba8a0aced741c44,Index  #2,completed  Ran for < 5 seconds


0: _input=file1 file2 _method=m1, _pars=1
1: _input=file1 file2 _method=m2, _pars=1
3: _input=file1 file2 _method=m2, _pars=2
2: _input=file1 file2 _method=m1, _pars=2


If you would like to loop the process with several parameters, you can put them into the same level by 'var1,var2'. For example,

In [4]:
%run
!touch file1 file2

[0]
method = ['m1', 'm2']
pars = [1, 2]
input: 'file1', 'file2', for_each=['method,pars']
print(f"{_index}: _input={_input} _method={_method}, _pars={_pars}")

0,1,2,3,4
,default,Workflow ID  7ba8a0aced741c44,Index  #4,completed  Ran for < 5 seconds


0: _input=file1 file2 _method=m1, _pars=1
2: _input=file1 file2 _method=m1, _pars=2
3: _input=file1 file2 _method=m2, _pars=2
1: _input=file1 file2 _method=m2, _pars=1


The variable passed to option `for_each` can a sequence (`list`, `tuple`, `set`, etc), a Pandas `Series`, `Index`, or `DataFrame`. In the last case, each `_loop` variable presents a line in the dataframe and you can access single values using format `_loop["header"]`. For example

In [5]:
%preview data
%run
[0]
import pandas as pd
data = pd.DataFrame([(1, 2, 'Hello'), (2, 4, 'World')], columns=['A', 'B', 'C'])
input: for_each='data'
output: f"{_data['A']}_{_data['B']}_{_data['C']}.txt"
sh: expand=True
    touch {_output}

0,1,2,3,4
,default,Workflow ID  7ba8a0aced741c44,Index  #5,completed  Ran for < 5 seconds


0: _input=file1 file2 _method=m1, _pars=1
1: _input=file1 file2 _method=m2, _pars=1
2: _input=file1 file2 _method=m1, _pars=2
3: _input=file1 file2 _method=m2, _pars=2


If you would like define your own loop variable, or if the default loop variable does not work (e.g. loop through `obj.sequence` where `_obj.sequence` is not a valid variable name), you can use a dictionary syntax in the format of `{'varname': sequence}`. Mult-variable and nested loops can be specified in the format of `{'var1': seq1, 'var2': seq2}` (same level) and `[{'var1': seq1}, {'var2': seq2}]`. 

For example, the first example for this parameter can be written as

In [6]:
!touch file1 file2

input: 'file1', 'file2', for_each=dict(method=['m1', 'm2'])
print(f"{_index}: {_input} {method}")

0: file1 file2 m1
1: file1 file2 m2


and a latter example can be written as

In [7]:
!touch file1 file2
%run
[0]
input: 'file1', 'file2', 
   for_each=dict(method=['m1','m2'], pars=[1, 2])
print(f"{_index}: _input={_input} method={method}, pars={pars}")

0,1,2,3,4
,default,Workflow ID  7ba8a0aced741c44,Index  #6,completed  Ran for < 5 seconds


0: _input=file1 file2 _method=m1, _pars=1
1: _input=file1 file2 _method=m2, _pars=1
2: _input=file1 file2 _method=m1, _pars=2
3: _input=file1 file2 _method=m2, _pars=2


The dictionary syntax also supports multiple keys. This helps customizing groups of variables. For example in the script below we only care for situations where `n` is greater than `p`,  

In [8]:
!touch a.txt
%run
[1]
import itertools
parameter: n = [100, 300]
parameter: p = [50, 100, 200]
parameter: outfile = ['1.txt', '2.txt', '3.txt', '4.txt', '5.txt', '6.txt']
input: 'a.txt', for_each= {'_n,_p': [(_n,_p) for _n in n for _p in p if _n > _p]}
print(f"{_index} {outfile[_index]} {_n} {_p}")

0,1,2,3,4
,default,Workflow ID  7ba8a0aced741c44,Index  #7,completed  Ran for < 5 seconds


0: _input=file1 file2 _method=m1, _pars=1
1: _input=file1 file2 _method=m2, _pars=1
3: _input=file1 file2 _method=m2, _pars=2
2: _input=file1 file2 _method=m1, _pars=2


### Option `paired_with` <a id="Option_paired_with"></a>

Option `paired_with` assigns attributes to each of the targets in `step_input`. For example, 

In [12]:
!mkdir case ctrl
!touch case/A1.bam case/A2.bam ctrl/A1.bam ctrl/A2.bam

bam_files = ['case/A1.bam', 'case/A2.bam', 'ctrl/A1.bam', 'ctrl/A2.bam']
mutated = ['case', 'case', 'ctrl', 'ctrl']

input: bam_files, paired_with=dict(mutated=mutated)
for i in range(4):
    print(f'Sample {_input[i]} is of type {_input[i].mutated}')

Sample case/A1.bam is of type case
Sample case/A2.bam is of type case
Sample ctrl/A1.bam is of type ctrl
Sample ctrl/A2.bam is of type ctrl


Here the dictionary syntax expands to 
```
paired_with={'mutated': ['case', 'case', 'ctrl', 'ctrl']}
```
and basically assigns each values to attribute `mutated` of each target.

Although this example is not particularly exciting, it becomes useful when the `step_input` is groupped,

In [15]:
!touch case/A1.bam case/A2.bam ctrl/A1.bam ctrl/A2.bam

bam_files = ['case/A1.bam', 'case/A2.bam', 'ctrl/A1.bam', 'ctrl/A2.bam']
mutated = ['case', 'case', 'ctrl', 'ctrl']

input: bam_files, paired_with=dict(mutated=mutated), group_by=2
print(f'\nGroup {_index}')
for s in _input:
    print(f'Sample {s} is of type {s.mutated}')


Group 0
Sample case/A1.bam is of type case
Sample case/A2.bam is of type case

Group 1
Sample ctrl/A1.bam is of type ctrl
Sample ctrl/A2.bam is of type ctrl


The dictionary syntax can be a little long to type so SoS provides a shortcut

```
paired_with='name'
```
which is equivalent to

```
paired_with=dict(_name=name)
```
note that SoS created variables already have a leading underscore to differentiate from regular variables.

In [18]:
!touch case/A1.bam case/A2.bam ctrl/A1.bam ctrl/A2.bam

bam_files = ['case/A1.bam', 'case/A2.bam', 'ctrl/A1.bam', 'ctrl/A2.bam']
mutated = ['case', 'case', 'ctrl', 'ctrl']
sample_name = ['A1', 'A2', 'A1', 'A2']

input: bam_files, paired_with=['mutated', 'sample_name'], group_by=1
print(f"{_index}: _input={_input} _mutated={_input._mutated}, _sample_name={_input._sample_name}")

0: _input=case/A1.bam _mutated=case, _sample_name=A1
1: _input=case/A2.bam _mutated=case, _sample_name=A2
2: _input=ctrl/A1.bam _mutated=ctrl, _sample_name=A1
3: _input=ctrl/A2.bam _mutated=ctrl, _sample_name=A2


Another convenience feature is that SoS creates a step level variable from these attributes so that you can access all values at the same time. That is to say, `_mutated` is created as a shortcut for 
```
[x._mutated for x in _input]
```

In [20]:
bam_files = ['case/A1.bam', 'case/A2.bam', 'ctrl/A1.bam', 'ctrl/A2.bam']
mutated = ['case', 'case', 'ctrl', 'ctrl']
sample_name = ['A1', 'A2', 'A1', 'A2']

input: bam_files, paired_with=['mutated', 'sample_name'], group_by=2
print(f"{_index}: _input={_input} _mutated={_mutated}, _sample_name={_sample_name}")

0: _input=case/A1.bam case/A2.bam _mutated=['case', 'case'], _sample_name=['A1', 'A2']
1: _input=ctrl/A1.bam ctrl/A2.bam _mutated=['ctrl', 'ctrl'], _sample_name=['A1', 'A2']


Values to option `paired_with` are usually lists of the same length as `step_input` but it can also be other types such as `paths` and `sos_targets`, in this case the iterator variables (e.g. `_mutated` for `mutated`) will have the same type as the input variable. For example, 

### Option `group_with` <a id="Option_group_with"></a>

Similar to option `paired_with` that associate variables to input files, you could associate items of a sequence with each substep. This option is applied after `group_by` and before `for_each`, which means the length of the sequence should equal to the number of substeps. and the variables will be the same for each `for_each` loop. Also similar to option `paired_with`, option `group_with` can take a string (name of variable) or a dictionary.

Using the above example, you can assign a label for each group by passing name of a sequence variable

In [21]:
%sandbox
!mkdir case ctrl
!touch case/A1.bam case/A2.bam ctrl/A1.bam ctrl/A2.bam

mutated = ['case', 'ctrl']
bam_files = ['case/A1.bam', 'case/A2.bam', 'ctrl/A1.bam', 'ctrl/A2.bam']
input: bam_files, group_by=2, group_with='mutated'
print(f"{_index}: _input={_input} _mutated={_mutated}")

0: _input=case/A1.bam case/A2.bam _mutated=case
1: _input=ctrl/A1.bam ctrl/A2.bam _mutated=ctrl


or a dictionary with variable name and values:

In [22]:
%sandbox
!mkdir case ctrl
!touch case/A1.bam case/A2.bam ctrl/A1.bam ctrl/A2.bam

bam_files = ['case/A1.bam', 'case/A2.bam', 'ctrl/A1.bam', 'ctrl/A2.bam']
input: bam_files, group_by=2, group_with={'mutated': ['case', 'ctrl']}
print(f"{_index}: _input={_input} mutated={mutated}")

0: _input=case/A1.bam case/A2.bam mutated=case
1: _input=ctrl/A1.bam ctrl/A2.bam mutated=ctrl


### Option `pattern` <a id="Option_pattern"></a>

This option does the reverse of function `expand_pattern`. It uses named wildcards to match pattern to all input files, and creates step variables for these wildcard objects. For example,

In [23]:
%sandbox
!touch a-20.txt b-10.txt
%run
[step]
input:  'a-20.txt', 'b-10.txt', pattern = '{name}-{par}.txt'
output: expand_pattern("{name}-processed-{par}.txt")
sh: expand=True
    echo {_output}
    touch {_output}

a-processed-20.txt b-processed-10.txt


will take all input files and extract `name` and `par` from each file name as variables `name` and `par`. It is then used to create output file names adding the word `processed` in between these wildcard objects. The outcome of the SoS script above is creation of files `a-processed-10.txt` and `b-processed-20.txt`.

When wildcard objects are accessed as step variables, both variable names with and without `_` prefix is available, e.g. in this example, both `_name` and `name`, `_par` and `par` are avaiable and are the same. The two conventions will only differ when `group_by` or `for_each` is also used. In which case the generated pattern variables `name` and `par` are automatically paired with `input` as if they have been paired using option `paired_with=['name', 'par']`.

In [24]:
%sandbox
!touch a-20.txt b-10.txt
%run
[step]
input:  'a-20.txt', 'b-10.txt', pattern = '{name}-{par}.txt', group_by=1
output: expand_pattern("{_name}-processed-{_par}.txt")
sh: expand=True
    echo {_output}
    touch {_output}

a-processed-20.txt
b-processed-10.txt


### Option `concurrent` <a id="Option_concurrent"></a>

Substeps of a step are by default executed concurrently with potential dependencies. For example,

In [5]:
sum = 0
import time
start_time = time.time()
input: for_each={'i': range(4)}
time.sleep(4)
print(f'sum is {sum} at index {_index}, completed in {time.time() - start_time:.1f} seconds')

sum is 0 at index 0, completed in 4.7 seconds
sum is 0 at index 1, completed in 4.7 seconds
sum is 0 at index 2, completed in 4.7 seconds
sum is 0 at index 3, completed in 4.7 seconds


As you can see, the `start_time` is the start time of all substeps, and the all substeps complete at about the same time because they are executed concurrently.

Concurrent execution can cause some unexpected results. For example, there are 4 substeps in the following example. Each of them adds `i` to a shared variable `sum`, but the results are not accumulated because each substep has its own `sum`.

In [7]:
sum = 0
input: for_each=dict(i=range(4))
sum += i
print(f'sum is {sum} at index {_index}')

sum is 0 at index 0
sum is 1 at index 1
sum is 2 at index 2
sum is 3 at index 3


To get the correct `sum` for all substeps, you can execute the substeps sequentially by adding option `concurrent=False`.

In [8]:
sum = 0
input: for_each=dict(i=range(4)), concurrent=False
sum += i
print(f'sum is {sum} at index {_index}')

sum is 0 at index 0
sum is 1 at index 1
sum is 3 at index 2
sum is 6 at index 3


### `dynamic` input files

In order to determine the best execution strategy, SoS evaluates all expressions for all steps before the execution of a workflow to figure
out input and output of steps. This works most of the time but sometimes the input of a step can only be determined at runtime. For example,
if you would like your workflow to automatically scan an input directory and process all fasta files under it, or if a previous step produces
files that cannot be determined beforehand, you can specify input files as follows,

```python
input: 'input/*.fasta'
```

The problem is that no file or a wrong set files might exist during the planing stage so SoS might skip this step or start the step
with a wrong set of files. To address this problem, you can declare the input files as **dynamic** by passing a `dynamic` object

```python
input: dynamic('input/*.fasta')
```

This tells SoS that the input of this step can only be determined at runtime and will execute the step only after all its previous
steps have been completed.

### Inherited input with groups

Remember that the purpose of `input` statement is to create `step_input` with appropriate targets and groups, but what will happen to `sos_targets` inherited from previous step, from `output_from` and `named_output`? The simple answer is that the existing groups will be honored.

Let us first look at one example,

In [10]:
%run
[10]
input: for_each=dict(i=range(3))
output: f'out_{i}.txt'
_output.touch()

[20]
print(_input)

0,1,2,3,4
,default,Workflow ID  00e512747f48765d,Index  #3,completed  Ran for < 5 seconds


out_0.txt
out_2.txt
out_1.txt


What happens here is that there are three substeps in step `10`, creating outputs `out_0.txt`, `out_1.txt` and `out_2.txt`. The `step_output` object contains three targets in three groups.

When this `step_output` target is inherited by step `20` and becomes the `step_input` of the next step, step `20` is executed with the three existing substeps with `_input` being the `_output` of step `10`.

If this is what you want, you can regroup the input with option `group_by`. For example,

In [11]:
%run
[10]
input: for_each=dict(i=range(3))
output: f'out_{i}.txt'
_output.touch()

[20]
input: group_by='all'
print(_input)

0,1,2,3,4
,default,Workflow ID  00e512747f48765d,Index  #4,completed  Ran for < 5 seconds


out_0.txt out_1.txt out_2.txt


The same logic applies to output obtained from `output_from` and `named_output`. In the following example, the `output_from` function inherited 4 groups of outputs from step 'A', and 2 groups of outputs from step `B`. It then regroup outputs from step `A` into 2 groups and merge it with output from `B`. The final `_input` of step `20` therefore has two groups, each with 2 from step `A` and 1 from step `B`.

In [12]:
%run
[A]
input: for_each=dict(i=range(4))
output: f'A_{i}.txt'
_output.touch()

[B]
input: for_each=dict(i=range(2))
output: f'B_{i}.txt'
_output.touch()

[20]
input: output_from('A', group_by=2), output_from('B')
print(_input)

0,1,2,3,4
,default,Workflow ID  36316057dc19cf7a,Index  #5,completed  Ran for < 5 seconds


A_0.txt A_1.txt B_0.txt
A_2.txt A_3.txt B_1.txt


It is worth noting that `named_output` can have groups as well. Because a step can produce multiple named outputs, `named_output('a')` will essentially get a subset of targets with source `a`, and all targets from groups with source `a`. For example, step `K` generates 8 outputs in 4 groups. With `named_output('B')`, output and all targets in groups with name `B` are returned so `_input` will be `B_0.txt`, `B_1.txt` and so on.

In [8]:
%run
[K]
input: for_each=dict(i=range(4))
output: A=f'A_{i}.txt', B=f'B_{i}.txt'
_output.touch()

[20]
input: named_output('B')
print(_input)

0,1,2,3,4
,default,Workflow ID  d80ff4278bad6c40,Index  #8,completed  Ran for < 5 seconds


B_0.txt
B_2.txt
B_1.txt
B_3.txt


### Summary

Options of step `input` are evaluated in the following orders:

1. A list of input files, if specified, would replace `input`, which is by default output from the previous step.
2. Option `group_by` groups the files into several groups, named `_input`
3. Option `for_each` repeat `_input` for each loop var, named `_loopvar` if `for_each='loopvar'`.
4. Option `paired_with` pairs one or more variables with `input`, variable `paired` is paired with `input`
    and variable `_paired` is paired with `_input` in each loop if `paired_with='paired'`
5. Option `pattern` extract variables from filenames in `input`. Variable `extracted` is paired with `input`
    and variable `_extracted` is paired with `_input` in each loop if `extract='{extracted}_other_part'`.

The differences between looped and non-loop steps are sumarized in the following figure

![step_loop](../media/step_loop.jpg)

## Step output <a id="Step_output"></a>

### Output files <a id="Output_files"></a>

Output files of a step can be specified by step `output`. Similar to `input`, step output accepts strings, variables, expressions, and allows wildcard characters. For example, the following are acceptable output files

```python
output:  []

output:  'accepted_hits.bam'

output:  aligned_reads, bam_stats

output:  'aligned/*.bam'

output:  expand_pattern('aligned_{samples}.bam')
```

In the last example, function `expand_pattern` is used to contruct list of files from items of a sequence `samples`.

Step process could be repeated multiple times with **substeps** defined by options `for_each` or `group_by`. Whereas `input` represents the complete set of input files, each substep has its own input files stored in variable `_input`.

When there is no substep, `output` and `_output` are the same. Otherwise, the `output:` statement produces `_output` for each substep, and `_output` for all substeps form `output` after the completion of all substeps.

For example, the following step accepts one or more bam files and index them using command `samtools index`. The input files are passed one by one and output from each substep is determined by `_input`. 

```python
[10]
input:
	bamfiles, group_by='single'

output:
	f'{_input}.bai'

run(f'''samtools index {_input} ''')
```

The use of variable `output` in this scenario is discouraged because `output`, as the collection of all `_output` increases with each substep.

Specifying output of the entire step when there are multiple substeps can lead to error. For example,

```python
[10]
input:
	bamfiles, group_by=1

output:
	output: [x + '.bai' for x in bamfiles]

task:
run(f'''samtools index {_input} ''')
```

will fail because tasks in this step produce the same set of output files (equal `_output` for all substeps), and none of the task produces all specified output files.


### Option `group_by` <a id="Option_group_by"></a>

The `output` statement usually generate a `sos_targets` object without groups. This object will become `_output`, which then becomes one of the groups of `step_output`.

However, the `output` statement can also generate a `sos_targets` object with groups, in which case the `_index` group will be become the `_output` of subgroup `_index`. For this reason, the object should have the same number of groups as `step_input`.

For example, if you have list of pre-determined input and output files and you would like to pair them, you can create list all output files and use option `group_by` to create an equal number of output groups as input groups.

In [5]:
%run
in_files = [f'a_{i}.txt' for i in range(4)]
out_files = [f'b_{i}.txt' for i in range(4)]

[1]
input: in_files, group_by=1
output: out_files, group_by=1
_output.touch()

[2]
print(_input)

0,1,2,3,4
,default,Workflow ID  41c862e5b9a946e4,Index  #5,completed  Ran for < 5 seconds


b_0.txt
b_2.txt
b_3.txt
b_1.txt


Note that the previous example might not produce `b_0.txt`, `b_1.txt` in order because the substeps are executed concurrently so substeps can complete the `print` statement in any order.

### Options `paired_with`, `group_with`, `pattern`, and `for_each`

Just like `input` statements, options `paired_with`, `group_with`, `pattern`, and `for_each` are all allowed in the `output` statement and can be used to create `_output` that can be passed to the next step, or from `output_from` or `named_output`.

The most useful usages of the `paired_with` and `group_with` options are to attached variables to output targets and the output group.

Say we know that particular input and output files are associated with a sample name, and we have paired each input file with a sample name using option `paired_with=dict(sample=samples)`. Consequently, `_input.sample` is available for each substep.

The output statement  
```
output: f'{_input}.result', paired_with=dict(sample=_input.sample)
```
generates `f'{_input}.result'` for each `_input` but the sample information will be lost. For this reason an option `paired_with=dict(sample=_input.sample)` is added to attach attribute `sample` with value `_input.sample` to each output file name. With this step each input file of step `20` has an attribute that points to its sample name.

In [16]:
!touch a.txt b.txt

%run
[10]
samples = ['A', 'B']
input_files = ['a.txt', 'b.txt']

input: input_files, group_by=1, paired_with=dict(sample=samples)
output: f'{_input}.result', paired_with=dict(sample=_input.sample)
_output.touch()

[20]
print(f'{_input} with sample name {_input.sample}')

0,1,2,3,4
,default,Workflow ID  c1f4c407fcb8ccd6,Index  #16,completed  Ran for < 5 seconds


a.txt.result with sample name A
b.txt.result with sample name B


Similarly, in the following example,

```
output: f'a_{i}.txt', group_with=dict(i=i)
```
creates four `sos_targets` object with variable `i` attached. These objects becomes `_output` of step `10`, and `_input` of the next step. `i` becomes available in step `20` because it was attached to `_input`. This feature is pretty powerful but can be dangerous as well because of the "contamination" of namespace of step `20` from `_input` variables.

In [13]:
%run
[10]
input: for_each=dict(i=range(4))
output: f'a_{i}.txt', group_with=dict(i=i)
_output.touch()

[20]
print(f'{_input} with variable i={i}')

0,1,2,3,4
,default,Workflow ID  09c0428018532656,Index  #13,completed  Ran for < 5 seconds


a_0.txt with variable i=0
a_2.txt with variable i=2
a_1.txt with variable i=1
a_3.txt with variable i=3


### `dynamic` output files

Similar to the cases with [dynamic input files](#dynamically-determined-input-files-function-dynamic), the output of some steps could also not be determined beforehand. For example, with the following script that generates `html` files that cannot be determined during dry run, 

In [25]:
%sandbox --expect-error
%run
[10]
output: '*.html'

import random
for i in range(2):
    run(f"touch result_{random.randint(1, 20)}.html")

touch result_16.html
touch result_10.html


Output target *.html does not exist after the completion of step default_10


In this case, you will need to define the output as `dynamic` using a `dynamic` function.

In [26]:
%sandbox
%run
[10]
output: dynamic('*.html')

import random
for i in range(2):
    run(f"touch result_{random.randint(1, 20)}.html")


touch result_15.html
touch result_7.html


In this case, SoS knows that the output can only be determined after the completion of the step.

## Step depends <a id="Step_depends"></a>

This item specifies files that are required for the step. Although not required, it is a good practice to list resource files and other dependency files for a particular step. For example

```python
[10]
input: fasta_files
depends: reference_seq, executable('fastqc')
```

Similar to `output` options, dependent files can also be defined after `input` options and consist of dependent files determined from loop variables. 

The following figure summarizes the effect of `input` and `output` options and input options `group_by` and `for_each` on the flow
of input and output files and related variables.

![Step options](../media/step_options.jpg)

## Step process <a id="Step_process"></a>

A step process is the Python statements that perform certain tasks and produce step output from step input. A step process can contain arbitrary Python statements. For example,

In [27]:
%sandbox
%run
[10]
output: 'a.txt'
with open(_output[0], 'w') as dest:
   dest.write('some text')

and

In [28]:
%sandbox
%run
[10]
output: 'a.txt'
sh: expand=True
    echo "some text" > {_output:q}

use inline (interpreted and executed by SoS) python code or shell script to generate `a.txt`. 

Step processes are executed within SoS and are executed sequentially. However, part or all of the step process can be executed externally and potentially in parallel as step `task`. This will be covered in detail in section [External task](External_task.html).