# Step definition

SoS steps are building blocks of SoS workflows. Although the input and output can be more general, each step typically has its **input**, **output**, and **dependents** files, it executes a **step process** that consists of one or more Python statements and SoS actions (special python functions). Part or all the step process, called **tasks**, can be executed and monitored externally.

![sos_step](../media/sos_step.png)


## Step variables

SoS defines multiple variables during the execution of a step. The first type of variables provides information about the step, which includes

* **`step_name`**: name of the step
* **`step_id`**: Hash ID of the step, which is determined by the content of the step
* **`workflow_id`**: Hash ID of the workflow in which the step is defined. It would be the ID of the nested workflow if the workflow is nested.
* **`master_id`**: Hash ID of the entire workflow, regardless if the step is defined in a nested workflow.

For example, in the following example, two steps have different step names, step IDs, and workflow IDs because `nested` is a nested workflow, but they share the same `master_id`, which equals to `workflow_id` of the outermost workflow.  

In [1]:
%run
[nested]
print(f'Workflow {workflow_id}: step name={step_name}')
print(f'Workflow {workflow_id}: step id={step_id}')
print(f'Workflow {workflow_id}: workflow id={workflow_id}')
print(f'Workflow {workflow_id}: master id={master_id}')

[default]
print(f'Workflow {workflow_id}: step name={step_name}')
print(f'Workflow {workflow_id}: step id={step_id}')
print(f'Workflow {workflow_id}: workflow id={workflow_id}')
print(f'Workflow {workflow_id}: master id={master_id}')
sos_run('nested')

Workflow 1663b815c3bed9f2: step name=default_0
Workflow 1663b815c3bed9f2: step id=950015e2ddfc10d4
Workflow 1663b815c3bed9f2: workflow id=1663b815c3bed9f2
Workflow 1663b815c3bed9f2: master id=1663b815c3bed9f2
Workflow fecfb67fda2f92df: step name=nested_0
Workflow fecfb67fda2f92df: step id=74bf81da4f5098aa
Workflow fecfb67fda2f92df: workflow id=fecfb67fda2f92df
Workflow fecfb67fda2f92df: master id=1663b815c3bed9f2


These variables can be useful, for example, to save runtime information, as in the example of

In [8]:
%preview -n default_10.log
%run 

[10]
with open(step_name + '.log', 'w') as log:
    log.write(f'Step specific log message saved in {step_name}.log')

Step specific log message saved in default_10.log

If a section defines multiple steps, the step variables can be used to define (slightly) different steps according to which step is executing. For example,

In [9]:
%run human

[human_10, mouse_10]
if 'human' in step_name:
   print("I am dealing with human")
else:
   print("I am dealing with mouse")


I am dealing with human


The second type of variables are step input, output, and dependent targets. They are of type `sos_targets` and will be explained later. 

## Step options <a id="Step_options"></a>

**Step options** are specified after step name that assists the specification of workflows. SoS provides the following options

### Option `skip`
 <a id="Option_skip"></a>
Option `skip` takes two formats, the first format has no value

```
[10: skip]
```

and is equivalent to

```
[10: skip=True]
```

The whole step will be skipped as if it is not defined at all in the script. This option provides a quick method to disable a step.

The second format takes a value, which is usually an expression that will be evaluated when the step is executed. For example, step 10 is by default executed by default

In [1]:
%run
parameter: qc = True
[10 (quality check): skip=not qc]
print(f"{step_name} is executed")

default_10 is executed


but will not be executed if the workflow is executed with option `--no-qc` (for `qc=False`)

In [2]:
%run --no-qc
parameter: qc = True
[10 (quality check): skip=not qc]
print(f"{step_name} is executed")

INFO: [32mquality check[0m is [32mignored[0m due to skip option.


### Option `sigil`
 <a id="Option_sigil"></a>
Option `sigil` accepts a string for an alternative sigil, or `None` to disable string interpolation in the step. The sigil must be two strings separated by a space, such as `%( )`, `< >`, and `#{ }`. Sigils with equal left and right symbol such as `# #` can be used although they do not support features such as nested interpolation. Please refer to section [SoS Syntax](SoS_Syntax.html) for details of this option.

### Option `shared` <a id="Option_shared"></a>

SoS executes each step in a separate process and by default does not return any result to the master SoS process. Option `shared` is used to share variables between steps. This option accepts:

* A string (variable name), or
* A map between variable names and expressions (strings) that will be evaluated upon the completion of the step.
* A sequence of strings (variables) or maps.

For example,

In [1]:
%run
[10: shared='myvar']
myvar = 100

[20]
print(myvar)

100


A map syntax is recommended to share `step_output` of one step with others, because the variable assignment will be evaluated only after the step is complete:

In [2]:
%sandbox
%run
[1: shared = {'test_output': 'step_output'}]
output: 'a.txt'
sh:
    touch a.txt
[2]
print(f"Input file {test_output}")
input: test_output

Input file a.txt


The map syntax is evaluated as expressions; therefore it is possible to finer control what specific output, or variations of output, to share with others. For example:

In [3]:
%sandbox
%run
[1: shared={'test_output_1':'step_output[0]', 'test_output_2': 'step_output[1]'}]
output: 'a.txt', 'b.txt'
sh:
    touch a.txt b.txt
[2]
print(f"output 1: {test_output_1}")
print(f"output 2: {test_output_2}")

output 1: a.txt
output 2: b.txt


to shared the first file in `output` (filename `output[0]`) instead of the entire output file list.

The `shared` option also provides a `sos_variable` target.

Things becomes more complicated when there are multiple substeps. For example, when you use option `shared` on the following step with 10 substeps, only one of the random seed is returned because `rng` represent the last value of the variable after the completion of all substeps.

In [4]:
%run
[1: shared='seed']
input: for_each={'i': range(10)}
import random
seed = random.randint(0, 1000)


[2]
print(seed)

450


If you would like to see the variable in all substeps, you can prefix the variable name with `step_`

In [5]:
%run
[1: shared='step_seed']
input: for_each={'i': range(10)}
import random
seed = random.randint(0, 1000)


[2]
print(step_seed)

[858, 513, 328, 610, 142, 275, 458, 57, 762, 981]


You can also use the `step_*` vsriables in expressions as in the following example:

In [8]:
%run
[1: shared={'summed': 'sum(step_rng)', 'rngs': 'step_rng'}]
input: for_each={'i': range(10)}
import random
rng = random.randint(0, 10)


[2]
print(rngs)
print(summed)

[10, 0, 8, 1, 8, 9, 6, 7, 9, 1]
59


Variables generated by external tasks adds another layer of complexity because tasks usually do not share variables with the substep it belongs. To solve this problem, you will have to use the `shared` option of `task` to return the variable to the substep:

In [11]:
%run
[1: shared={'summed': 'sum(step_rng)', 'rngs': 'step_rng'}]
input: for_each={'i': range(10)}

task: shared='rng'
import random
rng = random.randint(0, 10*i)


[2]
print(rngs)
print(summed)


[0, 2, 12, 10, 24, 9, 51, 39, 79, 4]
230


### Option `provides` <a id="Option_provides"></a>

This option lists files or targets a step generates so that it can be called if the target is required but does not exist. Steps with option `procides` are called **auxiliary steps** and are building blocks of makefile style workflows.

Option `provides` accepts

* A filename or file pattern such as `"{sample}.bam.idx"`
* Other types of targets such as `executable("ms")`
* A list (sequence) of one or more file patterns and targets.

A file pattern is a filename with optional patterns with variable names enbraced in `{ }`. SoS matches filenames with the patterns and, if successful, assign variables with matched parts of the names. For example,

```
[compress: provides = '{filename}.bam']
```

would be triggered with target `sample_A.bam` and `sample_B.bam`. When the step is triggered by `sample_A.bam`, it defines variable `filename` as `sample_A` and sets the output of the step as `sample_A.bam`. Please check tutorial [`Auxiliary Steps`](../tutorials/Auxiliary_Steps.ipynb) for details.

## Step Input <a id="Step_Input"></a>

### Input files

The input of SoS step follows the following rules:

* **the input of a SoS step is by default the output of the previous step**, which is `None` for the first step.
* **`input:`**, which could be **a list** of filenames (string literal, variables, or expressions that return filenames).  Wildcard characters (`*`, which matches everything and `?`, which matches any single character) are acceptable. Nested lists are flattened.
* **`from_steps`**, output from specified steps.

Examples of input specification are as follows:

```
input: []

input: 'file1.fasta', 'file2.fasta'

input: 'file*.fasta', 'filename with space.fasta'

input:
    'file*.txt',
    'directory/file2.txt'

input: aligned_reads

input: aligned_reads, reference_genome

input: aligned_reads[2:]

input: 'data/*.fastq'

input: '*/GXT*.fastq'

input: func(parameter)
```

It is worth noting that

* The first examples shows that the step does not need any input file (so it does not depend on any other step).
* It does not matter if `aligned_reads` and `reference_genome` are strings or lists of strings because SoS will flatten nested lists to a single list of filenames.
* The `input` option tries to expand filenames with wildcard characters (`*` and `?`). This can be very useful for workflows that, for example, regularly scan a directory and process unprocessed files. However, because the value of this step depends on availability of files, the output of `sos show script` and the execution path will be unpredictable, and even wrong if there is no available file during the execution of `sos show script`.

The input files will be evaluated and form a list of input files. They are by default sent to the step process all at once as varible `_input`, but can also be sent in groups, each time
with different `_input`. Here `_input` is a temporary variable that is available only within the step.

### Option `from_steps`

Option specifies one or more step names and will take their output as the input of the present step. These input will be appended to the input files specified for the step. For example, the `step_input` of `step_30` of the following example is `c.txt` (from `step_30`, directly specified), `a.txt` (from `step_10`), and `b.txt` (from `step_20`).

In [2]:
!touch c.txt
%run
[step_10]
output: 'a.txt'
_output.touch()

[step_20]
output: 'b.txt'
_output.touch()


[step_30]
input: 'c.txt', from_steps=['step_10', 'step_20']
print(step_input)


c.txt a.txt b.txt


The source input targets, meaning from which step the targets are generated, can be traced by the `source` attriute of `step_input` and `_input`, and can be used to group inputs. For example, `group_by='source'` groups the input by its source steps.

In [5]:
!touch c.txt
%run
[step_10]
output: 'a.txt'
_output.touch()

[step_20]
output: 'b.txt'
_output.touch()


[step_30]
input: 'c.txt', from_steps=['step_10', 'step_20'], group_by='source'
print()
print(f'"step_input" {step_input} has source {step_input.source}')
print(f'substep "_input" {_input} has source {_input.source}')



"step_input" c.txt a.txt b.txt has source ['step_30', 'step_10', 'step_20']
substep "_input" c.txt has source ['step_30']

"step_input" c.txt a.txt b.txt has source ['step_30', 'step_10', 'step_20']
substep "_input" a.txt has source ['step_10']

"step_input" c.txt a.txt b.txt has source ['step_30', 'step_10', 'step_20']
substep "_input" b.txt has source ['step_20']


### Option `group_by` <a id="Option_group_by"></a>

SoS by default passes all input files to step process as a single list. Option `group_by` pass input files in groups, each time with a subset of input files named `_input`. SoS allows you the following `group_by` options

| option | group by |
| --- | --- |
| `all` | all in a single group, the default |
| `single` | individual target |
| `pairs` | match first half of files with the second half, take one from each half each time |
| `combinations` | all unordered combinations of 2-sets |
| `pairwise` | all adjacent 2-sets |
| `source` | by source of input |
| `pairsource` | pair input files by their sources and take one from each source each time |
| `N` = `1`, `2`, ... | chunks of size `N` |
| `pairsN`, `N`=`2`, `3`, ... | match first half of files with the second half, take `N` from each half each time | 
| `pairsourceN`, `N`=`2`, `3`, ... | pair input files by their sources and take `N` from each source each time |
| `pairwiseN`, `N`=`2`, `3`, ...| all adjacent 2-sets, but each set has `N` items |
| `combinationsN`, `N`=`2`, `3`, ... | all unorderd combinations of `N` items |


For example, with the following sos script

In [8]:
%sandbox
!touch file1 file2 file3 file4
%run
[group]
input: 'file1', 'file2', 'file3', 'file4', group_by='pairwise'
print(f"{_input}")

file1 file2
file2 file3
file3 file4


It is very important to understand that the `group_by` option does not change variable `step_input`. It creates a looping variable `_input` that changes with each substep. To demonstrate more acceptable values, the following example uses `sos_run` action to execute this a step with different grouping method.

In [2]:
%sandbox
!touch file1 file2 file3 file4
%run

[group]
parameter: group = str
print(f"group_by={group}")
input: 'file1', 'file2', 'file3', 'file4', group_by=group
print(f"{_index}: {_input}")

[default]
sos_run('group', group=1)
sos_run('group', group=2)
sos_run('group', group='single')
sos_run('group', group='pairs')
sos_run('group', group='pairwise')
sos_run('group', group='combinations')
sos_run('group', group='combinations3')

group_by=1
0: file1
1: file2
2: file3
3: file4
group_by=2
0: file1 file2
1: file3 file4
group_by=single
0: file1
1: file2
2: file3
3: file4
group_by=pairs
0: file1 file3
1: file2 file4
group_by=pairwise
0: file1 file2
1: file2 file3
2: file3 file4
group_by=combinations
0: file1 file2
1: file1 file3
2: file1 file4
3: file2 file3
4: file2 file4
5: file3 file4
group_by=combinations3
0: file1 file2 file3
1: file1 file2 file4
2: file1 file3 file4
3: file2 file3 file4


We did not include options `pairsN` and `pairwiseN` in the example because we need more input files to see what is going on. As you can see from the following example, the `N` groups input targets as small groups of size `N` before `pairs` and `pairwise` are applied.

In [3]:
%sandbox
!touch A1 B1 A2 B2 A3 B3 A4 B4
%run

[group]
parameter: group = str
print(f"group_by={group}")
input: 'A1', 'B1', 'A2', 'B2', 'A3', 'B3', 'A4', 'B4', group_by=group
print(f"{_index}: {_input}")

[default]
sos_run('group', group='pairs2')
sos_run('group', group='pairwise2')


group_by=pairs2
0: A1 B1 A3 B3
1: A2 B2 A4 B4
group_by=pairwise2
0: A1 B1 A2 B2
1: A2 B2 A3 B3
2: A3 B3 A4 B4


Finally, as we recall from the `source` attribute of `sos_targets`, input targets can have `source` of the present step (if specified directly), or as the output of previouly executed steps. `group_by` allows you to group input by sources `group_by='source'`, or pair sources (`group_by='pairsource'` and `group_by='pairsourceN'`). For example,

In [3]:
!touch c1 c2 c3 c4
%run
[step_10]
output: 'a1', 'a2', 'a3', 'a4'
_output.touch()

[step_20]
output: 'b1', 'b2', 'b3', 'b4'
_output.touch()


[group]
parameter: group = str
print(f"group_by={group}")
input: 'c1', 'c2', 'c3', 'c4', from_steps=['step_10', 'step_20'], group_by=group
print(f"{_index}: {_input} from {_input.source}")

[default]
sos_run('group', group='source')
sos_run('group', group='pairsource')
sos_run('group', group='pairsource2')

group_by=source
0: c1 c2 c3 c4 from ['group', 'group', 'group', 'group']
1: a1 a2 a3 a4 from ['step_10', 'step_10', 'step_10', 'step_10']
2: b1 b2 b3 b4 from ['step_20', 'step_20', 'step_20', 'step_20']
group_by=pairsource
0: c1 a1 b1 from ['group', 'step_10', 'step_20']
1: c2 a2 b2 from ['group', 'step_10', 'step_20']
2: c3 a3 b3 from ['group', 'step_10', 'step_20']
3: c4 a4 b4 from ['group', 'step_10', 'step_20']
group_by=pairsource2
0: c1 c2 a1 a2 b1 b2 from ['group', 'group', 'step_10', 'step_10', 'step_20', 'step_20']
1: c3 c4 a3 a4 b3 b4 from ['group', 'group', 'step_10', 'step_10', 'step_20', 'step_20']


### Option `for_each` <a id="Option_for_each"></a>

Option `for_each` allows you to repeat step process for each value of a variable. For example,

In [11]:
%sandbox
!touch file1 file2
%run

method = ['m1', 'm2']
input: 'file1', 'file2', for_each='method'
print(f"{_index}: {_input} {_method}")

0: file1 file2 m1
1: file1 file2 m2


will repeat the step with each item of variable `method`

SoS automatically creates a loop variable `_method` for variable `method`, which assumes a slice of the variable at each iteration. 

Nested loops are also allowed. For example,

In [12]:
%sandbox
!touch file1 file2
%run
[0]
method = ['m1', 'm2']
pars = [1, 2]
input: 'file1', 'file2', for_each=['method', 'pars']
print(f"{_index}: _input={_input} _method={_method}, _pars={_pars}")

0: _input=file1 file2 _method=m1, _pars=1
1: _input=file1 file2 _method=m2, _pars=1
2: _input=file1 file2 _method=m1, _pars=2
3: _input=file1 file2 _method=m2, _pars=2


If you would like to loop the process with several parameters, you can put them into the same level by 'var1,var2'. For example,

In [13]:
%sandbox
%run
!touch file1 file2

[0]
method = ['m1', 'm2']
pars = [1, 2]
input: 'file1', 'file2', for_each=['method,pars']
print(f"{_index}: _input={_input} _method={_method}, _pars={_pars}")

0: _input=file1 file2 _method=m1, _pars=1
1: _input=file1 file2 _method=m2, _pars=2


The variable passed to option `for_each` can a sequence (`list`, `tuple`, `set`, etc), a Pandas `Series`, `Index`, or `DataFrame`. In the last case, each `_loop` variable presents a line in the dataframe and you can access single values using format `_loop["header"]`. For example

In [14]:
%sandbox
%preview data
%run
[0]
import pandas as pd
data = pd.DataFrame([(1, 2, 'Hello'), (2, 4, 'World')], columns=['A', 'B', 'C'])
input: for_each='data'
output: f"{_data['A']}_{_data['B']}_{_data['C']}.txt"
sh: expand=True
    touch {_output}

If you would like define your own loop variable, or if the default loop variable does not work (e.g. loop through `obj.sequence` where `_obj.sequence` is not a valid variable name), you can use a dictionary syntax in the format of `{'varname': sequence}`. Mult-variable and nested loops can be specified in the format of `{'var1': seq1, 'var2': seq2}` (same level) and `[{'var1': seq1}, {'var2': seq2}]`. 

For example, the first example for this parameter can be written as

In [15]:
%sandbox
!touch file1 file2

input: 'file1', 'file2', for_each={'method': ['m1', 'm2']}
print(f"{_index}: {_input} {method}")

0: file1 file2 m1
1: file1 file2 m2


and a latter example can be written as

In [16]:
%sandbox
!touch file1 file2
%run
[0]
input: 'file1', 'file2', for_each={'method': ['m1','m2'],
                                   'pars': [1, 2]}
print(f"{_index}: _input={_input} method={method}, pars={pars}")

0: _input=file1 file2 method=m1, pars=1
1: _input=file1 file2 method=m2, pars=2


The dictionary syntax also supports multiple keys. This helps customizing groups of variables. For example in the script below we only care for situations where `n` is greater than `p`,  

In [17]:
%sandbox
!touch a.txt
%run
[1]
import itertools
parameter: n = [100, 300]
parameter: p = [50, 100, 200]
parameter: outfile = ['1.txt', '2.txt', '3.txt', '4.txt', '5.txt', '6.txt']
input: 'a.txt', for_each= {'_n,_p': [(_n,_p) for _n in n for _p in p if _n > _p]}
print(f"{_index} {outfile[_index]} {_n} {_p}")

0 1.txt 100 50
1 2.txt 300 50
2 3.txt 300 100
3 4.txt 300 200


### Option `paired_with` <a id="Option_paired_with"></a>

Option `paired_with` pairs variables with step input (variable `step_input`) so that corresponding information is available for substeps (`_input`). For example,

In [18]:
%sandbox
!mkdir case ctrl
!touch case/A1.bam case/A2.bam ctrl/A1.bam ctrl/A2.bam

bam_files = ['case/A1.bam', 'case/A2.bam', 'ctrl/A1.bam', 'ctrl/A2.bam']
mutated = ['case', 'case', 'ctrl', 'ctrl']
sample_name = ['A1', 'A2', 'A1', 'A2']

input: bam_files, paired_with=['mutated', 'sample_name'], group_by=1
print(f"{_index}: _input={_input} _mutated={_mutated}, _sample_name={_sample_name}")

0: _input=case/A1.bam _mutated=['case'], _sample_name=['A1']
1: _input=case/A2.bam _mutated=['case'], _sample_name=['A2']
2: _input=ctrl/A1.bam _mutated=['ctrl'], _sample_name=['A1']
3: _input=ctrl/A2.bam _mutated=['ctrl'], _sample_name=['A2']


The paired varaibles stay with the input files, so we get different paired variables with different grouping method

In [19]:
%sandbox
!mkdir case ctrl
!touch case/A1.bam case/A2.bam ctrl/A1.bam ctrl/A2.bam

bam_files = ['case/A1.bam', 'case/A2.bam', 'ctrl/A1.bam', 'ctrl/A2.bam']
mutated = ['case', 'case', 'ctrl', 'ctrl']
sample_name = ['A1', 'A2', 'A1', 'A2']

input: bam_files, paired_with=['mutated', 'sample_name'], group_by='pairs'
print(f"{_index}: _input={_input} _mutated={_mutated}, _sample_name={_sample_name}")

0: _input=case/A1.bam ctrl/A1.bam _mutated=['case', 'ctrl'], _sample_name=['A1', 'A1']
1: _input=case/A2.bam ctrl/A2.bam _mutated=['case', 'ctrl'], _sample_name=['A2', 'A2']


Values to option `paired_with` are usually lists of the same length as `step_input` but it can also be other types such as `paths` and `sos_targets`, in this case the iterator variables (e.g. `_mutated` for `mutated`) will have the same type as the input variable. For example, 

In [3]:
%sandbox
!touch 1.txt 2.txt 3.txt

anno = paths('1.ann', '2.ann', '3.ann')
input: '1.txt', '2.txt', '3.txt', group_by='pairwise',
       paired_with='anno'
sh: expand=True
  echo 'Pair {_anno} of type {type(_anno).__name__} with {_input}'

Pair 1.ann 2.ann of type paths with 1.txt 2.txt
Pair 2.ann 3.ann of type paths with 2.txt 3.txt


In case that the variable you would like to pair has a name with `.` (e.g. `align.output` for which name of looping variable `_align.outpu` is invalid) or if you would like to control the name of the looping variable, you can use the full format of this parameter `{var_name: var_value}`. 

In [20]:
%sandbox
!mkdir case ctrl
!touch case/A1.bam case/A2.bam ctrl/A1.bam ctrl/A2.bam

bam_files = ['case/A1.bam', 'case/A2.bam', 'ctrl/A1.bam', 'ctrl/A2.bam']
input: bam_files, paired_with={
    'mutated': ['case', 'case', 'ctrl', 'ctrl'],
    'sample_name': ['A1', 'A2', 'A1', 'A2']
    }, group_by=1
print(f"{_index}: _input={_input} mutated={mutated}, sample_name={sample_name}")

0: _input=case/A1.bam mutated=['case'], sample_name=['A1']
1: _input=case/A2.bam mutated=['case'], sample_name=['A2']
2: _input=ctrl/A1.bam mutated=['ctrl'], sample_name=['A1']
3: _input=ctrl/A2.bam mutated=['ctrl'], sample_name=['A2']


### Option `group_with` <a id="Option_group_with"></a>

Similar to option `paired_with` that associate variables to input files, you could associate items of a sequence with each substep. This option is applied after `group_by` and before `for_each`, which means the length of the sequence should equal to the number of substeps. and the variables will be the same for each `for_each` loop. Also similar to option `paired_with`, option `group_with` can take a string (name of variable) or a dictionary.

Using the above example, you can assign a label for each group by passing name of a sequence variable

In [21]:
%sandbox
!mkdir case ctrl
!touch case/A1.bam case/A2.bam ctrl/A1.bam ctrl/A2.bam

mutated = ['case', 'ctrl']
bam_files = ['case/A1.bam', 'case/A2.bam', 'ctrl/A1.bam', 'ctrl/A2.bam']
input: bam_files, group_by=2, group_with='mutated'
print(f"{_index}: _input={_input} _mutated={_mutated}")

0: _input=case/A1.bam case/A2.bam _mutated=case
1: _input=ctrl/A1.bam ctrl/A2.bam _mutated=ctrl


or a dictionary with variable name and values:

In [22]:
%sandbox
!mkdir case ctrl
!touch case/A1.bam case/A2.bam ctrl/A1.bam ctrl/A2.bam

bam_files = ['case/A1.bam', 'case/A2.bam', 'ctrl/A1.bam', 'ctrl/A2.bam']
input: bam_files, group_by=2, group_with={'mutated': ['case', 'ctrl']}
print(f"{_index}: _input={_input} mutated={mutated}")

0: _input=case/A1.bam case/A2.bam mutated=case
1: _input=ctrl/A1.bam ctrl/A2.bam mutated=ctrl


### Option `pattern` <a id="Option_pattern"></a>

This option does the reverse of function `expand_pattern`. It uses named wildcards to match pattern to all input files, and creates step variables for these wildcard objects. For example,

In [23]:
%sandbox
!touch a-20.txt b-10.txt
%run
[step]
input:  'a-20.txt', 'b-10.txt', pattern = '{name}-{par}.txt'
output: expand_pattern("{name}-processed-{par}.txt")
sh: expand=True
    echo {_output}
    touch {_output}

a-processed-20.txt b-processed-10.txt


will take all input files and extract `name` and `par` from each file name as variables `name` and `par`. It is then used to create output file names adding the word `processed` in between these wildcard objects. The outcome of the SoS script above is creation of files `a-processed-10.txt` and `b-processed-20.txt`.

When wildcard objects are accessed as step variables, both variable names with and without `_` prefix is available, e.g. in this example, both `_name` and `name`, `_par` and `par` are avaiable and are the same. The two conventions will only differ when `group_by` or `for_each` is also used. In which case the generated pattern variables `name` and `par` are automatically paired with `input` as if they have been paired using option `paired_with=['name', 'par']`.

In [24]:
%sandbox
!touch a-20.txt b-10.txt
%run
[step]
input:  'a-20.txt', 'b-10.txt', pattern = '{name}-{par}.txt', group_by=1
output: expand_pattern("{_name}-processed-{_par}.txt")
sh: expand=True
    echo {_output}
    touch {_output}

a-processed-20.txt
b-processed-10.txt


### Option `concurrent` <a id="Option_concurrent"></a>

Substeps of a step are by default executed sequentially with potential dependencies. For example,

In [6]:
sum = 0
input: for_each={'i': range(4)}
sum += i
print(f'sum is {sum} at index {_index}')

sum is 0 at index 0
sum is 1 at index 1
sum is 3 at index 2
sum is 6 at index 3


However, if substeps are independent, you can use option `concurrent=True` to execute substeps in parallel. As you can see from the following example, all substeps starts with `sum=0` and are executed in parallel.

In [7]:
sum = 0
import time
start_time = time.time()
input: for_each={'i': range(4)}, concurrent=True
sum += i
time.sleep(4-i)
print(f'sum is {sum} at index {_index}, completed in {time.time() - start_time:.1f} seconds')

sum is 0 at index 0, completed in 4.0 seconds
sum is 1 at index 1, completed in 3.0 seconds
sum is 2 at index 2, completed in 2.0 seconds
sum is 3 at index 3, completed in 1.0 seconds


Note that `concurrent=True` is ignored if a step contains `task` because tasks will be executed in parallel anyway.

### `dynamic` input files

In order to determine the best execution strategy, SoS evaluates all expressions for all steps before the execution of a workflow to figure
out input and output of steps. This works most of the time but sometimes the input of a step can only be determined at runtime. For example,
if you would like your workflow to automatically scan an input directory and process all fasta files under it, or if a previous step produces
files that cannot be determined beforehand, you can specify input files as follows,

```python
input: 'input/*.fasta'
```

The problem is that no file or a wrong set files might exist during the planing stage so SoS might skip this step or start the step
with a wrong set of files. To address this problem, you can declare the input files as **dynamic** by passing a `dynamic` object

```python
input: dynamic('input/*.fasta')
```

This tells SoS that the input of this step can only be determined at runtime and will execute the step only after all its previous
steps have been completed.

### Summary

Options of step `input` are evaluated in the following orders:

1. A list of input files, if specified, would replace `input`, which is by default output from the previous step.
2. Option `group_by` groups the files into several groups, named `_input`
3. Option `for_each` repeat `_input` for each loop var, named `_loopvar` if `for_each='loopvar'`.
4. Option `paired_with` pairs one or more variables with `input`, variable `paired` is paired with `input`
    and variable `_paired` is paired with `_input` in each loop if `paired_with='paired'`
5. Option `pattern` extract variables from filenames in `input`. Variable `extracted` is paired with `input`
    and variable `_extracted` is paired with `_input` in each loop if `extract='{extracted}_other_part'`.

The differences between looped and non-loop steps are sumarized in the following figure

![step_loop](../media/step_loop.jpg)

## Step output <a id="Step_output"></a>

### Output files <a id="Output_files"></a>

Output files of a step can be specified by step `output`. Similar to `input`, step output accepts strings, variables, expressions, and allows wildcard characters. For example, the following are acceptable output files

```python
output:  []

output:  'accepted_hits.bam'

output:  aligned_reads, bam_stats

output:  'aligned/*.bam'

output:  expand_pattern('aligned_{samples}.bam')
```

In the last example, function `expand_pattern` is used to contruct list of files from items of a sequence `samples`.

Step process could be repeated multiple times with **substeps** defined by options `for_each` or `group_by`. Whereas `input` represents the complete set of input files, each substep has its own input files stored in variable `_input`.

When there is no substep, `output` and `_output` are the same. Otherwise, the `output:` statement produces `_output` for each substep, and `_output` for all substeps form `output` after the completion of all substeps.

For example, the following step accepts one or more bam files and index them using command `samtools index`. The input files are passed one by one and output from each substep is determined by `_input`. 

```python
[10]
input:
	bamfiles, group_by='single'

output:
	f'{_input}.bai'

run(f'''samtools index {_input} ''')
```

The use of variable `output` in this scenario is discouraged because `output`, as the collection of all `_output` increases with each substep.

Specifying output of the entire step when there are multiple substeps can lead to error. For example,

```python
[10]
input:
	bamfiles, group_by=1

output:
	output: [x + '.bai' for x in bamfiles]

task:
run(f'''samtools index {_input} ''')
```

will fail because tasks in this step produce the same set of output files (equal `_output` for all substeps), and none of the task produces all specified output files.


### Option `group_by` <a id="Option_group_by"></a>

As we have explained above, `_output` of an substep should be determined from `_input`. However, there are cases that `input` and `output` are pre-determined and it is not easy to derive `_output` from `_input`. In this case, option `group_by` could be used to divide specified outputs for each substep.

For example, the above example would work with a `group_by` output option

```python
[10]
input:
	bamfiles, group_by=1

output:
	output: [x + '.bai' for x in bamfiles], group_by=1

task:
run(f'''samtools index {_input} ''')
```


### `dynamic` output files

Similar to the cases with [dynamic input files](#dynamically-determined-input-files-function-dynamic), the output of some steps could also not be determined beforehand. For example, with the following script that generates `html` files that cannot be determined during dry run, 

In [25]:
%sandbox --expect-error
%run
[10]
output: '*.html'

import random
for i in range(2):
    run(f"touch result_{random.randint(1, 20)}.html")

touch result_16.html
touch result_10.html


Output target *.html does not exist after the completion of step default_10


In this case, you will need to define the output as `dynamic` using a `dynamic` function.

In [26]:
%sandbox
%run
[10]
output: dynamic('*.html')

import random
for i in range(2):
    run(f"touch result_{random.randint(1, 20)}.html")


touch result_15.html
touch result_7.html


In this case, SoS knows that the output can only be determined after the completion of the step.

## Step depends <a id="Step_depends"></a>

This item specifies files that are required for the step. Although not required, it is a good practice to list resource files and other dependency files for a particular step. For example

```python
[10]
input: fasta_files
depends: reference_seq, executable('fastqc')
```

Similar to `output` options, dependent files can also be defined after `input` options and consist of dependent files determined from loop variables. 

The following figure summarizes the effect of `input` and `output` options and input options `group_by` and `for_each` on the flow
of input and output files and related variables.

![Step options](../media/step_options.jpg)

## Step process <a id="Step_process"></a>

A step process is the Python statements that perform certain tasks and produce step output from step input. A step process can contain arbitrary Python statements. For example,

In [27]:
%sandbox
%run
[10]
output: 'a.txt'
with open(_output[0], 'w') as dest:
   dest.write('some text')

and

In [28]:
%sandbox
%run
[10]
output: 'a.txt'
sh: expand=True
    echo "some text" > {_output:q}

use inline (interpreted and executed by SoS) python code or shell script to generate `a.txt`. 

Step processes are executed within SoS and are executed sequentially. However, part or all of the step process can be executed externally and potentially in parallel as step `task`. This will be covered in detail in section [External task](External_task.html).