# Title

* **Difficulty level**: easy
* **Time need to lean**: 10 minutes or less
* **Key points**:
  * a
  

### Output files <a id="Output_files"></a>

Output files of a step can be specified by step `output`. Similar to `input`, step output accepts strings, variables, expressions, and allows wildcard characters. For example, the following are acceptable output files

```python
output:  []

output:  'accepted_hits.bam'

output:  aligned_reads, bam_stats

output:  'aligned/*.bam'

output:  expand_pattern('aligned_{samples}.bam')
```

In the last example, function `expand_pattern` is used to contruct list of files from items of a sequence `samples`.

Step process could be repeated multiple times with **substeps** defined by options `for_each` or `group_by`. Whereas `input` represents the complete set of input files, each substep has its own input files stored in variable `_input`.

When there is no substep, `output` and `_output` are the same. Otherwise, the `output:` statement produces `_output` for each substep, and `_output` for all substeps form `output` after the completion of all substeps.

For example, the following step accepts one or more bam files and index them using command `samtools index`. The input files are passed one by one and output from each substep is determined by `_input`. 

```python
[10]
input:
	bamfiles, group_by='single'

output:
	f'{_input}.bai'

run(f'''samtools index {_input} ''')
```

The use of variable `output` in this scenario is discouraged because `output`, as the collection of all `_output` increases with each substep.

Specifying output of the entire step when there are multiple substeps can lead to error. For example,

```python
[10]
input:
	bamfiles, group_by=1

output:
	output: [x + '.bai' for x in bamfiles]

task:
run(f'''samtools index {_input} ''')
```

will fail because tasks in this step produce the same set of output files (equal `_output` for all substeps), and none of the task produces all specified output files.


### Option `group_by` <a id="Option_group_by"></a>

The `output` statement usually generate a `sos_targets` object without groups. This object will become `_output`, which then becomes one of the groups of `step_output`.

However, the `output` statement can also generate a `sos_targets` object with groups, in which case the `_index` group will be become the `_output` of subgroup `_index`. For this reason, the object should have the same number of groups as `step_input`.

For example, if you have list of pre-determined input and output files and you would like to pair them, you can create list all output files and use option `group_by` to create an equal number of output groups as input groups.

In [5]:
%run
in_files = [f'a_{i}.txt' for i in range(4)]
out_files = [f'b_{i}.txt' for i in range(4)]

[1]
input: in_files, group_by=1
output: out_files, group_by=1
_output.touch()

[2]
print(_input)

0,1,2,3,4
,default,Workflow ID  41c862e5b9a946e4,Index  #5,completed  Ran for < 5 seconds


b_0.txt
b_2.txt
b_3.txt
b_1.txt


Note that the previous example might not produce `b_0.txt`, `b_1.txt` in order because the substeps are executed concurrently so substeps can complete the `print` statement in any order.

### Options `paired_with`, `group_with`, `pattern`, and `for_each`

Just like `input` statements, options `paired_with`, `group_with`, `pattern`, and `for_each` are all allowed in the `output` statement and can be used to create `_output` that can be passed to the next step, or from `output_from` or `named_output`.

The most useful usages of the `paired_with` and `group_with` options are to attached variables to output targets and the output group.

Say we know that particular input and output files are associated with a sample name, and we have paired each input file with a sample name using option `paired_with=dict(sample=samples)`. Consequently, `_input.sample` is available for each substep.

The output statement  
```
output: f'{_input}.result', paired_with=dict(sample=_input.sample)
```
generates `f'{_input}.result'` for each `_input` but the sample information will be lost. For this reason an option `paired_with=dict(sample=_input.sample)` is added to attach attribute `sample` with value `_input.sample` to each output file name. With this step each input file of step `20` has an attribute that points to its sample name.

In [16]:
!touch a.txt b.txt

%run
[10]
samples = ['A', 'B']
input_files = ['a.txt', 'b.txt']

input: input_files, group_by=1, paired_with=dict(sample=samples)
output: f'{_input}.result', paired_with=dict(sample=_input.sample)
_output.touch()

[20]
print(f'{_input} with sample name {_input.sample}')

0,1,2,3,4
,default,Workflow ID  c1f4c407fcb8ccd6,Index  #16,completed  Ran for < 5 seconds


a.txt.result with sample name A
b.txt.result with sample name B


Similarly, in the following example,

```
output: f'a_{i}.txt', group_with=dict(i=i)
```
creates four `sos_targets` object with variable `i` attached. These objects becomes `_output` of step `10`, and `_input` of the next step. `i` becomes available in step `20` because it was attached to `_input`. This feature is pretty powerful but can be dangerous as well because of the "contamination" of namespace of step `20` from `_input` variables.

In [13]:
%run
[10]
input: for_each=dict(i=range(4))
output: f'a_{i}.txt', group_with=dict(i=i)
_output.touch()

[20]
print(f'{_input} with variable i={i}')

0,1,2,3,4
,default,Workflow ID  09c0428018532656,Index  #13,completed  Ran for < 5 seconds


a_0.txt with variable i=0
a_2.txt with variable i=2
a_1.txt with variable i=1
a_3.txt with variable i=3


### Inherited input with groups

Remember that the purpose of `input` statement is to create `step_input` with appropriate targets and groups, but what will happen to `sos_targets` inherited from previous step, from `output_from` and `named_output`? The simple answer is that the existing groups will be honored.

Let us first look at one example,

In [10]:
%run
[10]
input: for_each=dict(i=range(3))
output: f'out_{i}.txt'
_output.touch()

[20]
print(_input)

0,1,2,3,4
,default,Workflow ID  00e512747f48765d,Index  #3,completed  Ran for < 5 seconds


out_0.txt
out_2.txt
out_1.txt


What happens here is that there are three substeps in step `10`, creating outputs `out_0.txt`, `out_1.txt` and `out_2.txt`. The `step_output` object contains three targets in three groups.

When this `step_output` target is inherited by step `20` and becomes the `step_input` of the next step, step `20` is executed with the three existing substeps with `_input` being the `_output` of step `10`.

If this is what you want, you can regroup the input with option `group_by`. For example,

In [11]:
%run
[10]
input: for_each=dict(i=range(3))
output: f'out_{i}.txt'
_output.touch()

[20]
input: group_by='all'
print(_input)

0,1,2,3,4
,default,Workflow ID  00e512747f48765d,Index  #4,completed  Ran for < 5 seconds


out_0.txt out_1.txt out_2.txt


The same logic applies to output obtained from `output_from` and `named_output`. In the following example, the `output_from` function inherited 4 groups of outputs from step 'A', and 2 groups of outputs from step `B`. It then regroup outputs from step `A` into 2 groups and merge it with output from `B`. The final `_input` of step `20` therefore has two groups, each with 2 from step `A` and 1 from step `B`.

In [12]:
%run
[A]
input: for_each=dict(i=range(4))
output: f'A_{i}.txt'
_output.touch()

[B]
input: for_each=dict(i=range(2))
output: f'B_{i}.txt'
_output.touch()

[20]
input: output_from('A', group_by=2), output_from('B')
print(_input)

0,1,2,3,4
,default,Workflow ID  36316057dc19cf7a,Index  #5,completed  Ran for < 5 seconds


A_0.txt A_1.txt B_0.txt
A_2.txt A_3.txt B_1.txt


It is worth noting that `named_output` can have groups as well. Because a step can produce multiple named outputs, `named_output('a')` will essentially get a subset of targets with source `a`, and all targets from groups with source `a`. For example, step `K` generates 8 outputs in 4 groups. With `named_output('B')`, output and all targets in groups with name `B` are returned so `_input` will be `B_0.txt`, `B_1.txt` and so on.

In [8]:
%run
[K]
input: for_each=dict(i=range(4))
output: A=f'A_{i}.txt', B=f'B_{i}.txt'
_output.touch()

[20]
input: named_output('B')
print(_input)

0,1,2,3,4
,default,Workflow ID  d80ff4278bad6c40,Index  #8,completed  Ran for < 5 seconds


B_0.txt
B_2.txt
B_1.txt
B_3.txt


## Further reading

* 