# How to use named output in data-flow style workflows

* **Difficulty level**: easy
* **Time need to lean**: 10 minutes or less
* **Key points**:
  * Output can be grouped by names, which can be referred to by `[name]`
  * Function `named_output(name)` refers to output with `name` in any step
  * Return value of `name_output(name)` can also have groups
  

## Limitations of basic dataflow-based workflows

In our tutorial on [How to define and execute basic SoS workflows](doc/user_guide/forward_workflow.html) we introduced basic dataflow-based workflows as follows:

In [2]:
!rm DEG.csv
%run plot

[global]
excel_file = 'data/DEG.xlsx'
csv_file = 'DEG.csv'
figure_file = 'output.pdf'

[convert]
input: excel_file
output: csv_file

run: expand=True
    xlsx2csv {_input} > {_output}

[plot]
input: csv_file
output: figure_file

R: expand=True
    data <- read.csv('{_input}')
    pdf('{_output}')
    plot(data$log2FoldChange, data$stat)
    dev.off()

0,1,2,3,4
,plot,Workflow ID  09ca8ead69135592,Index  #2,completed  Ran for 0 sec


xlsx2csv data/DEG.xlsx > DEG.csv



Basically, when the input of step `plot` (`csv_file`) is unavailable, SoS looks in the script for another step that generates this output. If it can be found, it will execute that step to produce the required input before step `plot` is executed.

A limitation of this kind of workflow is that the output of another step has to be determined "easily" either from the `output` statement itself, or with variable definitions from the `global` section. The following workflow would fail because the step of the output is defined as

```
output: _input.with_suffix('csv')
```

which takes the `_input` of the step and replaces its suffix with `.csv`. Because the `_output` depends on `_input`, it cannot be used to generate `data/DEG.csv` directly.

In [14]:
!rm -f data/DEG.csv
%run plot

[global]
excel_file = 'data/DEG.xlsx'
csv_file = 'data/DEG.csv'
figure_file = 'output.pdf'

[convert]
input: excel_file
output: _input.with_suffix('.csv')

run: expand=True
    xlsx2csv {_input} > {_output}

[plot]
input: csv_file
output: figure_file

R: expand=True
    data <- read.csv('{_input}')
    pdf('{_output}')
    plot(data$log2FoldChange, data$stat)
    dev.off()

0,1,2,3,4
,plot,Workflow ID  a6e8c3256681ffe1,Index  #14,failed  Ran for 0 sec


[91mERROR[0m: [91mNo step to generate target data/DEG.csv requested by plot[0m


## Named output

Similar to input statement, output of SoS steps can also be named. In the following example

* 4 substeps are defined with `i=0`, `1`, `2`, and `3`
* The output of each substep is `f'a_{i}.txt'` and `f'b_{i}.txt'` (`a_0.txt`, `b_0.txt` etc).
* The outputs are grouped to group `a` and `b`.
* The output of the entire step consist of `_output` of substeps, which becomes the `_input` of the next step. This is how we can example the output of step `10`. 

In [10]:
%run

[10]
input: for_each=dict(i=range(4))
output: a=f'a_{i}.txt', b=f'b_{i}.txt'
_output.touch()

[20]
print(f'{_input} with sources {_input.sources}')
print(_input['a'])

0,1,2,3,4
,default,Workflow ID  c05e8f20c645cdd0,Index  #10,completed  Ran for 0 sec


a_0.txt b_0.txt with sources ['a', 'b']
a_0.txt
a_1.txt b_1.txt with sources ['a', 'b']
a_1.txt
a_2.txt b_2.txt with sources ['a', 'b']
a_2.txt
a_3.txt b_3.txt with sources ['a', 'b']
a_3.txt


As we can see, there are four substeps for step `20`. The `_input` of substeps has two files with names `a` and `b`, and we can refer to the targets with name `a` with `_input['a']`.

## Function `named_output` <a id="named_output"></a>

 <div class="bs-callout bs-callout-primary" role="alert">
    <h4>Function <code>named_output(name, group_by, ...)</code></h4>
    <p>Function <code>named_output</code> refers the named output of any SoS step defined in the script. Using <code>named_output</code> in the <code>input</code> statement of a step will create an dependency on the step with the named output, and insert the named output as input of the step.</p>
 </div>

The problem we had with complex output can be resolved by function `named_output()`. For example, the aforementioned workflow can be written as

In [13]:
!rm -f data/DEG.csv
%run plot

[global]
excel_file = 'data/DEG.xlsx'
csv_file = 'data/DEG.csv'
figure_file = 'output.pdf'

[convert]
input: excel_file
output: csv = _input.with_suffix('.csv')

run: expand=True
    xlsx2csv {_input} > {_output}

[plot]
input: named_output('csv')
output: figure_file

R: expand=True
    data <- read.csv('{_input}')
    pdf('{_output}')
    plot(data$log2FoldChange, data$stat)
    dev.off()

0,1,2,3,4
,plot,Workflow ID  4dcd3205dd8ea4c0,Index  #13,completed  Ran for 2 sec


xlsx2csv data/DEG.xlsx > data/DEG.csv



null device 
          1 


Here `named_output('csv')` refers to any step that produces an output with name `csv`, which is the step `convert` in this workflow. The input of step `plot` is the return value of `named_output('csv')` which is `data/DEG.csv`, although its exact name can only be identified after the conversion step is executed.

<div class="bs-callout bs-callout-warning" role="alert">
    <h4>Uniqueness of names of output</h4>
    <p>Although outputs of steps can be identified with arbitrary names and mulitple steps can have the same names for outputs, names refered by function <code>named_output</code> have to be unique.</p>
 </div>

<div class="bs-callout bs-callout-warning" role="alert">
    <h4><code>named_output()</code> can only be called from input statements</h4>
    <p><code>named_output()</code> is a function provided by SoS to define input of steps and can only be called from input statements.</p>
 </div>

## Groups of output returned by `named_output` *

As we have seem, the output of a step can have multiple groups. In this case the return value of `named_output(name)` consists of the `name` part of all groups.

In the following example, `named_output('a')` obtains the `a` part of the output of step `A`, which consists of 4 groups. During the execution of the workflow, step `A` is executed to generate input for step `default`, which consists of 4 steps with `_input` equals `a_0.txt`, `a_1.txt` etc.

In [23]:
%run -v0 

[A]
input: for_each=dict(i=range(4))
output: a=f'a_{i}.txt', b=f'b_{i}.txt'
_output.touch()

[default]
input: named_output('a')
output: _input.with_suffix('.bak')
print(f'Generating {_output}')
_output.touch()

0,1,2,3,4
,default,Workflow ID  42a47df1e4ce640a,Index  #23,completed  Ran for 0 sec


Generating a_0.bak
Generating a_1.bak
Generating a_2.bak
Generating a_3.bak


<div class="bs-callout bs-callout-warning" role="alert">
    <h4>Option <code>group_by</code> of function <code>output_from</code></h4>
    <p>Option <code>group_by</code> regroups the groups returned by <code>output_from</code>
</div>

If you would like to remove the groups or re-group the returned files using another method, you can use the `group_by` option of function `output_from`. For example, the `group_by='all'` option in the following example groups all 4 input files into a single group:

In [21]:
%run -v0 

[A]
input: for_each=dict(i=range(4))
output: a=f'a_{i}.txt', b=f'b_{i}.txt'
_output.touch()

[default]
input: named_output('a', group_by='all')
output: [x.with_suffix('.bak') for x in _input]
print(f'Generating {_output}')
_output.touch()

Generating a_0.bak a_1.bak a_2.bak a_3.bak


## Further reading

* [How to include output from another step in a SoS step](doc/user_guide/output_from.html)
* [How to use Makefile-style rules to generate required files](doc/user_guide/auxiliary_steps.html)
* [How to execute workflow to generate specific output](doc/user_guide/target_oriented.html