# How to include output from another step in a SoS step

* **Difficulty level**: intermediate
* **Time need to lean**: 10 minutes or less
* **Key points**:
  * Function `output_from(step)` refers to output from another `step`
  * `output_from(step)[name]` can be used to refer to named output from `step`
  

## Referring to named output from another step

As shown in the example from tutorial [How to use named output in data-flow style workflows](doc/user_guide/named_output.html), function `named_output` can be used to refer to named output from another step:

In [1]:
!rm -f data/DEG.csv
%run plot

[global]
excel_file = 'data/DEG.xlsx'
csv_file = 'data/DEG.csv'
figure_file = 'output.pdf'

[convert]
input: excel_file
output: csv = _input.with_suffix('.csv')

run: expand=True
    xlsx2csv {_input} > {_output}

[plot]
input: named_output('csv')
output: figure_file

R: expand=True
    data <- read.csv('{_input}')
    pdf('{_output}')
    plot(data$log2FoldChange, data$stat)
    dev.off()

Set sos options to "-s force"


0,1,2,3,4
,plot,Workflow ID  4dcd3205dd8ea4c0,Index  #1,completed  Ran for 1 sec


xlsx2csv data/DEG.xlsx > data/DEG.csv

null device 
          1 


One obvious limitation of `named_output()` is that the name has to be unique in the workflow. For example, in the following script where another step `test_csv` also gives its output a name `csv`, the workflow would fail due to ambiguity. This is usually not a concern with small workflows. However, when workflows get more and more complex, it is sometimes desired to anchor named output more precisely.

In [2]:
!rm -f data/DEG.csv
%run plot

[global]
excel_file = 'data/DEG.xlsx'
csv_file = 'data/DEG.csv'
figure_file = 'output.pdf'

[convert]
input: excel_file
output: csv = _input.with_suffix('.csv')

run: expand=True
    xlsx2csv {_input} > {_output}

[test_csv]
input: excel_file
output: csv = f'{_input:n}_test.csv'

run: expand=True
    xlsx2csv {_input} | head -10 > {_output}
    
[plot]
input: named_output('csv')
output: figure_file

R: expand=True
    data <- read.csv('{_input}')
    pdf('{_output}')
    plot(data$log2FoldChange, data$stat)
    dev.off()

0,1,2,3,4
,plot,Workflow ID  43ddd46fbd415940,Index  #2,failed  Ran for < 5 seconds


[91mERROR[0m: [91mMultiple steps convert, test_csv to generate target named_output("csv")[0m


## Function  `output_from` <a id="output_from"></a>

 <div class="bs-callout bs-callout-primary" role="alert">
    <h4>Function <code>output_from(steps, group_by, ...)</code></h4>
    <p>Function <code>output_from</code> refers to the output of <code>step</code>. The returned the object is the complete output from <code>step</code> with its own sources and groups. Therefore,</p>
    <ul>
        <li>More than one steps can be specified as a list of step names</li>
        <li>Option <code>group_by</code> can be used to regroup the returned files</li>
        <li><code>output_from(step)[name]</code> refers to all output with source <code>name</code></li>
    </ul>
 </div>

Function `output_from` imports the output from one or more other steps. For example, in the following workflow `output_from(['step_10', 'step_20'])` takes the output from steps `step_10` and `step_20` as input.

In [3]:
%run -v0
[step_10]
output: 'a.txt'
_output.touch()

[step_20]
output: 'b.txt'
_output.touch()

[step_30]
input:  output_from(['step_10', 'step_20'])
print(_input)

a.txt b.txt


The above example is a simple forward workflow with numerically numbered steps. In this case the parameters of `output_from` can be simplied to just the indexes (integers) so the workflow can be written as

In [4]:
%run -v0
[step_10]
output: 'a.txt'
_output.touch()

[step_20]
output: 'b.txt'
_output.touch()

[step_30]
input:  output_from([10, 20])
print(_input)

a.txt b.txt


The source `steps` of `output_from(steps)` does not have to be limited to numerically-indexed steps. For example, the above example can be written as:

In [5]:
%run -v0
[A]
output: 'a.txt'
_output.touch()

[B]
output: 'b.txt'
_output.touch()

[default]
input:  output_from(['A', 'B'])
print(_input)

a.txt b.txt


### `source` of outputs returned from `output_from`

The `sources` of the files returned from `output_from()` is by default the names of the steps so you can refer to these files separately using the `_input[name]` syntax:

In [6]:
%run -v0
[A]
output: 'a.txt'
_output.touch()

[B]
output: 'b.txt'
_output.touch()

[default]
input:  output_from(['A', 'B'])
print(_input)
print(f'Output from A is {_input["A"]}')
print(f'Output from B is {_input["B"]}')

a.txt b.txt
Output from A is a.txt
Output from B is b.txt


If the output has its own sources (names), the sources will be kept.

In [7]:
%run -v0
[A]
output: A_out = 'a.txt'
_output.touch()

[B]
output: B_out = 'b.txt'
_output.touch()

[default]
input:  output_from(['A', 'B'])
print(_input)
print(f'Output from A is {_input["A_out"]}')
print(f'Output from B is {_input["B_out"]}')

a.txt b.txt
Output from A is a.txt
Output from B is b.txt


As usual, keyword arguments of the input statement override the `sources` of input files:

In [8]:
%run -v0
[step_10]
output: 'a.txt'
_output.touch()

[step_20]
output: 'b.txt'
_output.touch()

[step_30]
input:  s10=output_from(10), s20=output_from(20)
print(f'Output from step_10 is {_input["s10"]}')
print(f'Output from step_20 is {_input["s20"]}')

Output from step_10 is a.txt
Output from step_20 is b.txt


### groups of output returned from `output_from`

Similar to the case with `named_output`, the returned object from `output_from()` keeps its original groups. For example, 

In [9]:
%run B -v0
[A]
input: for_each=dict(i=range(4))
output: f'a_{i}.txt'
_output.touch()

[B]
input: output_from('A')
output: _input.with_suffix('.bak')
print(f'Converting {_input} to {_output}')
_output.touch()

Converting a_0.txt to a_0.bak
Converting a_1.txt to a_1.bak
Converting a_2.txt to a_2.bak
Converting a_3.txt to a_3.bak


You can override the groups using the `group_by` option of `output_from`. 

In [10]:
%run B -v0
[A]
input: for_each=dict(i=range(4))
output: f'a_{i}.txt'
_output.touch()

[B]
input: output_from('A', group_by=2)
output: [x.with_suffix('.bak') for x in _input]
print(f'Converting {_input} to {_output}')
_output.touch()

Converting a_0.txt a_1.txt to a_0.bak a_1.bak
Converting a_2.txt a_3.txt to a_2.bak a_3.bak


Note that we used 
```
_input.with_suffix('.bak')
```
when `_input` contains only one filename and the above the statement is equivalent to
```
_input[0].with_suffix('.bak')
```
However, when `_input` contains more than one files, you will have to deal with them one by one as follows:
```
[x.with_suffix('.bak') for x in _input]
```

## Using `output_from` in place of `named_output`

Going back to our `conver`, `plot` example. When another step is added to have the same named output, it is no longer possible to use `named_output(name)`. In this case you can explicitly specify the step from which the named output is defined, and use

```
output_from(step)[name]
```
instead of
```
named_output(name)
```
as shown in the following example:

In [11]:
!rm -f data/DEG.csv
%run plot 

[global]
excel_file = 'data/DEG.xlsx'
csv_file = 'data/DEG.csv'
figure_file = 'output.pdf'

[convert]
input: excel_file
output: csv = _input.with_suffix('.csv')

run: expand=True
    xlsx2csv {_input} > {_output}

[test_csv]
input: excel_file
output: csv = f'{_input:n}_test.csv'

run: expand=True
    xlsx2csv {_input} | head -10 > {_output}
    
[plot]
input: output_from('convert')['csv']
output: figure_file

R: expand=True
    data <- read.csv('{_input}')
    pdf('{_output}')
    plot(data$log2FoldChange, data$stat)
    dev.off()

xlsx2csv data/DEG.xlsx > data/DEG.csv

null device 
          1 


Note that `output_from` is better than `named_output` for its ability to referring to a specific step, but is also worse than `named_output` for the same reason because it makes the workflow more difficult to maintain. We generally recommend the use of `named_output` for its simplicity.

## Further reading
* [How to use named output in data-flow style workflows](doc/user_guide/named_output.html)