# How to define step output

* **Difficulty level**: easy
* **Time need to lean**: 10 minutes or less
* **Key points**:
  * Step output are defined for each substep and can be derived from substep input (variable `_input`)
  * Variable `step_output` is defined at the completion of the step, and can be passed to other steps

The output statement defines the output files or targets of a SoS step, it is optional but is fundamental for the creation of all but very simple workflows. You can check out the [How to create dependencies between SoS steps](doc/user_guide/step_dependencies.html) tutorial for a quick overview of the use of output statements. This tutorial lists what you can put in the output statement of a step with simple examples and you should refer to other tutorials for more in-depth discussions of the topics.

## Steps with no output statement

The `output` statement is optional. When no output file is defined, a step will have undefined output.

For example, the following workflow has a step `A` that execute a simple shell script. No output statement is needed and the workflow will work just fine.

In [1]:
%run A -v0
[A_1]
sh:
  echo do something
  
[A_2]
print(f'The input of step {step_name} is "{step_input}"')

The input of step A_2 is ""


In simple workflows with numerically indexed steps, an empty output will be passed to the next step.

## Unnamed output files

The easiest way to explicitly specify input of a step is to list output files directly in the `output` statement.


In [2]:
output: 'a.txt'
_output.touch()
print(f'_output is {_output}')

_output is a.txt


Here we showed touch function for _output, which is of type sos_targets. This function creates one or more files in variable _output and will be used quite often in the tutorials because SoS will check if the output file exists after the execution of the step.

As for the case of input statement, multiple files can be listed as multiple paramters, sequences (list, tuple etc), or variables of string or sequence types.

## Output in substeps

It is very important to remember that **output statement defines output for substeps**.

Let us create a few input files,

In [3]:
!touch a.txt b.txt c.txt d.txt

In the following example, option `group_by` creates two substeps with `_input` being `a.txt` and `b.txt` respectively. The `_input` (actually `_input[0]` is of type `file_target`, which is derived from `pathlib.Path` so you can use any member function for `pathlib.Path`. Here we use `with_suffix` to obtain `a.bak` from `a.txt`.

In [4]:
input: 'a.txt', 'b.txt', group_by=1
output: _input.with_suffix('.bak')

print(f'Input of substep is {_input}, output of substep is {_output}')
_output.touch()

Input of substep is a.txt, output of substep is a.bak
Input of substep is b.txt, output of substep is b.bak


As you can see, `_output` is defined for each substep from `_input`. But what is `step_output`?

`step_output` is defined as an accumuted version of `_output`, with `_output` as its groups. It is useful only when the output is imported to other steps, either implicitly as show below, or as output of functions `output_from` and `named_output`.

In [5]:
%run -v0

[10]
input: 'a.txt', 'b.txt', group_by=1
output: _input.with_suffix('.bak')

print(f'Input of substep is {_input}, output of substep is {_output}')
_output.touch()

[20]
print(f'step_input is {step_input}, substep input is {_input}')

Input of substep is a.txt, output of substep is a.bak
Input of substep is b.txt, output of substep is b.bak
step_input is a.bak b.bak, substep input is a.bak
step_input is a.bak b.bak, substep input is b.bak


## Output with predefined groups

In situations when you have predefined input and output pairs, you can define output groups with option `group_by`. The key here is that the number of groups should match the number of substeps.

For example,

In [6]:
%run -s force -v0

txt_files = ['a.txt', 'b.txt']
bak_files = ['a.bak', 'b.bak']

input: txt_files, group_by=1
output: bak_files, group_by=1

print(f'Input of substep is {_input}, output of substep is {_output}')
_output.touch()

Input of substep is a.txt, output of substep is a.bak
Input of substep is b.txt, output of substep is b.bak


## Named output

Similar to named input, you can assign labels to output files and refer them with `_output["label"]`.

In [7]:
output: A='a.txt', B='b.txt'

print(f"Output with label A is {_output['A']}, with label B is {_output['B']}")
print(f"Output of step is {_output}")
_output.touch()

Output with label A is a.txt, with label B is b.txt
Output of step is a.txt b.txt


More importantly though, is that these labels defines named output that can be referred to with function `named_output`.

In [8]:
%run -v0

[A]
output: A='a.txt', B='b.txt'

_output.touch()

[default]
input: named_output('A')
print(f'Input of step is {_input}')

Input of step is a.txt


## Attach variables to individual output files

The `paired_with` variables can be used to attach variables to output files.

In [9]:
output: 'a.txt', 'b.txt', paired_with=dict(sample_name=['A', 'B'])
print(f'Output of substep is {_output}, with sample names {_output[0].sample_name} and {_output[1].sample_name}')
_output.touch()

Output of substep is a.txt b.txt, with sample names A and B


## Attach variables to output

Option `group_with` can be used to attach variable to output groups, which can be useful as annotations for output files when the output is passed to other steps.

A potentially confusing part of the `group_with` option is that it assigns elements of the list to all `_output`, not to a single `_output` that the output statement is creating.

In [10]:
sample_names = ['A', 'B']

input: for_each=dict(sample_name=sample_names)
output:  f'out_{sample_name}.txt', group_with=dict(sample=sample_names)
print(f'Output of substep is {_output}, with sample name {_output.sample}')

_output.touch()

Output of substep is out_A.txt, with sample name A
Output of substep is out_B.txt, with sample name B


If you would like to attach some result to individual substep, it can be easier to just set the variable to `_output` though.

In [11]:
%run -v0

[10]
input: for_each=dict(i=range(2))
output:  f'out_{i}.txt'

import random
seed = random.randint(1, 1000)

_output.touch()
_output.set(seed=seed)

[20]
print(f'seed of output {_input} is {seed}')

seed of output out_0.txt is 369
seed of output out_1.txt is 701


## Further reading

* [How to specify input and output files and process input files in groups](doc/user_guide/input_substeps.html) for basics of step and substeps.