# How to define step input

* **Difficulty level**: easy
* **Time need to lean**: 10 minutes or less
* **Key points**:
  * Input statements accepts regular Python aguments
  * Input files can be grouped to create substeps
  * Input files can be labeled and be accessed by the labels
  * A step can include part or all output from other steps
  * You can attach variables to individual input files or substeps  

The `input` statement defines the input files or targets of a SoS step, it is optional but is fundamental for the creation of all but very simple workflows. You can check out the [How to create dependencies between SoS steps](doc/user_guide/step_dependencies.html) tutorial for a quick overview of the use of input statements. This tutorial lists what you can put in the `input` statement of a step with simple examples and you should refer to other tutorials for more in-depth discussions of the topics.

## Steps with no input statement

The `input` statement is optional. When no input file is defined, a step will either have undefined input, or output from its previous step as its input. 

For example, the following workflow has a step `A` that execute a simple shell script. No input statement is needed and the workflow will work just fine.

In [5]:
%run A -v0
[A]
sh:
  echo do something

In a special case when a workflow is defined with numerically indexed steps, a step without input statement will depend on its previous step and take its output as its input. Here we just present a very simple example and you will see more complex examples in other tutorials.

In [7]:
%run  -v0

[step_10]
output: 'a.txt'
_output.touch()

[step_20]
print(f'The input of this step is {step_input}')

The input of this step is a.txt


## Unnamed input files

The easiest way to explicitly specify input of a step is to list input files directly in the `input` statement. Because SoS  checks the existence of input files when it executes a step, let us first create a few files:

In [9]:
# create some sample files
!touch a.txt b.txt c.txt d.txt

The following is a SoS step (with a default section head) with a `input` statement, which results in a `step_input` variable with a single file `a.txt`:

In [4]:
input: 'a.txt'
print(f'step_input is {step_input}')

step_input is a.txt


Multiple files can be listed as multiple paramters, sequences (`list`, `tuple` etc), or variables of string or sequence types. For example, you can define a parameter `in_files` of type `paths` (list of `path`) and specify input files from command line:

In [3]:
%run --in-files a.txt b.txt -v0
parameter: in_files = paths
input: in_files

print(f'step_input is {step_input}')

step_input is a.txt b.txt


You can list multiple files, mix string literals with variable names, 

In [5]:
in_files = ['a.txt', 'b.txt']
input: in_files, 'c.txt', 'd.txt'

print(f'step_input is {step_input}')

step_input is a.txt b.txt c.txt d.txt


Because steps in these examples do not have any substep, it is equivalent to use variable `_input` instead of `step_input`.

## Substep created by option `group_by`

A step can be executed multiple times with different variables, which are called **substeps**. The input of each substep is assigned to variable `_step`. The most common way to define substeps are using option `group_by` to group input files.

For example,

In [10]:
input: 'a.txt', 'b.txt', 'c.txt', 'd.txt', group_by=2
print(f'Input of substep is {_input}')

Input of substep is a.txt b.txt
Input of substep is c.txt d.txt


## Substep craeted by option `for_each`

Another way to create substeps are repeating the step with different values of a variable. For example, in the following example, a variable `val` is defined to iterate through a list `[1, 2]` and it will create two substeps with `val=1` and `val=2` respectively.

In [12]:
input: 'a.txt', for_each=dict(val=[1, 2])
print(f'Processing {_input} with {val}')

Processing a.txt with 1
Processing a.txt with 2


## Named input

You can assign subsets of your input files some labels and refer to these subsets with the labels.

In [13]:
input: A='a.txt', B='b.txt'
print(f'Step input is {_input}. Inputs with label A is {_input["A"]}. Input with label B is {_input["B"]}')

Step input is a.txt b.txt. Inputs with label A is a.txt. Input with label B is b.txt


## Attach variables to individual input files

Variables `step_input` and `_input` are of type `sos_targets`, which consists of SoS targets, most of which are `file_targets`. All targets have a dictionary that can be used to store attributes related to them.

For example, by pairing a list of sample names to a list of input files, the attribute `sample_name` is attached to each input file and can be accessible through `.sample_name`.

In [15]:
input: 'a.txt', 'b.txt', paired_with=dict(sample_name=['A', 'B'])
print(f'Input of substep is {_input}, with sample names {_input[0].sample_name} and {_input[1].sample_name}')


Input of substep is a.txt b.txt, with sample names A and B


The variables are attached to individual input files so they will be available with the files in substeps:

In [18]:
input: 'a.txt', 'b.txt', paired_with=dict(sample_name=['A', 'B']), group_by=1
print(f'Input of substep is {_input}, with sample names {_input[0].sample_name}')

Input of substep is a.txt, with sample names A
Input of substep is b.txt, with sample names B


## Attach variables to substeps

When we group input files, we can attach attribute to the entire group presented by variable `_input`. This is done through option `group_with`.

For example, in the following workflow, 4 files are grouped into two groups each with two files. The two groups are attached with items in list `['AB', 'CD']` with name `sample_name`, and can be accessed with `_input.sample_name`.

In [16]:
input: 'a.txt', 'b.txt', 'c.txt', 'd.txt', group_with=dict(sample_name=['AB', 'CD']), group_by=2
print(f'Input of substep is {_input}, with sample name {_input.sample_name}')

Input of substep is a.txt b.txt, with sample name AB
Input of substep is c.txt d.txt, with sample name CD


For simplicity (and backward compatibility), the substep variables can be accessed directly in substeps so that you can use `sample_name` instead of `_input.sample_name`.

In [17]:
input: 'a.txt', 'b.txt', 'c.txt', 'd.txt', group_with=dict(sample_name=['AB', 'CD']), group_by=2
print(f'Input of substep is {_input}, with sample name {sample_name}')

Input of substep is a.txt b.txt, with sample name AB
Input of substep is c.txt d.txt, with sample name CD


## Named input with function `named_input`

The input statement accept python functions. A function `named_input` is defined to include named output from another step.

In [21]:
%run -v0

[A]
output: summary='output.html'
_output.touch()

[default]
input: named_output('summary')
print(f'Input of the step is {_input}')

Input of the step is output.html


## Output from another step using function `output_from`

`named_output` can be used to refer to part of, if multiple named output exists, or all of the output from a step. Similarly, a function `output_from` can be used to include complete output from specified step:

In [23]:
%run -v0

[A]
output: 'output.html'
_output.touch()

[default]
input: output_from('A')
print(f'Input of the step is {_input}')

Input of the step is output.html


## Further reading

* [How to specify input and output files and process input files in groups](doc/user_guide/input_substeps.html) for basics of step and substeps.
* [How to process input files in groups](doc/user_guide/group_by.html) for option `group_by`, and
* [How to repeat steps for different parameters](doc/user_guide/for_each.html) for option `for_each`