# Title

* **Difficulty level**: easy
* **Time need to lean**: 10 minutes or less
* **Key points**:
  * a
  

## Step variables

SoS defines multiple variables during the execution of a step. The first type of variables provides information about the step, which includes

* **`step_name`**: name of the step
* **`step_id`**: Hash ID of the step, which is determined by the content of the step
* **`workflow_id`**: Hash ID of the workflow in which the step is defined. It would be the ID of the nested workflow if the workflow is nested.
* **`master_id`**: Hash ID of the entire workflow, regardless if the step is defined in a nested workflow.

For example, in the following example, two steps have different step names, step IDs, and workflow IDs because `nested` is a nested workflow, but they share the same `master_id`, which equals to `workflow_id` of the outermost workflow.  

In [1]:
%run
[nested]
print(f'Workflow {workflow_id}: step name={step_name}')
print(f'Workflow {workflow_id}: step id={step_id}')
print(f'Workflow {workflow_id}: workflow id={workflow_id}')
print(f'Workflow {workflow_id}: master id={master_id}')

[default]
print(f'Workflow {workflow_id}: step name={step_name}')
print(f'Workflow {workflow_id}: step id={step_id}')
print(f'Workflow {workflow_id}: workflow id={workflow_id}')
print(f'Workflow {workflow_id}: master id={master_id}')
sos_run('nested')

Workflow 1663b815c3bed9f2: step name=default_0
Workflow 1663b815c3bed9f2: step id=950015e2ddfc10d4
Workflow 1663b815c3bed9f2: workflow id=1663b815c3bed9f2
Workflow 1663b815c3bed9f2: master id=1663b815c3bed9f2
Workflow fecfb67fda2f92df: step name=nested_0
Workflow fecfb67fda2f92df: step id=74bf81da4f5098aa
Workflow fecfb67fda2f92df: workflow id=fecfb67fda2f92df
Workflow fecfb67fda2f92df: master id=1663b815c3bed9f2


These variables can be useful, for example, to save runtime information, as in the example of

In [8]:
%preview -n default_10.log
%run 

[10]
with open(step_name + '.log', 'w') as log:
    log.write(f'Step specific log message saved in {step_name}.log')

Step specific log message saved in default_10.log

If a section defines multiple steps, the step variables can be used to define (slightly) different steps according to which step is executing. For example,

In [9]:
%run human

[human_10, mouse_10]
if 'human' in step_name:
   print("I am dealing with human")
else:
   print("I am dealing with mouse")


I am dealing with human


The second type of variables are step input, output, and dependent targets. They are of type `sos_targets` and will be explained later. 

#### `step_input` and `_input`

In SoS, the `input` statement mostly creates a `step_input` object with provided parameters. That is to say,

```
input: 'a.txt', 'b.txt', group_by=1
```
is almost equivalent to

```
step_input = sos_targets('a.txt', 'b.txt', group_by=1)
```
and we can use `sos_targets` objects directly in an `input` statement in more complicated cases.

Variable `_input` represents the input targets for each substep (`groups` of `sos_targets` as we will see later).

In the simple case, `step_input` is the same as `_input`. For example, variables `step_input` and `_input` of the following step are `sos_targets` objects with a single `file_target` object:

In [1]:
input: 'SoS_Syntax.ipynb'
print(f"step_input={step_input:r}")
print(f"_input={_input:r}")
sh: expand=True
    wc -l {_input}

step_input='SoS_Syntax.ipynb'
_input='SoS_Syntax.ipynb'
    4626 SoS_Syntax.ipynb


and if you have multiple input files, you can pass them altogether as a `sos_targets` with two `file_target` 

In [2]:
input: 'SoS_Syntax.ipynb', 'SoS_Magics.ipynb'
print(f"step_input={step_input:r}")
print(f"_input={_input:r}")
sh: expand=True
    wc -l {_input[0]}
    wc -l {_input[1]}

step_input='SoS_Syntax.ipynb' 'SoS_Magics.ipynb'
_input='SoS_Syntax.ipynb' 'SoS_Magics.ipynb'
    4626 SoS_Syntax.ipynb
    3747 SoS_Magics.ipynb


or separately as two groups of inputs:

In [3]:
input: 'SoS_Syntax.ipynb', 'SoS_Magics.ipynb', group_by=1
print(f"step_input={step_input:r}")
print(f"_input={_input:r}")
sh: expand=True
    wc -l {_input}

step_input='SoS_Syntax.ipynb' 'SoS_Magics.ipynb'
_input='SoS_Syntax.ipynb'
    4626 SoS_Syntax.ipynb
step_input='SoS_Syntax.ipynb' 'SoS_Magics.ipynb'
_input='SoS_Magics.ipynb'
    3747 SoS_Magics.ipynb


In this case, the step input contains two `file_target` (
```
step_input = sos_targets('SoS_Syntax.ipynb', 'SoS_Magics.ipynb')`
```
but the step process is executed twice, with 
```
_input = sos_targets('SoS_Syntax.ipynb')
_input = sos_targets('SoS_Magics.ipynb')
```
respectively. Because `_input` contains only one element, it is not necessary to use `_input[0]` in the script.

#### `sources` of targets

Targets in a `sos_targets` has an attribute `source`, which correspond to the step that the target is specified (input) or generated (output). For example, the `source` of a `sos_targets` that is directly specified in a step is the name of step.

In [2]:
%run
[hi]
input: 'SoS_Syntax.ipynb'
print(_input.sources)

['hi']


If you have multiple inputs, you can sparate them into different groups using keyword arguments

In [10]:
!touch a.bam b.bam a.bai b.bai

input: bam=['a.bam', 'b.bam'], bai=['a.bai', 'a.bai']

print(_input)
print(_input.sources)

a.bam b.bam a.bai a.bai
['bam', 'bam', 'bai', 'bai']


If the input target is inherited from another step, the source will the name of that step.

In [3]:
%run
[10]
output: 'a.txt'
_output.touch()

[11]
print(_input.sources)

['10']


In a more complex case when the source comes from multiple input steps and the present step, the `sources` attribute points out the source of each target:

In [11]:
!touch c.txt
%run
[step_10]
output: 'a.txt'
_output.touch()

[step_20]
output: 'b.txt'
_output.touch()

[step_30]
input: 'c.txt', output_from(['step_10', 'step_20'])
print(_input)
print(_input.sources)

0,1,2,3,4
,step,Workflow ID  2cc805e966b4890b,Index  #2,completed  Ran for < 5 seconds


c.txt a.txt b.txt
['step_30', 'step_10', 'step_20']


Although the use of keyword argument will override the default source

In [12]:
!touch c.txt
%run
[step_10]
output: 'a.txt'
_output.touch()

[step_20]
output: 'b.txt'
_output.touch()

[step_30]
input: 'c.txt', prev=output_from(['step_10', 'step_20'])
print(_input)
print(_input.sources)

0,1,2,3,4
,step,Workflow ID  2cc805e966b4890b,Index  #3,completed  Ran for < 5 seconds


c.txt a.txt b.txt
['step_30', 'prev', 'prev']


The `source` information can be used to select subsets of targets according to their sources. For example, `_intput['prev']` would generate a `sos_targets` with all targets from source `prev`.

In [14]:
!touch c.txt
%run
[step_10]
output: 'a.txt'
_output.touch()

[step_20]
output: 'b.txt'
_output.touch()

[step_30]
input: 'c.txt', output_from(['step_10', 'step_20'])
print(_input['step_10'])
print(_input['step_10'].sources)

0,1,2,3,4
,step,Workflow ID  8ca6d85f4843b9f8,Index  #5,completed  Ran for 0 sec


a.txt
['step_10']


#### `groups` of `sos_targets`

As we have seen, targets in a `sos_targets` can be grouped in many ways and `_input` contains subsets of the targets and is the input for each substep. For example, in the following example, the 4 input files are grouped into two groups of the same size. The step is executed twice, each time for a different group.  `step_input.groups` contains a list of `sos_targets` that becomes `_input` of the substep.

In [18]:
input: 'a.txt', 'b.txt', 'c.txt', 'd.txt', group_by=2
print(f'\nGroup {_index}')
print(step_input.groups)
print(_input)


Group 0
[[file_target('a.txt'), file_target('b.txt')], [file_target('c.txt'), file_target('d.txt')]]
a.txt b.txt

Group 1
[[file_target('a.txt'), file_target('b.txt')], [file_target('c.txt'), file_target('d.txt')]]
c.txt d.txt


#### `zap` file targets 

`sos_targets` accepts the `zap()` function which `zap` all file targets in ths list. This technique is usually used to remove large intermediate files during the execution of the workflow. For example, if you have a workflow that downloads and processs large files, you can do something like

```
[download: provides='{file}.fastq']
download: expand=True
    http://some_url/{file}.fastq

[default]
input: [f'{x}.fastq' for x in range(1000)], group_by=1
output: _input.with_suffix('.bam')
sh: expand=True
   process _input to _output
  
_input.zap()
```

In this example, 1000 `fastq` files are downloaded and processed, but the input files are zapped  after they are processed. Although the files have been removed, re-running the workflow will not download and process the files again because the downloaded files still considered to exist by SoS.

### Using `sos_targets`

Whereas you can always use `step_input`, `_input` etc in scripts (format-string) easily, using them outside of scripts is a bit tricker. Let us check a few common usages and see how to use `sos_targets` in SoS.

First, it is a common practice to generate another file with a slightly different name. For example, in the following script, `step_input` is a `sos_targets` with two filenames, `_input` is a `sos_targets` with one filename, and we need to generate `test.bam.bai` from input file `test.bam`. In this particular case, you can append the `.bai` extension directly to `_input` using the `+` operator, by passing the `+` operation to the first `file_target` of `_input`:

In [46]:
%sandbox
!touch test1.bam test2.bam

input_files = ['test1.bam', 'test2.bam']
input: input_files, group_by=1
output: _input + '.bai'

print(f"Generating {_output} from {_input}")
_output.touch()

Generating /private/var/folders/ys/gnzk0qbx5wbdgm531v82xxljv5yqy8/T/tmpenoojx6z/test1.bam.bai from test1.bam
Generating /private/var/folders/ys/gnzk0qbx5wbdgm531v82xxljv5yqy8/T/tmpenoojx6z/test2.bam.bai from test2.bam


If the extensions are different, you can use the `with_suffix` function of `pathlib.Path`, again by calling the `with_suffix` function of its only element `_input[0]`:

In [47]:
%sandbox
!touch test1.bam test2.bam

input_files = ['test1.bam', 'test2.bam']
input: input_files, group_by=1
output: _input.with_suffix('.bai')

print(f"Generating {_output} from {_input}")
_output.touch()

Generating test1.bai from test1.bam
Generating test2.bai from test2.bam


You can also use the `:n` formatter to retrieve the part of `_input` without extension, and add `.bam.bai` to it as follows:

In [48]:
%sandbox
!touch test1.bam test2.bam

input_files = ['test1.bam', 'test2.bam']
input: input_files, group_by=1
output: f"{_input:n}.bam.bai"

print(f"Generating {_output} from {_input}")
_output.touch()

Generating test1.bam.bai from test1.bam
Generating test2.bam.bai from test2.bam


Note that the result of `_input + '.bai'` is a `path` and `f'{_input:n}.bam.bai'` is a `str` but both will be converted to `_output` with type `sos_targets`. You can use either style although `:u`, `:b` etc can be shorter than `expanduser()` and `.name`, especially when you combine multiple formatters to extract certain parts of a `path`.

Second, because `path`, `path`, and even `sos_targets` supports the `os.PathLike` interface, you can use them as filenames directly in SoS (Python). For example, you can pass `sos_targets` with one element to usual `os.path` functions such as `getsize` and open it directly with the `open` function:

In [49]:
input: "SoS_Syntax.ipynb"
print(f"Size of {repr(_input)} is {os.path.getsize(_input)}.")
print(f"The first two lines of {repr(_input)} are:")

with open(input, 'r') as myself:
    print(myself.readline())
    print(myself.readline())

Size of [file_target('SoS_Syntax.ipynb')] is 1062666.
The first two lines of [file_target('SoS_Syntax.ipynb')] are:
{

 "cells": [



However, although Python standard libraries supports `os.PathLike` interface, third party modules might or might not work correctly in this day so it can be safer to use `str(input)` to convert the paths to `str` before passing them to these functions.

## Further reading

* 