# SoS Data Types

* **Difficulty level**: easy
* **Time need to lean**: 10 minutes or less
* **Key points**:
  * `sos_targets` 
  

##  SoS Targets <a id="SoS_Targets"></a>

A **target** is an object that can be created and detected. A SoS step can take a list of targets as input, check the existence of a list of dependent targets, and produce a list of targets as output. In addition, a target can have arbitrary attached "attributes" that provides additional information about the targets.

The most common target type is a `file_target`, which represent a file on the file system. `file_target` is derived from [pathlib.Path](https://docs.python.org/3.6/library/pathlib.html?highlight=pathlib#module-pathlib) with some additional functions. In addition to `file_target`, SoS provides a number of built-in targets such as `file_target`, `sos_variable`, and `executable`, and targets defined by language modules such as `R_Library`. This section describes general usages of targets and syntax for the most frequently used target `file_target`. Please refer to Chapter [Targets and Actions](Targets_and_Actions.html) for descriptions of other target types.

![sos_step](../media/target_classes.png)

<small>Class hierachy for all target related classes. Basically, all targets are derived from a virtual `BaseTarget` class. A `file_target` stands for a disk file and is presented by a `path` object, which is an extension of `Path` object from `pathlib`. A `sos_targets` object is a collection of `BaseTargets`, which can be created from individual or lists of `BaseTarget` and `path` (class `paths`).<small>

### `BaseTarget` <a id="BaseTarget"></a>

`BaseTarget` is the parent class of all SoS targets. It cannot be used directly but it provides an interface for all targets to get and set attributes.

For example, the `input` statement of the following example creates an `sos_targets` object with two children  `file_target('a.txt')` and `file_target('b.txt')`. Option `paired_with` accepts a dictionary and assigns attributes `sample` to each of the targets with specified values. 

In [7]:
input: 'a.txt', 'b.txt', paired_with={'sample': ['A', 'B']}
print(_input[0].sample)
print(_input[1].sample)

A
B


Although targets and their attributes are usually set in an `input` statement, you can create targets and set attributes directly. For example 

In [8]:
file_a = file_target('a.txt').set('sample', 'A')
print(file_a.sample)
print(file_a.get('sample'))

A
A


Here the `target.set(name, value)` function sets an attribute to the `target`, `target.get(name, default=None)` get the value of attribute `name`, and returns `default` if `name` is not a valid attribute. It is therefore a safer way to retrieve an attribute than `target.name` if you are uncertain if attribute `name` exists for `target`.

### File and directory names (`path`) <a id="File_and_directory_names_path"></a>

The `path` type handles the manipulation of file and directory names. It is derived from Python [pathlib.Path](https://docs.python.org/3/library/pathlib.html) but it

1. automatically expands user from path starting with `~`
2. allows you to extend `path` with a `+` operation
3. has a special `zap` operation to replace (large) files with their signatures
4. accepts a list of format options to easily format path in different formats

Let us start from the first feature:

In [1]:
p = path('~/sos/sos-docs/src/documentation/SoS_Syntax.ipynb')

In [2]:
p

path('/Users/bpeng1/sos/sos-docs/src/documentation/SoS_Syntax.ipynb')

In [3]:
p.name

'SoS_Syntax.ipynb'

In [4]:
p.parent

path('/Users/bpeng1/sos/sos-docs/src/documentation')

In [5]:
p.parent / 'something.txt'

path('/Users/bpeng1/sos/sos-docs/src/documentation/something.txt')

In [6]:
p.suffix

'.ipynb'

In [7]:
p.parts

('/',
 'Users',
 'bpeng1',
 'sos',
 'sos-docs',
 'src',
 'documentation',
 'SoS_Syntax.ipynb')

In [8]:
p.exists()

True

In [9]:
p.is_dir()

False

In [10]:
p.with_name('SoS_Syntax.html')

path('/Users/bpeng1/sos/sos-docs/src/documentation/SoS_Syntax.html')

In [11]:
p.with_suffix('.html')

path('/Users/bpeng1/sos/sos-docs/src/documentation/SoS_Syntax.html')

and you can evaluate `path` in format strings as

In [12]:
f"Hello My name is {p.name}"

'Hello My name is SoS_Syntax.ipynb'

Note that `path` offers a [`os.PathLike`](https://docs.python.org/3/library/os.html#os.PathLike) interface and can be used directly with `os.path` functions such as

In [13]:
os.path.getsize(p)

0

Secondly, **`path` allows you to extend `path` with a `str` or `path` with a `+` operation**. For example, with

In [14]:
p = path('test.txt')

`p + '.tmp'` returns a `path` with `.tmp` appended to the path

In [15]:
p + '.tmp'

path('test.txt.tmp')

which is different from the `/` operator that join the operant as another piece of the path

In [16]:
p / '.tmp'

path('test.txt/.tmp')

A note of caution, however, is that because `path` strips ending slash from an input path

In [17]:
path('/path/to/')

path('/path/to')

The result of the following can be surprising

In [18]:
path('/path/to/') + 'a.txt'

path('/path/toa.txt')

so the rule of thumb is that you should use `/` to extend path and `+` to extend name, as in

In [19]:
path('/path/to') / "dir" / "filename" + ".ext"

path('/path/to/dir/filename.ext')

The third addition of the `path` type is **a `zap()` function that removes the file and creates a `{filename}.zapped` file with file signatures**. This `.zapped` file is considered to be "existent" by the runtime signature system so that a workflow step will not be repeated of some of its input or output files are zapped, unless the actually files are needed. An example of this usage will be given later.

### Path list (`paths`) <a id="Path_list_paths"></a>

The `paths` type keeps a list of `path` objects. It can be initialized from one or more `str`, `path` or lists of `path`. Lists of paths will be flattened and concatenated so the end result will always be an one-dimensional list. The variables appear to be a sequence that can be sliced and iterated. For example, the following statement creates a `paths` object with three filenames from a single filename and a list of two filenames:

In [26]:
targets = paths('a.txt', ['b.txt', 'c.txt'])
targets

[path('a.txt'), path('b.txt'), path('c.txt')]

You can access one or more elements of a `paths` or iterate through it

In [27]:
targets[2]

path('c.txt')

In [28]:
targets[1:]

[path('b.txt'), path('c.txt')]

In [29]:
for t in targets:
   print(t)

a.txt
b.txt
c.txt


To convert a `paths` object to a regular list, you can use function `list`

In [30]:
list(targets)

[path('a.txt'), path('b.txt'), path('c.txt')]

or slice part of the `paths` using slices

In [31]:
type(targets[1:])

list

The first big difference between the formatting of `paths` and regular lists of `path` is that **formatting are applied to each item and joint by space or comma**. For example, whereas a regular list is formatted as a list

In [32]:
target_list = ['a.txt', 'b.txt', 'c.txt']
f"{target_list}"

"['a.txt', 'b.txt', 'c.txt']"

A `paths` is formatted as

In [33]:
f"{targets}"

'a.txt b.txt c.txt'

or separated by `,` with format option `","` 

In [34]:
f"{targets:,}"

'a.txt,b.txt,c.txt'

or after formatting each element with specified formatter

In [35]:
f"{targets:r,}"

"'a.txt','b.txt','c.txt'"

One particular consequence of this format rule is that a `paths` with only one element will behave exactly like a single path so you can use `input` (a `paths`) in place of `input[0]` (a `path`) if you know there is only one target inside `input`:

In [36]:
single = paths('SoS_Syntax.ipynb')
f"{single[0]} is the same as {single}"

'SoS_Syntax.ipynb is the same as SoS_Syntax.ipynb'

In [37]:
single.suffix

'.ipynb'

In [38]:
single.resolve()

path('/Users/bpeng1/sos/sos-docs/src/documentation/SoS_Syntax.ipynb')

In [39]:
os.path.getsize(single)

0

Finally, although `paths` usually contains file or directory names, the fact that it formats as space separated items is very useful for passing lists of strings to command lines and SoS magics, so you will see the following usage pattern from time to time:

In [40]:
samples = ['sample_A', 'sample_B', 'sample_C']

In [41]:
!echo {paths(samples)}

sample_A sample_B sample_C


Here a list of strings to converted to `paths` just to be formatted as command line options, which is equivalent to the following but avoids problems when `samples` is a single string such as `'sample_A'`

In [42]:
!echo {' '.join(samples)}

sample_A sample_B sample_C


`paths` has a `zap()` function and will zap all files in `paths`.

### File target (`file_target`) <a id="File_target_file_target"></a>

A `file_target` is `path` with additional `BaseTarget` interface. Because it is derived from `path`, it can be created from a string or a `path` and behaves exactly like a `path` object. Because it is also derived from `BaseTarget`, you can set and get aribitrary attributes to a `file_target`.

### Target list (`sos_targets`) <a id="Target_list_sos_targets"></a>

`sos_targets` is the type for all SoS system variables (`step_input`, `_input`, `step_output`, `_output`, `step_depends`, and `_depends`). These variables are created by SoS from `input:`, `output:` and `depends:` statements and are used to, for example, compose scripts in different languages.

`sos_targets` shares the same design as `paths` although its content is a list of `BaseTargets`. it can be created from `str`, `list`, `path`, `paths`, `file_targets`, `sos_targets`, and many other `BaseTarget` types. In addition, a `sos_targets` can contain `groups`, which is a list of `sos_targets` consisting of subsets of targets.

#### `step_input` and `_input`

In SoS, the `input` statement mostly creates a `step_input` object with provided parameters. That is to say,

```
input: 'a.txt', 'b.txt', group_by=1
```
is almost equivalent to

```
step_input = sos_targets('a.txt', 'b.txt', group_by=1)
```
and we can use `sos_targets` objects directly in an `input` statement in more complicated cases.

Variable `_input` represents the input targets for each substep (`groups` of `sos_targets` as we will see later).

In the simple case, `step_input` is the same as `_input`. For example, variables `step_input` and `_input` of the following step are `sos_targets` objects with a single `file_target` object:

In [1]:
input: 'SoS_Syntax.ipynb'
print(f"step_input={step_input:r}")
print(f"_input={_input:r}")
sh: expand=True
    wc -l {_input}

step_input='SoS_Syntax.ipynb'
_input='SoS_Syntax.ipynb'
    4626 SoS_Syntax.ipynb


and if you have multiple input files, you can pass them altogether as a `sos_targets` with two `file_target` 

In [2]:
input: 'SoS_Syntax.ipynb', 'SoS_Magics.ipynb'
print(f"step_input={step_input:r}")
print(f"_input={_input:r}")
sh: expand=True
    wc -l {_input[0]}
    wc -l {_input[1]}

step_input='SoS_Syntax.ipynb' 'SoS_Magics.ipynb'
_input='SoS_Syntax.ipynb' 'SoS_Magics.ipynb'
    4626 SoS_Syntax.ipynb
    3747 SoS_Magics.ipynb


or separately as two groups of inputs:

In [3]:
input: 'SoS_Syntax.ipynb', 'SoS_Magics.ipynb', group_by=1
print(f"step_input={step_input:r}")
print(f"_input={_input:r}")
sh: expand=True
    wc -l {_input}

step_input='SoS_Syntax.ipynb' 'SoS_Magics.ipynb'
_input='SoS_Syntax.ipynb'
    4626 SoS_Syntax.ipynb
step_input='SoS_Syntax.ipynb' 'SoS_Magics.ipynb'
_input='SoS_Magics.ipynb'
    3747 SoS_Magics.ipynb


In this case, the step input contains two `file_target` (
```
step_input = sos_targets('SoS_Syntax.ipynb', 'SoS_Magics.ipynb')`
```
but the step process is executed twice, with 
```
_input = sos_targets('SoS_Syntax.ipynb')
_input = sos_targets('SoS_Magics.ipynb')
```
respectively. Because `_input` contains only one element, it is not necessary to use `_input[0]` in the script.

#### `sources` of targets

Targets in a `sos_targets` has an attribute `source`, which correspond to the step that the target is specified (input) or generated (output). For example, the `source` of a `sos_targets` that is directly specified in a step is the name of step.

In [2]:
%run
[hi]
input: 'SoS_Syntax.ipynb'
print(_input.sources)

['hi']


If you have multiple inputs, you can sparate them into different groups using keyword arguments

In [10]:
!touch a.bam b.bam a.bai b.bai

input: bam=['a.bam', 'b.bam'], bai=['a.bai', 'a.bai']

print(_input)
print(_input.sources)

a.bam b.bam a.bai a.bai
['bam', 'bam', 'bai', 'bai']


If the input target is inherited from another step, the source will the name of that step.

In [3]:
%run
[10]
output: 'a.txt'
_output.touch()

[11]
print(_input.sources)

['10']


In a more complex case when the source comes from multiple input steps and the present step, the `sources` attribute points out the source of each target:

In [11]:
!touch c.txt
%run
[step_10]
output: 'a.txt'
_output.touch()

[step_20]
output: 'b.txt'
_output.touch()

[step_30]
input: 'c.txt', output_from(['step_10', 'step_20'])
print(_input)
print(_input.sources)

0,1,2,3,4
,step,Workflow ID  2cc805e966b4890b,Index  #2,completed  Ran for < 5 seconds


c.txt a.txt b.txt
['step_30', 'step_10', 'step_20']


Although the use of keyword argument will override the default source

In [12]:
!touch c.txt
%run
[step_10]
output: 'a.txt'
_output.touch()

[step_20]
output: 'b.txt'
_output.touch()

[step_30]
input: 'c.txt', prev=output_from(['step_10', 'step_20'])
print(_input)
print(_input.sources)

0,1,2,3,4
,step,Workflow ID  2cc805e966b4890b,Index  #3,completed  Ran for < 5 seconds


c.txt a.txt b.txt
['step_30', 'prev', 'prev']


The `source` information can be used to select subsets of targets according to their sources. For example, `_intput['prev']` would generate a `sos_targets` with all targets from source `prev`.

In [14]:
!touch c.txt
%run
[step_10]
output: 'a.txt'
_output.touch()

[step_20]
output: 'b.txt'
_output.touch()

[step_30]
input: 'c.txt', output_from(['step_10', 'step_20'])
print(_input['step_10'])
print(_input['step_10'].sources)

0,1,2,3,4
,step,Workflow ID  8ca6d85f4843b9f8,Index  #5,completed  Ran for 0 sec


a.txt
['step_10']


#### `groups` of `sos_targets`

As we have seen, targets in a `sos_targets` can be grouped in many ways and `_input` contains subsets of the targets and is the input for each substep. For example, in the following example, the 4 input files are grouped into two groups of the same size. The step is executed twice, each time for a different group.  `step_input.groups` contains a list of `sos_targets` that becomes `_input` of the substep.

In [18]:
input: 'a.txt', 'b.txt', 'c.txt', 'd.txt', group_by=2
print(f'\nGroup {_index}')
print(step_input.groups)
print(_input)


Group 0
[[file_target('a.txt'), file_target('b.txt')], [file_target('c.txt'), file_target('d.txt')]]
a.txt b.txt

Group 1
[[file_target('a.txt'), file_target('b.txt')], [file_target('c.txt'), file_target('d.txt')]]
c.txt d.txt


#### `zap` file targets 

`sos_targets` accepts the `zap()` function which `zap` all file targets in ths list. This technique is usually used to remove large intermediate files during the execution of the workflow. For example, if you have a workflow that downloads and processs large files, you can do something like

```
[download: provides='{file}.fastq']
download: expand=True
    http://some_url/{file}.fastq

[default]
input: [f'{x}.fastq' for x in range(1000)], group_by=1
output: _input.with_suffix('.bam')
sh: expand=True
   process _input to _output
  
_input.zap()
```

In this example, 1000 `fastq` files are downloaded and processed, but the input files are zapped  after they are processed. Although the files have been removed, re-running the workflow will not download and process the files again because the downloaded files still considered to exist by SoS.

## Further reading

* 