# How to specify input and output files and process input files in groups

* **Difficulty level**: easy
* **Time need to lean**: 10 minutes or less
* **Key points**:
  * Input files are specified with the `input` statement, which defines variable `_input`
  * Output files are specified with the `output` statement, which defines variable `_output`
  * Input files can be processed in groups with the `group_by` option
  

## Specifying step input and output

Taking again the example workflow from [our first tutorial](sos_in_notebook.html), we have defined variables such as `excel_file` and used them directly in the scripts.

In [6]:
[global]
excel_file = 'data/DEG.xlsx'
csv_file = 'DEG.csv'
figure_file = 'output.pdf'

[plot_10]
run: expand=True
    xlsx2csv {excel_file} > {csv_file}

[plot_20]
R: expand=True
    data <- read.csv('{csv_file}')
    pdf('{figure_file}')
    plot(data$log2FoldChange, data$stat)
    dev.off()

You can add an `input` and an `output` statement to the steps and write the workflow as 

In [9]:
[global]
excel_file = 'data/DEG.xlsx'
csv_file = 'DEG.csv'
figure_file = 'output.pdf'

[plot_10]
input: excel_file
output: csv_file

run: expand=True
    xlsx2csv {_input} > {_output}

[plot_20]
input: csv_file
output: figure_file

R: expand=True
    data <- read.csv('{_input}')
    pdf('{_output}')
    plot(data$log2FoldChange, data$stat)
    dev.off()

Comparing the two workflows, you will notice that steps in the new workflow have `input` and `output` statements that define the input and output of the steps, and two magic variables `_input` and `_output` are used in the scripts.

## Substeps and input option `group_by`

The `input` and `output` statements notify SoS the input and output of the steps and allow SoS to handle them in a much more intelligent way. The most useful one, however, is the definition of substeps that allows SoS to process groups of input one by one, and/or the same groups of input with different sets of variables (option `for_each`, which will be discussed later).

Let us assume that we have two input files `data/S20_R1.fastq` and `data/S20_R2.fastq` and we would like to check the quality of them using a tool called [fastqc](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). 

In [1]:
for infile in ['data/S20_R1.fastq', 'data/S20_R2.fastq']:
    sh(f'fastqc {infile}')

Started analysis of S20_R1.fastq
Analysis complete for S20_R1.fastq
Started analysis of S20_R2.fastq
Analysis complete for S20_R2.fastq


There are a few problems with this approach,

* The action `sh` cannot be written in the script format because it is called in a for loop
* The input files are handled one by one although they are independent and can be processed in parallel

To address these problems, you can write the step as follows:

In [2]:
input: 'data/S20_R1.fastq', 'data/S20_R2.fastq', group_by=1

sh: expand=True
    fastqc {_input}

Started analysis of S20_R1.fastq
Analysis complete for S20_R1.fastq
Started analysis of S20_R2.fastq
Analysis complete for S20_R2.fastq


<div class="bs-callout bs-callout-primary" role="alert">
    <h4>Substeps created by the <code>group_by</code> input option</h4>
    <ul>
        <li>The <code>group_by</code> option groups input files and creates multiple groups of input files</li>  
        <li>Multiple <em>substeps</em> are created for each group of input files</li>
        <li>The input of each substep is stored in variable <code>_input</code></li>
        <li>The substeps are by default executed in parallel</li>
    </ul>
</div>

In this example, option `group_by=1` divides the two input files into two groups, each with one input file. Two substeps are created from the groups. They execute the same step process (statements after the `input` statement) but with different values of variable `_input`. The `sh` action is written in the script format, which can be a lot more readable if the script is long. The substeps are executed in parallel so the step could be completed a lot faster than the `for` look version.

## Output of substeps

<div class="bs-callout bs-callout-primary" role="alert">
    <h4>The <code>output</code> statement</h4>
    <ul>
        <li>The <code>output</code> statement defines the output of each substep</li>
        <li>The output of the entire step consists of output from each substep</li>
    </ul>
</div>

The `input` statement defines input of the entire step, and optionally input of each substep as variable `_input`. **The `output` statement, however, defines the output of each substep**.

In the following example, the two input files are divided into two groups, reprented by `_input` for each substep. The output statement defines a variable `_output` for each substep.

In [12]:
input: 'data/S20_R1.fastq', 'data/S20_R2.fastq', group_by=1
output: f'{_input:n}_fastqc.html'

sh: expand=True
    fastqc {_input}

Started analysis of S20_R1.fastq
Started analysis of S20_R2.fastq
Analysis complete for S20_R1.fastq
Analysis complete for S20_R2.fastq


<div class="bs-callout bs-callout-primary" role="alert">
    <h4>Special format specification for <code>_input</code> objects</h4>
    <p>SoS variables <code>_input</code> and <code>_output</code> are of type <code>sos_targets</code> and accept additional <a href="https://docs.python.org/3/reference/lexical_analysis.html#f-strings">format specifications</a>. For example, 
    <ul>
        <li><code>:n</code> is the name of the path. e.g. <code>f'{_input:n}'</code> returns <code>/path/to/a</code> if <code>_input</code> is <code>/path/to/a.txt</code></li>
         <li><code>:b</code> is the basename of the path. e.g. <code>a.txt</code> from <code>/path/to/a.txt</code></li>
        <li><code>:d</code> is the directory name of the path. e.g. <code>/path/to</code> from <code>/path/to/a.txt</code></li>
    </ul>
</div>

The output statement of this example is

```python
output: f'{_input:n}_fastqc.html'
```
which takes the name of `_input` and add `_fastqc.html`. For example, if `_input = 'data/S20_R1.fastq'`, the corresponding `_output = 'data/S20_R1_fastqc.html`.

With this output statement, SoS will, among many other things, check if the output is properly generated after the completion of each substep, and returns an output object with the `_output` of each substep.

## Further reading

* [Parameter `group_by` of `sos_targets`](doc/user_guide/ref_group_by.html)
* [SoS Data Types](doc/user_guide/sos_datatypes.html)<br>