# How to use SoS Notebook to organize and share your project

* **Difficulty level**: easy
* **Time need to lean**: 10 minutes or less
* **Key points**:
  * Use SoS workflow system to record multi-script, multi-language data analysis
  * Adding comments and help messages
  * Using command line option to apply the workflow to new batches of data  

This tutorial is a re-cap of what we have learned from other tutorials, with an emphasis on how to organize your data analysis for easy sharing and reproducing.

## Using SoS to record your data analysis

As we have shown in the [Using SoS workflow system in Jupyter and from command line](doc/user_guide/sos_in_notebook.html) and the following tutorials, SoS allows you to perform your data analysis in Jupyter or record the scripts you developed in other environments in a Jupyter notebook, without a steep learning curve.

### Using SoS Notebook for interactive multi-language data analysis

Firstly, you can perform your data analysis in Jupyter using multiple kernels in one notebook. Without going into the details on how SoS Notebook can assist the interactive data analysis, here is what the end result might look like.

<div class="bs-callout bs-callout-primary" role="alert">
    <h4>Multi-language notebook</h4>
  <ul>
      <li>Data analysis is performed by multiple kernels</li>
      <li>Analysis in each kernel can be separated into multiple cells</li>
      <li>The <code>%expand</code> magic can be used to pass variables from SoS to subkernels</li>
      <li>The entire data analysis can be rerun using the <code>Kernel</code> => <code>Restart Kernel and Run All Cells</code></li>
    </ul>
</div>

In [1]:
excel_file = 'data/DEG.xlsx'
csv_file = 'DEG.csv'
figure_file = 'output.pdf'

In [2]:
%expand
xlsx2csv {excel_file} > {csv_file}

In [3]:
%expand
data <- read.csv('{csv_file}')
pdf('{figure_file}')
plot(data$log2FoldChange, data$stat)
dev.off()

This tutorial does not introduce any 

## Simple SoS workflow

<div class="bs-callout bs-callout-primary" role="alert">
    <h4>Simple workflows with numerically numbered steps</h4>
  <ul>
      <li>Workflows with numerically numbered steps</li>
      <li>Definition of input and output is optional</li>
      <li>Execute the workflow from within the notebook using magics <code>%run</code>, <code>%sosrun</code>, or from command line using <code>sos run</code></li>
    </ul>
</div>

The multi-language data analysis can be converted almost trivially to the following SoS workflow. In contrast to analysis in SoS notebook, each step must contain complete scripts that can be executed independent of other steps. One of the benfits of the conversion is that the workflow can be execute from command line. 

In [2]:
[global]
excel_file = 'data/DEG.xlsx'
csv_file = 'DEG.csv'
figure_file = 'output.pdf'

[plot_10]
run: expand=True
    xlsx2csv {excel_file} > {csv_file}

[plot_20]
R: expand=True
    data <- read.csv('{csv_file}')
    pdf('{figure_file}')
    plot(data$log2FoldChange, data$stat)
    dev.off()

## Allow command line options

<div class="bs-callout bs-callout-primary" role="alert">
    <h4>Adding command line options</h4>
  <ul>
      <li>Command line options are defined with the <code>parameter</code> statement.</li>
      <li>Both optional and mandatory options are supported</li>
    </ul>
</div>

Adding command line options allows you to apply the workflow to other sets of data, usually from command line:

In [None]:
%run --excel-file data/DEG.xlsx

[global]
parameter: excel_file = str
parameter: csv_file = 'DEG.csv'
parameter: figure_file = 'output.pdf'

[plot_10]
run: expand=True
    xlsx2csv {excel_file} > {csv_file}

[plot_20]
R: expand=True
    data <- read.csv('{csv_file}')
    pdf('{figure_file}')
    plot(data$log2FoldChange, data$stat)
    dev.off()

## Adding help messages

<div class="bs-callout bs-callout-primary" role="alert">
    <h4>Adding comments to scripts</h4>
  <ul>
      <li>The first comment block is the description of the script. This is where you introduce the purpose of the workflows.
</li>
      <li>Comments immediately before section header and parameter: definitions become the descriptions of the sections and parameters.</li>
      <li>Workflow, step, and parameter descriptions are displayed in the output of `-h` of the script.
    </ul>
</div>



In [1]:
%run -h

# This workflow converts input excel file
# into a .csv file and plot fields log2FoldChange
# again stat

[global]
# input excel file
parameter: excel_file = str

# intermediate csv file
parameter: csv_file = 'DEG.csv'

# output figure file
parameter: figure_file = 'output.pdf'

# Uses command xlsx2csv to convert
# excel file to csv format
[plot_10]
run: expand=True
    xlsx2csv {excel_file} > {csv_file}

# Load data in csv format and plot log2FoldChange
# again stat
[plot_20]
R: expand=True
    data <- read.csv('{csv_file}')
    pdf('{figure_file}')
    plot(data$log2FoldChange, data$stat)
    dev.off()

0,1,2,3,4
,undefined,Workflow ID  undefined,Index  #1,completed  Ran for < 5 seconds


usage: sos run .sos/interactive.sos [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

This workflow converts input excel file
into a .csv file and plot fields log2FoldChange
again stat

Workflows:
  plot

Global Workflow Options:
  --excel-file VAL (as str, required)
                        input excel file
  --csv-file 'DEG.csv'
                        intermediate csv file
  --figure-file 'output.pdf'
                        output figure file

Sections
  plot_10:              Uses command xlsx2csv to convert excel file to csv
                        format
  plot_20:              Load data in csv format and plot log2FoldChange again
                        stat


<div class="bs-callout bs-callout-info" role="alert">
    <h4>Getting the help message of SoS scripts</h4>
<p>You can get the help message of notebook with embedded workflows using command</p>
<pre>
    sos run notebook.ipynb -h
    
</pre>
    <p>The same hold for a SoS workflow in plain text format. That is to say, for any SoS script, you can run</p>
<pre>
    sos run script.sos -h
    
</pre>
<p>to get its usage message. Further more, if your SoS script has a sheband line</p>
<pre>
    #!/usr/bin/env sos-runner
    
</pre>
<p>and has executable permission (<code>chmod +x script.sos</code>), you can execute the workflow as a regular command</p>
<pre>
    script.sos [options]
    
</pre>
<p>and gets its help message with command </p>
<pre>
    script.sos -h
    
</pre>
</div>

## Adding multiple workflows in one SoS notebook

If your analysis contains multiple related or unrealted steps, you can include them all in the notebook and execute them with their names.

In [7]:
[global]
excel_file = 'data/DEG.xlsx'
csv_file = 'DEG.csv'
figure_file = 'output.pdf'

[convert]
run: expand=True
    xlsx2csv {excel_file} > {csv_file}

[plot]
R: expand=True
    data <- read.csv('{csv_file}')
    pdf('{figure_file}')
    plot(data$log2FoldChange, data$stat)
    dev.off()

## Adding step input and output to analysis input files in parallel

Now that you have learned the basics of SoS, you can go ahead and use them to oraganize your scripts. However, SoS is very powerful system and can be used to write powerful workflows and execute scripts in containers and remote hosts. The following example from [How to define and execute basic SoS workflows](doc/user_guide/forward_workflow.html) demonstrates the creation and passing of substeps and you can learn more from other tutorials.

In [8]:
[10]
input: 'data/S20_R1.fastq', 'data/S20_R2.fastq', group_by=1
output: f'{_input:n}_fastqc.html'

sh: expand=True
    fastqc {_input}
    
[20]

from bs4 import BeautifulSoup

with open(_input) as html:
    soup = BeautifulSoup(html)
    for h2 in soup.findAll('h2'):
        if h2.img:
            print(f"{_input:bn} {h2.text}: {h2.img['alt']}")