In [6]:
# disable word wrap in outputs (puts long urls out of view)
from IPython.core.display import display, HTML
display(HTML("<style>div.output_area pre {white-space: pre;}</style>"))

[Structure the pipeline](https://www.conducto.com/docs/basics/pipeline-structure) with a [`Serial`](https://www.conducto.com/docs/basics/pipeline-structure#serial) node at the root and give it some [`Parallel`](https://www.conducto.com/docs/basics/pipeline-structure#parallel) children.

In [1]:
import conducto as co

root = co.Serial()
root["Download"] = co.Parallel()
root["Process"] = co.Parallel()
print(root.pretty())

[1;34m/[0m
├─0 [1;34mDownload[0m
└─1 [1;34mProcess[0m


 `/Process` won't start until `/Download` is complete, but their children will run in parallel.
 
[`Exec`](https://www.conducto.com/docs/basics/pipeline-structure) nodes take shell commands and run them in the environment defined by [`Image`](https://www.conducto.com/docs/basics/images).

In [2]:
prep_img = co.Image(reqs_packages=["wget", "gzip"])

for name, url in genomes + genes:
    root["Download"][name] = co.Serial(image=prep_img)
    root["Download"][name]["Get"] = co.Exec(f"wget -O {data}/{name}.fna.gz {url}")
    root["Download"][name]["Decompress"] = co.Exec(f"cd {data} && gunzip {name}.fna.gz")
print(root.pretty())

[1;34m/[0m
├─0 [1;34mDownload[0m
│ ├─ [1;34ms_cerevisiae[0m
│ │ ├─0 [1;36mGet[0m   wget -O /conducto/data/pipeline/s_cerevisiae.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz
│ │ └─1 [1;36mDecompress[0m   cd /conducto/data/pipeline && gunzip s_cerevisiae.fna.gz
│ ├─ [1;34mb_bruxellensis[0m
│ │ ├─0 [1;36mGet[0m   wget -O /conducto/data/pipeline/b_bruxellensis.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/011/074/885/GCA_011074885.2_ASM1107488v2/GCA_011074885.2_ASM1107488v2_genomic.fna.gz
│ │ └─1 [1;36mDecompress[0m   cd /conducto/data/pipeline && gunzip b_bruxellensis.fna.gz
│ ...
└─1 [1;34mProcess[0m


[`Exec`](https://www.conducto.com/docs/basics/pipeline-structure) nodes can also call functions. For this they need an environment with the current file and its dependencies.  

Here it depends on some python packages and [an image from DockerHub](https://hub.docker.com/r/ncbi/blast) which already has the tool that we need.

In [5]:
root["Process"].image = co.Image(
    image="ncbi/blast", copy_dir=".", reqs_py=["conducto", "biopython"]
)

def process(target, genes, hits):
    pass  # processing code goes here

for name, _ in genomes:
    root["Process"][name] = co.Exec(process, f"{data}/{name}.fna",  f"{data}/S288C.fna", f"{data}/{name}.xml")

Finally, we'll add an interactive environment where we can explore the process output.  We give this node extra `cpu` and `mem` for easy exploration.

In [3]:
root["Analyze"] = co.Exec("analyze.ipynb", cpu=8, mem=32)
print(root.pretty())

[1;34m/[0m
├─0 [1;34mDownload[0m
│ ├─ [1;34ms_cerevisiae[0m
│ ├─ [1;34mb_bruxellensis[0m
│ ...
├─1 [1;34mProcess[0m
│ ├─ [1;36ms_cerevisiae[0m   conducto __conducto_intermediate_path:/home/user/src/conducto/examples/data_science/saccharomyces/pipeline.py:endpath__ process --target=/conducto/data/pipeline/s_cerevisiae.fna --genes=/conducto/data/pipeline/S288C.fna --hits=/conducto/data/pipeline/s_cerevisiae.xml
│ ├─ [1;36mb_bruxellensis[0m   conducto __conducto_intermediate_path:/home/user/src/conducto/examples/data_science/saccharomyces/pipeline.py:endpath__ process --target=/conducto/data/pipeline/b_bruxellensis.fna --genes=/conducto/data/pipeline/S288C.fna --hits=/conducto/data/pipeline/b_bruxellensis.xml
│ ...
└─2 [1;36mAnalyze[0m   analyze.ipynb


When we launch this pipeline, we get a link to the Conducto web app where we can interact with it.

In [5]:
root._build()

Starting pipeline sac-cer
View at [1;4mhttps://conducto.com/app/p/sac-cer[0m
