In [1]:
import json
import conducto as co
from my_experiment import process_data, notebook_pkgs
data_dir = "/conducto/data/pipeline"

Make the root node of the tree.  It's a [Serial](/docs/basics/pipeline-structure#node-types) node, so Conducto runs its children one after another.

In [2]:
root = co.Serial()
root.describe()['id']

[1;34m/[0m

[Image](/docs/basics/Images) objects devine environments.

In [3]:
download_img = co.Image(install_packages=["wget"])

[Exec](/docs/basics/pipeline-structure#node-types) nodes run commands or call functions in those environments.

In [4]:
root["Download"] = co.Exec("wget -NcP {data_dir} https://github.com/conducto/examples/raw/drafts/blast/data_science/find_genes/data/genedata.zip",
                           image=download_img)
root.describe()['id']

[1;34m/[0m
└─0 [1;36mDownload[0m   wget -NcP {data_dir} https://github.com/conducto/examples/raw/drafts/blast/data_science/find_genes/data/genedata.zip

The Process and Analyze nodes share an image.  To build it:
 - start with a premade image from dockerhub
 - include the local directory so we can reference its other files
 - pip install their dependencies

In [5]:
bio_img = co.Image("ncbi/blast",
                   copy_dir=".",
                   install_pip=["pandas", "biopython"] + notebook_pkgs)

This node makes three calls to my_experiment.process_data(), each time with different parameters. 

In [6]:
process = co.Parallel(image=bio_img)
process["1"] = co.Exec(process_data, dataset=1, data_dir=data_dir)
process["2"] = co.Exec(process_data, dataset=2, data_dir=data_dir)
process["3"] = co.Exec(process_data, dataset=3, data_dir=data_dir)
root["Process"] = process

root.describe()['id']

[1;34m/[0m
├─0 [1;36mDownload[0m   wget -NcP {data_dir} https://github.com/conducto/examples/raw/drafts/blast/data_science/find_genes/data/genedata.zip
└─1 [1;34mProcess[0m
  ├─ [1;36m1[0m   conducto my_experiment.py process_data --dataset=1 --data_dir=/conducto/data/pipeline
  ├─ [1;36m2[0m   conducto my_experiment.py process_data --dataset=2 --data_dir=/conducto/data/pipeline
  └─ [1;36m3[0m   conducto my_experiment.py process_data --dataset=3 --data_dir=/conducto/data/pipeline

Once all our data is processed and ready to go, make a node to analyze it interactively.  Notebook nodes that run to completion can be viewed like reports.  Or you can leave them running and explore the data with code.

In [7]:
analyze = co.Notebook("analyze.ipynb", dir=data_dir, datasets=json.dumps([1,2,3]))

Add some extra resources for easy exploration.

In [8]:
analyze.set(image=bio_img, cpu=8, mem=32)
root["Analyze"] = analyze

root.describe()['id']

DuplicateImageError: happy-butterfree already present with a different definition in this repository

This will launch the pipeline and print a link.  The link will take you to the Conducto web app.  From there you can interact with the pipeline.

In [None]:
root.launch()