# **Pipelines**

## 1. Basics of GEOAnalytics Canada Pipelines

Our Pipeline tool helps with developing and building portable, scalable machine learning (ML) workflows based on Docker containers.

**The Pipelines platform consists of:**

* A UI for managing and tracking pipelines and their execution
* An engine for scheduling a pipeline’s execution
* An SDK for defining, building, and deploying pipelines in Python

A pipeline is a representation of a ML workflow containing the parameters required to run the pipeline and the inputs and outputs of each component. Each pipeline component is a self-contained code block, packaged as a Docker image.


In this tutorial notebook, we will build our first Pipeline. First, run the following command to install all the packages and dependencies required for this tutorial. 

In [1]:
!python3 -m pip install git+https://github.com/couler-proj/couler --ignore-installed

Collecting git+https://github.com/couler-proj/couler
  Cloning https://github.com/couler-proj/couler to /tmp/pip-req-build-tvbvuw6q
  Running command git clone --filter=blob:none --quiet https://github.com/couler-proj/couler /tmp/pip-req-build-tvbvuw6q
  Resolved https://github.com/couler-proj/couler to commit db7d4c32672315078ee023f0cc5af75164794b4d
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting pyaml
  Downloading pyaml-21.10.1-py2.py3-none-any.whl (24 kB)
Collecting kubernetes>=11.0.0
  Downloading kubernetes-24.2.0-py2.py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m30.7 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hCollecting docker>=4.1.0
  Downloading docker-6.0.0-py3-none-any.whl (147 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m147.2/147.2 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting Deprecated
  Downloading Deprecated-1.2.13-py2.py3-none-any.whl (9.6 kB)
Colle

In [None]:
## 2. Building A Basic Pipeline

After installing the required dependencies for this tutorial, 
we then need to import the necessary modules. 
Next, we define a job template that pacakges each step into its own Container. 


In [15]:
import couler.argo as couler
from couler.argo_submitter import ArgoSubmitter
from couler.core.templates.toleration import Toleration

In [16]:
def job(name):
    toleration = Toleration('ga.nodepool/type', 'NoSchedule', 'Exists')
    couler.add_toleration(toleration) # pipeline/nodepool=pipe:NoSchedule
    toleration2 = Toleration('kubernetes.azure.com/scalesetpriority', 'NoSchedule', 'Exists')
    couler.add_toleration(toleration2)
    couler.run_container(
        image="docker/whalesay:latest",
        command=["cowsay"],
        args=[name],
        step_name=name,
        node_selector={'pipeline':'small'}
    )

The next two functions demonstrate the dependencies between each step that 
can be created. Further down, we will see a more complex example, however,
declaring simple dependencies such as these to block subsequent steps from
operating before a given step has finished running can prove to be a powerful
tool when building complex products. 

In [17]:
#     A
#    / \
#   B   C
#  /
# D
def linear():
    couler.set_dependencies(lambda: job(name="A"), dependencies=None)
    couler.set_dependencies(lambda: job(name="B"), dependencies=["A"])
    couler.set_dependencies(lambda: job(name="C"), dependencies=["A"])
    couler.set_dependencies(lambda: job(name="D"), dependencies=["B"])

In [18]:
#   A
#  / \
# B   C
#  \ /
#   D
def diamond():
    couler.dag( # DAG: Directed Acyclic Graph
        [
            [lambda: job(name="A")],
            [lambda: job(name="A"), lambda: job(name="B")],  # A -> B
            [lambda: job(name="A"), lambda: job(name="C")],  # A -> C
            [lambda: job(name="B"), lambda: job(name="D")],  # B -> D
            [lambda: job(name="C"), lambda: job(name="D")],  # C -> D
        ]
    )

In [19]:
# linear()
diamond()


We then will submit our job to the `pipeline` namespace where jobs will be run.
Other names will just result in errors. First, we declare which submitter we will
be using - we will use the ArgoSubmitter as the backend is leveraging Argo.

In [21]:
submitter = ArgoSubmitter(namespace='pipeline')

INFO:root:Argo submitter namespace: pipeline
INFO:root:Cannot find local k8s config. Trying in-cluster config.
INFO:root:Initialized with in-cluster config.


Finally, we submit our Directed Acyclic Graph (DAG) that represents our "pipeline" we defined 
above to the Executor. 

In [22]:
deployment = couler.run(submitter=submitter)
deployment

INFO:root:Checking workflow name/generatedName runpy-
INFO:root:Submitting workflow to Argo
INFO:root:Workflow runpy-j8w5g has been submitted in "pipeline" namespace!


{'apiVersion': 'argoproj.io/v1alpha1',
 'kind': 'Workflow',
 'metadata': {'creationTimestamp': '2022-09-27T02:44:08Z',
  'generateName': 'runpy-',
  'generation': 1,
  'managedFields': [{'apiVersion': 'argoproj.io/v1alpha1',
    'fieldsType': 'FieldsV1',
    'fieldsV1': {'f:metadata': {'f:generateName': {}}, 'f:spec': {}},
    'manager': 'OpenAPI-Generator',
    'operation': 'Update',
    'time': '2022-09-27T02:44:08Z'}],
  'name': 'runpy-j8w5g',
  'namespace': 'pipeline',
  'resourceVersion': '42779483',
  'uid': '6a864b5c-68d4-4bf0-8c2e-e4053f0b02e5'},
 'spec': {'entrypoint': 'runpy',
  'templates': [{'dag': {'tasks': [{'arguments': {'parameters': [{'name': 'para-A-0',
          'value': 'A'}]},
       'name': 'A',
       'template': 'A'},
      {'arguments': {'parameters': [{'name': 'para-B-0', 'value': 'B'}]},
       'dependencies': ['A'],
       'name': 'B',
       'template': 'B'},
      {'arguments': {'parameters': [{'name': 'para-C-0', 'value': 'C'}]},
       'dependencies': ['

The following screenshot shows the successful run that the above JSON pipeline object represents.

![pipeline DAG](../images/getting_started_images/09_pipeline-dag.png)

And finally, the output of the above pipeline:

![whalesayAB](../images/getting_started_images/09_whalesayAB.png)![whalesayCD](../images/getting_started_images/09_whalesayCD.png)

## X. Building A More Complex Pipeline

In this example, we stub out a pseudo-pipeline for a Failover Mechanism while processing Sen2Cor. 

The idea is to begin with the most recent version of Sen2Cor and then if an error occurs while
processing the L1C input then it will failover to the next release of Sen2Cor. 
If the process fails over on all conditions, then an error job is thrown to perform 
what would be any cleanup and notification of fail on a certain input. 

This pipeline can be adapted into many other uses. 

First, after importing our libraries, we build a template job function that can take in `callable` objects to be inserted in to a Python3.6 image. 

In [1]:
import sys
import couler.argo as couler
from couler.argo_submitter import ArgoSubmitter
from couler.core.templates.toleration import Toleration
from couler.core.templates.volume_claim import VolumeClaimTemplate
from couler.core.constants import WFStatus

In [10]:
def job(name: str, source: callable):
    toleration = Toleration('ga.nodepool/type', 'NoSchedule', 'Exists')
    couler.add_toleration(toleration) # pipeline/nodepool=pipe:NoSchedule
    toleration2 = Toleration('kubernetes.azure.com/scalesetpriority', 'NoSchedule', 'Exists')
    couler.add_toleration(toleration2)
    return couler.run_script(
        image="python:alpine3.6",
        source=source,
        step_name=name,
        node_selector={'pipeline':'small'}
    )

Next, we define our steps that represent success or failure of running the Sen2Cor binary. 

In [9]:
def gather_files():
    return ['ras1','ras2','ras3','ras4']

def preprocess():
    print(f'preprocess')
    
def sen2cor290():  
    import random
    task = ['success', 'fail']
    res = random.randint(0, 1)
    res = task[res]
    print(f'{res}')
    if res == 'fail':
        sys.exit(2)

def sen2cor280():
    import random
    task = ['success', 'fail']
    res = random.randint(0, 1)
    res = task[res]
    print(f'{res}')
    if res == 'fail':
        sys.exit(2)
    
def sen2cor255():
    import random
    task = ['success', 'fail']
    res = random.randint(0, 1)
    res = task[res]
    print(f'{res}')
    if res == 'fail':
        sys.exit(2)

def fin():
    print('fin')

def err():
    print('error')

Once decalared, we wrap our functions inside of a submittable job

In [11]:
def preprocess_job():
    return job(name='preprocess', source=preprocess)

def sen2cor290_job():
    return job(name='sen2cor290', source=sen2cor290)

def sen2cor280_job():
    return job(name='sen2cor280', source=sen2cor280)

def sen2cor255_job():
    return job(name='sen2cor255', source=sen2cor255)

def fin_job():
    return job(name='fin', source=fin)

def err_job():
    return job(name='err', source=err)

We now need to build our DAG

First we gather our files, which is generally a list and then any necessary preprocessing steps. 
Once ready, the input is passed into our first Step: "Sen2Cor version 2.9.0". 
Using Boolean logic, we can determine how the failovers are managed. 

In [12]:
def run_dag(pth):

    couler.set_dependencies(
        preprocess_job, 
        dependencies=None
    )
    
    couler.set_dependencies(
        sen2cor290_job,
        dependencies='preprocess.Succeeded'
    )

    couler.set_dependencies(
        sen2cor280_job,
        dependencies='sen2cor290.Failed'
    )

    couler.set_dependencies(
        sen2cor255_job,
        dependencies='sen2cor280.Failed'
    )
    
    couler.set_dependencies(
        err_job,
        dependencies='sen2cor280.Failed && sen2cor290.Failed && sen2cor255.Failed'
    )
    
    couler.set_dependencies(
        fin_job,
        dependencies='sen2cor290.Succeeded || sen2cor280.Succeeded || sen2cor255.Succeeded'
    )
    
run_dag('pth')

Finally, we submit our job to the Executor! 

In [13]:
submitter = ArgoSubmitter(namespace='pipeline')

INFO:root:Argo submitter namespace: pipeline
INFO:root:Cannot find local k8s config. Trying in-cluster config.
INFO:root:Initialized with in-cluster config.


In [14]:
couler.run(submitter=submitter)

INFO:root:Checking workflow name/generatedName runpy-
INFO:root:Submitting workflow to Argo
INFO:root:Workflow runpy-knlcd has been submitted in "pipeline" namespace!


{'apiVersion': 'argoproj.io/v1alpha1',
 'kind': 'Workflow',
 'metadata': {'creationTimestamp': '2022-09-27T02:37:01Z',
  'generateName': 'runpy-',
  'generation': 1,
  'managedFields': [{'apiVersion': 'argoproj.io/v1alpha1',
    'fieldsType': 'FieldsV1',
    'fieldsV1': {'f:metadata': {'f:generateName': {}}, 'f:spec': {}},
    'manager': 'OpenAPI-Generator',
    'operation': 'Update',
    'time': '2022-09-27T02:37:01Z'}],
  'name': 'runpy-knlcd',
  'namespace': 'pipeline',
  'resourceVersion': '42775569',
  'uid': 'd3fef132-6be1-47f6-bd21-0c8ef710b7d0'},
 'spec': {'entrypoint': 'runpy',
  'templates': [{'dag': {'tasks': [{'name': 'preprocess',
       'template': 'preprocess'},
      {'depends': 'preprocess.Succeeded',
       'name': 'sen2cor290',
       'template': 'sen2cor290'},
      {'depends': 'sen2cor290.Failed',
       'name': 'sen2cor280',
       'template': 'sen2cor280'},
      {'depends': 'sen2cor280.Failed',
       'name': 'sen2cor255',
       'template': 'sen2cor255'},
     