# Wrapping an Application Package using EOEPCA's cwl-wrapper

This notebook uses the Python kernel.

In [1]:

import argparse
import yaml
import json
import graphviz
import pandas as pd
from io import StringIO
from click.testing import CliRunner
from cwl_wrapper.app import main as cwl_wrapper_main
from cwltool.main import main as cwltool_main
from cwl_wrapper.parser import Parser
from cwl_utils.parser import load_document_by_yaml
from ruamel.yaml import YAML

Let's display the Water bodies detection Application Package:

In [None]:
parsed_args = argparse.Namespace(print_dot=True, workflow="cwl-workflow/app-water-bodies.cwl#main", enable_ext=True)

stream_out = StringIO()
stream_err = StringIO()

res = cwltool_main(
    args=parsed_args,
    stdout=stream_out,
)
assert res == 0
graphviz.Source(stream_out.getvalue())

This Application Package has a parameter, `main/item` of type `Directory`:

In [6]:
def read_cwl_file(cwl_file_path):

    yaml = YAML(typ="safe")
    
    # Load the CWL document
    with open(cwl_file_path, 'r') as file:
        cwl_data = yaml.load(file)
        cwl_obj = load_document_by_yaml(cwl_data, "io://")

    return cwl_obj

def get_inputs(cwl_file_path):
    
    cwl_obj = read_cwl_file(cwl_file_path)

    # Collect inputs details
    inputs_data = []
    for input in cwl_obj.inputs:
        input_data = {
            "ID": input.id,
            "Type": input.type_,
            "Doc": input.doc if input.doc else "N/A",
            "Label": input.label if input.label else "N/A"
            
        }
        inputs_data.append(input_data)
    
    # Create DataFrame
    df = pd.DataFrame(inputs_data)
    return df

def get_outputs(cwl_file_path):
    
    cwl_obj = read_cwl_file(cwl_file_path)
    
    # Collect outputs details
    outputs_data = []
    for output in cwl_obj.outputs:
        output_data = {
            "ID": output.id,
            "Doc": output.doc if output.doc else "N/A",
            "Label": output.label if output.label else "N/A",
            "Type": output.type_
        }
        outputs_data.append(output_data)
    
    # Create DataFrame
    df = pd.DataFrame(outputs_data)
    return df


In [None]:

cwl_file = "cwl-workflow/app-water-bodies.cwl"

# Display the DataFrame as a table
display(get_inputs(cwl_file))

And we inspect the outputs:

In [None]:
display(get_outputs(cwl_file))

Let's wrap this Application Package using EOEPCA's cwl-wrapper utility.

The cwl-wrapper utility requires a few templates. 

We'll look at the `cwl-wrapper/conf/stage-in.cwl` and `cwl-wrapper/conf/stage-out.cwl` as these control the stage-in and stage-out. 

In [None]:
display(get_inputs("cwl-wrapper/conf/stage-in.cwl"))

In [None]:
display(get_outputs("cwl-wrapper/conf/stage-in.cwl"))

In [None]:
display(get_inputs("cwl-wrapper/conf/stage-out.cwl"))

In [None]:
display(get_outputs("cwl-wrapper/conf/stage-out.cwl"))

Now let's invoke cwl-wrapper with this configuration:

In [None]:

runner = CliRunner()
result = runner.invoke(cwl_wrapper_main, ['--help'])

print(result.output)

In [None]:
arguments = ["--maincwl", "cwl-wrapper/conf/main.yaml",
             "--rulez", "cwl-wrapper/conf/rules.yaml", 
             "--stagein", "cwl-wrapper/conf/stage-in.cwl", 
             "--stageout", "cwl-wrapper/conf/stage-out.cwl",
             "cwl-workflow/app-water-bodies.cwl"]

runner = CliRunner()
result = runner.invoke(cwl_wrapper_main, args=arguments)

print(result.output)

In [84]:
# save the output to a file
with open("w.cwl", "w") as f:
    f.write(result.output)

In [None]:
parsed_args = argparse.Namespace(print_dot=True, workflow="w.cwl#wrapped", enable_ext=True)

stream_out = StringIO()
stream_err = StringIO()

res = cwltool_main(
    args=parsed_args,
    stdout=stream_out,
)
assert res == 0
graphviz.Source(stream_out.getvalue())

**What happened?**

The wrapped CWL Workflow includes two additional steps:
* `wrapped/node_stage_in` that:
    * reads the `wrapped/item` parameter that is now a `string` (reference to a Landsat-9 acquisition catalog entry)
    * stages the Landsat-9 acquisition catalog entry as a STAC catalog
    * passes the resulting `Directory` to the `Water bodies detection based on NDWU and the otsu threshold` Workflow step
* `wrapped/node_stage_out` that:
    * reads the stage-out parameters:
        * `wrapped/aws_access_key_id`: the Platform AWS access key for the target S3 bucket
        * `wrapped/aws_secret_access_id`: the Platform AWS secret access key for the target S3 bucket
        * `wrapped/endpoint_url`: the  Platform S3 object storage service URL
        * `wrapped/region_name`: the Platorm S3 object storage region
        * `wrapped/bucket`: the Platorm S3 object storage bucket for the results
        * `wrapped/sub_path`: the Platorm S3 object storage bucket for the results
    * reads the `Water bodies detection based on NDWU and the otsu threshold` Workflow step results (type `Directory`)
    * pushes the STAC Catalog to S3
    * produces as output the `S3 catalog.json` URL (and `Water bodies detection based on NDWU and the otsu threshold` `stac_catalog` result)

### Running the wrapped 

At runtime, the Platform provides the additional parameters `aws_access_key_id`, `aws_secret_access_key`, `endpoint_url`, `s3_bucket`, `sub_path` and `region_name`.

Instead the Platform user selects the parameters `item` as a reference to a Landsat-9 acquisition catalog entry and the remaining Application Package input parameters `aoi` and `band`.

It is up to the Platform to ensure the stage-in and stage-out steps parameters are managed.   

We'll use `cwltool` to run the wrapped application package.

For that, we'll create a `params.yaml` file with the parameters:

In [30]:
arguments = {"item": "https://planetarycomputer.microsoft.com/api/stac/v1/collections/landsat-c2-l2/items/LC09_L2SP_042033_20231015_02_T1",
             "aoi": "-118.985,38.432,-118.183,38.938",
             "bands": ["green", "nir08"], 
             "aws_access_key_id": "test", 
             "aws_secret_access_key": "test",
             "endpoint_url": "http://localstack:4566", 
                "s3_bucket": "results",
                 "sub_path": "run-005", 
                 "region_name": "us-east-1"}

# create the YAML parameter file
with open("params.yaml", "w") as f:
    f.write(yaml.dump(arguments))

And invoke `cwltool`:

In [None]:
parsed_args = argparse.Namespace(workflow="w.cwl#wrapped", enable_ext=True, job_order=["params.yaml"], podman=True)

stream_out = StringIO()
stream_err = StringIO()

res = cwltool_main(
    args=parsed_args,
    stdout=stream_out,
    stderr=stream_err,
)
assert res == 0

Now we can inspect the results and verify there's an S3 URL pointing to the staged-out STAC `catalog.json`:

In [None]:
results = json.loads(stream_out.getvalue())

results

## Extra - Using cwl-wrapper Python API

`cwl-wrapper` can be invoked in a Python application with:

In [17]:
workflow_id = "main"

wf = Parser(
    cwl="cwl-workflow/app-water-bodies.cwl",
    output=None,
    stagein="cwl-wrapper/conf/stage-in.cwl",
    stageout="cwl-wrapper/conf/stage-out.cwl",
    maincwl="cwl-wrapper/conf/main.yaml",
    rulez="cwl-wrapper/conf/rules.yaml",
    assets=None,
    workflow_id=workflow_id,
)

In [None]:
wf.out