# GeoEDF_Tutorial_01

This notebook demonstrates a GeoEDF hydrologic workflow that acquires data (in HDF format) from a NASA Distributed Active Archive Center (DAAC) and aggregates the data across a provided watershed region. This is often the first step before running a hydrologic model. 

![Workflow](files/img/research.png)

In GeoEDF, this workflow combines a data connector (**NASAInput**) and processor (**HDFEOSShapefileMask**) as follows:

![Workflow](files/img/mcd.png)

The corresponding YAML GeoEDF workflow [file](./workflow/mcd15.yml) is as follows:

```
$1:
  Input:
    NASAInput:
      url: https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.061/%{filename}
      user: rkalyana
      password:
  Filter:
    filename:
      PathFilter:
        pattern: '%{dtstring}/MCD15A3H.*.h09v07*.hdf'
    dtstring:
      DateTimeFilter:
        pattern: '%Y.%m.%d'
        start: 07/16/2002
        exact_dates: True
$2:
  HDFEOSShapefileMask:
    hdffile: $1
    shapefile: /home/jovyan/CI4FAIR/files/watershed/subs1_projected_171936.shp
    datasets: [Lai]
```

**Note:** 

1. A GeoEDF workflow combines _instances_ of connector or processor classes. The YAML syntax enables the user to specify bindings for the class arguments for the connector or processor (e.g., url, user, shapefile, etc.).
2. Filters enhance the generality of connectors. In this specific case, HDF data for a specific time period can be acquired by modifying the _DateTimeFilter_ appropriately. Similarly, wildcards ('*') in the _PathFilter_ enable search across all the files hosted in that directory on the repository.
3. Filters essentially provide one or more binding values for variables referenced in other connectors or filters. For example, the _filename_ variable in the _NASAInput_ connector is bound by the _PathFilter_; and the _dtstring_ variable in _PathFilter_ is bound by the _DateTimeFilter_.
3. Numeric indices are used to denote the workflow step and establish output-input linkages between steps. 
4. Fields left blank (for e.g., _password_) are instantiated at workflow execution time by prompting the user to specify a value.

# Setup

## Library import

**WorkflowEngine** is the primary class that will be used to instantiate and execute the workflow above.

GeoEDF uses the _sregistry_ Singularity client library to interact with the GeoEDF Singularity registry. In order to turn off the informational messages from this library, we first set the _MESSAGELEVEL_ environment variable to _QUIET_.

In [None]:
import os

os.environ['MESSAGELEVEL'] = 'QUIET'

from geoedfengine.WorkflowEngine import WorkflowEngine

# Workflow Instantiation and Execution

A new workflow object is created by instantiating the _WorkflowEngine_ class with the workflow YML file path and providing an ID for tracking this workflow.

**Note:**

1. At this point, the GeoEDF engine will validate the workflow file for proper syntax (ensuring no cyclic dependencies, all variables are bound by filters, etc.). 
2. The user will be prompted to enter values for any variables that have been left blank (for e.g., _password_).
3. Workflow execution is asynchronous; on execution, the *workflow_status* method can be used to track execution status.
4. Workflow may take a while to run based on the resources available to your Jupyter environment.

In [None]:
#WorkflowEngine.execute_workflow(<workflow path>,<id>)

WorkflowEngine.execute_workflow('/home/jovyan/CI4FAIR/workflow/mcd15.yml','mcd')

# Workflow Monitoring

The status of the workflow can be monitored by passing the workflow ID to the *workflow_status* method.

Users can also use the *pegasus_status* and *pegasus_analyzer* tools from the Terminal to get more detailed workflow status or debug in case of failures.

In [None]:
#WorkflowEngine.workflow_status(<id>)

WorkflowEngine.workflow_status('mcd')

# Discussion

If the workflow succeeded, the output is an ESRI Shapefile (**_the output path is printed when the workflow is submitted for execution_**). However, there is no easy way to verify or visualize the result. There are Python mapping libraries (e.g. Folium or ipyLeaflet) that work with geospatial files, but require vector data to be in the GeoJSON format.

As a next step, we demonstrate how a new processor can be developed for converting a shapefile into a GeoJSON file and appended as a third step to the above workflow. This demonstration can be found in the **_GeoEDF_Tutorial_02** notebook.
