# RK_01_EarthCube_GeoEDF_Demo

## Author(s)
List authors, their current affiliations,  up-to-date contact information, and ORCID if available. Add as many author lines as you need. 

- Author1 = {"name": "Rajesh Kalyanam", "affiliation": "Purdue Research Computing", "email": "rkalyana@purdue.edu", "orcid": "0000-0002-7026-7419"}
- Author2 = {"name": "Lan Zhao", "affiliation": "Purdue Research Computing", "email": "lanzhao@purdue.edu"}
- Author3 = {"name": "Carol Song", "affiliation": "Purdue Research Computing", "email": "cxsong@purdue.edu"}
    


## Purpose
This notebook demonstrates **GeoEDF**, a new Python-based plug-and-play workflow framework for composing and executing end-to-end geospatial research workflows in a cyberinfrastructure environment. GeoEDF seeks to free researchers from the time-consuming data wrangling tasks and focus on their scientific research. 

## Technical contributions

* **GeoEDF** (Geospatial **E**xtensible **D**ata **F**ramework)[<sup id="fn1-back">1</sup>](#fn1) is a novel abstraction of geospatial workflows as a sequence of reusable data acquisition (_connector_) and processing (_processor_) steps.
* A YAML-based workflow syntax for composing geospatial workflows out of data _connectors_ and data _processors_.
* A Python-based **GeoEDF** workflow engine for planning and executing the YAML workflows on diverse compute resources.
* A growing repository of community contributed connectors and processors that can be used in research workflows. 

## Methodology

1. GeoEDF data connectors and processors are essentially Python classes that implement a standard interface. 
2. Connectors implement various data access protocols and assist in acquiring data from remote repositories (for e.g., NASA, USGS, etc.). Processors implement various domain agnostic and domain specific geospatial processing operations.
3. Data connectors and processors are contributed as open-source to GitHub where a CI/CD (continuous integration/continuous deployment) pipeline packages them as Singularity containers and deploys them to a GeoEDF Singularity registry.
4. Users can utilize any of these connectors or processors for their workflows or design and contribute their own.
5. The GeoEDF workflow engine can be used standalone as in this Docker container example, or integrated into a science gateway. 

The GeoEDF workflow engine leverages the [Pegasus Workflow Management System](https://pegasus.isi.edu/) for workflow planning and execution on diverse compute resources (local machine, Condor pool, HPC, etc.). This container is based on the [Pegasus Workflow Development Environment](https://github.com/pegasus-isi/pegasus-workflow-development-environment). 


## Results

This notebook demonstrates a GeoEDF hydrologic workflow that acquires data (in HDF format) from a NASA Distributed Active Archive Center (DAAC) and aggregates the data across a provided watershed region. This is often the first step before running a hydrologic model. 

![Workflow](files/img/research.png)

In GeoEDF, this workflow combines a data connector (**NASAInput**) and processor (**HDFEOSShapefileMask**) as follows:

![Workflow](files/img/mcd.png)

The corresponding YAML GeoEDF workflow [file](./workflow/mcd15.yml) is as follows:

```
$1:
  Input:
    NASAInput:
      url: https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006/%{filename}
      user: rkalyana
      password:
  Filter:
    filename:
      PathFilter:
        pattern: '%{dtstring}/MCD15A3H.*.h09v07*.hdf'
    dtstring:
      DateTimeFilter:
        pattern: '%Y.%m.%d'
        start: 07/16/2002
$2:
  HDFEOSShapefileMask:
    hdffile: $1
    shapefile: /home/earthcube/geoedf/files/watershed/subs1_projected_171936.shp
    datasets: [Lai]
```

**Note:** 

1. A GeoEDF workflow combines _instances_ of connector or processor classes. The YAML syntax enables the user to specify bindings for the class arguments for the connector or processor (e.g., url, user, shapefile, etc.).
2. Filters enhance the generality of connectors. In this specific case, HDF data for a specific time period can be acquired by modifying the _DateTimeFilter_ appropriately. Similarly, wildcards ('*') in the _PathFilter_ enable search across all the files hosted in that directory on the repository.
3. Filters essentially provide one or more binding values for variables referenced in other connectors or filters. For example, the _filename_ variable in the _NASAInput_ connector is bound by the _PathFilter_; and the _dtstring_ variable in _PathFilter_ is bound by the _DateTimeFilter_.
3. Numeric indices are used to denote the workflow step and establish output-input linkages between steps. 
4. Fields left blank (for e.g., _password_) are instantiated at workflow execution time by prompting the user to specify a value.

## Funding

- Award1 = {"agency": "NSF", "award_code": "1835833", "award_URL": "https://www.nsf.gov/awardsearch/showAward?AWD_ID=1835822"}

## Keywords

keywords=["workflow", "geospatial", "Pegasus", "containers", "cyberinfrastructure"]

## Citation

Kalyanam, R., Zhao, L., Song, C., GeoEDF Demo Notebook, EarthCube 2021 Annual Conference.

## Suggested next steps

* Further documentation on GeoEDF can be found at [GeoEDF Documentation] (https://geoedf.readthedocs.io/en/latest/)
* GeoEDF is coming soon to the Jupyter tool environment on the [MyGeoHub Science Gateway] (https://mygeohub.org). Users will be able to create and execute workflows, and manage and publish the workflow results on this gateway. 
* Current connector and processor definitions can be found in the [GeoEDF GitHub Repositories](https://github.com/geoedf). 

# Setup

## Library import

**GeoEDFWorkflow** is the primary class that will be used to instantiate and execute the workflow above.

GeoEDF uses the _sregistry_ Singularity client library to interact with the GeoEDF Singularity registry. In order to turn off the informational messages from this library, we first set the _MESSAGELEVEL_ environment variable to _QUIET_.

In [None]:
import os

os.environ['MESSAGELEVEL'] = 'QUIET'

from geoedfengine.GeoEDFWorkflow import GeoEDFWorkflow

# Workflow Instantiation

A new workflow object is created by instantiating the _GeoEDFWorkflow_ class with the workflow YML file path.

**Note:**

1. At this point, the GeoEDF engine will validate the workflow file for proper syntax (ensuring no cyclic dependencies, all variables are bound by filters, etc.). 
2. The user will be prompted to enter values for any variables that have been left blank (for e.g., _password_).
3. Enter the value _NASAservice123_ when prompted for the password.

In [None]:
workflow = GeoEDFWorkflow('/home/earthcube/geoedf/workflow/mcd15.yml')


# Workflow Execution

_GeoEDFWorkflow_ provides a method for executing workflows. In this case, we execute workflows locally on a Condor pool running in the container. Other execution sites can be configured in the Pegasus site catalog and provided as input during workflow instantiation. We do not include this feature in this demonstration due to its setup complexity.

**Note:**

1. Workflow execution is synchronous; on execution, a progress bar (0 - 100%) will be displayed.
2. Workflow may take a while to run based on the resources available to your local Docker engine. 
2. Workflow may report a _Failure_ from time to time since it is reaching out to an external NASA DAAC. If this happens, please check to see if https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006 is reachable in your browser and then repeat the steps (a) instantiate the workflow, (b) execute the workflow. 

In [None]:
workflow.execute()

# Discussion

If the workflow succeeded, the output is an ESRI Shapefile which can be found in the output directory reported by the execute step above. However, there is no easy way to verify or visualize the result. There are Python mapping libraries (e.g. Folium or ipyLeaflet) that work with geospatial files, but require vector data to be in the GeoJSON format.

As a next step, we demonstrate how a new processor can be developed for converting a shapefile into a GeoJSON file and appended as a third step to the above workflow. This demonstration can be found in the **_RK_02_EarthCube_GeoEDF_Demo_** notebook.


# References

[<sup id="fn1">1</sup>](#fn1-back) Kalyanam, R., Zhao, L., Song, X.C., Merwade, V., Jin, J., Baldos, U. and Smith, J., 2020. Geoedf: An extensible geospatial data framework for fair science. In Practice and Experience in Advanced Research Computing (pp. 207-214).