# RK_02_EarthCube_GeoEDF_Demo

## Author(s)
List authors, their current affiliations,  up-to-date contact information, and ORCID if available. Add as many author lines as you need. 

- Author1 = {"name": "Rajesh Kalyanam", "affiliation": "Purdue Research Computing", "email": "rkalyana@purdue.edu", "orcid": "0000-0002-7026-7419"}
- Author2 = {"name": "Lan Zhao", "affiliation": "Purdue Research Computing", "email": "lanzhao@purdue.edu"}
- Author3 = {"name": "Carol Song", "affiliation": "Purdue Research Computing", "email": "cxsong@purdue.edu"}
    


## Purpose
This notebook demonstrates how a new GeoEDF processor can be developed and used in a workflow. It is primarily intended to demonstrate the extensibility and flexibility of the GeoEDF framework.

## Methodology

1. A _Shapefile2GeoJSON_ processor has been included in the container. This is a domain agnostic processor that simply takes an ESRI Shapefile's containing directory as input and produces a GeoJSON file output.
2. This notebook demonstrates the utilities available to package the processor so that it can be used in a workflow.
3. While the new processor will be packaged (as a Singularity container) and saved locally in this container for this demonstration, in a production environment the GeoEDF GitHub CI/CD pipeline is used to package and push a newly contributed (via pull requests) processor to the GeoEDF Singularity registry.

## Results

The revised three step workflow that will be demonstrated in this notebook is as follows:

![Workflow](files/img/mcd-viz.png)

The corresponding YAML GeoEDF workflow [file](./workflow/mcd15-viz.yml) is as follows:

```
$1:
  Input:
    NASAInput:
      url: https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006/2002.07.16/MCD15A3H.*.h09v07*.hdf
      user: rkalyana
      password:
$2:
  HDFEOSShapefileMask:
    hdffile: $1
    shapefile: /home/earthcube/geoedf/files/watershed/subs1_projected_171936.shp
    datasets: [Lai]
$3:
  Shapefile2GeoJSON:
    inputdir: dir($2)
```

The Python package definition and implementation of the _Shapefile2GeoJSON_ processor can be found [here](./files/shapefile2geojson).

## Funding

- Award1 = {"agency": "NSF", "award_code": "1835833", "award_URL": "https://www.nsf.gov/awardsearch/showAward?AWD_ID=1835822"}

## Keywords

keywords=["workflow", "geospatial", "Pegasus", "containers", "cyberinfrastructure"]

## Citation

Kalyanam, R., Zhao, L., Song, C., GeoEDF Custom Processor Demo Notebook, EarthCube 2021 Annual Conference.

## Suggested next steps

* Further documentation on GeoEDF connector and processor templates can be found at [GeoEDF Documentation](https://geoedf.readthedocs.io/en/latest/).
* Current connector and processor definitions can be found in the [GeoEDF Connector Repository](https://github.com/geoedf/connectors) and [GeoEDF Processor Repository](https://github.com/geoedf/processors). 

# Processor Development

Typically, a user would develop a new connector or processor following the template of existing processors or connectors in the GeoEDF repository. We have included an example implementation of the _Shapefile2GeoJSON_ processor [here](./files/shapefile2geojson). Apart from the typical Python library classes, there is also a _recipe.hpccm_ file that is used to create the Singularity definition file and (subsequently) container. 

_recipe.hpccm_ utilizes the [NVIDIA HPC Container Maker](https://github.com/NVIDIA/hpc-container-maker) utility to simplify the definition of Singularity and Docker definition files. This allows contributors to quickly specify the OS and Python library dependencies for a connector or processor without having to be familiar with the specifics of the Singularity definition syntax. The _hpccm_ utility is used to create a Singularity definition file from this recipe file.

The _build-local-image_ [script](./files/build-local-image.sh) executes the following steps to package the processor into a Singularity container that can then be used in the workflow:

1. Create a Singularity definition file from the recipe.hpccm file.
2. Build a Singularity container from the definition file.
3. Copy the resulting Singularity container image to a local /images directory where GeoEDF can find additional connectors and processors during development (in addition to the GeoEDF Singularity registry server).

We will execute this script next to create the container for the _Shapefile2GeoJSON_ processor.

In [None]:
!./files/build-local-image.sh

## Discussion

As noted, the ability to build and test local connector and processor Singularity containers in workflows greatly simplifies the development process. Moreover, a mix of existing and new connectors and processors can be used since the GeoEDF workflow engine can use both containers hosted on the GeoEDF Singularity registry and locally built containers. 

Once a user is satisfied with their developed container, they can use a GitHub pull request to contribute their new connector or processor to the corresponding GeoEDF repositories. The GeoEDF CI/CD pipeline essentially repeats the same steps as the _build-local-image_ script, except for pushing the Singularity container image to the GeoEDF registry at the end.

Next, we will use this new processor in our prior workflow to help produce a GeoJSON file that can be visualized in this same Jupyter notebook.

# Setup

## Library import

As before, the **GeoEDFWorkflow** class will be imported and used to execute this new workflow.

In [None]:
import os

os.environ['MESSAGELEVEL'] = 'QUIET'

from geoedfengine.GeoEDFWorkflow import GeoEDFWorkflow

# Workflow Instantiation

A new workflow object for the three step workflow is created by instantiating the _GeoEDFWorkflow_ class with the new workflow YML file path.

**Note:**

1. Enter the value _NASAservice123_ when prompted for the password.

In [None]:
workflow = GeoEDFWorkflow('/home/earthcube/geoedf/workflow/mcd15-viz.yml')


# Workflow Execution

As before, we use the _execute()_ method to execute this instantiated workflow. 

In [None]:
workflow.execute()

# Results Visualization

## Library Imports

We will first import Python libraries necessary for reading in the GeoJSON output file and then visualizing it on a map.

In [None]:
import geopandas as gpd
import folium

## Data Import 

First we import the GeoJSON file from the workflow's output directory and create the necessary GeoPandas _DataFrame_ for visualization using the Folium library.

*Be sure to fill in the output directory reported by execute() above as the value for the **output_dir** variable below*

In [None]:
output_dir = ''
geojson_path = '%s/MCD15A3H.A2002197.h09v07.006.2015149103156.hdf.json' % output_dir
geo_df = gpd.GeoDataFrame.from_file(geojson_path)
field_name = 'Lai_500m'
mcd_df = geo_df.loc[:,[field_name,'geometry']]
mcd_df['id'] = mcd_df.index

## Visualization

Finally, we visualize the GeoJSON data as a color coded map

In [None]:
test_map = folium.Map(location=[40,-86],zoom_start=7)

test_map.choropleth(geo_data=mcd_df.to_json(),
                    data=mcd_df,
                    columns=['id',field_name],
                    key_on='feature.properties.{}'.format('id'),
                    legend_name = field_name,
                    fill_color='YlOrRd',
                    fill_opacity=0.5,
                    line_weight=2)
test_map

# Discussion

If the workflow succeeded, the output is a GeoJSON file that can be visualized inline on Jupyter notebooks using popular map visualization libraries. 

While this may be a contrived example, it demonstrates how a generic GeoEDF processor developed for one specific workflow can find use across various workflows and provide a useful tool for other researchers to develop their own end-to-end workflows in other domains. 

Previously, researchers would have had to download the resulting Shapefile after the second step to their desktop machine and visualize it on a desktop GIS tool such as QGIS. GeoEDF seeks to facilitate FAIR science by enabling researchers to conduct their end-to-end workflows entirely in a science gateway environment. Specifically, both workflow outputs and the workflow YAML file can be published with appropriate metadata to enable workflow reproducibility and validation by other researchers in the same gateway environments.

