# GeoEDF_Tutorial_02

## Purpose
This notebook demonstrates how a new GeoEDF processor can be developed and used in a workflow. It is primarily intended to demonstrate the extensibility and flexibility of the GeoEDF framework.

## Methodology

1. A _Shapefile2GeoJSON_ processor has been included in the container. This is a domain agnostic processor that simply takes an ESRI Shapefile's containing directory as input and produces a GeoJSON file output.
2. This notebook demonstrates the utilities available to package the processor so that it can be used in a workflow.
3. While the new processor will be packaged (as a Singularity container) and saved locally in this container for this demonstration, in a production environment the GeoEDF GitHub CI/CD pipeline is used to package and push a newly contributed (via pull requests) processor to the GeoEDF Singularity registry.

## Results

The revised three step workflow that will be demonstrated in this notebook is as follows:

![Workflow](files/img/mcd-viz.png)

The corresponding YAML GeoEDF workflow [file](./workflow/mcd15-viz.yml) is as follows:

```
$1:
  Input:
    NASAInput:
      url: https://opendap.cr.usgs.gov/opendap/hyrax/DP131/MOTA/MCD15A3H.061/2002.07.16/MCD15A3H.*.h09v07*.hdf
      user: rkalyana
      password:
$2:
  HDFEOSShapefileMask:
    hdffile: $1
    shapefile: /home/jovyan/CI4FAIR/files/watershed/subs1_projected_171936.shp
    datasets: [Lai]
$3:
  Shapefile2GeoJSON:
    inputdir: dir($2)

```

The Python package definition and implementation of the _Shapefile2GeoJSON_ processor can be found [here](./files/shapefile2geojson).

# Processor Development

Typically, a user would develop a new connector or processor following the template of existing processors or connectors in the GeoEDF repository. We have included an example implementation of the _Shapefile2GeoJSON_ processor [here](./files/shapefile2geojson). Apart from the typical Python library classes, there is also a _recipe.hpccm_ file that is used to create the Singularity definition file and (subsequently) container. 

_recipe.hpccm_ utilizes the [NVIDIA HPC Container Maker](https://github.com/NVIDIA/hpc-container-maker) utility to simplify the definition of Singularity and Docker definition files. This allows contributors to quickly specify the OS and Python library dependencies for a connector or processor without having to be familiar with the specifics of the Singularity definition syntax. The _hpccm_ utility is used to create a Singularity definition file from this recipe file.

The _build-local-image_ [script](./files/build-local-image.sh) executes the following steps to package the processor into a Singularity container that can then be used in the workflow:

1. Create a Singularity definition file from the recipe.hpccm file.
2. Build a Singularity container from the definition file.
3. Copy the resulting Singularity container image to a local /images directory where GeoEDF can find additional connectors and processors during development (in addition to the GeoEDF Singularity registry server).

We will execute this script next to create the container for the _Shapefile2GeoJSON_ processor.

In [None]:
!./files/build-local-image.sh

## Discussion

As noted, the ability to build and test local connector and processor Singularity containers in workflows greatly simplifies the development process. Moreover, a mix of existing and new connectors and processors can be used since the GeoEDF workflow engine can use both containers hosted on the GeoEDF Singularity registry and locally built containers. 

Once a user is satisfied with their developed container, they can use a GitHub pull request to contribute their new connector or processor to the corresponding GeoEDF repositories. The GeoEDF CI/CD pipeline essentially repeats the same steps as the _build-local-image_ script, except for pushing the Singularity container image to the GeoEDF registry at the end.

Next, we will use this new processor in our prior workflow to help produce a GeoJSON file that can be visualized in this same Jupyter notebook.

# Setup

## Library import

As before, the **WorkflowEngine** class will be imported and used to execute this new workflow.

In [None]:
import os

os.environ['MESSAGELEVEL'] = 'QUIET'

from geoedfengine.WorkflowEngine import WorkflowEngine

# Workflow Instantiation and Execution

The new three step workflow can be executed like before using the **_execute_workflow_** method from the _WorkflowEngine_ class.

**Note:**

1. Enter your EarthData password when prompted for the password.

In [None]:
WorkflowEngine.execute_workflow('/home/jovyan/CI4FAIR/workflow/mcd15-viz.yml','mcd-viz')

In [None]:
WorkflowEngine.workflow_status('mcd-viz')

# Results Visualization

If the workflow succeeded, a GeoJSON file is produced that can be visualized on a map. 

**Note:** The workflow engine may take some time to copy outputs to the workflow output path.

## Library Imports

We will first import Python libraries necessary for reading in the GeoJSON output file and then visualizing it on a map.

In [None]:
import geopandas as gpd
import folium

## Data Import 

First we import the GeoJSON file from the workflow's output directory and create the necessary GeoPandas _DataFrame_ for visualization using the Folium library.

**Notes:**
1. Be sure to fill in the output directory reported by execute_workflow() above as the value for the **output_dir** variable below.
2. Be sure to copy the right filename in the **geojson_path** variable below.


In [None]:
output_dir = '<path>'
geojson_path = '%s/output/<geoJSON filename>' % output_dir
geo_df = gpd.GeoDataFrame.from_file(geojson_path)
field_name = 'Lai_500m'
mcd_df = geo_df.loc[:,[field_name,'geometry']]
mcd_df['id'] = mcd_df.index

## Visualization

Finally, we visualize the GeoJSON data as a color coded map

In [None]:
test_map = folium.Map(location=[40,-86],zoom_start=7)

folium.Choropleth(geo_data=mcd_df.to_json(),
                    data=mcd_df,
                    columns=['id',field_name],
                    key_on='feature.properties.{}'.format('id'),
                    legend_name = field_name,
                    fill_color='YlOrRd',
                    fill_opacity=0.5,
                    line_weight=2).add_to(test_map)
test_map

# Discussion

While this may be a contrived example, it demonstrates how a generic GeoEDF processor developed for one specific workflow can find use across various workflows and provide a useful tool for other researchers to develop their own end-to-end workflows in other domains. 

Previously, researchers would have had to download the resulting Shapefile after the second step to their desktop machine and visualize it on a desktop GIS tool such as QGIS. GeoEDF seeks to facilitate FAIR science by enabling researchers to conduct their end-to-end workflows entirely in a science gateway environment. Specifically, both workflow outputs and the workflow YAML file can be published with appropriate metadata to enable workflow reproducibility and validation by other researchers in the same gateway environments.

