# Human-in-the-loop Machine Learning with GroundWork and Raster Vision

This notebook will walk you through a process in which you get a base model from _somewhere_, use it to create predictions, upload those predictions to the [GroundWork](https://groundwork.azavea.com/) application, correct those predictions, and use the corrections to improve your original model. Once you can get from a model to predictions to an improved model, you can repeat that process any number of times. That's the loop. You're the human. Let's rock.

A note on the environment: this notebook is written to run inside a container launched by `docker/run` on `master` in the `raster-vision` repo. If you're not sure how a dependency is available, why we didn't set things up in advance, or why we're in Python 3.6, that's the reason. Additionally, the environment configured by this repository's ansible scripts makes sure that we have some specific data in a specific location. If you run this outside of that configured environment, you'll need to specificy a different path to your data.

## From predictions to a GroundWork project

### Step 1: Set up dependencies

The `raster-vision` container has lots of things we'll need (PyTorch, the python scientific stack, a bunch of system dependencies) already available, but we'll need a few more dependencies.

These additional dependencies and what we'll use them for are:

- `geopandas`: to figure out which _tasks_ (more below) to put our predictions in
- `shapely`: for transforming geojson to python data
- `rtree`: `geopandas` needs this for spatial joins, but it's an optional dependency, so we have to say we want it

Run the cell below, then restart the notebook (`0 0` or `Kernel` -> `restart`).

In [None]:
%pip install geopandas shapely rtree

After that, we'll import everything we're going to need over the rest of the notebook:

In [None]:
import json
import functools
from random import random
from os.path import join
from itertools import zip_longest
from copy import deepcopy
from uuid import uuid4

import geopandas as gpd
from geopandas.tools import sjoin
import numpy as np
import requests
from scipy.special import softmax
import shapely
from shapely.geometry import MultiPolygon, shape

### Step 2: Configuration

You'll configure a few values here. The goals of this configuration are to make it so that you can talk to the Raster Foundry API (the same API that powers GroundWork) from within the notebook.

The values you'll configure are:

- `bearer_token`: you can get this from network tools while logged in to GroundWork. This is a JSON Web Token. To see what it represents and learn more about JSON Web Tokens, you can decode it at [jwt.io](https://jwt.io/)
- `url_base`: this value configures the scheme and host for requests to the Raster Foundry API. All of our requests will start with `app.rasterfoundry.com`
- `source_project_id`: this value is a UUID pointing to a project template. GroundWork has two sort of high level grouping concepts -- a _project_ is a specific image that you'd like to do labeling work in, and a _campaign_ is a group of projects. Since we have predictions over an image, we want to work at the project level in this notebook.

In [None]:
# Your token for interacting with the Raster Foundry API
bearer_token = "<token>"

# common HTTP headers shared by the requests we're going to make
headers = {"Authorization": f"Bearer {bearer_token}"}

# The base URL for the Raster Foundry API
# This will only be different if you're working with a copy of Raster Foundry that lives somewhere else
url_base = "https://app.rasterfoundry.com"

# UUID of your template project
source_project_id = "<your source project id>"

### Step 3: Create a copy of your template project

The workflow we'll use in this workshop is that each iteration of the human-in-the-loop workflow will create a new project based on the template that we configured above. The steps to copy a project are:

- fetch the existing project
- create a new JSON document like from fields in that project
- POST that new project to the Raster Foundry API
- fetch all the tasks in the existing project
- add them to the new project by changing their project reference

In [None]:
# Get the source project
@functools.lru_cache(None)
def fetch_project(project_id):
    return requests.get(join(url_base, "api", "annotation-projects", 
                             project_id), headers=headers).json()

# create a new JSON document from the source project
def make_project_clone_json(source_project, iteration_number=0):
    return {
        # the name is just the source project name with a "_HITL" suffix
        "name": f"""{source_project["name"]} {iteration_number}""",
        "projectType": source_project["projectType"],
        "taskSizePixels": 512,
        "aoi": source_project["aoi"],
        "labelersTeamId": source_project["labelersTeamId"],
        "validatorsTeamId": source_project["validatorsTeamId"],
        "projectId": None,
        "campaignId": source_project["campaignId"],
        "status": source_project["status"],
        "tileLayers": source_project["tileLayers"],
        "labelClassGroups": []
    }

# post the project copy to the Raster Foundry API
def post_project(project_json):
    post_hitl_url = join(url_base, "api","annotation-projects")
    post_hitl = requests.post(post_hitl_url, headers=headers, json=project_json)
    return post_hitl.json()

So we can get the source project by bundling that workflow up with `project_id` and `iteration_number` parameters:

In [None]:
def clone_project(project_id, iteration_number):
    source_project = fetch_project(project_id)
    new_project = make_project_clone_json(source_project, iteration_number)
    return post_project(new_project)

We can then make our first copy of the template project like so:

In [None]:
hitl_project = clone_project(source_project_id, 1)
f"""https://groundwork.azavea.com/app/campaign/{hitl_project["campaignId"]}/overview?p=0&f=all"""

You can visit that URL in the GroundWork app and select the project you just created to see that we've created a copy of the project that has the same imagery tile layer, but it doesn't appear to have any tasks.

#### Uploading tasks

GroundWork breaks labeling work into more manageable pieces called _tasks_. A task is a specific window within a larger image. If you look at the overview for the source project, each little box over the imagery is a task.

To fill in tasks, we'll need to:

* grab all of the tasks from the source project
* POST them to the new project

Since the number of tasks in the source project can be huge, we'll work in batches.

In [None]:
# Grab all of the tasks from the source project
def fetch_tasks(annotation_project_id, url_base):
    template_project_tasks_url = join(url_base,"api/annotation-projects/", annotation_project_id, "tasks")
    tasks = requests.get(template_project_tasks_url, headers=headers).json()
    has_next = tasks["hasNext"]
    next_page = 1
    while has_next:
        new_tasks_url = f"{template_project_tasks_url}?page={next_page}"
        next_tasks = requests.get(new_tasks_url, headers=headers).json()
        tasks["features"] += next_tasks["features"]
        has_next = next_tasks["hasNext"]
        next_page += 1
    return tasks

# break all tasks into manageable chunks
# modified from https://docs.python.org/3/library/itertools.html#itertools-recipes
def grouper(iterable, n):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    x = zip_longest(*args)
    # workaround to remove the fill values in the chunks
    return [[ii for ii in i if ii is not None] for i in x]

# POST the tasks to the new project in groups
def copy_tasks_to_project(source_tasks, project_id):
    tasks_post_url = join(url_base, "api", "annotation-projects", hitl_project["id"], "tasks")
    chunks = grouper(source_tasks['features'], 1250)
    out = {"type": "FeatureCollection", "features": []}
    for chunk in chunks:
        chunk_tasks = {"features": [], "type": "FeatureCollection"}
        for task in chunk:
            # set the status to unlabeled, no matter what it was before
            task["properties"]["status"] = "UNLABELED"
            # set the annotationProjectId to 
            task["properties"]["annotationProjectId"] = project_id
            chunk_tasks["features"] += [task]
    
        chunk_tasks_response = requests.post(tasks_post_url, headers=headers, json=chunk_tasks)
        # make sure this post request doesn't fail silently
        chunk_tasks_response.raise_for_status()
        out["features"] += chunk_tasks_response.json()["features"]
    return out

# One-shot to grab all the tasks and do the copy
def clone_tasks(from_project_id, to_project_id):
    source_tasks = fetch_tasks(source_project_id, url_base)
    return copy_tasks_to_project(source_tasks, to_project_id)

In [None]:
hitl_project_tasks = clone_tasks(source_project_id, hitl_project["id"])
len(hitl_project_tasks["features"])

### Step 4: upload labels

Now that we have a project and tasks, we can upload our labels. To upload the labels, we'll need to complete three steps:

* associate each predicted label with a label class
* associate each predicted label with a task
* POST all the labels to GroundWork

Each label has to be associated with a task within a project and with a label class. GroundWork associates labels with classes via UUIDs, while Raster Vision only speaks string names, so we'll need to make sure we can translate between the names and IDs. Fortunately, we can get that translation from the campaign.

In [None]:
# get a dict to map from label name (from Raster Vision) to label class id (GroundWork / Raster Foundry)
def get_class_map(project):
    campaign_id = project['campaignId']
    get_label_class_url = join(url_base, "api", "campaigns", hitl_campaign_id, "label-class-groups")
    label_class_summary = requests.get(get_label_class_url, headers=headers).json()
    return {d['name']: d['id'] for d in label_class_summary[0]['labelClasses']}

Joining labels to tasks is slightly more complex. Our labels in this case are chip classification labels. Those chips won't align perfectly with the task grid that we created. However, both the tasks and the chips happen to be square, so we know that if the centroid of a label is located within a task, that's the most appropriate task for the label. We can do this kind of join with geopandas. We'll also need to track the predictions' original geometries though, so we'll return that in a separate map.

In [None]:
def associate_tasks(predictions_feature_collection, tasks_feature_collection):
    geom_mapping = {}
    tasks_df = gpd.GeoDataFrame.from_features(tasks_feature_collection["features"], crs="epsg:4326")
    copied = deepcopy(predictions_feature_collection)
    for f in copied["features"]:
        f["properties"]["score"] = np.max(softmax(f["properties"]["scores"]))
        geom = shape(f["geometry"])
        ad_hoc_id = str(uuid4())
        geom_mapping[ad_hoc_id] = geom
        f["properties"]["ad-hoc-id"] = ad_hoc_id
        # find the centroids which we will use for easier joining to task grid
        f["geometry"] = geom.centroid
    return {"original_geometries": geom_mapping,
            "joined": sjoin(
                tasks_df,
                gpd.GeoDataFrame.from_features(copied['features'], crs='EPSG:4326'),
                how="left"
            )}

In [None]:
with open(rv_output_uri, "r") as inf:
    predictions_feature_collection = json.load(inf)

labels_with_task_ids = associate_tasks(predictions_feature_collection, hitl_project_tasks)

Finally, we can post these predictions to GroundWork:

In [None]:
# collect the json needed to post labels from each row in the labels with task ID table
def features_to_label_post_body(group, geom_mapping, class_map):
    def feature_to_label(r):
        try:
            return { "type": "Feature",
              "properties": {
                "annotationLabelClasses": [class_map[r['class_name']]],
                "score": r['score']
              },
              "geometry": shapely.geometry.mapping(geom_mapping[r["ad-hoc-id"]]),
             "id": r["id"]
            }
        except:
            print("Failing row:")
            print(r)
            raise

    return {
      "type":"FeatureCollection",
      "features": [feature_to_label(r) for _, r in group.iterrows() if not gpd.pd.isna(r['class_name'])],
        "nextStatus":"LABELED"
    }

def upload_labels(project_id, joined, original_geometries, class_map):
    for task_id, task_labels in joined.groupby('id'):
        label_upload_body = features_to_label_post_body(task_labels, original_geometries, class_map)
        label_upload_url = join(url_base, "api", "annotation-projects", project_id, "tasks", task_id, "labels")
        # we use a PUT here so that if we fail in the middle, we can try again and replace the labels
        label_upload_response = requests.put(label_upload_url, headers=headers, json=label_upload_body)
        # make sure this post request doesn't fail silently
        label_upload_response.raise_for_status()

In [None]:
class_map = get_class_map(hitl_project)
upload_labels(hitl_project["id"], class_map=class_map, **labels_with_task_ids)

### Putting it all together

The above steps are all separated out, but we can instead write a single function that runs through the entire workflow like so:

In [None]:
def create_rv_label_project(source_project_id, iteration_number, rv_output_uri):
    hitl_project = clone_project(source_project_id, iteration_number)
    hitl_project_tasks = clone_tasks(source_project_id, hitl_project["id"])
    
    with open(rv_output_uri, "r") as inf:
        predictions_feature_collection = json.load(inf)
    labels_with_task_ids = associate_tasks(predictions_feature_collection, hitl_project_tasks)
    class_map = get_class_map(hitl_project)
    upload_labels(hitl_project["id"], class_map=class_map, **labels_with_task_ids)
    return hitl_project

In [None]:
create_rv_label_project(source_project_id, 1, rv_output_uri)

We'll return to this step later when we run through the loop again.

## From GroundWork to new training data

Before, we set a task status of `UNLABELED` when we created tasks, then `LABELED` when we uploaded labels to them. There's a more advanced status, `VALIDATED`, that indicates that a human has reviewed some labels and signed off on them. We can use the validation process in GroundWork to correct the predictions produced by Raster Vision.

The validation workflow uses GroundWork. After you're done with that, come back here.

...

...

...

Done validating? Great. Let's create some new training data.

To do that, we'll create a STAC export. STAC is short for the [spatio-temporal asset catalog specification](https://github.com/radiantearth/stac-spec), an open, extensible standard for describing geospatial data. GroundWork knows how to export, and Raster Vision knows how to train models from, STAC catalogs implementing the [label extension](https://github.com/stac-extensions/label). We can create a new export using the `/stac` endpoint:

In [None]:
def create_validated_stac_export(campaign_id):
    export_post = {
        "name": "GW HITL Workshop export",
        "license": {"license": "proprietary"},
        "taskStatuses": ["VALIDATED"],
        "exportAssetType": None,
        "campaignId": campaign_id
    }
    resp = requests.post(f"{url_base}/api/stac", headers=headers, json=export_post)
    resp.raise_for_status()
    return resp.json()

In [None]:
export = create_validated_stac_export("9ed79bc2-dafa-4fa9-a4f8-0bce214d070e")
export

We can wait for the export to complete within the notebook by checking its status and sleeping -- exports don't take very long so this won't run too many times:

In [None]:
def wait_for_export(export):
    export = requests.get(f"""{url_base}/api/stac/{export["id"]}""", headers=headers).json()
    if export["exportStatus"] != "EXPORTED":
        time.sleep(10)
        return wait_for_export(export)
    return export

In [None]:
completed = wait_for_export(export)
completed["downloadUrl"]

We can download and unzip that export easily with bash:

In [None]:
%%bash
mkdir -p export-data
downloadUrl='https://rasterfoundry-production-data-us-east-1.s3.amazonaws.com/stac-exports/cc1678fd-0f24-476a-be7f-0c3e981ff6d8/catalog.zip?X-Amz-Security-Token=IQoJb3JpZ2luX2VjEJr%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJGMEQCIFkuOkC11O%2Bcl7LsnSqZwcK77f3axuEgszasnww38bbcAiAaemx4vCWZCO90yiDqhBP7q83QXq%2BO1vw4ze%2FidcsZaCqIBAjz%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F8BEAIaDDYxNTg3NDc0NjUyMyIMmSLpWhou1O2ONFOJKtwDJZXweWurOROFfU5NrtJ2Mj70rUCatyoGoKikb4mNiHEnaHbRer6Xf8omqn2GJt%2BP82uOhacYhc6OGXrp9nx%2FpEnxBA95bcuGFzgPaE3KiTo%2Beiop6oaoLPkNqKzV2X35ohtvjZF89ycS3jkPSlq%2F0IWPDsgBuFvpGzkHkrLG1x4tV7%2B8rE4IVoRC7E6hmpa0UKoMcG4sFWHgwRi%2BQF8nP81rm7cB%2BN7qRHWzLVCa0f9l%2FuNCR3AVPXZx%2BPe1ySzd3g3QbpFVVf3bejZi1100xYBRf2Ac3gqfvG2J26OwRi34MDJDhDPec9nUwSyX6kpAn9dbgkTwb8657yz5wHf9VYx3Ke0YmOvKBGhVRLzJOYPYsjPtBfbiFSJq4un5QujR6PmcMBJSRZ31AJjNgFGgst5d5VyhEGgwnCH%2BI2S1Qa2gYSlRZg8hVTte0zPNieboAptKPI7LpLZjQy4PzZZqXSplccPj9PdBMv6xBgFCFjulJZ54ACKpZEcypy0uL1q2gFN0SfHyBQQk9ZKDUmk%2FF71e4vHEDVZVn%2BUDjzJUoBAHeTrgF%2BkrdYXad26ekTsmqp%2BbXmi2SQWytiFwawke56PLYEou47c6kooD9Q51zDbS7XE7G2BDFz8fdY0wnbqoigY6pgFN7Wz%2FcYw13v16y7Pku0EywelbudBWAA93sHJkaJzJhUbW0YXA3etDdwCxBh%2FsJFme37U1XdZUxv3gyS4RXs%2BpzVqCGJ1atJwaKQKA0N3KmW0MYwpgTyDDLuyr0TCLQx0zPO3o0pp29b2aONxP1ZuByQ1ANfs3RsUSRifFWOeCaZaDkcFLtXbOkeJrSScHeMYQfQDVbekM2%2FIg%2FMItkuVigKhIdvQN&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20210921T204532Z&X-Amz-SignedHeaders=host&X-Amz-Expires=86399&X-Amz-Credential=ASIAY6ZH63CNRJTL6PWT%2F20210921%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=cedd6bfc5e0aaac69e01d65d435f7e1dc2432966cc47d32f6ef959789f5c176b'
wget -O export-data/stac-export.zip "${downloadUrl}"
cd export-data
unzip stac-export.zip

To see what's in the export, we can explore it using the notebook's file browser. Start with `README.md`. Now we have new training data!

## From new training data to new predictions

??????