# Computer Vision on the Descartes Labs Platform - Generate Training Data
__________________
This notebook will demonstrate how one can utilize Descartes Labs Python APIs to efficiently prototype and iterate on training data generation for an image segmentation model. This is meant to serve _solely as a jumping off point_ and is not intended to be used as a panacea for all machine learning needs.

The general outline of this sample is as follows:
* Explore our study area interactively with [`Dynamic Compute`](https://docs.descarteslabs.com/api/dynamic-compute.html), including:
    * Reading in a table of training features as a [`Vector`](https://docs.descarteslabs.com/api/vector.html) table, in this sample we will look at wellpads in West Texas
    * Overlaying [National Agricultural Imagery Program (NAIP)](https://app.descarteslabs.com/explorer/datasets/usda:naip:v1) high resolution optical imagery for our model input
* Split up the study area into [`DLTile`](https://docs.descarteslabs.com/descarteslabs/geo/readme.html#descarteslabs.geo.DLTile)s
* Define an asynchronous [`Function`](https://docs.descarteslabs.com/descarteslabs/compute/readme.html#descarteslabs.compute.Function) to map over each tile, which:
    * Searches and retrieves NAIP imagery from [`Catalog`](https://docs.descarteslabs.com/descarteslabs/catalog/readme.html)
    * Masks the imagery to the intersecting training features
    * Returns the corresponding **nir**, **red**, and **green** band values and feature masks
* Retrieve and format results of the function for input into a tensorflow model in [03b Training a Segmentation Model.ipynb](03b%20Training%20a%20Segmentation%20Model.ipynb)

In [None]:
import descarteslabs as dl
from descarteslabs.catalog import Blob, Product, Image, properties as p
from descarteslabs.compute import Function, Job

In [None]:
import descarteslabs.dynamic_compute as dc
from descarteslabs.vector import Table

In [None]:
import json, os, rasterio, sys
import geopandas as gpd
import numpy as np

from rasterio.mask import raster_geometry_mask
import matplotlib.pyplot as plt

Defining global variables, including NAIP Product ID and a Function name:

In [None]:
naip_pid = "usda:naip:v1"
wellpad_tid = "descarteslabs:wellpad-example-training-data"
func_name = "Pull Wellpad Training Data"

In [None]:
major = sys.version_info.major
minor = sys.version_info.minor
compute_image = f"python{major}.{minor}:latest"
compute_image

## Setting the Scene with Dynamic Compute

Here we will set up an interactive map to explore our study area in more detail. In the subsequent cells we will visualize the training feature collection, a series of 1000 outline wellpads in West Texas, and overlay them on top of 1m resolution NAIP imagery collected in 2016.

Setting up an ipyleaflet map, including center coordinates and zoom level:

In [None]:
m = dc.map

m.center = 33.4730, -101.4974
m.zoom = 14

Create a Mosaic of NAIP Imagery for our time period:

In [None]:
naip_mosaic = dc.Mosaic.from_product_bands(
    naip_pid, "nir red green", start_datetime="2016-01-01", end_datetime="2017-01-01"
)
naip_mosaic.visualize("NAIP FCC", m)

Then get and and visualize our training features table:

In [None]:
wellpad_table = Table.get(wellpad_tid)
wellpad_table.visualize("Wellpads", m)

And finally instantiate our map frame:

In [None]:
m

## Tiling 

Here we will create a list of tiles for our area. First, retrieve the table as a geodataframe:

In [None]:
wellpad_gdf = wellpad_table.collect()
wellpad_gdf.head(2)

In [None]:
dltiles = [
    dl.geo.DLTile.from_latlon(
        i.centroid.y, i.centroid.x, resolution=1.0, tilesize=256, pad=0
    )
    for i in wellpad_gdf.geometry.tolist()
]

### _Note on tiling:_
_In the above example we simply demonstrate one way of tiling up your AOIs. In practice the ideal method may vary depending on both the dimensions and distribution of your training data._

## Feature Masking - Generating Training Data with Tiles
Next up we will go through the methodology of generating our training inputs step by step before defining our asynchronous function:
* Catalog search NAIP imagery for each tile
* Download imagery as geotiff
* Clip input training features to the image
* Mask input features to image
* Return mask and array values

Search NAIP imagery over a sample tile:

In [None]:
naip_prod = Product.get(naip_pid)
naip_search = naip_prod.images()
naip_ic = (
    naip_search.intersects(dltiles[0]).filter("2016-01-01" < p.acquired < "2017-01-01")
).collect()
naip_ic

Download mosaic as a GeoTIFF:

In [None]:
naip_ic.download_mosaic(["nir", "red", "green"], dest="temp.tif", format="tif")

Clip training features to DLTile extent:

In [None]:
clip_table = Table.get(wellpad_tid, aoi=dltiles[0])
clip_gdf = clip_table.collect().to_crs(dltiles[0].crs)
clip_gdf.plot()

Mask input feature to the image:

In [None]:
fig, ax = plt.subplots(figsize=(12, 3), nrows=1, ncols=4)
with rasterio.open("temp.tif", "r+") as in_ds:
    out_msk, out_trans, out_wind = raster_geometry_mask(
        in_ds,
        clip_gdf.geometry.tolist(),
    )
    arr = in_ds.read()
    out_data = {}
    for i, band in enumerate(["nir", "red", "green"]):
        band_arr = arr[i, :, :]
        msk_band_arr = np.ma.masked_where(out_msk, band_arr)
        ax[i].imshow(msk_band_arr)
        ax[i].set_title(band + " masked")
    ax[3].imshow(out_msk)
    ax[3].set_title("mask")
os.remove("temp.tif")

## Scaling with Batch Compute

Now that we've covered the methodology, we can wrap the above code into a self-contained Python function to then send to our Batch Compute service. In the below example the only input argument is a single tile key, and the function returns a dictionary containing the input band values and their associated feature masks as lists. 

In [None]:
def pull_training_data(dltile_key):
    import os, rasterio, numpy as np
    from rasterio.mask import mask, raster_geometry_mask

    from descarteslabs.catalog import Product, properties as p
    import descarteslabs as dl
    from descarteslabs.vector import Table

    # Global variables
    naip_pid = "usda:naip:v1"
    bands = ["nir", "red", "green"]
    wellpad_tid = f"descarteslabs:wellpad-example-training-data"
    print("Starting process...")
    # Creating tile
    dltile = dl.geo.DLTile.from_key(dltile_key)

    # Retrieving features within the tile
    local_table = Table.get(wellpad_tid, aoi=dltile)
    local_gdf = local_table.collect().to_crs(dltile.crs)

    print("Downloaded GDF...")
    # Search NAIP
    naip_prod = Product.get(naip_pid)
    naip_ic = (
        naip_prod.images()
        .intersects(dltile)
        .filter("2016-01-01" < p.acquired < "2017-01-01")
    ).collect()
    # Download imagery
    naip_ic.download_mosaic(bands, dest="temp.tif")
    print("Downloaded GeoTIFF...")
    # Set up results dict
    data_dict = {"key": dltile_key, "data": {}}
    # Mask to features
    # Open the tiff
    print("Masking to features")
    with rasterio.open(f"temp.tif", "r+") as in_ds:
        # Mask to features
        out_msk, out_trans, out_wind = raster_geometry_mask(
            in_ds,
            local_gdf.geometry.tolist(),
        )
        # Read our dataset
        arr = in_ds.read()
        out_data = {}
        # For each band we mask and add the associated values to our output dictionary
        for i, band in enumerate(bands):
            band_arr = arr[i]
            msk_band_arr = np.ma.masked_where(out_msk, band_arr)
            # Append masked data to our output dict
            data_dict["data"][band] = band_arr.tolist()

        data_dict["data"]["mask"] = (~out_msk).tolist()
    print("Complete")
    # Cleaning up after ourselves
    os.remove("temp.tif")
    # Returning our results, which will be saved as a Blob
    return data_dict

It is best practice to test out your function locally to ensure things run as expected!

In [None]:
res_dict = pull_training_data(dltiles[0].key)
fig, ax = plt.subplots(figsize=(10, 5), nrows=1, ncols=2)
ax[0].imshow(np.array(res_dict["data"]["red"]).reshape(256, 256))
ax[1].imshow(np.array(res_dict["data"]["mask"]).reshape(256, 256))

## Creating a Compute Function
Now that we've settled on a function we can submit it to our asynchronous Batch Compute service:

In [None]:
async_func = Function(
    pull_training_data,
    name=func_name,
    image=compute_image,
    cpus=1,
    memory=2,
    timeout=300,
    maximum_concurrency=20,
    retry_count=1,
    requirements=[
        "geopandas",
        "rasterio",
    ],
)
async_func.save()
print(f"Saved {async_func.id}")

And submit each tile key to the function to return a list of jobs:

In [None]:
jobs = async_func.map([[dltile.key] for dltile in dltiles])
len(jobs)

We now wait for our function to complete. To track progress visit [app.descarteslabs.com/compute](https://app.descarteslabs.com/compute) or wait programmatically via:

    async_func.wait_for_completion()

## Retrieving Results
Once your function has finished running we can read the results as blobs. 

If you lost your function ID, access it from [app.descarteslabs.com/compute](https://app.descarteslabs.com/compute), or search the most recently created function by that name as below:

In [None]:
func_search = (
    Function.search()
    .filter(p.name.startswith(func_name))
    .sort(-Function.creation_date)
    .limit(1)
).collect()
async_func = func_search[0]
print(async_func.id)
print(async_func.creation_date)

Next we search and retrieve the results of our function. This may take a few minutes:

In [None]:
print(f"Retrieving results for {async_func.id}")
res_list = []
for b in (
    Blob.search()
    .filter(p.name.startswith(async_func.id))
    .filter(p.storage_type == "compute")
):
    res_list.append(json.loads(b.data()))
print(f"Retrieved {len(res_list)} results")

Now we concatenate our results by casting each list as numpy arrays:

In [None]:
rgb_list = [
    np.array(
        (res["data"]["nir"], res["data"]["red"], res["data"]["green"]),
        dtype=np.float64,
    )
    for res in res_list
]
msk_list = [np.array([res["data"]["mask"]]) for res in res_list]
del res_list

And finally reformat and save them for input to our tensorflow model trained in [03b Training a Segmentation Model.ipynb](03b%20Training%20a%20Segmentation%20Model.ipynb):

In [None]:
data_array = np.transpose(np.stack(rgb_list, axis=0), (0, 2, 3, 1))
mask_array = np.transpose(np.stack(msk_list, axis=0), (0, 2, 3, 1))
data_array.shape, mask_array.shape

In [None]:
np.save("data_array.npy", data_array)
np.save("mask_array.npy", mask_array)