## Supervised ML on Descartes Labs Platform: Training a Random Forest Classifier
__________________
This example will demonstrate a typical pattern of generating training data for a supervising classifier using Descartes Labs Platform APIs.

The general steps covered in this notebook are:
* Read in a training dataset from [`Vector`](https://docs.descarteslabs.com/api/vector.html) containing simple land cover categories over the Austin, TX area 
* Visualize our study area and input layers in [`Dynamic Compute`](https://docs.descarteslabs.com/api/dynamic-compute.html)
* Split the area into [`DLTile`s](https://docs.descarteslabs.com/descarteslabs/geo/readme.html#descarteslabs.geo.DLTile)
* Explore feature masking methodologies
* Define an asynchronous [`Function`](https://docs.descarteslabs.com/descarteslabs/compute/readme.html#descarteslabs.compute.Function) which takes a tile key as an input and:
    * Searches [`Catalog`](https://docs.descarteslabs.com/descarteslabs/catalog/readme.html) to raster data over the **nir**, **red**, and **green** bands of [National Agricultural Imagery Program (NAIP)](https://app.descarteslabs.com/explorer/datasets/usda:naip:v1) imagery
    * Extracts intersecting features as raster masks
    * Returns associated pixel values as lists
    
Move on to [02b Training a Supervised Classifier.ipynb](02b%20Training%20a%20Supervised%20Classifier.ipynb) to retrieve the results of the completed function and train a simple Random Forest Classifier. 

In [None]:
import descarteslabs as dl
import descarteslabs.dynamic_compute as dc
from descarteslabs.catalog import Blob, Image, Product, properties as p
from descarteslabs.compute import Function, Job
from descarteslabs.vector import Table

In [None]:
import geopandas as gpd
import numpy as np
import pandas as pd

from datetime import datetime
from ipyleaflet import GeoData
from rasterio.mask import raster_geometry_mask
from shapely.geometry import box

import json, os, pickle, rasterio
import matplotlib.pyplot as plt

%matplotlib inline

Defining global variables for reference throughout this example, including the NAIP product ID, a list of bands, a start and end date, resolution, and a name for our function:

In [None]:
pid = "usda:naip:v1"
bands = ["nir", "red", "green"]
start = "2020-01-01"
end = "2021-01-01"
resolution = 1.0  # meters
func_name = f"Get RFC Pixel Values {datetime.today().strftime('%Y-%m-%d')}"
func_name

Next we retrieve a table of sample training features:

In [None]:
table_id = "descarteslabs:austin-landcover-training-data"
table = Table.get(table_id)
table

## Study Area - Austin, TX
In the next few cells we will set up an interactive map frame to overlay our training feature collection on the input NAIP imagery. 

Setting up an interactive map, alongside center coordinates and zoom:

In [None]:
m = dc.map
m.center = 30.25, -97.74
m.zoom = 12

Create a mosaic of our NAIP imagery and visualize as a false color composite (FCC):

In [None]:
naip_mosaic = dc.Mosaic.from_product_bands(
    pid, bands, start_datetime=start, end_datetime=end
)
naip_mosaic.visualize("FCC", m)

Next visualize our input training table:

In [None]:
table.visualize(
    "Training Polygons",
    m,
)

### Generating Training Data - Tiling
As outlined above, the general steps to extract training data are as follows:
* Splitting up the training AOI into `DLTile`s
* For each `DLTile` we search **NAIP** imagery and use `rasterio` to extract all intersecting feature masks

First we will split our input feature collection's extent into tiles, over which we will define our funtion to iterate:

In [None]:
gdf = table.collect()
gdf_geom = box(*gdf["geometry"].total_bounds)
dltiles = dl.geo.DLTile.from_shape(
    gdf_geom, resolution=resolution, tilesize=2048, pad=0
)
len(dltiles)

Since our feature collection is sparse, and NAIP is high resolution, we want to omit any tiles that don't intersect any of our input features:

In [None]:
dltiles = [dltile for dltile in dltiles if gdf.intersects(dltile.geometry).any()]
len(dltiles)

Lastly, we can add our tile geometries to the map to visualize our project:

In [None]:
dltile_gdf = gpd.GeoDataFrame(
    {
        "geometry": [dltile.geometry for dltile in dltiles],
    },
    crs=4326,
)
geo_data = GeoData(
    geo_dataframe=dltile_gdf,
    style={"color": "black", "fillOpacity": 0.0},
    name="DLTiles",
)
m.add_layer(geo_data)

In [None]:
m

## Generating Training Data - Masking

Next up we will explore feature masking methodologies. In this example we will retrieve Catalog imagery over a sample tile as a geotiff and use [rasterio](https://rasterio.readthedocs.io/en/stable/) to efficiently mask our features in a [pd.apply()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html). 

First we will search NAIP over a sample tile:

In [None]:
dltile = dltiles[0]
dltile

In [None]:
naip_ic = (
    Product.get(pid)
    .images()
    .intersects(dltile)
    .filter(start <= p.acquired < end)
    .sort("acquired")
    .limit(None)
).collect()
naip_ic

Next we define a `generate_polygon_masks` function. We will use [`rasterio.mask`](https://rasterio.readthedocs.io/en/latest/api/rasterio.mask.html) to efficienty mask our geotiff downloaded through Catalog to each feature in our training dataset and extract the associated band values into our geodataframe. 

This function takes 3 arguments:
* An input row in a geodataframe
* An opened raster dataset
* A list of bands for column names

It then passes the row's geometry as a feature mask, including **all_touched** to find all pixels that touch our geometry and **crop** to efficiently open a small _window_ of the whole raster. Iterating over each band name, it finally returns a list of corresponding pixel values for each feature (e.g. **nir**, **red**, and **green**).

In [None]:
def generate_polygon_masks(x, in_ds, bands):
    """
    Takes input row of a geodataframe with a geometry in the same projection as
    an input dataset. Performs a raster_geometry_mask and populates the row
    with n corresponding unmasked values, one for each band. Also saves the feature's
    mask and window for more effecient reading of the geotiff.
    """
    # Perform the mask--this returns a feature mask, transform (unused),
    # and the window over which to read the input dataset

    out_msk, out_trans, out_wind = raster_geometry_mask(
        in_ds, [x["geometry"]], all_touched=True, crop=True
    )
    x["out_msk"] = out_msk
    x["out_wind"] = out_wind

    # Opens the input dataset at the specified window
    in_window = in_ds.read(window=out_wind)
    # For each band we mask to the feature and return the stack
    # Arr is shape (bands, y, x)
    out_arr = np.stack([a[~out_msk] for a in in_window])
    # Figuring out out to best store this--we just return a list of values
    # for each band in the input dataset
    for i, band in enumerate(bands):
        vals = out_arr[i].tolist()
        x[f"{band}"] = vals

    return x

Here we will test things out and plot the steps below:
1. Download our NAIP imagery as a geotiff from Catalog
2. Open our geotiff in rasterio
4. Reproject our geodataframe to local CRS
3. Apply our `generate_polygon_masks` function to annotate each feature with contained band values

We also will plot out the corresponding windowed masks for the **crop** argument explained above. 

In [None]:
print("Downloading mosaic...")
naip_ic.download_mosaic(bands, dest="naip_temp.tif")
# Performing the feature sampling by applying the function defined above:
with rasterio.open("naip_temp.tif", "r+") as in_ds:
    print("Performing feature sampling...")
    # Generate Polygon Masks function:
    sampled_gdf = gdf.clip(dltile.geometry).to_crs(dltile.crs)
    sampled_gdf = sampled_gdf.apply(
        lambda x: generate_polygon_masks(x, in_ds, bands), axis=1
    )
    # Visualizing methodology:
    # 1. Cropped feature mask:
    out_crp_msk, out_crp_trans, out_crp_window = raster_geometry_mask(
        in_ds, sampled_gdf.geometry.tolist(), all_touched=True, crop=True
    )
    # 2. Uncropped feature mask:
    out_msk, out_trans, out_window = raster_geometry_mask(
        in_ds, sampled_gdf.geometry.tolist(), all_touched=True, crop=False
    )

    fig, ax = plt.subplots(figsize=(9, 3), nrows=1, ncols=3)
    arr = in_ds.read()
    ax[0].imshow(arr.transpose((1, 2, 0))[:, :, :3])
    ax[0].set_title(r"FCC")
    ax[1].imshow(out_msk)
    ax[1].set_title(r"Uncropped Mask")
    ax[2].imshow(out_crp_msk)
    ax[2].set_title(r"Cropped Mask")
plt.tight_layout()
os.remove("naip_temp.tif")

And here we see the new columns on our dataframe:

In [None]:
sampled_gdf

## Wrapping it All Together with Batch Compute
Here we'll define a function which wraps all of the previously outlined methodology into a self-contained Python function. The inputs here are a single tile key and the overall steps are as follows:
* Re-create a tile from the passed key
* Retrieve the training features clipped to the input tile
* Search NAIP over the input tile and retrieve the imagery as a geotiff
* Perform the feature sampling method outlined above against the clipped features
* Return the associated intersecting band values as lists

In [None]:
def get_pixel_values(dltile_key):
    import descarteslabs as dl
    from descarteslabs.catalog import Blob, Image, Product, properties as p
    from descarteslabs.vector import Table

    import numpy as np
    import geopandas as gpd
    from rasterio.mask import raster_geometry_mask

    import rasterio
    import os
    from json import loads

    def generate_polygon_masks(x, in_ds, bands):
        """
        Takes input row of a dataframe with a geometry in the same projection as
        an input dataset. Performs a raster_geometry_mask and populates the row
        with n corresponding unmasked values, one for each band.
        """
        # Perform the mask--this returns a feature mask, transform (unused),
        # and the window over which to read the input dataset

        out_msk, out_trans, out_wind = raster_geometry_mask(
            in_ds, [x["geometry"]], all_touched=True, crop=True
        )

        x["out_msk"] = out_msk
        x["out_wind"] = out_wind

        # Opens the input dataset at the specified window
        in_window = in_ds.read(window=out_wind)
        # For each band we mask to the feature and return the stack
        # Arr is shape (bands, y, x)
        out_arr = np.stack([a[~out_msk] for a in in_window])
        # Figuring out out to best store this--we just return a list of values
        # for each band in the input dataset
        for i, band in enumerate(bands):
            vals = out_arr[i].tolist()
            x[f"{band}"] = vals
        return x

    dltile = dl.geo.DLTile.from_key(dltile_key)
    print(f"Processing {dltile_key}")

    table_id = "descarteslabs:austin-landcover-training-data"
    pid = "usda:naip:v1"
    start = "2020-01-01"
    end = "2021-01-01"
    bands = ["nir", "red", "green"]
    # Pulling GDF from Vector

    table = Table.get(table_id, aoi=dltile)
    gdf = table.collect().to_crs(dltile.crs)

    print("Downloaded GDF...")

    # This checks whether there are any intersecting features in this Scene, if not
    # then we end the Job here.
    try:
        assert len(gdf) > 0, print(f"No intersections {dltile_key}")
    except AssertionError:
        return {}
    print("Searching Images...")

    naip_ic = (
        Product.get(pid)
        .images()
        .intersects(dltile)
        .filter(start <= p.acquired < end)
        .sort("acquired")
        .limit(None)
    ).collect()
    print(naip_ic)

    naip_ic.download_mosaic(
        bands=bands,
        geocontext=dltile,
        dest=f"naip_temp.tif",
        format="tif",
    )
    print("Downloaded GeoTIFF...")

    # Opening the geotiff via Rasterio
    with rasterio.open(f"naip_temp.tif", "r+") as in_ds:
        print("Performing feature sampling...")
        # Generate Polygon Masks function:
        sampled_gdf = gdf.apply(
            lambda x: generate_polygon_masks(x, in_ds, bands), axis=1
        )

    # Returning GDF as a dictionary, dropping geom and index columns along the way:
    out_data = sampled_gdf.drop(columns=["geometry", "out_msk", "out_wind"]).to_dict()

    print("Cleaning up")
    # Deleting the tiff from memory
    os.remove(f"naip_temp.tif")
    print("Complete")

    return {"dltile": dltile_key, "data": out_data}

Now we format our input arguments:

In [None]:
args = [[dltile.key] for dltile in dltiles]
len(args)

Now that it's all packaged up into a function, we can test it locally:

In [None]:
pd.DataFrame(get_pixel_values(dltiles[0].key)["data"])

Once we are happy with the performance of our function we can save it to our Compute service.

Note here that we must pass geopandas as a requirement:

In [None]:
async_func = Function(
    get_pixel_values,
    name=func_name,
    image="python3.9:latest",
    cpus=1,
    memory=2,
    timeout=900,
    maximum_concurrency=20,
    retry_count=1,
    requirements=["descarteslabs-vector", "geopandas"],
)

async_func.save()
print(f"Saved {async_func.id}")

**_Take note of your Function ID!_**

And finally map args to our Function to return a set of jobs:

In [None]:
jobs = async_func.map(args)
len(jobs)

Navigate to [app.descarteslabs.com/compute](https://app.descarteslabs.com/compute) to track your progress.

Or wait programmatically via:

In [None]:
# async_func.wait_for_completion()

Once this function completes, you can move on to [02b Training a Supervised Classifier.ipynb](02b%20Training%20a%20Supervised%20Classifier.ipynb) to retrieve the results and train our model! 