## Supervised ML on EarthOne Platform: Training a Random Forest Classifier
__________________
This example will demonstrate a typical pattern of generating training data for a supervised classifier using EarthOne Platform APIs.

The general steps covered in this notebook are:
* Read in a training dataset from [`Vector`](https://docs.earthone.earthdaily.com/earthdaily/earthone/vector/readme.html) containing simple land cover categories over the Austin, TX area 
* Visualize our study area and input layers in [`Dynamic Compute`](https://docs.earthone.earthdaily.com/api/dynamic-compute.html)
* Define an asynchronous [`Function`](https://docs.earthone.earthdaily.com/earthdaily/earthone/compute/readme.html#earthdaily.earthone.compute.Function) which takes a tile key as an input and:
    * Searches [`Catalog`](https://docs.earthone.earthdaily.com/earthdaily/earthone/catalog/readme.html) to raster data over the **nir**, **red**, and **green** bands of National Agricultural Imagery Program (NAIP) imagery
    * Extracts intersecting features as raster masks
    * Returns associated pixel values as lists
    
Move on to [02b Training a Supervised Classifier.ipynb](02b%20Training%20a%20Supervised%20Classifier.ipynb) to retrieve the results of the completed function and train a simple Random Forest Classifier. 

In [None]:
import json
import os
import pickle
import yaml
import sys

import earthdaily.earthone as eo
import earthdaily.earthone.compute
import earthdaily.earthone.vector as eo_vector
import earthdaily.earthone.dynamic_compute as dc

import geopandas as gpd
import ipyleaflet
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import shapely.geometry as sgeom
from shapely import remove_repeated_points

Load global variables for reference throughout this example, including the NAIP product ID, a list of bands, a start and end date, resolution, and a name for our function:

In [None]:
with open("config.yaml", "r") as file:
    config = yaml.load(file, yaml.FullLoader)

In [None]:
major = sys.version_info.major
minor = sys.version_info.minor
compute_image = f"python{major}.{minor}:latest"
compute_image

Next we retrieve a table of sample training features:

In [None]:
table = eo_vector.Table.get(config["training_table_name"])
table

## Study Area - Austin, TX
In the next few cells we will set up an interactive map frame to overlay our training feature collection on the input NAIP imagery. 

Setting up an interactive map, alongside center coordinates and zoom:

In [None]:
m = dc.map
m.center = 30.2552, -97.7689
m.zoom = 12

Create a mosaic of our NAIP imagery and visualize as a false color composite (FCC):

In [None]:
naip_mosaic = dc.Mosaic.from_product_bands(
    config["product_id"],
    config["bands"],
    start_datetime=config["start"],
    end_datetime=config["end"],
)

naip_mosaic.visualize("FCC", m)

Next visualize our input training table:

In [None]:
table.visualize(
    "Training Polygons",
    m,
)

### Generating Training Data - Tiling
As outlined above, the general steps to extract training data are as follows:
* Splitting up the training AOI into tiles
* For each tile we search NAIP imagery and extract all intersecting feature masks

First, lets pull the data from the table and get a feel for it.

In [None]:
gdf = table.collect()
gdf.head()

We have rows of data with a geometry column, a plain-text category, a category integer (so water maps to the value 3) and a uuid that uniquely identifies each row. Lets look at what the categories are

In [None]:
print(
    f"There are {len(gdf)} features with the following categories: {gdf.category.unique()}"
)

In [None]:
m

## Generating Training Data

First we will search NAIP over a sample tile:

In [None]:
geom = gdf.iloc[0]['geometry']
aoi = eo.geo.AOI(geom, resolution=config['resolution_m'], crs='EPSG:3857')
aoi

In [None]:
naip_ic = (
    eo.catalog.Product.get(config["product_id"])
    .images()
    .intersects(aoi)
    .filter(config["start"] <= eo.catalog.properties.acquired < config["end"])
    .sort("acquired")
    .limit(None)
).collect()


In [None]:
naip_arr = naip_ic.mosaic(config["bands"], bands_axis=-1)

In [None]:
fig, ax = plt.subplots()
ax.imshow(naip_arr)

## Putting it All Together with Batch Compute
Here we'll define a function which wraps all of the previously outlined methodology into a self-contained Python function. The inputs here are a single tile key and the overall steps are as follows:
* Re-create a tile from the passed key
* Retrieve the training features clipped to the input tile
* Search NAIP over the input tile and retrieve the imagery as a geotiff
* Perform the feature sampling method outlined above against the clipped features
* Return the associated intersecting band values as lists

In [None]:
def get_pixel_values(
    FEATURE_ID: str, 
    TABLE_ID: str, 
    START: str,
    END: str, 
    BANDS: list):

    import json
    import os

    import earthdaily.earthone as eo
    import earthdaily.earthone.vector as eo_vector
    import geopandas as gpd
    import numpy as np
    import pandas as pd


    print(f"Processing {FEATURE_ID}")

    PRODUCT_ID = "usda:naip:v1"

    # Pulling GDF from Vector
    feature = eo_vector.Feature.get(f"{TABLE_ID}:{FEATURE_ID}").values

    aoi = eo.geo.AOI(feature['geometry'], resolution=1.0, crs="EPSG:3857")

    print("Searching Images...")
    naip_ic = (
        eo.catalog.Product.get(PRODUCT_ID)
        .images()
        .intersects(aoi)
        .filter(START <= eo.catalog.properties.acquired < END)
        .sort("acquired")
        .limit(None)
    ).collect()

    print("Downloaded GDF...")
    naip_ndarr = naip_ic.mosaic(
        bands=BANDS,
    )
    print("Downloaded Imagery...")

    # Returning GDF as a dictionary, dropping geom column along the way:
    # In practice, you could modify your own personal Table here:
    out_data = {"uuid": FEATURE_ID, "year": START[:4],  "data": {}}
    for i, band in enumerate(BANDS):
        out_data['data'][band]=naip_ndarr[i].compressed().tolist()

    out_data['data']['category_int']=np.full(naip_ndarr[0].compressed().shape, feature['category_int']).tolist()
    out_data['data']['year']=np.full(naip_ndarr[0].compressed().shape, int(START[:4])).tolist()

    print("Complete")

    return out_data

Now we format our input arguments:

In [None]:
args = [[uuid, config["training_table_name"], config['start'], config['end'], config['bands']] for uuid in gdf.uuid]
len(args)

Now that it's all packaged up into a function, we can test it locally:

In [None]:
res = get_pixel_values(*args[0])
pd.DataFrame(res['data']).head()

Once we are happy with the performance of our function we can save it to our Compute service.

Note here that we must pass geopandas as a requirement:

In [None]:
async_func = eo.compute.Function(
    get_pixel_values,
    name=config["gen_data_func_name"],
    image=compute_image,
    cpus=1,
    memory=2,
    timeout=300,
    maximum_concurrency=20,
    retry_count=1,
)

async_func.save()
print(f"Saved {async_func.id}")

**_Take note of your Function ID!_**

And finally map args to our Function to return a set of jobs:

In [None]:
jobs = async_func.map(args)

Navigate to [earthone.earthdaily.com/compute](https://earthone.earthdaily.com/compute) to track your progress.

Or wait programmatically via:

In [None]:
# async_func.wait_for_completion()

Once this function completes, you can move on to [02b Training a Supervised Classifier.ipynb](02b%20Training%20a%20Supervised%20Classifier.ipynb) to retrieve the results and train our model! 