## Supervised ML on Descartes Labs Platform: Training a Random Forest Classifier
__________________
This example will demonstrate a typical pattern of training  a supervised classifier using Descartes Labs Platform APIs.

The general steps covered in this notebook are:
* Retrieve the active [`Function`](https://docs.descarteslabs.com/descarteslabs/compute/readme.html#descarteslabs.compute.Function) created in [02a Generate Training Data.ipynb](02a%20Generate%20Training%20Data.ipynb) and its associated results
* Reformat the returned pixel data and train a simple [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
* Test inference on a sample tile to visualize land cover predictions 
* Save the trained model as a [`Blob`](https://docs.descarteslabs.com/descarteslabs/catalog/docs/blob.html#descarteslabs.catalog.Blob) for reference in [02c Deploying a Supervised Classifier.ipynb](02c%20Deploying%20a%20Supervised%20Classifier.ipynb)

_**Note:**_ In order to run this example you must first complete the steps outlined in [02a Generate Training Data.ipynb](02a%20Generate%20Training%20Data.ipynb).

In [None]:
import json
import os
import pickle
import yaml

import descarteslabs as dl
import descarteslabs.compute
import descarteslabs.vector as dl_vector
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import shapely.geometry as sgeom
from numpy.typing import NDArray
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

Pulling in the config file, including the NAIP product ID, a list of bands, a start and end date, resolution, and a function name to search:

In [None]:
with open("config.yaml", "r") as file:
    config = yaml.load(file, yaml.FullLoader)

## Retrieving an Active Compute Function
If you lost your ID, you can retrieve it at [app.descarteslabs.com/compute](https://app.descarteslabs.com/compute) or search the latest created function with that name as below:

In [None]:
func_search = (
    dl.compute.Function.search()
    .filter(dl.catalog.properties.name == config["gen_data_func_name"])
    .sort(-dl.compute.Function.creation_date)
).collect()
async_func = func_search[0]
async_func.id

## Retrieving Function Results

Next we will loop through each [`Job`](https://docs.descarteslabs.com/descarteslabs/compute/readme.html#descarteslabs.compute.Job) from our function to access its results:

In [None]:
results = []
for job in async_func.jobs:
    results.append(job.result())

Since our function from [02a Generate Training Data.ipynb](02a%20Generate%20Training%20Data.ipynb) simply returned a dictionary, we can load each and concatenate as a list of dataframes:

In [None]:
df = pd.concat([gpd.GeoDataFrame(res["data"]) for res in results])
df.head()

## Reshaping Results for Scikit-Learn

In the following cell we'll define a simple function which converts each returned list of band values to numpy arrays:

In [None]:
def list_to_array(row: pd.Series, bands: list[str]) -> NDArray:
    val_list = [np.array(y) for y in row[bands].values]
    return np.stack(val_list).T

We then group our dataframe by each respective cover type, apply our ndarray conversion function, and concatenate into two training sets that area accepted by [`.fit(X, y)`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.fit):

In [None]:
X_list = []
y_list = []
for group, group_df in df.groupby("category_int"):
    # Apply the function
    X_arrs = group_df.apply(lambda x: list_to_array(x, config["bands"]), axis=1)
    X_arr = np.concatenate([x for x in X_arrs])
    y_arr = np.full(X_arr.shape[0], group)
    X_list.append(X_arr)
    y_list.append(y_arr)

X = np.concatenate(X_list)
y = np.concatenate(y_list)
X.shape, y.shape

Here **X** is shape **(n_samples, n_features)** and **y** is shape **(n_samples, n_outputs)**.

Now we can perform a simple [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html):

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
X_train.shape, y_train.shape

Declare our classifier:

In [None]:
clf = RandomForestClassifier(n_jobs=-1, verbose=1)

Fit it on our training samples:

In [None]:
clf.fit(X_train, y_train)

And evaluate our performance:

In [None]:
yhat = clf.predict(X_test)
acc = accuracy_score(y_test, yhat)
acc

## Testing Predictions
Now that we've trained the model, we can also see how it performs over test imagery. Here we will define a single tile over which we will see how our model performs:

Search NAIP over our sample tile:

In [None]:
dltile = dl.geo.DLTile.from_latlon(
    30.2629, -97.7507, resolution=config["resolution_m"], tilesize=1024, pad=0
)
naip_ic = (
    dl.catalog.Product.get(config["product_id"])
    .images()
    .intersects(dltile)
    .filter(config["start"] <= dl.catalog.properties.acquired < config["end"])
    .sort("acquired")
    .limit(None)
).collect()

Retrieve imagery as an ndarray:

In [None]:
ndarr = naip_ic.mosaic(
    bands=config["bands"],
    bands_axis=-1,
)

Reshape to **(n_samples, n_features)**:

In [None]:
nx, ny, nsamples = ndarr.shape
in_ras_arr = ndarr.reshape(-1, nsamples)
in_ras_arr.shape

And predict:

In [None]:
preds = clf.predict(in_ras_arr)

fig, ax = plt.subplots(figsize=(20, 10), nrows=1, ncols=2)
ax[0].imshow(ndarr)
ax[0].set_title("FCC")
ax[1].imshow(preds.reshape(nx, ny), cmap="terrain")
ax[1].set_title("RFC Preds")

We may want to outline building shadows next time! 

## Saving for Later

Once happy with the performance of a model we can save it as a .pickle file and store it as a blob:

In [None]:
import pickle

with open("training_rfc.pickle", "wb") as rfc_pkl_file:
    pickle.dump(clf, rfc_pkl_file)

#### _Note on Saving Blobs:_
We do not always need to delete and overwrite our objects on every iteration as in the following cell. This notebook is designed for demonstration purposes where we do not care about preserving each prior model.

In practice, as long as your Blob has a **unique** ID you ignore the following cell and simply run:

    blob = Blob(name="unique-model-name")
    blob.upload("rfc_file.pickle")
    blob.save()

In [None]:
blob_name = config["trained_model_blob_name"]

try:
    # Create a new Blob object
    blob = dl.catalog.Blob(
        name=blob_name,
        tags=["examples"],
    )
    # Upload our DataFrame to this Blob:
    blob.upload("training_rfc.pickle")
    blob.save()

except dl.exceptions.ConflictError:
    print("Blob already exists")
    blob = dl.catalog.Blob.get(name=blob_name)
    blob.delete()
    blob = dl.catalog.Blob(
        name=blob_name,
        tags=["examples"],
    )
    # Upload our DataFrame to this Blob:
    blob.upload("training_rfc.pickle")
    blob.save()

And finally cleaning up:

In [None]:
os.remove("training_rfc.pickle")

Next move on to [02c Deploying a Supervised Classifier.ipynb](02c%20Deploying%20a%20Supervised%20Classifier.ipynb) to scale the inference of the model we just trained!