## Supervised ML on Descartes Labs Platform: Training a Random Forest Classifier
__________________
This example will demonstrate a typical pattern of training  a supervising classifier using Descartes Labs Platform APIs.

The general steps covered in this notebook are:
* Retrieve a running [`Function`]() and its results
* Reformat results for input into a [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
* Save the trained model as a [`Blob`]() for reference in [02c Deploying a Supervised Classifier.ipynb](02c%20Deploying%20a%20Supervised%20Classifier.ipynb)

_Note:_ In order to run this example you must first complete the steps outlined in [02a Generate Training Data.ipynb](02a%20Generate%20Training%20Data.ipynb).

In [None]:
import descarteslabs as dl
from descarteslabs.catalog import Blob, Image, Product, properties as p
from descarteslabs.compute import Function, Job
from descarteslabs.vector import Table

In [None]:
import json, pickle, os
import geopandas as gpd
import numpy as np
import pandas as pd

from shapely.geometry import box
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

Defining global variables for reference throughout this example, including the NAIP product ID, a list of bands, a start and end date, resolution, and a function name to search:

In [None]:
pid = "usda:naip:v1"
bands = ["nir", "red", "green"]
start = "2020-01-01"
end = "2021-01-01"
resolution = 1.0  # meters
func_name = f"Get RFC Pixel Values"
func_name

As well as the current user's namespace information:

In [None]:
org = dl.auth.Auth().payload["org"]
user_id = dl.auth.Auth().namespace

## Retrieving an Active Compute Function
If you lost your ID, you can retrieve it at [app.descarteslabs.com/compute](https://app.descarteslabs.com/compute) or search the latest created Function with that name as below:

In [None]:
func_search = (
    Function.search()
    .filter(p.owner == user_id)
    .filter(p.name.startswith("Get RFC Pixel Values"))
    .sort(-Function.creation_date)
    .limit(1)
).collect()
async_func = func_search[0]
async_func.id

## Retrieving Function Results

Next we will loop through each [`Job`](https://docs.descarteslabs.com/descarteslabs/compute/readme.html#descarteslabs.compute.Job) from our function to access its results:

In [None]:
print(f"Results for {async_func.id}")
res_list = []
for b in (
    Blob.search()
    .filter(p.namespace == f"{org}:{user_id}")
    .filter(p.name.startswith(async_func.id))
    .filter(p.storage_type == "compute")
):
    print(f"ID: {b.id}")
    res_list.append(json.loads(b.data()))

Since our function from [02a Generate Training Data.ipynb](02a%20Generate%20Training%20Data.ipynb) simply returned a dictionary, we can load each and concatenate as a list of dataframes:

In [None]:
df = pd.concat([gpd.GeoDataFrame(res["data"]) for res in res_list])
df.head(1)

Next up we'll define a simple function which converts each list of band values to numpy arrays:

In [None]:
def list_to_array(x, bands):
    val_list = [np.array(y) for y in x[bands].values]
    return np.stack(val_list).T

We next group our dataframe by each respective cover type, apply our ndarray conversion function, and concatenate into two training sets that area accepted by [`.fit(X, y)`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.fit). Where **X** is shape **(n_samples, n_features)** and **y** is shape **(n_samples, n_outputs)**:

In [None]:
X_list = []
y_list = []
for group, group_df in df.groupby("category_int"):
    X_arrs = group_df.apply(lambda x: list_to_array(x, bands), axis=1)
    X_arr = np.concatenate([x for x in X_arrs])
    y_arr = np.full(X_arr.shape[0], group)
    X_list.append(X_arr)
    y_list.append(y_arr)

X = np.concatenate(X_list)
y = np.concatenate(y_list)
X.shape, y.shape

Now we can perform a simple [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html):

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

Declare our classifier:

In [None]:
clf = RandomForestClassifier(n_jobs=-1, verbose=3)

Fit it on our training samples:

In [None]:
clf.fit(X_train, y_train)

And evaluate our performance:

In [None]:
yhat = clf.predict(X_test)
acc = accuracy_score(y_test, yhat)
acc

## Testing Predictions
Now that we've trained the model, we can also see how it performs over test imagery. Here we will define a single tile over which we will see how our model performs:

In [None]:
dltile = dl.geo.DLTile.from_latlon(
    30.2629, -97.7507, resolution=resolution, tilesize=1024, pad=0
)
dltile

Search NAIP over our sample tile:

In [None]:
naip_ic = (
    Product.get(pid)
    .images()
    .intersects(dltile)
    .filter(start <= p.acquired < end)
    .sort("acquired")
    .limit(None)
).collect()
naip_ic

Retrieve imagery as an ndarray:

In [None]:
ndarr = naip_ic.mosaic(
    bands=bands,
    bands_axis=-1,
)

Reshape to **(n_samples, n_features)**:

In [None]:
nx, ny, nsamples = ndarr.shape
in_ras_arr = ndarr.reshape(-1, nsamples)
in_ras_arr.shape

And predict:

In [None]:
preds = clf.predict(in_ras_arr)

In [None]:
fig, ax = plt.subplots(figsize=(20, 10), nrows=1, ncols=2)
ax[0].imshow(ndarr)
ax[0].set_title("FCC")
ax[1].imshow(preds.reshape(nx, ny), cmap="terrain")
ax[1].set_title("RFC Preds")

We may want to outline building shadows next time! 

## Saving for Later

Once happy with the performance of a model we can save it as a .pickle file and store it as a blob:

In [None]:
import pickle

with open("training_rfc.pickle", "wb") as rfc_pkl_file:
    pickle.dump(clf, rfc_pkl_file)

#### _Note on Saving Blobs:_
We do not always need to delete and overwrite our objects on every iteration as in the following cell. This notebook is designed for demonstration purposes where we do not care about preserving each prior model.

In practice, as long as your Blob has a **unique** ID you ignore the following cell and simply run:

    blob = Blob(name="unique-model-name")
    blob.upload("rfc_file.pickle")
    blob.save()

In [None]:
try:
    # Create a new Blob object
    blob = Blob(
        name="training_rfc_model",
        tags=["examples"],
    )
    # Upload our DataFrame to this Blob:
    blob.upload("training_rfc.pickle")
    blob.save()

except:
    print("Blob already exists")
    # Already exists within your org
    blob = Blob.get(name="training_rfc_model", namespace=f"{org}:{user_id}")
    blob.delete()
    print("Deleted blob")
    # Create a new Blob object
    blob = Blob(
        name="training_rfc_model",
        tags=["examples"],
    )
    # Upload our DataFrame to this Blob:
    blob.upload("training_rfc.pickle")
    blob.save()
blob.save()
blob

And finally cleaning up:

In [None]:
os.remove("training_rfc.pickle")

Next move on to [02c Deploying a Supervised Classifier.ipynb](02c%20Deploying%20a%20Supervised%20Classifier.ipynb) to scale the inference of the model we just trained!