# Unsupervised ML on the Descartes Labs Platform: Training a KMeans Classifier
This notebook will demonstrate a typical example of how to train a simple machine learning clustering model using Descartes Labs Platform APIs. 

The general steps covered in this notebook are:
* Use [`Catalog`](https://docs.descarteslabs.com/descarteslabs/catalog/readme.html) to search and raster pixel data over the __nir__, __red__, and __green__ bands of Sentinel-2 over the Burlington, VT area
* Train a simple [`KMeans`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) model using a specified number of clusters on spectral data retrieved from Catalog
* Save the pre-trained model as a [`Blob`](https://docs.descarteslabs.com/descarteslabs/catalog/docs/blob.html) to scale inference across the entire US State of Vermont in a Batch Compute [`Function`](https://docs.descarteslabs.com/descarteslabs/compute/readme.html#descarteslabs.compute.Function) defined in [01b Deploying an Unsupervised Classifier.ipynb](01b%20Deploying%20an%20Unsupervised%20Classifier.ipynb)

In [None]:
import descarteslabs as dl
from descarteslabs.catalog import Blob, Product, properties as p

In [None]:
import os, pickle
import numpy as np

from sklearn.cluster import KMeans

import matplotlib.pyplot as plt

Defining global variables for reference throughout this example, including the Product ID for [Sentinel-2 L2A](https://app.descarteslabs.com/explorer/datasets/esa:sentinel-2:l2a:v1) and a list of bands:

In [None]:
##Input Product ID and list of Bands
s2_pid = "esa:sentinel-2:l2a:v1"
bands = ["nir", "red", "green"]

Next setting resolution, in meters, and number of classes for our clustering model:

In [None]:
resolution = 10.0  # meters
n_classes = 5  # Num classes/clusters

## Generating Training Data with Catalog

In this example, we will train our model using imagery collected around Burlington, VT during the summer months of 2023. The first step is to define an [`AOI`](https://docs.descarteslabs.com/descarteslabs/geo/readme.html#descarteslabs.geo.AOI) over which we want to train our model:

In [None]:
geometry = {
    "type": "Polygon",
    "coordinates": [
        [
            [-73.27705090082665, 44.508008292897614],
            [-73.12833936666375, 44.507346829692835],
            [-73.12833936666375, 44.39147517499973],
            [-73.27921510194753, 44.390370530395586],
            [-73.27705090082665, 44.508008292897614],
        ]
    ],
}
aoi = dl.geo.AOI(geometry, resolution=resolution, crs="EPSG:26918")  ## UTM Zone
aoi

Next we'll search Sentinel-2 for cloud-free imagery for the summer of 2023:

In [None]:
s2_prod = Product.get(s2_pid)
search = s2_prod.images()
ic = (
    search.intersects(aoi)
    .filter("2023-06-01" < p.acquired < "2023-09-01")
    .filter(p.cloud_fraction < 0.1)
    .limit(None)
).collect()
ic

Now we can mosaic our bands into a 3d ndarray:

In [None]:
mosaic = ic.mosaic(bands)
dl.utils.display(mosaic, size=5)

Next we must reshape from our image's **(nsamples, ny, nx)** to **(samples, nfeatures)**, which is accepted via [`.fit()`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.fit):

In [None]:
nsamples, ny, nx = mosaic.shape
in_data = mosaic.transpose((1, 2, 0)).reshape((ny * nx, nsamples))
in_data.shape

Now we fit a simple model, using the specified number of clusters as our input arguments. 

_Note: This example is designed for demonstration purposes and is not designed to be optimally performant or accurate!_

In [None]:
kmeans = KMeans(n_clusters=n_classes, n_init="auto").fit(in_data)
kmeans

And finally we call [`.predict()`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.predict) and reshape our results to display:

In [None]:
preds = kmeans.predict(in_data).reshape(ny, nx)
dl.utils.display(preds, size=5)

## Saving for Later

Lastly, we will save our model as a .pickle file and store it as a blob:

In [None]:
with open("training_kmeans.pickle", "wb") as kmeans_pkl_file:
    pickle.dump(kmeans, kmeans_pkl_file)

In [None]:
##Current user's org and ID:
org = dl.auth.Auth().payload["org"]
user_id = dl.auth.Auth().namespace

#### **_Note on Saving Blobs:_** 

We do not always need to delete and overwrite our objects on every iteration as in the following cell. This notebook is designed for demonstration purposes where we do not care about preserving each prior model. 

In practice, as long as your Blob has a **unique** ID you ignore the following cell and simply run:

    blob = Blob(name="unique-model-name")
    blob.upload("kmeans_file.pickle")
    blob.save()

In [None]:
try:
    # Create a new Blob object
    blob = Blob(
        name="training_kmeans_model",
        tags=["examples"],
    )
    # Upload our model to this Blob:
    blob.upload("training_kmeans.pickle")
    blob.save()
except:
    print("Blob already exists, deleting old iteration:")
    # Already exists, overwriting
    blob = Blob.get(name="training_kmeans_model", namespace=f"{org}:{user_id}")
    blob.delete()
    print("Deleted blob")
    # Create a new Blob object
    blob = Blob(
        name="training_kmeans_model",
        tags=["examples"],
    )
    # Upload our model to this Blob:
    blob.upload("training_kmeans.pickle")
    blob.save()
print("Uploaded model to Blob")
blob.save()
blob

In [None]:
# Cleaning up our file:
os.remove("training_kmeans.pickle")

Next move on to [01b Deploying an Unsupervised Classifier.ipynb](01b%20Deploying%20an%20Unsupervised%20Classifier.ipynb) to scale the inference of the model we just trained!